Ask HN: Best way to keep the raw HTML of scraped pages?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

warc-proxy

1 61 10.0 Python

Serving content from a WARC

I thought that mitmproxy did this, but cursory searches didn't show anything; that said, their actual format[1] has even more fidelity (I'd guess it's comparable to wireshark)
One should be aware that WARC is great for preservation, but getting content back out of it would require specialized tooling ala: https://github.com/alard/warc-proxy
1: https://github.com/mitmproxy/mitmproxy/blob/9.0.1/mitmproxy/...

mitmproxy

152 34,347 9.4 Python

An interactive TLS-capable intercepting HTTP proxy for penetration testers and software developers.

I thought that mitmproxy did this, but cursory searches didn't show anything; that said, their actual format[1] has even more fidelity (I'd guess it's comparable to wireshark)
One should be aware that WARC is great for preservation, but getting content back out of it would require specialized tooling ala: https://github.com/alard/warc-proxy
1: https://github.com/mitmproxy/mitmproxy/blob/9.0.1/mitmproxy/...

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Scrapy

180 50,896 9.6 Python

Scrapy, a fast high-level web crawling & scraping framework for Python.

If you weren't already aware, Scrapy has strong support for this via their HTTPCache middleware; you can choose whether to have it actually behave like a cache, choosing to returned already scraped content if matched or merely to act as a pass-through cache: https://docs.scrapy.org/en/2.7/topics/downloader-middleware....
Their OOtB storage does what the sibling comment says about sha1-ing the request and then sharding the output filename by the first two characters: https://github.com/scrapy/scrapy/blob/2.7.1/scrapy/extension...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project