SaaSHub helps you find the best software and product alternatives Learn more →
Top 18 warc Open-Source Projects
-
ArchiveBox
🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
-
heritrix3
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
grab-site
The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
cdx_toolkit
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
-
browsertrix
Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
-
wget-lua
Wget-AT is a modern Wget with Lua hooks, Zstandard (+dictionary) WARC compression and URL-agnostic deduplication.
-
forum-dl
Scrape posts, threads from forums, news aggregators, mail archives, export to JSONL, mailbox, WARC
-
abracabra
Eventually a search engine, but currently a filtering pipeline for HTML and soon WARC files.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
Project mention: Ask HN: What Underrated Open Source Project Deserves More Recognition? | news.ycombinator.com | 2024-03-07Two projects I greatly appreciate, allowing me to easily archive my bandcamp and GOG purchases (after the initial setup anyways):
https://github.com/easlice/bandcamp-downloader
https://github.com/Kalanyr/gogrepoc
And I recently learned about archivebox, which I think is going to be a fast favorite and finally let me clear out my mess of tabs/bookmarks: https://github.com/ArchiveBox/ArchiveBox
If anyone's interested in web crawling technology, check out Heretrix [1], been around since 2004 and while not the most performant it has incorporated many responsible disciplines in the design and as this article pointed out, WARC format.
1. https://heritrix.readthedocs.io
Project mention: YaCy, a distributed Web Search Engine, based on a peer-to-peer network | news.ycombinator.com | 2024-03-05
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29The format you want is WARC. Even the Library of Congress uses it. There are many many WARC scrapers. I'd look at what the Internet Archive recommends. A quick search turned up this from the Archive Team and Jason Scott https://github.com/ArchiveTeam/grab-site (https://wiki.archiveteam.org/index.php/Who_We_Are) but I found that in less than 15 seconds of searching so do your own diligence.
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try https://replayweb.page/ as a test for viewing a WARC file. I do think you'll run into problems though with wanting to browse interconnected links in a forum format, but try this as a first step.
One potential option but definitely a bit more work would be, once you have all the warc files downloaded, you can open them all in python using the warctools module and maybe beautifulsoup and potentially parse/extract all of the data embedded in the WARC archives into your own "fresh" HTML webserver.
https://github.com/internetarchive/warctools
Project mention: Show HN: Command line tool for extracting secrets from WARC (Web ARChive) files | news.ycombinator.com | 2023-12-20
Project mention: Ask HN: How can I back up an old vBulletin forum without admin access? | news.ycombinator.com | 2024-01-29You can try forum-dl, a forum scraping tool I've been writing for this purpose: https://github.com/mikwielgus/forum-dl
It's single-threaded, alpha-quality software, and still isn't compatible with many forums and themes. But it can export WARCs and may just happen to work for you.
warc related posts
-
An Introduction to the WARC File
-
Ask HN: How can I back up an old vBulletin forum without admin access?
-
WARC'in the Crawler
-
Best practices for archiving websites
-
struggling to download websites
-
Selenium over scrapy
-
Internet Archive Down, will be up and running soon (i hope).
-
A note from our sponsor - SaaSHub
www.saashub.com | 14 May 2024
Index
What are some of the best open-source warc projects? This list will help you:
Project | Stars | |
---|---|---|
1 | ArchiveBox | 19,915 |
2 | heritrix3 | 2,698 |
3 | conifer | 1,457 |
4 | grab-site | 1,270 |
5 | replayweb.page | 626 |
6 | ipwb | 590 |
7 | WarcDB | 384 |
8 | warcio | 346 |
9 | bitextor | 280 |
10 | cdx_toolkit | 153 |
11 | troll-a | 130 |
12 | browsertrix | 126 |
13 | warc-parquet | 99 |
14 | wget-lua | 83 |
15 | forum-dl | 61 |
16 | chatnoir-resiliparse | 43 |
17 | warc2zim | 34 |
18 | abracabra | 0 |
Sponsored