warcio
Streaming WARC/ARC library for fast web archive IO (by webrecorder)
at-dataproc
Tools used to process/transform ArchiveTeam WARCs (by signalhunter)
warcio | at-dataproc | |
---|---|---|
4 | 2 | |
353 | 1 | |
2.5% | - | |
5.0 | 10.0 | |
18 days ago | over 1 year ago | |
Python | Python | |
Apache License 2.0 | - |
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
warcio
Posts with mentions or reviews of warcio.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2022-10-04.
-
Can anyone familiar with databases of Youtube archives help me? I don't know how to find what I'm looking for. Details in post.
What you're looking at are WARC (Web ARChive) files, which contain the raw API responses saved from YouTube. You need to parse them into usable data with something like warcio, then ingesting it into a database.
- Any very noob friendly way to extract images and videos from WARC files?
-
Help with WARC files/Extracting a portion of a full crawl
The tool you want is warcio which is available in pypi and has a command line interface as well. You can use that to extract contents, or scan through warcs, or build new warcs.
-
Is anyone working on a Yahoo Answers Archive yet, and if so where can we go to find it?
(Note: I use warcio for this, but the description above explains what is actually happening)
at-dataproc
Posts with mentions or reviews of at-dataproc.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2022-10-04.
-
YouTube Discussions Tab dataset (245.3 million comments)
I've been processing ArchiveTeam's YouTube discussions dataset into something more workable than the unwieldy raw JSON responses saved from YouTube, and I would like to share it to anyone who's interested in the data. This all started when a reddit user asked if their channel's discussion tab was saved, and I challenged myself into processing this dataset for fun. Here's some code that I wrote for this, if anyone is curious.
-
Can anyone familiar with databases of Youtube archives help me? I don't know how to find what I'm looking for. Details in post.
Just a quick update: I'm currently processing all of the WARCs from the ArchiveTeam project, which will take around ~2 days at current transfer rates from the Internet Archive (which is notoriously slow). I wrote my own software to do this, which is available here if you to check it out.
What are some alternatives?
When comparing warcio and at-dataproc you can also consider the following projects:
pywb - Core Python Web Archiving Toolkit for replay and recording of web archives
youtube-discussions-grab
ArchiveBox - 🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
yahoo-answers-archiveteam-compose
youtube-discussions-archive - EXPERIMENTAL YouTube Discussion Tab Downloader