-
url-collector
An application that crawls the Common Crawl corpus for URLs with the specified file extensions.
-
library-of-alexandria
Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I wrote an application that goes through a Common Crawl dataset. This file was the result of parsing the July/August 2021 dataset (3.15 billion web pages or 360 TiB of uncompressed content).
Well, some of you might know that I'm working on the Library of Alexandria project. This project is all about archiving PDF files and making them searchable. Effectively building a privately owned library with a couple hundred million documents.
Related posts
-
A newspaper vanished from the internet. Did someone pay to kill it? | *digs into link rot and the loss of digital archives*
-
What do you do when your PC ran out internal HDD cables?
-
Putting 5,998,794 books on IPFS
-
r/DataHoarder community is mentioned in this: The Enduring Allure of the Library of Alexandria | On the Media | WNYC Studios
-
Anyone here with 50TB,100TB+ of personal storage that isn't mostly movies/TV/porn ??