240 million URLs for PDF and DOC files

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

url-collector

2 0 5.1 Java

An application that crawls the Common Crawl corpus for URLs with the specified file extensions.

I wrote an application that goes through a Common Crawl dataset. This file was the result of parsing the July/August 2021 dataset (3.15 billion web pages or 360 TiB of uncompressed content).

library-of-alexandria

23 108 7.6 Java

Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.

Well, some of you might know that I'm working on the Library of Alexandria project. This project is all about archiving PDF files and making them searchable. Effectively building a privately owned library with a couple hundred million documents.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

A newspaper vanished from the internet. Did someone pay to kill it? | *digs into link rot and the loss of digital archives*

3 projects | /r/DataHoarder | 14 Dec 2022
What do you do when your PC ran out internal HDD cables?

1 project | /r/DataHoarder | 7 Dec 2022
Putting 5,998,794 books on IPFS

2 projects | /r/DataHoarder | 20 Nov 2022
r/DataHoarder community is mentioned in this: The Enduring Allure of the Library of Alexandria | On the Media | WNYC Studios

1 project | /r/DataHoarder | 12 Nov 2022
Anyone here with 50TB,100TB+ of personal storage that isn't mostly movies/TV/porn ??

1 project | /r/DataHoarder | 30 Sep 2022

240 million URLs for PDF and DOC files

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
library-of-alexandria Archiving
Post date: 2 Oct 2021

url-collector

library-of-alexandria

InfluxDB

Related posts

A newspaper vanished from the internet. Did someone pay to kill it? | digs into link rot and the loss of digital archives

What do you do when your PC ran out internal HDD cables?

Putting 5,998,794 books on IPFS

r/DataHoarder community is mentioned in this: The Enduring Allure of the Library of Alexandria | On the Media | WNYC Studios

Anyone here with 50TB,100TB+ of personal storage that isn't mostly movies/TV/porn ??

240 million URLs for PDF and DOC files

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder library-of-alexandria Archiving Post date: 2 Oct 2021

url-collector

library-of-alexandria

InfluxDB

Related posts

A newspaper vanished from the internet. Did someone pay to kill it? | *digs into link rot and the loss of digital archives*

What do you do when your PC ran out internal HDD cables?

Putting 5,998,794 books on IPFS

r/DataHoarder community is mentioned in this: The Enduring Allure of the Library of Alexandria | On the Media | WNYC Studios

Anyone here with 50TB,100TB+ of personal storage that isn't mostly movies/TV/porn ??

This page summarizes the projects mentioned and recommended in the original post on /r/DataHoarder
library-of-alexandria Archiving
Post date: 2 Oct 2021

A newspaper vanished from the internet. Did someone pay to kill it? | digs into link rot and the loss of digital archives