How I archived 100 million PDF documents... (Part 1)

This page summarizes the projects mentioned and recommended in the original post on dev.to

Our great sponsors
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Sonar - Write Clean Java Code. Always.
  • Mergify - Updating dependencies is time-consuming.
  • library-of-alexandria

    Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.

    After a quick Google search, I figured out that only less than 1% of ancient texts survived to the modern day. This unfortunate fact was my inspiration to start working on an ambitious web crawling and archival project, called the Library of Alexandria.

  • mixnode-warcreader-java

    Read Web ARChive (WARC) files in Java.

    I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

  • java-warc

    Read Web ARChive (WARC) files in Java.

    I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)

  • java-warc

    Read Web ARChive (WARC) files in Java. (by bottomless-archive-project)

    I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)

  • Apache PDFBox

    Mirror of Apache PDFBox

    So, when I started to view the documents, a lot of them simply failed to open. I had to look around for a library that could verify PDF documents. I had some experience with PDFBox in the past, so it seemed to be a good go-to solution. It had no way to verify documents by default, but it could open and parse them and that was enough to filter out the incorrect ones. It felt a little bit strange just to read the whole PDF into the memory to verify if it is correct or not, but hey I needed a simple fix for now and it worked really well.

  • jsoup

    jsoup: the Java HTML parser, built for HTML editing, cleaning, scraping, and XSS safety.

    Finally, at this point, I was able to go through a bunch of webpages (parsing them in the process with JSoup), grab all the links that contained pdf files based on the file extension then download them. Unsurprisingly, most of the pages (~60-80%) ended up being unavailable (404 Not Found and friends). After a quick cup of coffee, I got the 10.000 documents on my hard drive. This is when I realized that I have one more problem to solve.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts