PDF processing and analysis with open-source tools

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • Mergify - Updating dependencies is time-consuming.
  • InfluxDB - Collect and Analyze Billions of Data Points in Real Time
  • Sonar - Write Clean Java Code. Always.
  • pdfsizeopt

    PDF file size optimizer

    This is missing the "pdfsizeopt" suite, that bundles statically compiled utilities to reduce size.

    Static compilation means that it will run on most Linux platforms without extra required software.

    I believe one aspect of it will remove characters from included fonts that are not used.

    It really is quite impressive.


  • tesseract-ocr

    Tesseract Open Source OCR Engine (main repository)

    > Would love to find a cheaper (local) option vs AWS

    How about tesseract (https://github.com/tesseract-ocr/tesseract)

  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

  • tesseract-ocr-for-php

    A wrapper to work with Tesseract OCR inside PHP.

    There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.

  • Apache PDFBox

    Mirror of Apache PDFBox

    PDFBox can do this. It’s not part of the CLI but it wouldn’t be too hard to add:


  • pdf-diff

    A tool for visualizing differences between two pdf files.

    This tool might be helpful for comparing pdfs: https://github.com/serhack/pdf-diff

  • pdfsam

    PDFsam, a desktop application to split, merge, mix, rotate PDF files and extract pages

    I'd like to add this tool to the list: https://pdfsam.org/

  • WeasyPrint

    The awesome document factory

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts