Pdfsandwich

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • ripgrep-all

    rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

  • On an even more macro level I've had a great experience with ripgrep-all[0], which uses Tesseract internally.

    I have e.g. a directory with all weekly lecture slides for one lecture, and can directly find where (both file and page) we learned something related to photosynthesis via `rga photoshynthesis`.

    [0]: https://github.com/phiresky/ripgrep-all

  • InvoiceNet

    Deep neural network to extract intelligent information from invoice documents.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • awesome-document-understanding

    A curated list of resources for Document Understanding (DU) topic

  • While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

    The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

    However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.

  • While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

    The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

    However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.

  • tabula

    Tabula is a tool for liberating data tables trapped inside PDF files

  • While trying to find a specific project I recalled, I encountered this list of projects which might be of interest: https://github.com/tstanislawek/awesome-document-understandi...

    The project I had in mind was similar to this one but I can't remember the name currently: https://github.com/tabulapdf/tabula

    However, if you're looking for a ML-based, invoice-specific project looks like the other comment to your reply might be more useful.

  • OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

  • Similar, perhaps more feature-rich tool is OCRmyPDF https://github.com/jbarlow83/OCRmyPDF

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts