Ask HN: What are you using to parse PDFs for RAG?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Purpose built for real-time analytics at any scale.
InfluxDB Platform is powered by columnar analytics, optimized for cost-efficient storage, and built with open data standards.
www.influxdata.com
featured
  • marker

    Convert PDF to markdown quickly with high accuracy

    Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • pdf2htmlEX

    Convert PDF to HTML without losing text or format. (by pdf2htmlEX)

    Previously have used https://github.com/pdf2htmlEX/pdf2htmlEX to convert PDF to HTML at scale, could potentially try and parse the output html to markdown as second stage.

  • PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  • RAG

    RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF (by pymupdf)

  • pandoc

    Universal markup converter

  • nougat

    Implementation of Nougat Neural Optical Understanding for Academic Documents

  • AI-in-a-Box

    https://github.com/Azure/AI-in-a-Box/tree/main/ai-services/d...

  • InfluxDB

    Purpose built for real-time analytics at any scale. InfluxDB Platform is powered by columnar analytics, optimized for cost-efficient storage, and built with open data standards.

    InfluxDB logo
  • open-parse

    Improved file parsing for LLM’s

  • document-ai-samples

    Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

  • layout-parser

    A Unified Toolkit for Deep Learning Based Document Image Analysis

  • PDF-Extract-Kit

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

  • PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

  • WDoc

    Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, scalable, under developpement

    For my RAG projet [WDoc](https://github.com/thiswillbeyourgithub/WDoc/tree/dev) I use multiple pdf parser then use heuristics the keep the best one. The code is at https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    And the heurstics are partly based on using fasttext to detecr languages : https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    It's probably crap for tables but I don't want to rely on external parsers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • When Will the GenAI Bubble Burst?

    1 project | news.ycombinator.com | 4 Apr 2024
  • A better document viewer

    1 project | /r/linux4noobs | 13 Sep 2023
  • Gibts ein (CLI) tool, das Kontrast und Helligkeit von gescannten Textdokumenten dynamisch anpasst?

    3 projects | /r/de_EDV | 27 Jun 2023
  • OCR for a full pdf on Neoreader

    1 project | /r/Onyx_Boox | 25 Jun 2023
  • ELI5: why is PDF such a widespread text format, instead of a format that's actually easier to edit?

    1 project | /r/explainlikeimfive | 3 Jun 2023