Ask HN: What are you using to parse PDFs for RAG?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  1. marker

    Convert PDF to markdown + JSON quickly with high accuracy

    Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. pdf2htmlEX

    Convert PDF to HTML without losing text or format. (by pdf2htmlEX)

    Previously have used https://github.com/pdf2htmlEX/pdf2htmlEX to convert PDF to HTML at scale, could potentially try and parse the output html to markdown as second stage.

  4. PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  5. RAG

    RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF (by pymupdf)

  6. pandoc

    Universal markup converter

  7. nougat

    Implementation of Nougat Neural Optical Understanding for Academic Documents

  8. AI-in-a-Box

    AI-in-a-Box leverages the expertise of Microsoft across the globe to develop and provide AI and ML solutions to the technical community. Our intent is to present a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction.

    https://github.com/Azure/AI-in-a-Box/tree/main/ai-services/d...

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. open-parse

    Improved file parsing for LLM’s

  11. document-ai-samples

    Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

  12. layout-parser

    A Unified Toolkit for Deep Learning Based Document Image Analysis

  13. PDF-Extract-Kit

    A Comprehensive Toolkit for High-Quality PDF Content Extraction

  14. PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

  15. wdoc

    Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, etc

    For my RAG projet [WDoc](https://github.com/thiswillbeyourgithub/WDoc/tree/dev) I use multiple pdf parser then use heuristics the keep the best one. The code is at https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    And the heurstics are partly based on using fasttext to detecr languages : https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...

    It's probably crap for tables but I don't want to rely on external parsers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • OCRmyPDF: The Magic Wand for Your Scanned PDFs

    1 project | dev.to | 13 May 2025
  • Using Docling’s OCR features with RapidOCR

    9 projects | dev.to | 3 Apr 2025
  • Ask HN: What is the best method for turning a scanned book as a PDF into text?

    13 projects | news.ycombinator.com | 16 Feb 2025
  • Z-Library Helps Students to Overcome Academic Poverty, Study Finds

    2 projects | news.ycombinator.com | 21 Nov 2024
  • Llama-OCR: An Open-Source Llama 3.2 Based OCR Tool

    6 projects | news.ycombinator.com | 15 Nov 2024

Did you know that Python is
the 2nd most popular programming language
based on number of references?