PDF

Top 23 PDF Open-Source Projects

  1. Stirling-PDF

    #1 Locally hosted web application that allows you to perform various operations on PDF files

    Project mention: A free, unlimited online PDF converter with Privacy focus | news.ycombinator.com | 2025-01-03

    Congrats on the launch, it is interesting. Do you have plans for open source the project?

    I'm a happy user of Stirling-PDF [1] which provides all my PDF needs. I do host it in my network and not accessible from internet for better privacy.

    [1] https://github.com/Stirling-Tools/Stirling-PDF

  2. Civic Auth

    Auth in Less Than 5 Minutes. Civic Auth comes with multiple SSO options, optional embedded wallets, and user management — all implemented with just a few lines of code. Start building today.

    Civic Auth logo
  3. siyuan

    A privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.

    Project mention: French gov's open source alternative to Notion or Outline | news.ycombinator.com | 2025-03-16
  4. MinerU

    A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

    Project mention: Gemini beats everyone on new OCR benchmark | news.ycombinator.com | 2025-02-14

    The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...

    [1]: https://github.com/opendatalab/MinerU

  5. docling

    Get your documents ready for gen AI

    Project mention: Document Loading, Parsing, and Cleaning in AI Applications | dev.to | 2025-04-24

    Libraries like MarkItDown and Docling can convert PDFs and other formats to Markdown. Markdown has become one of the cleanest and most efficient formats for ingesting data into LLMs because it's nearly plaintext and token-efficient. It can also efficiently represent non-text data like tables.

  6. OCRmyPDF

    OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

    Project mention: 13 GitHub Projects that Supercharge Your AI and Development Journey 🚀 | dev.to | 2025-03-03

    Stars: 19899 Author: ocrmypdf Star the OCRmyPDF repository⭐

  7. paperless-ngx

    A community-supported supercharged version of paperless: scan, index and archive all your physical documents

    Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30
  8. Awesome-CV

    :page_facing_up: Awesome CV is LaTeX template for your outstanding job application

  9. CodeRabbit

    CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.

    CodeRabbit logo
  10. awesome-english-ebooks

    经济学人(含音频)、纽约客、卫报、连线、大西洋月刊等英语杂志免费下载,支持epub、mobi、pdf格式, 每周更新

  11. koodo-reader

    A modern ebook manager and reader with sync and backup capacities for Windows, macOS, Linux, Android, iOS and Web

    Project mention: Koodo Reader – A cross-platform eBook reader | news.ycombinator.com | 2025-04-21
  12. koreader

    An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices

    Project mention: Lux – a luxurious package manager for Lua | news.ycombinator.com | 2025-04-07

    I know some projects like Koreader[1] use Lua as their primary application language. If you could convince one of them to switch, it would provide some assurances about the maturity and popularity of the idea.

    [1]: https://github.com/koreader/koreader

  13. Etherpad

    Etherpad: A modern really-real-time collaborative document editor.

  14. best-resume-ever

    :necktie: :briefcase: Build fast :rocket: and easy multiple beautiful resumes and create your best CV ever! Made with Vue and LESS.

  15. react-pdf

    📄 Create PDF files using React

    Project mention: Pdf Generation Libraries Comparison | dev.to | 2025-01-27

    React PDF

  16. ai-pdf-chatbot-langchain

    AI PDF chatbot agent built with LangChain & LangGraph

  17. sumatrapdf

    SumatraPDF reader

  18. mit-deep-learning-book-pdf

    MIT Deep Learning Book in PDF format (complete and parts) by Ian Goodfellow, Yoshua Bengio and Aaron Courville

    Project mention: Top Github repositories for 10+ programming languages | dev.to | 2024-07-16

    MIT deep learning PDF

  19. QuestPDF

    QuestPDF is a modern open-source .NET library for PDF document generation. Offering comprehensive layout engine powered by concise and discoverable C# Fluent API. Easily generate PDF reports, invoices, exports, etc.

    Project mention: QuestPDF HTML to PDF C# Alternatives For .NET Developers | dev.to | 2024-07-19

    PDF (Portable Document Format) is widely used to save data or send data in a portable, secure format. When it comes to manipulating data into a PDF file or designing a document like an invoice, C# developers often turn to robust libraries. Two popular Libraries for these tasks are IronPDF and QuestPDF. In this article, we'll delve into how to use QuestPDF for HTML to PDF conversion and compare its features with those of IronPDF.

  20. xournalpp

    Xournal++ is a handwriting notetaking software with PDF annotation support. Written in C++ with GTK3, supporting Linux (e.g. Ubuntu, Debian, Arch, SUSE), macOS and Windows 10. Supports pen input from devices such as Wacom Tablets.

    Project mention: Six useful tools for corporate tasks | dev.to | 2025-03-08

    Website Github

  21. h2ogpt

    Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/

    Project mention: Major Technologies Worth Learning in 2025 for Data Professionals | dev.to | 2024-12-07

    Artificial Intelligence (AI) is becoming a ubiquitous, and dare I say, indispensable part of data workflows. Tools like ChatGPT have made it easier to review data and write reports. But diving even deeper, tools like DataRobot, H2O.ai, and Google’s AutoML are also simplifying machine learning pipelines and automating repetitive tasks, enabling professionals to focus on high-value activities like model optimization and data storytelling. Mastering these tools will not only boost productivity but also ensure you remain competitive in an AI-first world.

  22. milewski-ctfp-pdf

    Bartosz Milewski's 'Category Theory for Programmers' unofficial PDF and LaTeX source

    Project mention: Category Theory in Programming | news.ycombinator.com | 2024-12-01

    IMO Bartosz Milewski gave a pretty good answer to the "why" question in the preface to his book:

    > Second, there are many different kinds of math, and they appeal to different audiences. You might be allergic to calculus or algebra, but it doesn’t mean you won’t enjoy category theory. I would go as far as to argue that category theory is the kind of math that is particularly well suited for the minds of programmers. That’s because category theory — rather than dealing with particulars — deals with structure. It deals with the kind of structure that makes programs composable.

    Composition is at the very root of category theory — it’s part of the definition of the category itself. And I will argue strongly that composition is the essence of programming. We’ve been composing things forever, long before some great engineer came up with the idea of a subroutine. Some time ago the principles of structured programming revolutionized programming because they made blocks of code composable. Then came object oriented programming, which is all about composing objects. Functional programming is not only about composing functions and algebraic data structures — it makes concurrency composable — something that’s virtually impossible with other programming paradigms.

    https://bartoszmilewski.com/2014/10/28/category-theory-for-p...

    And regarding:

    > Anything that could be useful to you from CT can be explained in one afternoon over some coffee or beer.

    Yes, you can go through the definitions, but you won't understand all of those concepts in one afternoon unless you're a savant.

  23. unstructured

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    Project mention: Parsing PDFs (and more) in Elixir using Rust | news.ycombinator.com | 2025-01-29

    I've been thinking a lot about how to accomplish various RAG things in Elixir (for LLM applications). PDF is one of the missing pieces, so glad to see work here. The really tricky part is not just parsing out the text (you can just call the pdftotext unix command line utility for that), but accurately pulling out things like complex tables, etc in a way that could be chunked/post processed in a useful way. I'd love to see something like Unstructured or Marker but in Rust that Elixir could NIF out to it.

    - https://github.com/Unstructured-IO/unstructured#eight_pointe...

    - https://github.com/VikParuchuri/marker

  24. Zettlr

    Your One-Stop Publication Workbench

    Project mention: Information flow - how I capture the notes | dev.to | 2024-08-26

    zettlr - great for long form, but missing some daily use functions.

  25. documenso

    The Open Source DocuSign Alternative.

    Project mention: The Open Source DocuSign Alternative | news.ycombinator.com | 2025-02-04
  26. InfluxDB

    InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.

    InfluxDB logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

PDF discussion

Log in or Post with

PDF related posts

  • Practical Ways to Generate PDFs in Go: Libraries, LaTeX, Pandoc, Chrome

    4 projects | dev.to | 27 Apr 2025
  • Koodo Reader – A cross-platform eBook reader

    1 project | news.ycombinator.com | 21 Apr 2025
  • LLM.pdf – Run LLMs Inside a PDF

    1 project | news.ycombinator.com | 21 Apr 2025
  • A pandoc LaTeX template to convert Markdown files to PDF or LaTeX

    1 project | news.ycombinator.com | 14 Apr 2025
  • PDF Diff Tool – Show differences between two PDF files visually

    1 project | news.ycombinator.com | 14 Apr 2025
  • Docling “Enrichment Features”

    1 project | dev.to | 13 Apr 2025
  • PDF.js VS EmbedPDF - a user suggested alternative

    2 projects | 4 Apr 2025
  • A note from our sponsor - Civic Auth
    www.civic.com | 28 Apr 2025
    Civic Auth comes with multiple SSO options, optional embedded wallets, and user management — all implemented with just a few lines of code. Start building today. Learn more →

Index

What are some of the best open-source PDF projects? This list will help you:

# Project Stars
1 Stirling-PDF 56,251
2 siyuan 34,081
3 MinerU 32,298
4 docling 28,423
5 OCRmyPDF 28,226
6 paperless-ngx 26,717
7 Awesome-CV 24,278
8 awesome-english-ebooks 23,888
9 koodo-reader 22,034
10 koreader 20,624
11 Etherpad 17,362
12 best-resume-ever 16,397
13 react-pdf 15,557
14 ai-pdf-chatbot-langchain 15,403
15 sumatrapdf 14,639
16 mit-deep-learning-book-pdf 13,247
17 QuestPDF 12,779
18 xournalpp 12,542
19 h2ogpt 11,782
20 milewski-ctfp-pdf 11,215
21 unstructured 10,985
22 Zettlr 10,960
23 documenso 10,822

Sponsored
Auth in Less Than 5 Minutes
Civic Auth comes with multiple SSO options, optional embedded wallets, and user management — all implemented with just a few lines of code. Start building today.
www.civic.com

Did you know that TypeScript is
the 1st most popular programming language
based on number of references?