Civic Auth comes with multiple SSO options, optional embedded wallets, and user management — all implemented with just a few lines of code. Start building today. Learn more →
Top 23 PDF Open-Source Projects
-
Stirling-PDF
#1 Locally hosted web application that allows you to perform various operations on PDF files
Project mention: A free, unlimited online PDF converter with Privacy focus | news.ycombinator.com | 2025-01-03Congrats on the launch, it is interesting. Do you have plans for open source the project?
I'm a happy user of Stirling-PDF [1] which provides all my PDF needs. I do host it in my network and not accessible from internet for better privacy.
[1] https://github.com/Stirling-Tools/Stirling-PDF
-
Civic Auth
Auth in Less Than 5 Minutes. Civic Auth comes with multiple SSO options, optional embedded wallets, and user management — all implemented with just a few lines of code. Start building today.
-
siyuan
A privacy-first, self-hosted, fully open source personal knowledge management software, written in typescript and golang.
Project mention: French gov's open source alternative to Notion or Outline | news.ycombinator.com | 2025-03-16 -
MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
The system they tested are mostly used as a part of a larger system. A more fair comparison would be to use something like MinerU [1] and proper benchmark like the OHR Bench and Reductos table bench. This paper is really bad...
[1]: https://github.com/opendatalab/MinerU
-
Libraries like MarkItDown and Docling can convert PDFs and other formats to Markdown. Markdown has become one of the cleanest and most efficient formats for ingesting data into LLMs because it's nearly plaintext and token-efficient. It can also efficiently represent non-text data like tables.
-
Project mention: 13 GitHub Projects that Supercharge Your AI and Development Journey 🚀 | dev.to | 2025-03-03
Stars: 19899 Author: ocrmypdf Star the OCRmyPDF repository⭐
-
paperless-ngx
A community-supported supercharged version of paperless: scan, index and archive all your physical documents
Project mention: Paperless-ngx: scan, index and archive all your physical documents | news.ycombinator.com | 2024-09-30 -
-
CodeRabbit
CodeRabbit: AI Code Reviews for Developers. Revolutionize your code reviews with AI. CodeRabbit offers PR summaries, code walkthroughs, 1-click suggestions, and AST-based analysis. Boost productivity and code quality across all major languages with each PR.
-
-
koodo-reader
A modern ebook manager and reader with sync and backup capacities for Windows, macOS, Linux, Android, iOS and Web
-
koreader
An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices
I know some projects like Koreader[1] use Lua as their primary application language. If you could convince one of them to switch, it would provide some assurances about the maturity and popularity of the idea.
[1]: https://github.com/koreader/koreader
-
-
best-resume-ever
:necktie: :briefcase: Build fast :rocket: and easy multiple beautiful resumes and create your best CV ever! Made with Vue and LESS.
-
React PDF
-
-
-
mit-deep-learning-book-pdf
MIT Deep Learning Book in PDF format (complete and parts) by Ian Goodfellow, Yoshua Bengio and Aaron Courville
MIT deep learning PDF
-
QuestPDF
QuestPDF is a modern open-source .NET library for PDF document generation. Offering comprehensive layout engine powered by concise and discoverable C# Fluent API. Easily generate PDF reports, invoices, exports, etc.
PDF (Portable Document Format) is widely used to save data or send data in a portable, secure format. When it comes to manipulating data into a PDF file or designing a document like an invoice, C# developers often turn to robust libraries. Two popular Libraries for these tasks are IronPDF and QuestPDF. In this article, we'll delve into how to use QuestPDF for HTML to PDF conversion and compare its features with those of IronPDF.
-
xournalpp
Xournal++ is a handwriting notetaking software with PDF annotation support. Written in C++ with GTK3, supporting Linux (e.g. Ubuntu, Debian, Arch, SUSE), macOS and Windows 10. Supports pen input from devices such as Wacom Tablets.
Website Github
-
h2ogpt
Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
Project mention: Major Technologies Worth Learning in 2025 for Data Professionals | dev.to | 2024-12-07Artificial Intelligence (AI) is becoming a ubiquitous, and dare I say, indispensable part of data workflows. Tools like ChatGPT have made it easier to review data and write reports. But diving even deeper, tools like DataRobot, H2O.ai, and Google’s AutoML are also simplifying machine learning pipelines and automating repetitive tasks, enabling professionals to focus on high-value activities like model optimization and data storytelling. Mastering these tools will not only boost productivity but also ensure you remain competitive in an AI-first world.
-
milewski-ctfp-pdf
Bartosz Milewski's 'Category Theory for Programmers' unofficial PDF and LaTeX source
IMO Bartosz Milewski gave a pretty good answer to the "why" question in the preface to his book:
> Second, there are many different kinds of math, and they appeal to different audiences. You might be allergic to calculus or algebra, but it doesn’t mean you won’t enjoy category theory. I would go as far as to argue that category theory is the kind of math that is particularly well suited for the minds of programmers. That’s because category theory — rather than dealing with particulars — deals with structure. It deals with the kind of structure that makes programs composable.
Composition is at the very root of category theory — it’s part of the definition of the category itself. And I will argue strongly that composition is the essence of programming. We’ve been composing things forever, long before some great engineer came up with the idea of a subroutine. Some time ago the principles of structured programming revolutionized programming because they made blocks of code composable. Then came object oriented programming, which is all about composing objects. Functional programming is not only about composing functions and algebraic data structures — it makes concurrency composable — something that’s virtually impossible with other programming paradigms.
https://bartoszmilewski.com/2014/10/28/category-theory-for-p...
And regarding:
> Anything that could be useful to you from CT can be explained in one afternoon over some coffee or beer.
Yes, you can go through the definitions, but you won't understand all of those concepts in one afternoon unless you're a savant.
-
unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
I've been thinking a lot about how to accomplish various RAG things in Elixir (for LLM applications). PDF is one of the missing pieces, so glad to see work here. The really tricky part is not just parsing out the text (you can just call the pdftotext unix command line utility for that), but accurately pulling out things like complex tables, etc in a way that could be chunked/post processed in a useful way. I'd love to see something like Unstructured or Marker but in Rust that Elixir could NIF out to it.
- https://github.com/Unstructured-IO/unstructured#eight_pointe...
- https://github.com/VikParuchuri/marker
-
zettlr - great for long form, but missing some daily use functions.
-
-
InfluxDB
InfluxDB high-performance time series database. Collect, organize, and act on massive volumes of high-resolution data to power real-time intelligent systems.
PDF discussion
PDF related posts
-
Practical Ways to Generate PDFs in Go: Libraries, LaTeX, Pandoc, Chrome
-
Koodo Reader – A cross-platform eBook reader
-
LLM.pdf – Run LLMs Inside a PDF
-
A pandoc LaTeX template to convert Markdown files to PDF or LaTeX
-
PDF Diff Tool – Show differences between two PDF files visually
-
Docling “Enrichment Features”
-
PDF.js VS EmbedPDF - a user suggested alternative
2 projects | 4 Apr 2025 -
A note from our sponsor - Civic Auth
www.civic.com | 28 Apr 2025
Index
What are some of the best open-source PDF projects? This list will help you:
# | Project | Stars |
---|---|---|
1 | Stirling-PDF | 56,251 |
2 | siyuan | 34,081 |
3 | MinerU | 32,298 |
4 | docling | 28,423 |
5 | OCRmyPDF | 28,226 |
6 | paperless-ngx | 26,717 |
7 | Awesome-CV | 24,278 |
8 | awesome-english-ebooks | 23,888 |
9 | koodo-reader | 22,034 |
10 | koreader | 20,624 |
11 | Etherpad | 17,362 |
12 | best-resume-ever | 16,397 |
13 | react-pdf | 15,557 |
14 | ai-pdf-chatbot-langchain | 15,403 |
15 | sumatrapdf | 14,639 |
16 | mit-deep-learning-book-pdf | 13,247 |
17 | QuestPDF | 12,779 |
18 | xournalpp | 12,542 |
19 | h2ogpt | 11,782 |
20 | milewski-ctfp-pdf | 11,215 |
21 | unstructured | 10,985 |
22 | Zettlr | 10,960 |
23 | documenso | 10,822 |