Show HN: I am building a new Python library to read/write PDF files

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

pdfsyntax

8 420 8.5 Python

A Python library to inspect and modify the internal structure of a PDF file

I never knew about the J number suffix in python: https://docs.python.org/3/reference/lexical_analysis.html#im... which it would appear is used to represent references: https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...
I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:
    ValueError: could not convert string to float: b'5.0.0'

pdfquery

3 753 0.0 Python

A fast and friendly PDF scraping library.

On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. There are issues when you put it into production work but the idea is like “why haven’t i thought of this” because PDF has some tree like structure.
[1]: https://github.com/jcushman/pdfquery

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
kaitai_struct_formats

3 682 6.3 Kaitai Struct

Kaitai Struct: library of binary file formats (.ksy)

This is tangential to your submission, but PDF is the file format I use for exercising any library that claims to be a declarative file format (ala https://github.com/kaitai-io/kaitai_struct_formats#readme )

pdfplumber

29 5,565 8.2 Python

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it.
[1]: https://github.com/jsvine/pdfplumber

i7j-rups

3 248 5.3 Java

RUPS is an acronym for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams.

> find a version of iText RUPS application from somewhere on the internet
You mean this, right? https://github.com/itext/i7j-rups#readme

WeasyPrint

43 6,646 9.5 Python

The awesome document factory
PDF.js

84 46,263 9.9 JavaScript

PDF Reader in JavaScript

The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:
https://github.com/mozilla/pdf.js/tree/master/test/pdfs

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
bericht

1 14 0.0 Python

Incremental HTML to PDF converter.
polyfile

2 322 7.6 Python

A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer

Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?
Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.
https://github.com/trailofbits/polyfile

djot

43 1,580 5.8 HTML

A light markup language

> the most reliable way of generating PDFs
See pandoc: https://pandoc.org/
And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.
Including the author's own djot: https://github.com/jgm/djot

pandoc

420 32,396 9.8 Haskell

Universal markup converter

> the most reliable way of generating PDFs
See pandoc: https://pandoc.org/
And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.
Including the author's own djot: https://github.com/jgm/djot

PyMuPDF

5 4,053 9.8 Python

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).
PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.
[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

mupdf

28 55 8.8 C

mirrored from git://git.ghostscript.com/mupdf.git (by ccxvii)

I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).
PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.
[Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

annotated-pdf-spec

1 5 10.0

Collection of useful hints for implementing a PDF library

As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.
If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.
As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs

pdf-issues

2 62 8.8

Industry-based resolutions for issues and errata reported against any PDF-related specification

Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now
That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

wkhtmltopdf - Convert HTML to PDF

3 projects | /r/commandline | 26 Mar 2021
Show HN: A new open-source library to design PDF using React

2 projects | news.ycombinator.com | 17 Feb 2024
1.5M PDFs in 25 Minutes

2 projects | news.ycombinator.com | 15 Feb 2024
Yara scanning PDF files

1 project | /r/computerforensics | 1 Jun 2023
I need help install PyPDF2 library on my computer

1 project | /r/learnpython | 12 May 2023

Show HN: I am building a new Python library to read/write PDF files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
PDF Python Converter Specific Formats Processing Text
Post date: 17 Nov 2022

pdfsyntax

pdfquery

InfluxDB

kaitai_struct_formats

pdfplumber

i7j-rups

WeasyPrint

PDF.js

SaaSHub

bericht

polyfile

djot

pandoc

PyMuPDF

mupdf

annotated-pdf-spec

pdf-issues

SaaSHub

Related posts

wkhtmltopdf - Convert HTML to PDF

Show HN: A new open-source library to design PDF using React

1.5M PDFs in 25 Minutes

Yara scanning PDF files

I need help install PyPDF2 library on my computer

Show HN: I am building a new Python library to read/write PDF files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com PDF Python Converter Specific Formats Processing Text Post date: 17 Nov 2022

Related posts

wkhtmltopdf - Convert HTML to PDF

Show HN: A new open-source library to design PDF using React

1.5M PDFs in 25 Minutes

Yara scanning PDF files

I need help install PyPDF2 library on my computer

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
PDF Python Converter Specific Formats Processing Text
Post date: 17 Nov 2022