Show HN: I am building a new Python library to read/write PDF files

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern API for authentication & user identity.
  • Onboard AI - ChatGPT with full context of any GitHub repo.
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • pdfsyntax

    A Python library to inspect and modify the internal structure of a PDF file

    I never knew about the J number suffix in python: https://docs.python.org/3/reference/lexical_analysis.html#im... which it would appear is used to represent references: https://github.com/desgeeko/pdfsyntax/blob/main/tests/test_p...

    I wish you good luck, this file format has tripped up many, many a developer. It blew up on a pdf I had lying around:

        ValueError: could not convert string to float: b'5.0.0'

  • pdfquery

    A fast and friendly PDF scraping library.

    On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. There are issues when you put it into production work but the idea is like “why haven’t i thought of this” because PDF has some tree like structure.

    [1]: https://github.com/jcushman/pdfquery

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • kaitai_struct_formats

    Kaitai Struct: library of binary file formats (.ksy)

    This is tangential to your submission, but PDF is the file format I use for exercising any library that claims to be a declarative file format (ala https://github.com/kaitai-io/kaitai_struct_formats#readme )

  • pdfplumber

    Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.

    I recently tried pdfplumber [1] to extract tables from (relatively) difficult formatted tables in PDF, and it was a great experience. I can recommend it.

    [1]: https://github.com/jsvine/pdfplumber

  • i7j-rups

    RUPS is an acronym for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams.

    > find a version of iText RUPS application from somewhere on the internet

    You mean this, right? https://github.com/itext/i7j-rups#readme

  • WeasyPrint

    The awesome document factory

  • PDF.js

    PDF Reader in JavaScript

    The PDF format is diverse enough for such a new project to still have plenty of incompatibilities. If you wanted to find many of them quickly, you might want to have a look at the documents used as test cases in other projects, such as pdf.js:

    https://github.com/mozilla/pdf.js/tree/master/test/pdfs

  • Onboard AI

    ChatGPT with full context of any GitHub repo. Onboard AI learns any GitHub repo in minutes and lets you chat with it to locate functionality, understand different parts, and generate new code. Use it for free at app.getonboardai.com.

  • bericht

    Incremental HTML to PDF converter.

  • polyfile

    A pure Python cleanroom implementation of libmagic, with instrumented parsing from Kaitai struct and an interactive hex viewer

    Be careful with PDF! There are many ambiguities in the specification that are implemented differently between parsers, as well as implicitly accepted malformations that almost all parsers will silently accept without warning. It is very easy to accidentally produce so-called file format schizophrenia: When the same file is rendered differently between two parsers. For example, with PDF, what if you have a PDF object stream that has a length that doesn't agree with the position of its `endstream` token? What if you have a PDF dictionary with duplicate keys? Do you use the value of the first key or the second? What if you have two, valid PDFs concatenated one after the other? Do you render the first or the second? What if an object in the XREF table has an incorrect offset?

    Shameless plug: I am one of the maintainers of PolyFile, which, among other things, can produce an interactive HTML hex editor with an annotated syntax tree for dozens of filetypes, including PDF. For PDF, it uses a dynamically instrumented version of the PDFminer parser. It sounds like it might satisfy your use case.

    https://github.com/trailofbits/polyfile

  • djot

    A light markup language

    > the most reliable way of generating PDFs

    See pandoc: https://pandoc.org/

    And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

    Including the author's own djot: https://github.com/jgm/djot

  • pandoc

    Universal markup converter

    > the most reliable way of generating PDFs

    See pandoc: https://pandoc.org/

    And a variety of intermediate or input text formats, where you can pick your preferred poison whether for book publishing, research papers, math papers, technical documentation, slides, etc.

    Including the author's own djot: https://github.com/jgm/djot

  • PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

    I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

    PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

    [Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

  • mupdf

    mirrored from git://git.ghostscript.com/mupdf.git (by ccxvii)

    I think you might mean PyMuPDF (https://github.com/pymupdf/PyMuPDF), a Python library built on top of the MuPDF C library (https://mupdf.com/).

    PyMuPDF and MuPDF are both available under dual open source AGPL and commercial licenses. They have been around for many years and are under continual development.

    [Disclaimer, i work for Artifex, who wrote MuPDF and recently acquired PyMuPDF.]

  • annotated-pdf-spec

    Collection of useful hints for implementing a PDF library

    As others have already written, there are many slightly invalid PDF files out there in the wild that many readers can display mostly fine and which your library should also be able to handle.

    If you can, grab yourself a copy of the most recent PDF 2.0 specification since it contains much more information and is much more correct in terms of how to implement things. Also have a look at the errata at https://pdf-issues.pdfa.org/32000-2-2020/index.html.

    As I'm implementing a PDF library (in Ruby), I have started to collect some situations that arise in the wild but are not spec-compliant, see https://github.com/gettalong/annotated-pdf-spec. That might help you in parsing some invalid PDFs

  • pdf-issues

    Industry-based resolutions for issues and errata reported against any PDF-related ISO standard

    Merely for your consideration, if those were actual issues on that repo, (a) it would allow adding labels to them (as in https://github.com/pdf-association/pdf-issues/issues?q=is%3A... ) (b) folks could comment, acting as a low-rent stackoverflow, and (c) it would allow anyone to contribute new ones versus the "PR against README.md" situation right now

    That also more closely matches the mental model of those items: bugs against the specification, whether the official PDF Association agrees that they are or not

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts