Crates for converting PDF's into Markdown

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pdf-extract

1 318 6.9 Rust

A rust library for extracting content from pdfs

https://github.com/jrmuizel/pdf-extract could be extended to do something like this.

layout-parser

6 4,453 0.0 Python

A Unified Toolkit for Deep Learning Based Document Image Analysis

I built my own solution using a combination of Tesseract and OpenCV (in python). But even though the source PDF content is computer generated, I still get sporadic OCR errors. After writing my solution, I came across this https://github.com/Layout-Parser/layout-parser which might be a better starting point for dealing with PDFs but I haven't tried it yet.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project