Show HN: Open-source Rule-based PDF parser for RAG

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

llmsherpa

6 918 6.9 Jupyter Notebook

Developer APIs to Accelerate LLM Projects

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...
All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

nlm-ingestor

3 786 7.1 Python

This repo provides the server side code for llmsherpa API to connect. It includes parsers for various file formats.

Here's another notebook from the repo with examples: https://github.com/nlmatics/nlm-ingestor/blob/main/notebooks...

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
txtai

355 6,953 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.
Here's a couple examples:
- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai
- https://neuml.hashnode.dev/extract-text-from-documents
Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

grobid

11 3,057 9.2 Java

A machine learning software for extracting information from scholarly documents
paperetl

12 315 6.3 Python

📄 ⚙️ ETL processes for medical and scientific papers
WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Txtai: An all-in-one embeddings database for semantic search and LLM workflows
1 project | news.ycombinator.com | 24 Jan 2024
Generate knowledge with Semantic Graphs and RAG
1 project | dev.to | 23 Jan 2024
Ten Noteworthy AI Research Papers of 2023
1 project | news.ycombinator.com | 6 Jan 2024
2023: The Year of AI
2 projects | news.ycombinator.com | 25 Dec 2023
Altering the prompt in an Extractor pipeline?
1 project | /r/txtai | 5 Dec 2023

Show HN: Open-source Rule-based PDF parser for RAG

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Machine Learning Deep Learning Search ETL
Post date: 23 Jan 2024

llmsherpa

nlm-ingestor

InfluxDB

txtai

grobid

paperetl

WorkOS

Related posts

Show HN: Open-source Rule-based PDF parser for RAG

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Python Machine Learning Deep Learning Search ETL Post date: 23 Jan 2024

llmsherpa

nlm-ingestor

InfluxDB

txtai

grobid

paperetl

WorkOS

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Machine Learning Deep Learning Search ETL
Post date: 23 Jan 2024