Document Parsing - an unsolved problem?

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

tika-python

4 1,420 2.2 Python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

tika-docker

20 103 4.1 Shell

Convenience Docker images for Apache Tika Server

At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
PaddleOCR

60 38,704 8.7 Python

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

We ended up just pulling the text off of the page / OCRing it in house and performing a vote of those in-house OCR source's data against the GCP text output; what tool you pick (Tesseract, PaddleOCR, etc. depends on many considerations, but can usually be boiled down into: speed, cost, accuracy. Good, fast, cheap. Pick two. :)

ocrpy

6 218 0.0 Jupyter Notebook

OCR, Archive, Index and Search: Implementation agnostic OCR framework.

GitHub: https://github.com/maxent-ai/ocrpy

txtai

356 7,033 9.3 Python

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

txtai has a component for document text extraction. It wraps the Tika library in Python. This component also has logic for splitting text into sentences/paragraphs.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What contributing to Open-source is, and what it isn't

1 project | news.ycombinator.com | 27 Apr 2024
Build knowledge graphs with LLM-driven entity extraction

1 project | dev.to | 21 Feb 2024
Bootstrap or VC?

1 project | news.ycombinator.com | 5 Feb 2024
txtai: An embeddings database for semantic search, graph networks and RAG

1 project | news.ycombinator.com | 3 Feb 2024
Langchain Is Pointless

16 projects | news.ycombinator.com | 8 Jul 2023

Document Parsing - an unsolved problem?

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology
Python NLP OCR semantic-search tika-server
Post date: 19 Jul 2022

tika-python

tika-docker

InfluxDB

PaddleOCR

ocrpy

txtai

SaaSHub

Related posts

What contributing to Open-source is, and what it isn't

Build knowledge graphs with LLM-driven entity extraction

Bootstrap or VC?

txtai: An embeddings database for semantic search, graph networks and RAG

Langchain Is Pointless

Document Parsing - an unsolved problem?

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology Python NLP OCR semantic-search tika-server Post date: 19 Jul 2022

tika-python

tika-docker

InfluxDB

PaddleOCR

ocrpy

txtai

SaaSHub

Related posts

What contributing to Open-source is, and what it isn't

Build knowledge graphs with LLM-driven entity extraction

Bootstrap or VC?

txtai: An embeddings database for semantic search, graph networks and RAG

Langchain Is Pointless

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology
Python NLP OCR semantic-search tika-server
Post date: 19 Jul 2022