-
tika-python
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
-
txtai
💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.
At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.
We ended up just pulling the text off of the page / OCRing it in house and performing a vote of those in-house OCR source's data against the GCP text output; what tool you pick (Tesseract, PaddleOCR, etc. depends on many considerations, but can usually be boiled down into: speed, cost, accuracy. Good, fast, cheap. Pick two. :)
GitHub: https://github.com/maxent-ai/ocrpy
txtai has a component for document text extraction. It wraps the Tika library in Python. This component also has logic for splitting text into sentences/paragraphs.