Document Parsing - an unsolved problem?

This page summarizes the projects mentioned and recommended in the original post on /r/LanguageTechnology

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • tika-python

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

  • At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

  • tika-docker

    Convenience Docker images for Apache Tika Server

  • At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

  • We ended up just pulling the text off of the page / OCRing it in house and performing a vote of those in-house OCR source's data against the GCP text output; what tool you pick (Tesseract, PaddleOCR, etc. depends on many considerations, but can usually be boiled down into: speed, cost, accuracy. Good, fast, cheap. Pick two. :)

  • ocrpy

    OCR, Archive, Index and Search: Implementation agnostic OCR framework.

  • GitHub: https://github.com/maxent-ai/ocrpy

  • txtai

    💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows

  • txtai has a component for document text extraction. It wraps the Tika library in Python. This component also has logic for splitting text into sentences/paragraphs.

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • What contributing to Open-source is, and what it isn't

    1 project | news.ycombinator.com | 27 Apr 2024
  • Build knowledge graphs with LLM-driven entity extraction

    1 project | dev.to | 21 Feb 2024
  • Bootstrap or VC?

    1 project | news.ycombinator.com | 5 Feb 2024
  • txtai: An embeddings database for semantic search, graph networks and RAG

    1 project | news.ycombinator.com | 3 Feb 2024
  • Langchain Is Pointless

    16 projects | news.ycombinator.com | 8 Jul 2023