Document Parsing - an unsolved problem?

This page summarizes the projects mentioned and recommended in the original post on reddit.com/r/LanguageTechnology

Our great sponsors
  • InfluxDB - Build time-series-based applications quickly and at scale.
  • Scout APM - Truly a developer’s best friend
  • Zigi - The context switching struggle is real
  • Sonar - Write Clean Python Code. Always.
  • tika-python

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

    At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

  • tika-docker

    Convenience Docker images for Apache Tika Server

    At my previous job we had the same problem which we solved by using Tika. We called it on the server along with other stuff, but there is also a Python binding.

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

  • PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

    We ended up just pulling the text off of the page / OCRing it in house and performing a vote of those in-house OCR source's data against the GCP text output; what tool you pick (Tesseract, PaddleOCR, etc. depends on many considerations, but can usually be boiled down into: speed, cost, accuracy. Good, fast, cheap. Pick two. :)

  • ocrpy

    OCR, Archive, Index and Search: Implementation agnostic OCR framework.

    GitHub: https://github.com/maxent-ai/ocrpy

  • txtai

    💡 Build AI-powered semantic search applications

    txtai has a component for document text extraction. It wraps the Tika library in Python. This component also has logic for splitting text into sentences/paragraphs.

  • Scout APM

    Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts