Extracting information from scanned PDF docs, is it possible?

This page summarizes the projects mentioned and recommended in the original post on /r/learnpython

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • Slavic-BERT-NER

    Shared BERT model for 4 languages of Bulgarian, Czech, Polish and Russian. Slavic NER model.

  • Extracting the required data from the string. This is very specific for each use case and most likely my use case won't intersect with yours, but in case it does, I'm trying to detect names of people and companies from the text, for which I'm using the Slavic NER model (note that my PDFs are not in english).

  • thefuzz

    Fuzzy String Matching in Python

  • Finally, even though Tesseract's output is usually very nice, it can sometime make a mistake. Again, this is case-specific, and if you're extracting for example numbers, it will be very hard to check for errors, but since I'm extracting names, I'm capable of fuzzy comparing the names detected by Slavic NER to a database of names that I have. I do this fuzzy matching with thefuzz library, and in cases I find a very high match with one of the names in my database, I simply fix the error by taking the name from there.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • Democratizing Music Creation with an AI-powered Music Generation App

    1 project | dev.to | 7 May 2024
  • VisBrowser: Visual Browsing Adapter for Playwright

    1 project | news.ycombinator.com | 7 May 2024
  • Show HN: Token Streaming with OpenAI SDK and FastAPI

    1 project | news.ycombinator.com | 7 May 2024
  • Chkbit: Check the data integrity of your files over time

    1 project | news.ycombinator.com | 7 May 2024
  • We Built an Open-Source Text-to-Image Evaluation Library for Clip Models

    1 project | news.ycombinator.com | 7 May 2024