-
Slavic-BERT-NER
Shared BERT model for 4 languages of Bulgarian, Czech, Polish and Russian. Slavic NER model.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Extracting the required data from the string. This is very specific for each use case and most likely my use case won't intersect with yours, but in case it does, I'm trying to detect names of people and companies from the text, for which I'm using the Slavic NER model (note that my PDFs are not in english).
Finally, even though Tesseract's output is usually very nice, it can sometime make a mistake. Again, this is case-specific, and if you're extracting for example numbers, it will be very hard to check for errors, but since I'm extracting names, I'm capable of fuzzy comparing the names detected by Slavic NER to a database of names that I have. I do this fuzzy matching with thefuzz library, and in cases I find a very high match with one of the names in my database, I simply fix the error by taking the name from there.
Related posts
-
Democratizing Music Creation with an AI-powered Music Generation App
-
VisBrowser: Visual Browsing Adapter for Playwright
-
Show HN: Token Streaming with OpenAI SDK and FastAPI
-
Chkbit: Check the data integrity of your files over time
-
We Built an Open-Source Text-to-Image Evaluation Library for Clip Models