-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
With scanned documents, the greatest utility is going to come from treating them as images instead of pdfs, and preprocessing the images to help the OCR algorithm. When a pdf program does OCR, it first tries to "cheat" by reading hard-coded strings, then it uses image analysis for everything else. Any OCR program is going to apply image processing that the authors think work for as many cases as possible, but if you have bespoke documents, preprocessing is the way. Here are some general strategies: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md