Our great sponsors
-
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Then you need some file handling to handle different file types. Text documents and spreadsheets don't need OCR. You can use any excel / word reader library to just parse the data and count the words. For pdfs and images, I would use PaddleOCR. It's free and works reasonably well. If you are only interested in words, do some postprocessing. Easy but not accurate would be checking if a string is not just punctuation, you could also map against a dictionary or use nlp.