Our great sponsors
-
textract-cli
CLI utility for using AWS Textract DetectDocumentText to OCR image files in synchronous mode without uploading to S3.
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
That's really neat: https://github.com/mbafford/textract-cli/blob/master/textrac... - my tool had to involve an S3 bucket too just because Textract won't let you upload a PDF directly to it without storing it in S3 first.
- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.
- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff
- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough
- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing
- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.
- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.
I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)
[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...
Related posts
- Do what Google does: build a semantic search app powered by Jina AI's open source, neural search framework.
- A semantic search app powered by Jina AI's open source, neural search framework. Using this, you can index and search song lyrics using state-of-the-art machine learning language models
- [P] A week ago, I came across this super cool project to build Cross Modal Search. I will now share more details about the project
- Build your own Google Image search powered by deep-learning, open-source
- I was wrong! A big thank you to r/python members 🙏