pdfminer.six
Jina AI examples
Our great sponsors
pdfminer.six | Jina AI examples | |
---|---|---|
14 | 22 | |
5,430 | 403 | |
4.2% | - | |
7.1 | 9.6 | |
16 days ago | over 2 years ago | |
Python | Python | |
MIT License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
pdfminer.six
-
Code to extract text from pdf to excel
I love to use PDFMiner and PDFQuery for this https://github.com/pdfminer/pdfminer.six https://towardsdatascience.com/scrape-data-from-pdf-files-using-python-and-pdfquery-d033721c3b28
- Advanced PDF to Excel with documents and example code
-
how do I automate extracting data from two pdfs and input into an excel sheet according to an order number
Entering things in Excel is very easy. Extracting things from PDF is a pain. This (https://github.com/pdfminer/pdfminer.six) gets pretty close to what you need, but it may be easier to use this to just convert the entire PDF to text and parse the text to extract the info you need.
- Can I make a code to compare a pdf file and an excel sheet by line by line tell the difference in amounts?
-
How do I now access GPT-4? I click the link but it just takes me to the information page, I don’t have access to it on the API playground page.
Convert pdf to string https://github.com/pdfminer/pdfminer.six
- Extracting text from PDFs using pdfminer
-
Recommendations for parsing text from .pdf files
Now I see that the project is abandoned but there's an active fork called pdfminer.six . Hope that helps.
-
Creating a python class for organizing courses I took in my education
Technically this information is on my transcript, so I will be trying to use pdfminer to extract that data if there is a way to use a class you recommend when using that code https://github.com/pdfminer/pdfminer.six
- Show HN: Search PDFs with Transformers and Python Notebook
- Best tools for PDF Scraping?
Jina AI examples
-
Show HN: Search PDFs with Transformers and Python Notebook
- Modern PDFs - if you wanna extract text and images, then the PDFSegmenter used in my example will work. If tables too, might need some additional jiggery-pokery, but definitely doable. I know other ppl using the same framework (Jina) who've accomplished it.
- Exact word search - pretty simple. I've focused on more advanced stuff because color vs colour is same same but different. Also just because it's pretty easy since I'm just using pre-defined building blocks, not manually integrating stuff
- Cross platform frontend - I've seen a lyrics search frontend [0] and I've built stuff in Streamlit before. Jina offers RESTful/gRPC/WebSockets gateways so it can't be too tough
- Lightweight? I mean how lightweight do you want it? C? Bash? Assembly? I've found Python good for text parsing
- Long-term: The notebook I wrote has a few (each of which have their own), but compared to others they're relatively lightweight.
- Gluing code: I've been using pre-existing building blocks, and writing new Executors (i.e. building blocks) is relatively straightforward, and then scaling them up with shards, replicas, etc is just a parameter away.
I'm more into the search side then the PDF stuff. The PDF side I've had experience with through bitter suffering and torment. Not a fun format to work with (unless you're into sado-masochism)
[0] https://github.com/jina-ai/examples/tree/master/multires-lyr...
-
Getting started with Jina AI
Semantic Wikipedia Search
- Do what Google does: build a semantic search app powered by Jina AI's open source, neural search framework.
- A semantic search app powered by Jina AI's open source, neural search framework. Using this, you can index and search song lyrics using state-of-the-art machine learning language models
-
[P] A week ago, I came across this super cool project to build Cross Modal Search. I will now share more details about the project
I was looking for some projects based on search engines, and building a tool which could search across various types of data, and that's when I came across this GitHub project: https://github.com/jina-ai/jina/blob/master/.github/pages/hello-world.md#-multimodal-document-search. Encouraged by thorough, step by step instructions on how to build a search service that can use diverse modal features to provide accurate results; I ventured through the documents till I came to the latest updated version, here: https://github.com/jina-ai/examples/tree/master/cross-modal-search.
- Build your own Google Image search powered by deep-learning, open-source
-
[P] Open-source Neural Search framework to implement semantic search & multimedia search. Just released 2.0, seeking your feedback.
There are already some examples on music search, pdf search and video search that shows some POC of it's capabilities around those use cases. You can discuss your specific use case in detail with Jina community on slack
-
I was wrong! A big thank you to r/python members 🙏
Thank you so much for the appreciation and sharing your use cases. Checkout examples for chatbot and financial analysis - https://github.com/jina-ai/examples
-
PDF search - Another project I built using Jina(AI Search framework)
git clone --depth 1 --filter=blob:none --sparse https://github.com/jina-ai/examples git sparse-checkout set multimodal-search-pdf
- Alternative to Google Images - Open-Source image search engine
What are some alternatives?
PDFMiner - Python PDF Parser (Not actively maintained). Check out pdfminer.six.
finetuner - :dart: Task-oriented embedding tuning for BERT, CLIP, etc.
pdfplumber - Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
jina - ☁️ Build multimodal AI applications with cloud-native stack
PyPDF2 - A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
jina-hub - An open-registry for hosting Jina executors via container images
OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
jina-financial-qa-search
tabula-py - Simple wrapper of tabula-java: extract table from PDF into pandas DataFrame
jina-app-store-example - App store search example, using Jina as backend and Streamlit as frontend [Moved to: https://github.com/jina-ai/example-app-store]
PyPDF2 - A utility to read and write PDFs with Python [Moved to: https://github.com/py-pdf/PyPDF2]
jina-meme-search-example - Meme search engine built with Jina neural search framework. Search with captions or image files to find matching memes. [Moved to: https://github.com/jina-ai/example-meme-search]