Parsr
grobid
Our great sponsors
Parsr | grobid | |
---|---|---|
7 | 11 | |
5,640 | 3,036 | |
0.9% | - | |
4.6 | 9.3 | |
5 months ago | 7 days ago | |
JavaScript | Java | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Parsr
-
LlamaCloud and LlamaParse
I'm part of the team that build LlamaParse. It's net improvement compare to other PDF->Structured Text extractors (I build several in the past, includig https://github.com/axa-group/Parsr).
For character extraction, LlamaParse use a mixture of OCR / character extraction from the PDF (it's the only parser I'm aware of that address some of the buggy PDF font issues, check the 'text' mode to see raw document before reconstruction), use a mixture of heuristic and Machine learning models to reconstruct the document.
Once plug with a Recursive retrieval strategy, allow you to get Sota result on question answering over complexe text (see notebook: https://github.com/run-llama/llama_parse/blob/main/examples/...).
AMA
-
Issue getting Parsr GUI up and running
Link to the Github Repository and pre defined Docker Compose Build file for Parsr
-
PDF GPT allows you to chat with the contents of your PDF file
I would check out https://github.com/Unstructured-IO/unstructured (what lang chain uses) or https://github.com/axa-group/Parsr (probably what unstructured copied to get their startup off the ground lol)
-
Converting PDF into HTML: is it possble?
Things I still want to try: - Parsr
-
Does anyone know where I can get access to a prebuilt general document understanding model?
I personally haven't used it yet, but I heard some good things about Parsr: https://github.com/axa-group/Parsr
- Turn your (PDF,Image) documents into structured data
-
[D] What pdf parser do you use for paragraph parsing for huggingface models
Parsing PDFs is very non-trivial process. Google and Amazon parses are largely based on OCRing. There are some advanced state-of-the-art NN-based OCR approaches but they are not very stable, but a stable industry standard is Tesseract, and nice all-in-one open source tools that brings a ton of tools together is https://github.com/axa-group/Parsr . hope this helps
grobid
- Show HN: Open-source Rule-based PDF parser for RAG
- How to ingest image based PDFs into private GPT model?
- 🥪 Best Sites For ebooks, articles, research papers etc..🥪
- Grobid – ML software for extracting information from scholarly documents
-
How to create a web app that turns academic papers into text documents
Interesting concept. Grobid tries to do the same https://github.com/kermitt2/grobid
-
Extract research paper`s references
I would suggest using grobid - a pipeline for extracting scientific PDFs into a common XML format which can be easily parsed. Grobid has quite a nice mature REST API that I've used in some of my own projects. It parses references and matches them to their DOI using the CrossRef API with a reported 95% F1 score. This should make your job pretty simple as far as I can tell - all you'd need to do is run your papers through grobid and then build a citation graph by comparing document DOIs.
- Free/open-source alternatives to Connected Papers...?
-
Seeking Advice: How to extract Abstract from scientific journals (.pdfs) 10k+.
Just use science-parse or GROBID. They have been designed for that exact reason.
-
Project to rebuild papers with plaintext markup languages
- I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
-
[D] What pdf parser do you use for paragraph parsing for huggingface models
A few years ago I evaluated a few open source tools. In the end focused on GROBID. As usual, it depends on the type of document whether it works well for your use-case. There is some focus on it being "fast" (if that is a concern).
What are some alternatives?
Teedy - Lightweight document management system packed with all the features you can expect from big expensive solutions
CERMINE - Content ExtRactor and MINEr
unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Smile - Statistical Machine Intelligence & Learning Engine
Ambar - :mag: Ambar: Document Search Engine
science-parse - Science Parse parses scientific papers (in PDF form) and returns them in structured form.
deriveODM - DeriveODM is a reactive ODM - Object Document Mapper - framework, a "wrapper" around MongoDB, that removes all the hassle of data-persistence by handling it transparently in the background, in a DRY manner.
datahub - The Metadata Platform for your Data Stack
gpt4-pdf-chatbot-langchain - GPT4 & LangChain Chatbot for large PDF docs
Tribuo - Tribuo - A Java machine learning library
edenai-javascript - The best AI engines in one API: vision, text, speech, translation, OCR, machine learning, etc. SDK and examples for JavaScript developers.
Deep Java Library (DJL) - An Engine-Agnostic Deep Learning Framework in Java