-
Hi, I'm the author of marker - https://github.com/VikParuchuri/marker - from my testing, marker handles almost all the issues you mentioned. The biggest issue (that I'm working on fixing right now) is formatting tables properly.
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
Previously have used https://github.com/pdf2htmlEX/pdf2htmlEX to convert PDF to HTML at scale, could potentially try and parse the output html to markdown as second stage.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
-
-
-
AI-in-a-Box
AI-in-a-Box leverages the expertise of Microsoft across the globe to develop and provide AI and ML solutions to the technical community. Our intent is to present a curated collection of solution accelerators that can help engineers establish their AI/ML environments and solutions rapidly and with minimal friction.
https://github.com/Azure/AI-in-a-Box/tree/main/ai-services/d...
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
document-ai-samples
Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud
-
-
-
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
-
wdoc
Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, etc
For my RAG projet [WDoc](https://github.com/thiswillbeyourgithub/WDoc/tree/dev) I use multiple pdf parser then use heuristics the keep the best one. The code is at https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...
And the heurstics are partly based on using fasttext to detecr languages : https://github.com/thiswillbeyourgithub/WDoc/blob/654c05c5b2...
It's probably crap for tables but I don't want to rely on external parsers.