SaaSHub helps you find the best software and product alternatives Learn more →
Marker Alternatives
Similar projects and alternatives to marker
-
-
InfluxDB
InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
-
FLiPStackWeekly
FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...
-
PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
-
-
EasyOCR
Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
unstructured
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
-
PyMuPDF
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
-
-
-
-
-
-
document-ai-samples
Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud
-
voyager
🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability. (by spotify)
-
-
-
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
marker discussion
marker reviews and mentions
- Using Docling’s OCR features with RapidOCR
-
German parliament votes as a Git contribution graph
I usually convert to markdown from PDF local laws when I need them as a reference for the functional specification. That way its easier to pinpoint to exact section of the thing.
Its not easy to convert general law to markdown, it involved online converters and manual fixes. Currently experimenting with marker [1] on local LLM hardware and so far it is the best out there.
[1]: https://github.com/VikParuchuri/marker
-
Mistral OCR
> with LLM as a judge
For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.
Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.
I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?
[0] https://github.com/VikParuchuri/marker/blob/master/benchmark...
-
OlmOCR, Ai2's open-source tool to extract clean plain text from PDFs
I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.
Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.
Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).
Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .
You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .
Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.
- Show HN: Benchmarking VLMs vs. Traditional OCR
- Ask HN: What is the best method for turning a scanned book as a PDF into text?
-
Parsing PDFs (and more) in Elixir using Rust
Very cool, any plans for a dockerized API of market similar to what Unstructured released? I know you have a very attractively priced serverless offering (https://www.datalab.to) but having something to develop against locally would be great (for those of us not in the Python world).
- Datalab – SOTA document intelligence models
-
Ask HN: Who is hiring? (January 2025)
Datalab | NYC | Full-time | Software Engineer and Head of Business Ops | $250k-$350k + 1.5-3% equity | https://www.datalab.to
A significant % of useful data is locked away in tough-to-parse formats like PDFs. We build tools to extract it, like https://github.com/VikParuchuri/surya (15k Github stars), and https://github.com/VikParuchuri/marker (19k stars). We also run an inference API and product.
We do meaningful research (we’ve trained several SoTA models), ship product, and contribute to open source. We’re hiring for 2 roles to help us scale:
Senior fullstack software engineer
- work across our open source repos, inference api, and frontend product
-
AI projects to watch: Mozilla's first Builders Accelerator cohort kicks off
Particularly noteworthy for me (as someone doing Elixir & ML at the moment) is the 100k$ attributed to Elixir Nx:
> Building a cutting-edge distributed ML framework with Elixir from Nx – Creating a scalable, distributed machine learning framework that outperforms current solutions, the Nx project leverages Elixir’s unique strengths in multitasking and fault-tolerance. By automatically spreading tasks across multiple devices, Nx aims to deliver a more scalable and efficient solution.
And lots of other interesting projects, including https://github.com/enjalot/latent-scope, https://github.com/Intelligent-Instruments-Lab/tolvera, https://ente.io/, https://github.com/VikParuchuri/marker ...
-
A note from our sponsor - SaaSHub
www.saashub.com | 12 May 2025
Stats
VikParuchuri/marker is an open source project licensed under GNU General Public License v3.0 only which is an OSI approved license.
The primary programming language of marker is Python.