marker

Convert PDF to markdown + JSON quickly with high accuracy (by VikParuchuri)

Marker Alternatives

Similar projects and alternatives to marker

  1. tesseract-ocr

    Tesseract Open Source OCR Engine (main repository)

  2. InfluxDB

    InfluxDB – Built for High-Performance Time Series Workloads. InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.

    InfluxDB logo
  3. FLiPStackWeekly

    FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...

  4. PaddleOCR

    69 marker VS PaddleOCR

    Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)

  5. ripgrep-all

    rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.

  6. EasyOCR

    42 marker VS EasyOCR

    Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.

  7. nougat

    19 marker VS nougat

    Implementation of Nougat Neural Optical Understanding for Academic Documents

  8. surya

    16 marker VS surya

    OCR, layout analysis, reading order, table recognition in 90+ languages

  9. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  10. unstructured

    Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

  11. PyMuPDF

    8 marker VS PyMuPDF

    PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

  12. llmsherpa

    6 marker VS llmsherpa

    Developer APIs to Accelerate LLM Projects

  13. zerox

    15 marker VS zerox

    OCR & Document Extraction using vision models

  14. SecureAI-Tools

    12 marker VS SecureAI-Tools

    Private and secure AI tools for everyone's productivity.

  15. document-ai-samples

    7 marker VS document-ai-samples

    Sample applications and demos for Document AI, the end-to-end document processing platform on Google Cloud

  16. voyager

    4 marker VS voyager

    🛰️ An approximate nearest-neighbor search library for Python and Java with a focus on ease of use, simplicity, and deployability. (by spotify)

  17. benchmark

    6 marker VS benchmark

    OCR Benchmark (by getomni-ai)

  18. narrator

    5 marker VS narrator

    David Attenborough narrates your life

  19. llama_cloud_services

    Knowledge Agents and Management in the Cloud

  20. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a better marker alternative or higher similarity.

marker discussion

Log in or Post with

marker reviews and mentions

Posts with mentions or reviews of marker. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2025-04-03.
  • Using Docling’s OCR features with RapidOCR
    9 projects | dev.to | 3 Apr 2025
  • German parliament votes as a Git contribution graph
    6 projects | news.ycombinator.com | 24 Mar 2025
    I usually convert to markdown from PDF local laws when I need them as a reference for the functional specification. That way its easier to pinpoint to exact section of the thing.

    Its not easy to convert general law to markdown, it involved online converters and manual fixes. Currently experimenting with marker [1] on local LLM hardware and so far it is the best out there.

    [1]: https://github.com/VikParuchuri/marker

  • Mistral OCR
    7 projects | news.ycombinator.com | 6 Mar 2025
    > with LLM as a judge

    For anyone else interested, prompt is here [0]. The model used was gemini-2.0-flash-001.

    Benchmarks are hard, and I understand the appeal of having something that seems vaguely deterministic rather than having a human in the loop, but I have a very hard time accepting any LLM-judged benchmarks at face value. This is doubly true when we're talking about something like OCR which, as you say, is a very hard problem for computers of any sort.

    I'm assuming you've given this some thought—how did you arrive at using an LLM to benchmark OCR vs other LLMs? What limitations with your benchmark have you seen/are you aware of?

    [0] https://github.com/VikParuchuri/marker/blob/master/benchmark...

  • OlmOCR, Ai2's open-source tool to extract clean plain text from PDFs
    3 projects | news.ycombinator.com | 28 Feb 2025
    I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.

    Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.

    Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).

    Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .

    You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .

    Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.

  • Show HN: Benchmarking VLMs vs. Traditional OCR
    6 projects | news.ycombinator.com | 20 Feb 2025
  • Ask HN: What is the best method for turning a scanned book as a PDF into text?
    13 projects | news.ycombinator.com | 16 Feb 2025
  • Parsing PDFs (and more) in Elixir using Rust
    3 projects | news.ycombinator.com | 29 Jan 2025
    Very cool, any plans for a dockerized API of market similar to what Unstructured released? I know you have a very attractively priced serverless offering (https://www.datalab.to) but having something to develop against locally would be great (for those of us not in the Python world).
  • Datalab – SOTA document intelligence models
    1 project | news.ycombinator.com | 10 Jan 2025
  • Ask HN: Who is hiring? (January 2025)
    18 projects | news.ycombinator.com | 2 Jan 2025
    Datalab | NYC | Full-time | Software Engineer and Head of Business Ops | $250k-$350k + 1.5-3% equity | https://www.datalab.to

    A significant % of useful data is locked away in tough-to-parse formats like PDFs. We build tools to extract it, like https://github.com/VikParuchuri/surya (15k Github stars), and https://github.com/VikParuchuri/marker (19k stars). We also run an inference API and product.

    We do meaningful research (we’ve trained several SoTA models), ship product, and contribute to open source. We’re hiring for 2 roles to help us scale:

    Senior fullstack software engineer

    - work across our open source repos, inference api, and frontend product

  • AI projects to watch: Mozilla's first Builders Accelerator cohort kicks off
    4 projects | news.ycombinator.com | 28 Sep 2024
    Particularly noteworthy for me (as someone doing Elixir & ML at the moment) is the 100k$ attributed to Elixir Nx:

    > Building a cutting-edge distributed ML framework with Elixir from Nx – Creating a scalable, distributed machine learning framework that outperforms current solutions, the Nx project leverages Elixir’s unique strengths in multitasking and fault-tolerance. By automatically spreading tasks across multiple devices, Nx aims to deliver a more scalable and efficient solution.

    And lots of other interesting projects, including https://github.com/enjalot/latent-scope, https://github.com/Intelligent-Instruments-Lab/tolvera, https://ente.io/, https://github.com/VikParuchuri/marker ...

  • A note from our sponsor - SaaSHub
    www.saashub.com | 12 May 2025
    SaaSHub helps you find the best software and product alternatives Learn more →

Stats

Basic marker repo stats
30
24,879
9.8
6 days ago

Sponsored
InfluxDB – Built for High-Performance Time Series Workloads
InfluxDB 3 OSS is now GA. Transform, enrich, and act on time series data directly in the database. Automate critical tasks and eliminate the need to move data externally. Download now.
www.influxdata.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?