Marker: Convert PDF to Markdown quickly with high accuracy

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • marker

    Convert PDF to markdown quickly with high accuracy

  • tesseract-ocr

    Tesseract Open Source OCR Engine (main repository)

  • Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

    https://github.com/tesseract-ocr/tesseract/releases

    I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • libgen_to_txt

    Convert all of libgen to high quality markdown

  • Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

  • pdfcpu

    A PDF processor written in Go.

  • I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual or tags).

    The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu

    I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts