Marker: Convert PDF to Markdown quickly with high accuracy

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Scout Monitoring - Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com
featured
InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
  • marker

    Convert PDF to markdown quickly with high accuracy

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • tesseract-ocr

    Tesseract Open Source OCR Engine (main repository)

  • Last update was pretty recent, and the git mentions tesseract 5 as a dep. so it's likely moved on a bit from when you last tried it:

    https://github.com/tesseract-ocr/tesseract/releases

    I suppose it depends on your use-case. For personal tasks like this it should be more than sufficient, and won't need user details/cc or whatever to use it.

  • libgen_to_txt

    Convert all of libgen to high quality markdown

  • Author here - this is one of the reasons I made this. Also see https://github.com/VikParuchuri/libgen_to_txt , although I haven't integrated marker with it yet (it uses naive text extraction).

  • pdfcpu

    A PDF processor written in Go.

  • I can report that the closest I've came before is with PDFMiner (https://pypi.org/project/pdfminer/) for Python. The benefit of this one is that it retains styling information, so that italics and the like can be retained, at least with some post-processing (I think one might need to convert certain CSS-classes to actual or tags).

    The other option I have started looking into is the PDFCPU library for Go. It is a bit more low-level than PDFMiner, but one gets out very well structured info, that seem it might be possible to post-process quite well, for one's particular use case and PDF layouts: https://github.com/pdfcpu/pdfcpu

    I also now tried the Marker tool in the OT, and it seems to do a reasonable job. It did intermingle some columns though, at least in some tricky cases such as when there were a round shaped image in between the two columns. One note is that Marker doesn't seem to retain styling like italics though.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • OCR with tesseract, python and pytesseract

    2 projects | dev.to | 4 Jun 2024
  • OCR Tools for Mac, iOS and Windows

    1 project | news.ycombinator.com | 3 Jun 2024
  • Multimodal AI: Bridging the Gap Between Human and Machine Understanding

    1 project | dev.to | 14 May 2024
  • Highlighting Image Text

    1 project | dev.to | 30 Apr 2024
  • one of the Codia AI Design technologies: OCR Technology

    1 project | dev.to | 14 Feb 2024