Project to rebuild papers with plaintext markup languages

This page summarizes the projects mentioned and recommended in the original post on /r/Open_Science

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • Zettlr

    Your One-Stop Publication Workbench

  • pandoc

    Universal markup converter

    Many publishers also publish a supplementary HTML version these days, which may be an acceptable format or at least easy to convert to other formats with pandoc.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • grobid

    A machine learning software for extracting information from scholarly documents

    - I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.

  • - An alternative is pdfextract by Crossref. They probably use this to build their own large database. It also works really well and gives you some JSON that would probably need less postprocessing than Grobid. I didn't use it for some minor technical reason that I forgot.

  • pdffigures2

    Given a scholarly PDF, extract figures, tables, captions, and section titles.

    - pdffigures2 is from the team behind Semantic Scholar, and they probably use it to extract the figures that they show in their search engine. It only extracts figures and their captions and no other things. I don't recall whether the other tools can also extract figures, but if not, then this will be a perfect supplement.

  • CERMINE

    Content ExtRactor and MINEr

    - Another alternative that's on my list but that I didn't try is Cermine.

  • s2orc-doc2json

    Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)

    Their library is built on Grobid and probably makes it easier to use, and thus probably the best overall choice.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts