pdf2doi : A python library to retrieve the DOI (or other identifiers) from a pdf file

This page summarizes the projects mentioned and recommended in the original post on /r/Python

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern API for authentication & user identity.
  • Onboard AI - ChatGPT with full context of any GitHub repo.
  • pdf2doi

    A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.

    For an arXiv paper is slightly more complicated because a query to export.arxiv.org returns data in XML format, so you need to parse it (I use feedparser) and then build a valid BibTeX string. You can take a look at the functions arxiv2bib and make_bibtex.

  • PyPDF2

    A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

    Look into the metadata of the .pdf file (extracted via PyPDF2) and see if any string matches the pattern of a DOI or an arXiv ID. Priority is given to the metadata which contains the word 'doi' in its label.

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

  • textract

    extract text from any document. no muss. no fuss.

    Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with PyPDF2 and textract.

  • pdftitle

    a utility to extract the title from a PDF file

    Try to find possible titles of the publication. In the current version, possible titles are identified via pdftitle, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts