pdf2doi : A python library to retrieve the DOI (or other identifiers) from a pdf file

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

pdf2doi

2 84 4.4 Python

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.

For an arXiv paper is slightly more complicated because a query to export.arxiv.org returns data in XML format, so you need to parse it (I use feedparser) and then build a valid BibTeX string. You can take a look at the functions arxiv2bib and make_bibtex.
PyPDF2

30 7,359 9.5 Python

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files

Look into the metadata of the .pdf file (extracted via PyPDF2) and see if any string matches the pattern of a DOI or an arXiv ID. Priority is given to the metadata which contains the word 'doi' in its label.
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
textract

4 3,771 3.7 HTML

extract text from any document. no muss. no fuss.

Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with PyPDF2 and textract.
pdftitle

1 127 0.0 Python

a utility to extract the title from a PDF file

Try to find possible titles of the publication. In the current version, possible titles are identified via pdftitle, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project