Our great sponsors
-
pdf2doi
A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
For an arXiv paper is slightly more complicated because a query to export.arxiv.org returns data in XML format, so you need to parse it (I use feedparser) and then build a valid BibTeX string. You can take a look at the functions arxiv2bib and make_bibtex.
-
PyPDF2
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Look into the metadata of the .pdf file (extracted via PyPDF2) and see if any string matches the pattern of a DOI or an arXiv ID. Priority is given to the metadata which contains the word 'doi' in its label.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
-
Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with PyPDF2 and textract.
-
Try to find possible titles of the publication. In the current version, possible titles are identified via pdftitle, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.