A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
For an arXiv paper is slightly more complicated because a query to export.arxiv.org returns data in XML format, so you need to parse it (I use feedparser) and then build a valid BibTeX string. You can take a look at the functions arxiv2bib and make_bibtex.
A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
Look into the metadata of the .pdf file (extracted via PyPDF2) and see if any string matches the pattern of a DOI or an arXiv ID. Priority is given to the metadata which contains the word 'doi' in its label.
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
extract text from any document. no muss. no fuss.
Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with PyPDF2 and textract.
a utility to extract the title from a PDF file
Try to find possible titles of the publication. In the current version, possible titles are identified via pdftitle, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.
Show HN: A new open-source library to design PDF using React
2 projects | news.ycombinator.com | 17 Feb 2024
1.5M PDFs in 25 Minutes
2 projects | news.ycombinator.com | 15 Feb 2024
Yara scanning PDF files
1 project | /r/computerforensics | 1 Jun 2023
I need help install PyPDF2 library on my computer
1 project | /r/learnpython | 12 May 2023
Render PDF with Python
2 projects | /r/learnpython | 31 Mar 2023