Our great sponsors
A Markdown Editor for the 21st century.
Universal markup converter
Many publishers also publish a supplementary HTML version these days, which may be an acceptable format or at least easy to convert to other formats with pandoc.
Access the most powerful time series database as a service. Ingest, store, & analyze all types of time series data in a fully-managed, purpose-built database. Keep data forever with low-cost storage and superior data compression.
A machine learning software for extracting information from scholarly documents
- I ended up using Grobid, which converts the PDF to a very detailed XML format. The format is not a word processing format though, but a format specifically for representing scientific documents. I don't know, if it would, for example, contain tags about bold or italicized text. The tool is working really well, but since you probably cannot use the output XML format directly, it will need some postprocessing, which would be relatively simple with XML parsing libraries.
- An alternative is pdfextract by Crossref. They probably use this to build their own large database. It also works really well and gives you some JSON that would probably need less postprocessing than Grobid. I didn't use it for some minor technical reason that I forgot.
Given a scholarly PDF, extract figures, tables, captions, and section titles.
- pdffigures2 is from the team behind Semantic Scholar, and they probably use it to extract the figures that they show in their search engine. It only extracts figures and their captions and no other things. I don't recall whether the other tools can also extract figures, but if not, then this will be a perfect supplement.
Content ExtRactor and MINEr
- Another alternative that's on my list but that I didn't try is Cermine.
Parsers for scientific papers (PDF2JSON, TEX2JSON, JATS2JSON)
Their library is built on Grobid and probably makes it easier to use, and thus probably the best overall choice.
Write Clean Java Code. Always.. Sonar helps you commit clean code every time. With over 600 unique rules to find Java bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
What are the best programs to batch convert URLs or HTML files to PDFs?
4 projects | /r/DataHoarder | 30 May 2022
Fudge SRD in ODT format
1 project | /r/FudgeRPG | 2 Jun 2023
Multichapter authors who write their whole fic before posting, how do you format your document?
1 project | /r/FanFiction | 2 Jun 2023
PM wants something that "looks better than a Word doc"
2 projects | /r/technicalwriting | 1 Jun 2023
1 project | /r/termux | 29 May 2023