pdfplumber
djot
pdfplumber | djot | |
---|---|---|
29 | 43 | |
5,603 | 1,580 | |
- | - | |
8.2 | 5.8 | |
16 days ago | 2 months ago | |
Python | HTML | |
MIT License | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
pdfplumber
- Running OCR against PDFs and images directly in the browser
-
Google Scholar PDF Reader
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- Parsing dates with PDFminer
-
How to Extract Data from Tables in a Public Record PDF
I recently published a story that was based on some data analysis I did of a report I obtained from the Department of Behavioral Health and Developmental Services in VA. I wanted to share a quick walkthrough of how I extracted the data from tables in a PDF using a Python module called PDFplumber. I also uploaded a video to Youtube if you prefer that.
-
Code to extract text from pdf to excel
I've been working with pdfplumber, which is built atop pdfminer.six. It allows one to break the page up into sections and extract text from them in turn, which may help keep columns separated better.
-
I need to parse unstructured tables from a pdf into a json, what can I do
You could try pdfplumber
-
Advanced PDF to Excel with documents and example code
I'm not sure if there is a way to reliably detect bold characters: https://github.com/jsvine/pdfplumber/issues/724
-
how do I automate extracting data from two pdfs and input into an excel sheet according to an order number
pdfplumber is also pretty good. It can help segment text a bit better than pdfminer can alone.
-
Extracting particular things from pdf program?
To handle machine generated one, a possible package is pdfplumber.
- Convert PDF to text for parsing
djot
-
LaTeX and Neovim for technical note-taking
I know this doesn't solve your problem directly, but I recommend people to try out Djot[0], a markup language from the author of CommonMark.
Djot has a single well-defined spec, and most of the basic formatting has the same syntax as (a) Markdown, so switching is pretty painless. It has as a main goal to be legible and visually aesthetic as-is, just like Markdown.
What Djot adds is its _predictability_. Nested formatting, precedence order, line breaks behavior, nested blocks, mixed inline and block formatting, custom attributes are all laid out precisely in the spec in a thought-out manner. Till this day I still can't remember how to put line break within a list item in Markdown (and I'm sure there're more than one way).
[0]: https://djot.net/
- Pandoc 3.1.12 Released
-
Pandoc
Worth noting that the author has also created a markup language, djot.
https://github.com/jgm/djot
-
Augmenting the Markdown Language for Great Python Graphical Interfaces
Every time I see people doing something with Markdown, I wish they just replace it with support for Djot[0] instead. It is a Markdown alternative by the creator of Pandoc and CommonMark that fixes all of the most egregious mistakes, while being legible and visually pleasant as-is. It is also syntactically similar to Markdown, which should ease adoption.
[0] https://github.com/jgm/djot
- Djot is a light markup syntax
- Beyond Markdown
-
HELP!!! Stuck forever
Are you using markdown? It might make sense to look at 'djot' as well: https://djot.net/; it's a new 'light' markup language conceived as a successor to commonmark; development is led by none other than John McFarlane (author of pandoc, & also led commonmark standardization) Djot makes it really easy to attach arbitrary attributes to block elements as well as inline elements; and the parser records source positions in the output -- all of which makes it really convenient keeping track of elements changing position or value.
-
Is there a way to send data from neovim in real-time to other applications? Want to create a neovim qmk bridge.
I have a simple script that sends a djot buffer (https://github.com/jgm/djot) to the parser, if there's a change, on the CursorHold event.
-
wiki.vim v0.6 is released
Since you mentioned you were considering moving to CommonMark, have you had time to look into Djot (also by jpm)? Djot is meant to be easier to parse, and I'm planning to write a tree-sitter grammar for it.
-
Typst, a modern LaTeX alternative written in Rust, is now open source
Another recent development here is https://djot.net/ (by the pandoc author). It indeed thoroughly solves both:
What are some alternatives?
PDFMiner - Python PDF Parser (Not actively maintained). Check out pdfminer.six.
typst - A new markup-based typesetting system that is powerful and easy to learn.
PyPDF2 - A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
mdBook - Create book from markdown files. Like Gitbook but implemented in Rust
OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Zato - ESB, SOA, REST, APIs and Cloud Integrations in Python
pdfminer.six - Community maintained fork of pdfminer - we fathom PDF
scroll - Tools for thought. An extensible alternative to Markdown.
py-pdf-parser - A Python tool to help extracting information from structured PDFs.
pdfsyntax - A Python library to inspect and modify the internal structure of a PDF file
PyMuPDF - PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
pdfquery - A fast and friendly PDF scraping library.