SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Parser Projects
-
MinerU
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
🚀 P by @ Tim Spann 🚀 Air by @ Tim Spann 🤖 Milvus by @ Tim Spann 💰 AIM ADS-B 👽 Milvus RAG 🍿PDF Processing
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
Project mention: FastAPI, Pydantic, Psycopg3: the holy trinity for Python web APIs | dev.to | 2024-10-24
Pydantic is bundled with FastAPI and is excellent for modelling, validating, and serialising API responses.
-
Project mention: Writing Composable SQL Using Knex and Pipelines | news.ycombinator.com | 2024-11-28
You can compose SQL with https://ibis-project.org/tutorials/ibis-for-sql-users, which is using https://github.com/tobymao/sqlglot to parse the SQL under the hood.
As an alternative to parsing the SQL yourself, DuckDB's [relational API](https://duckdb.org/docs/api/python/relational_api) allows you to compose SQL expressions efficiently and lazily, which I've used when playing around with thinks like https://gist.github.com/ajfriend/eea0795546c7c44f1c24ab0560a...
-
-
Lark
Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
-
MegaParse
File Parser optimised for LLM Ingestion with no loss 🧠 Parse PDFs, Docx, PPTx in a format that is ideal for LLMs.
❄️ Apache Polaris + Iceberg Quickstart ⚡️ How to extract tables from pdfs 🚀 Microsoft 1bit LLM BitNet 🐿️ Verifying Kafka Transactions Entry 2 🐿️ FLUSS: Streaming Storage 🐿️ Fluss -> Flow for Flink Real Time Analytics 🌐 TableFlow - iceberg / kafka ❄️ Snowflake Cortex AI + Slack 🐿️❄️ Door dash flink, kafka, snowflake 🧠 Prompt Stack -- all in one 🔌 SpaCY Layout for PDF 📱 Responsible AI Pathways 📼 Megaparse documents python 🔌 Time Series LLM ❄️ Generate Synthetic Data in Snowflake 🐿️ LLMs and GenAI - When to use them 🐿️ Flink Observability with Prometheus 📡 New SQL GUI 🍫 TDD for GenAI 🕵️ 🎁 Open Source Agent Framework for Production 💻 Cedit command line editor 🏭 ServiceNow AgentLab 🎤 Snowflake Lessons Learned in Replication 🎄 Privastead 🔌 Backup Icloud with nodejs on linux 🔌 Backup Google with nodejs on linux 🎄 HuggingFace macos chat source code 🎁 Ollama working with structured output 🎁 dspy ai how to 🔌 Piazza updater 🔌 Building a financial report with langgraph ColPali Notebook with QWEN 2 VL
-
-
-
-
oletools
oletools - python tools to analyze MS OLE2 files (Structured Storage, Compound File Binary Format) and MS Office documents, for malware analysis, forensics and debugging.
-
I don't think PLY can handle the newest Python grammar, but I haven't looked into it.
For what it's worth, my Python grammar was based on an older project I did called GardenSnake, which still available as a PLY example, at https://github.com/dabeaz/ply/tree/master/example/GardenSnak... .
I've been told it's was one of the few examples of how to handle an indentation-based grammar using a yacc-esque parser generator.
-
msgspec
A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
gjson [1] and a few other go packages offer a way to parse arbitrary JSON without requiring structs to hold them.
re: Python. I like PyRight/PyLance for Python typing, it seems to "just work" afaict. I also like msgspec for dataclass like behavior [2].
---
1: https://github.com/tidwall/gjson
2: https://jcristharif.com/msgspec/
-
rdflib
RDFLib is a Python library for working with RDF, a simple yet powerful language for representing information.
-
-
Project mention: Python type hints may not be not for me in practice | news.ycombinator.com | 2024-11-27
You want runtime typechecking.
See either beartype [1] or typeguard [2]. And if you're doing any kind of array-based programming (JAX or not), then jaxtyping [3].
[1] https://github.com/beartype/beartype/
[2] https://github.com/agronholm/typeguard
[3] https://github.com/patrick-kidger/jaxtyping
-
I dont think it is about discipline. At its core, a good test will take an example and do something with it to demonstrate an outcome.
That's exactly what how to docs do - probably with the same examples.
Logically, they should be the same thing.
For example:
https://github.com/crdoconnor/strictyaml/blob/master/hitch/s...
And:
https://hitchdev.com/strictyaml/using/alpha/scalar/email-and...
You just need a (non turing complete) language that is dual use - it generates docs and runs tests.
-
python-user-agents
A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
-
Project mention: Crafting Structured {JSON} Responses: Ensuring Consistent Output from any LLM 🦙🤖 | dev.to | 2024-09-15
Generating perfectly formatted JSON from LLMs is a common yet challenging task. By guiding the JSON syntax, communicating its usage, and using validation tools like json-fixer, we can significantly improve the consistency and reliability of the output. By combining clear instructions, strategic prompting, and robust validation, we can transform our LLM interactions from a gamble into a reliable pipeline for structured data. That's all for the day folks, Stay informed, iterate, and refine your approach to master the art of JSON generation from any LLM.
-
cinemagoer
Cinemagoer is a Python package useful to retrieve and manage the data of the IMDb (to which we are not affiliated in any way) movie database about movies, people, characters and companies
-
-
The project uses a mix of Rust and GDScript, and for linting the GDScript code in GitHub actions, I use godot-gdscript-toolkit.
-
Construct
Construct: Declarative data structures for python that allow symmetric parsing and building
-
Python Parser discussion
Python Parser related posts
-
HOCON - secret behind .conf files
-
QuivrHQ/MegaParse: File Parser Optimised for LLM Ingestion with No Loss
-
MegaParse – File Parser Optimised for LLM Ingestion
-
Show HN: RasperDucky, an Implementation of DuckyScript3 for Raspberry Pico
-
Adding Syntax to the CPython Interpreter
-
Checkbox Extraction from PDFs - A Tutorial
-
Show HN: Griffe, load Python APIs, find breaking changes
-
A note from our sponsor - SaaSHub
www.saashub.com | 19 Jan 2025
Index
What are some of the best open-source Parser projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | MinerU | 24,613 |
2 | pydantic | 22,019 |
3 | sqlglot | 6,998 |
4 | pdfminer.six | 6,139 |
5 | Lark | 5,045 |
6 | MegaParse | 4,950 |
7 | sqlparse | 3,796 |
8 | phonenumbers | 3,537 |
9 | snoop | 3,125 |
10 | oletools | 2,975 |
11 | PLY | 2,815 |
12 | msgspec | 2,539 |
13 | rdflib | 2,189 |
14 | m3u8 | 2,094 |
15 | typeguard | 1,576 |
16 | strictyaml | 1,503 |
17 | python-user-agents | 1,449 |
18 | json_repair | 1,382 |
19 | cinemagoer | 1,243 |
20 | ViperMonkey | 1,065 |
21 | godot-gdscript-toolkit | 1,052 |
22 | Construct | 934 |
23 | wiktextract | 847 |