Top 23 Python Text processing Projects

pydantic

167 18,521 9.8 Python

Data validation using Python type hints

Project mention: Advanced RAG with guided generation | dev.to | 2024-04-18

First, note the method prefix_allowed_tokens_fn. This method applies a Pydantic model to constrain/guide how the LLM generates tokens. Next, see how that constrain can be applied to txtai's LLM pipeline.
fuzzywuzzy

20 9,067 0.0 Python

Fuzzy String Matching in Python
WorkOS

workos.com
sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
diff-match-patch

8 7,080 0.0 Python

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29

I also came across this diff match algorithms: https://github.com/google/diff-match-patch
汉字拼音转换工具（Python 版）

1 4,666 7.0 Python

汉字转拼音(pypinyin)

Project mention: Pinyin Character Sorting | /r/ChineseLanguage | 2023-07-09

Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.
Lark

35 4,471 7.5 Python

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13

Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...
You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/
The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.
ftfy

1 3,704 5.7 Python

Fixes mojibake and other glitches in Unicode text, after the fact.
sqlparse

7 3,574 8.2 Python

A non-validating SQL parser module for Python

Project mention: Show HN: Databasediagram.com – Private, Text to Entity-Relationship Diagram Tool | news.ycombinator.com | 2023-06-08

Suggest checking out the sqlparse library for a way to do the different flavours without needing to address each case directly: https://github.com/andialbrecht/sqlparse
InfluxDB

www.influxdata.com
sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
phonenumbers

7 3,398 8.3 Python

Python port of Google's libphonenumber
TextDistance

6 3,296 7.0 Python

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.
PLY

2 2,695 1.0 Python

Python Lex-Yacc
pyparsing

13 2,083 8.5 Python

Python library for creating PEG parsers

Project mention: Pyparsing 3.1.0 released | /r/pyparsing | 2023-06-19

After over a year since the last release of pyparsing, I've bundled up all the bug-fixes and changes, and they are now released as pyparsing 3.1.0. Visit this link for the details.
chardet

8 2,071 2.9 Python

Python character encoding detector
shortuuid

5 1,975 0.8 Python

A generator library for concise, unambiguous and URL-safe UUIDs.

Project mention: Ask HN: Could you show your personal blog here? | news.ycombinator.com | 2023-07-04

Unlike many people here, I don't like to write hundreds of mediocre posts. Instead, I prefer very few posts, that unfortunately are still mediocre.
If you're tired of all the perfection that exists on the internet, where every piece is deeply insightful and changes your life, I'd encourage you to read my articles, which only promise to shorten it:
https://www.stavros.io/
msgspec

31 1,839 8.9 Python

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

Project mention: Htmx, Rust and Shuttle: A New Rapid Prototyping Stack | news.ycombinator.com | 2023-11-01
python-slugify

1 1,446 5.6 Python

Returns unicode slugs
typeguard

7 1,432 8.2 Python

Run-time type checker for Python
python-user-agents

0 1,404 0.0 Python

A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.
DataProfiler

61 1,357 6.3 Python

What's in your data? Extract schema, statistics and entities from datasets

Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22

It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler
For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.
pyfiglet

4 1,299 7.4 Python

An implementation of figlet written in Python

Project mention: echo -e doesn't work | /r/bash | 2023-04-27

btw there's also python's native pyfiglet https://github.com/pwaller/pyfiglet
Construct

5 875 7.6 Python

Construct: Declarative data structures for python that allow symmetric parsing and building
xpinyin

0 809 2.6 Python

Translate Chinese hanzi to pinyin (拼音) by Python, 汉字转拼音
python-nameparser

2 635 4.3 Python

A simple Python module for parsing human names into their individual components
Charset Normalizer

4 517 8.6 Python

Truly universal encoding detector in pure Python
SaaSHub

www.saashub.com
sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2024-04-18.

Python Text processing related posts

Advanced RAG with guided generation
2 projects | dev.to | 18 Apr 2024
LongRoPE: Extending LLM Context Window Beyond 2M Tokens
1 project | news.ycombinator.com | 22 Feb 2024
utype VS pydantic - a user suggested alternative
2 projects | 15 Feb 2024
Pydantic v2 ruined the elegance of Pydantic v1
1 project | news.ycombinator.com | 28 Jan 2024
Ask HN: Pydantic has too much deprecation. Why is it popular?
1 project | news.ycombinator.com | 3 Jan 2024
OpenAI uses Pydantic for their ChatCompletions API
1 project | news.ycombinator.com | 3 Dec 2023
If you're late, consider creating your CV with this Python code: RenderCV
1 project | /r/gradadmissions | 30 Nov 2023
A note from our sponsor - InfluxDB
www.influxdata.com | 18 Apr 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Text processing projects in Python? This list will help you:

	Project	Stars
1	pydantic	18,521
2	fuzzywuzzy	9,067
3	diff-match-patch	7,080
4	汉字拼音转换工具（Python 版）	4,666
5	Lark	4,471
6	ftfy	3,704
7	sqlparse	3,574
8	phonenumbers	3,398
9	TextDistance	3,296
10	PLY	2,695
11	pyparsing	2,083
12	chardet	2,071
13	shortuuid	1,975
14	msgspec	1,839
15	python-slugify	1,446
16	typeguard	1,432
17	python-user-agents	1,404
18	DataProfiler	1,357
19	pyfiglet	1,299
20	Construct	875
21	xpinyin	809
22	python-nameparser	635
23	Charset Normalizer	517