Python Text processing

Open-source Python projects categorized as Text processing | Edit details

Top 23 Python Text processing Projects

  • pydantic

    Data parsing and validation using Python type hints

    Project mention: Strict Python Function Parameters | | 2022-01-23

    Slightly off-topic, but everyone writing modern Python should be familiar with Pydantic and similar libraries that use type hints for validation and parsing:

    We're using Pydantic for Robusta ( and absolutely love it. You get the best of traditional Python (rapid prototyping and no boilerplate) while still being able to scale your codebase and keep it maintainable. Robusta is the first large project I've written in Python where I'm not encountering type errors at runtime left and right.

  • fuzzywuzzy

    Fuzzy String Matching in Python

    Project mention: I made a bot that stops muck chains, here are the phrases that he looks for to flag the comment as a muck comment. Are there any muck forms I forgot about? | | 2021-12-08

    You can have a look at this library to use fuzzy search instead of looking for plaintext muck:

  • Scout APM

    Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.

  • diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

    Project mention: Keeping track of changes made to xml file. | | 2021-10-18

    A bit late to the party but have you checked this? google/diff-match-patch

  • 汉字拼音转换工具(Python 版)


  • ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

    Project mention: Made a Programing language using python | | 2021-11-29

    There's also lark, which is used by a plethora of projects (I haven't used it, but I heard about PreQL on a podcast where they talk for a bit about what it's like to develop a new language in lark)

  • phonenumbers

    Python port of Google's libphonenumber

    Project mention: Does anyone know where I can find official docs for python-phonenumbers package? | | 2022-01-12

    This is the GitHub repo for the package.

  • OPS

    OPS - Build and Run Open Source Unikernels. Quickly and easily build and deploy open source unikernels in tens of seconds. Deploy in any language to any cloud.

  • sqlparse

    A non-validating SQL parser module for Python

    Project mention: Open Source SQL Parsers | | 2021-10-08

    Regular expressions is a popular approach to extract information from SQL statements. However, regular expressions quickly become too complex to handle common features like WITH, sub-queries, windows clauses, aliases and quotes. sqlparse is a popular python package that uses regular expressions to parse SQL.

  • TextDistance

    Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

    Project mention: life4/textdistance: Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage. | | 2021-09-06
  • PLY

    Python Lex-Yacc

  • chardet

    Python character encoding detector

    Project mention: 3 Ways to Handle non UTF-8 Characters in Pandas | | 2022-01-20

    chardet is a library for decoding characters, once installed you can use the following to determine encoding:

  • jellyfish

    🎐 a python library for doing approximate and phonetic matching of strings.

    Project mention: Comparing Strings (Street Names) With Machine Learning | | 2021-11-11

    When comparing strings (in our case street names), there are plenty of off-the-shelf features that can be used, such as those provided by the jellyfish. This package also provides a number of phonetic encodings. We can combine an encoding with a metric, such as Levenshtein Distance, to measure the phonetic similarity between two street names.

  • shortuuid

    A generator library for concise, unambiguous and URL-safe UUIDs.

    Project mention: Building a Micro Business: What Services I Pay For | | 2021-12-30

    skorokithakis: developer of django-annoying and shortuuid

  • pyparsing

    Python library for creating PEG parsers

    Project mention: Parser Combinators in Haskell | | 2021-12-22

    Since it is not mentioned in the article: Python users may also want to check out pyparsing [0]. It is slightly different from Parsec/FParsec (for instance, it ignores all whitespace by default), but I think it is a really good project.


  • python-user-agents

    A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

  • python-slugify

    Returns unicode slugs

  • pyfiglet

    An implementation of figlet written in Python

    Project mention: pyfiglet VS python-asciistuff - a user suggested alternative | | 2022-01-15
  • xpinyin

    Translate Chinese hanzi to pinyin (拼音) by Python, 汉字转拼音

  • Construct

    Construct: Declarative data structures for python that allow symmetric parsing and building

    Project mention: Binary serialization library for at least C++17? | | 2021-10-10

    I myself am looking for a binary serializer/deserializer that's like construct in python or construct-js, but obviously I wouldn't need some of the types that they have, since C++ already has them.

  • python-nameparser

    A simple Python module for parsing human names into their individual components

  • awesome-slugify

    Python flexible slugify function

  • unicode-slugify

    A slugifier that works in unicode

  • Charset Normalizer

    🔎 Like Chardet. 🚀 Package for encoding & language detection. Charset detection.

    Project mention: Everything to know about Requests v2.26.0 | | 2021-07-13

    Starting in v2.26.0 for Python 3 the new default library for encoding detection will be charset_normalizer which is MIT licensed. The library itself is relatively young so a lot of work has gone into making sure users aren't broken with this change including extensive tests against real-life websites and comparing the results against chardet to ensure better performance and accuracy in every case.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-01-23.

Python Text processing related posts


What are some of the best open-source Text processing projects in Python? This list will help you:

Project Stars
1 pydantic 8,781
2 fuzzywuzzy 8,587
3 diff-match-patch 5,003
4 汉字拼音转换工具(Python 版) 3,665
5 ftfy 3,178
6 Lark 2,969
7 phonenumbers 2,916
8 sqlparse 2,696
9 TextDistance 2,590
10 PLY 2,090
11 chardet 1,646
12 jellyfish 1,606
13 shortuuid 1,581
14 pyparsing 1,370
15 python-user-agents 1,228
16 python-slugify 1,157
17 pyfiglet 938
18 xpinyin 735
19 Construct 705
20 python-nameparser 516
21 awesome-slugify 463
22 unicode-slugify 301
23 Charset Normalizer 219
Find remote jobs at our new job board There are 30 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
Static code analysis for 29 languages.
Your projects are multi-language. So is SonarQube analysis. Find Bugs, Vulnerabilities, Security Hotspots, and Code Smells so you can release quality code every time. Get started analyzing your projects today for free.