Python Text processing

Open-source Python projects categorized as Text processing

Top 23 Python Text processing Projects

Text processing
  • pydantic

    Data validation using Python type hints

  • Project mention: JSON extra uses orjson instead of ujson | news.ycombinator.com | 2024-06-05

    I'm really surprised ijl got angry that his mail was quoted, it looks innocent enough to me.

    For reference it's been edited out here: https://github.com/pydantic/pydantic/issues/589

    But github shows edits, so the edit is meaningless for privacy. Here's the original mail (yes, I'm blatantly ignoring his request to not publish this, I'm just this evil.)

        I've looked into replacing ujson in pydantic with orjson

  • Scout Monitoring

    Free Django app performance insights with Scout Monitoring. Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.

    Scout Monitoring logo
  • fuzzywuzzy

    Fuzzy String Matching in Python

  • diff-match-patch

    Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.

  • Project mention: Ideas for approaching pattern matching/distance problem | /r/learnprogramming | 2023-06-29

    I also came across this diff match algorithms: https://github.com/google/diff-match-patch

  • 汉字拼音转换工具(Python 版)

    汉字转拼音(pypinyin)

  • Project mention: Pinyin Character Sorting | /r/ChineseLanguage | 2023-07-09

    Could probably whip up a python script real quick with this library: https://github.com/mozillazg/python-pinyin. Probably need some extra logic to deal with heteronyms. Not sure what your goal is.

  • Lark

    Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

  • Project mention: Show HN: I wrote a RDBMS (SQLite clone) from scratch in pure Python | news.ycombinator.com | 2023-08-13

    Lark supports, and recommends, writing and storing the grammar in a .lark file. We have syntax highlighting support in all major IDEs, and even in github itself. For example, here is Lark's built-in grammar for Python: https://github.com/lark-parser/lark/blob/master/lark/grammar...

    You can also test grammars "live" in our online IDE: https://www.lark-parser.org/ide/

    The rationale is that it's more terse and has less visual clutter than a DSL over Python, which makes it easier to read and write.

  • ftfy

    Fixes mojibake and other glitches in Unicode text, after the fact.

  • Project mention: You can't just assume UTF-8 | news.ycombinator.com | 2024-04-29

    If you’re actually in a position where you need to guess the encoding, something like “ftfy” <https://github.com/rspeer/python-ftfy> (webapp: <https://ftfy.vercel.app/>) is a perfectly reasonable choice.

    But, you should always do your absolute utmost not to be put in a situation where guessing is your only choice.

  • sqlparse

    A non-validating SQL parser module for Python

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • phonenumbers

    Python port of Google's libphonenumber

  • TextDistance

    📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

  • PLY

    Python Lex-Yacc

  • pyparsing

    Python library for creating PEG parsers

  • chardet

    Python character encoding detector

  • Project mention: This Week In Python | dev.to | 2024-05-10

    chardet – Python character encoding detector

  • shortuuid

    A generator library for concise, unambiguous and URL-safe UUIDs.

  • Project mention: Ask HN: Could you show your personal blog here? | news.ycombinator.com | 2023-07-04

    Unlike many people here, I don't like to write hundreds of mediocre posts. Instead, I prefer very few posts, that unfortunately are still mediocre.

    If you're tired of all the perfection that exists on the internet, where every piece is deeply insightful and changes your life, I'd encourage you to read my articles, which only promise to shorten it:

    https://www.stavros.io/

  • msgspec

    A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

  • Project mention: Htmx, Rust and Shuttle: A New Rapid Prototyping Stack | news.ycombinator.com | 2023-11-01
  • python-slugify

    Returns unicode slugs

  • typeguard

    Run-time type checker for Python

  • python-user-agents

    A Python library that provides an easy way to identify devices like mobile phones, tablets and their capabilities by parsing (browser) user agent strings.

  • DataProfiler

    What's in your data? Extract schema, statistics and entities from datasets

  • Project mention: LongRoPE: Extending LLM Context Window Beyond 2M Tokens | news.ycombinator.com | 2024-02-22

    It's been possible to skip tokenization for a long time, my team and I did it here - https://github.com/capitalone/DataProfiler

    For what it's worth, we actually were working with LSTMs with nearly a billion params back in 2016-2017 area. Transformers made it far more effective to train and execute, but ultimately LSTMs are able to achieve similar results, though slow & require more training data.

  • pyfiglet

    An implementation of figlet written in Python

  • Construct

    Construct: Declarative data structures for python that allow symmetric parsing and building

  • xpinyin

    Translate Chinese hanzi to pinyin (拼音) by Python, 汉字转拼音

  • python-nameparser

    A simple Python module for parsing human names into their individual components

  • Charset Normalizer

    Truly universal encoding detector in pure Python

  • SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Text processing discussion

Log in or Post with

Python Text processing related posts

Index

What are some of the best open-source Text processing projects in Python? This list will help you:

Project Stars
1 pydantic 19,420
2 fuzzywuzzy 9,134
3 diff-match-patch 7,225
4 汉字拼音转换工具(Python 版) 4,733
5 Lark 4,575
6 ftfy 3,727
7 sqlparse 3,622
8 phonenumbers 3,438
9 TextDistance 3,321
10 PLY 2,724
11 pyparsing 2,132
12 chardet 2,107
13 shortuuid 2,016
14 msgspec 1,972
15 python-slugify 1,463
16 typeguard 1,466
17 python-user-agents 1,418
18 DataProfiler 1,377
19 pyfiglet 1,324
20 Construct 894
21 xpinyin 820
22 python-nameparser 639
23 Charset Normalizer 536

Sponsored
Free Django app performance insights with Scout Monitoring
Get Scout setup in minutes, and let us sweat the small stuff. A couple lines in settings.py is all you need to start monitoring your apps. Sign up for our free tier today.
www.scoutapm.com