polyglot | Jieba | |
---|---|---|
1 | 8 | |
2,321 | 33,624 | |
0.3% | 0.3% | |
0.0 | 0.0 | |
about 1 year ago | 6 months ago | |
Python | Python | |
GNU General Public License v3.0 or later | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
polyglot
Jieba
-
Show HN: Mandarin Word Segmenter with Translation
Thanks for the kind words!
I'm using Jieba[0] because it hits a nice balance of fast and accurate. But I'm initializing it with a custom dictionary (~800k entries), and have added several layers of heuristic post-segmentation. For example, Jieba tends to split up chengyu into two words, but I've decided they should be displayed as a single word, since chengyu are typically a single entry in dictionaries.
[0] https://github.com/fxsjy/jieba
-
PostgreSQL Full-Text Search in a Nutshell
Let's continue with jieba as an example. This is the main program logic for pg_jieba, which is also a Python package, so let's use Python for the example.
-
[OC] How Many Chinese Characters You Need to Learn to Read Chinese!
jieba to do Chinese word segmentation
-
Sentence parser for Mandarin?
Jieba: Chinese text segmenter
-
How many in here use google sheets to keep track on their Chinese vocabulary? (2 pics) - More info in the comments
If you know some python you can use a popular library called Jieba 结巴 to automatically get pinyin for every word. (Jieba has actually been ported to many languages) You can also use it to break a chinese text into a set of unique words for easy addition to your spreadsheet.
- Where can I download a database of Chinese word classifications (noun, verb, etc)
-
Learn vocabulary effortlessly while browsing the web [FR,EN,DE,PT,ES]
Since you're saying the main issue is segmentation, there are libraries to help out with that issue. jieba is fantastic if you have a Python backend, nodejieba (50k downloads/week) if it's more JS-side.
-
I'm looking for a specific vocab list
https://github.com/fxsjy/jieba/ (has some good word frequency data)
What are some alternatives?
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
NLTK - NLTK Source
SnowNLP - Python library for processing Chinese text
TextBlob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
pkuseg-python - pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation
langid.py - Stand-alone language identification system
Stanza - Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
Pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.