SudachiPy
Python version of Sudachi, a Japanese tokenizer. (by WorksApplications)
Sudachi
A Japanese Tokenizer for Business (by WorksApplications)
SudachiPy | Sudachi | |
---|---|---|
3 | 2 | |
348 | 747 | |
- | 1.5% | |
1.6 | 5.2 | |
over 1 year ago | 13 days ago | |
Python | Java | |
Apache License 2.0 | Apache License 2.0 |
The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
SudachiPy
Posts with mentions or reviews of SudachiPy.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2022-09-01.
-
Sakubun - a tool I made to help you practice kanji, with customized quiz questions and sentences
The current readings were generated with SudachiPy, with a little processing. UniDic seems pretty interesting, I'll check it out. Do you know how well its accuracy is, compared to Sudachi?
-
software which turn hiragana and katakana into kanji
There are free tools for both of these things. I made game2text to do OCR and script matching. It includes a segmentation and normalization library Sudachi but I have not used its normalization feature for the app. I'm not sure anyone else even wants this feature but it will be pretty straightforward to add it if you're familiar with Python and vanilla Javascript.
-
Tokenizing / picking words out of non-english languages
spaCy uses SudachiPy internally (see the doc comment about that), so if you don't need any of spaCy's extra features or want more control over the tokenization, you could use SudachiPy directly.
Sudachi
Posts with mentions or reviews of Sudachi.
We have used some of these posts to build our list of alternatives
and similar projects. The last one was on 2021-08-31.
-
Python Text Parsing Project: Furigana Inserter for Anki
Instead of the common segmentation tool Mecab, this project will use Sudachi, which features multiple text segmentation modes as well as Furigana retrieval.
-
Gauging interest and plausibility of an overhaul of Anki's Morphman
I don't think there's anything special about Ichiran here*, rather, as you observe, MeCab isn't quite the right tool for the job. A quick google suggests that people sometimes follow MeCab with J.DepP to end up with bunsetsu, which is (I think) what you'd want for Morphman. Sudachi has python bindings and offers a couple of different levels of aggression/granularity.
What are some alternatives?
When comparing SudachiPy and Sudachi you can also consider the following projects:
spaCy - 💫 Industrial-strength Natural Language Processing (NLP) in Python
kagome - Self-contained Japanese Morphological Analyzer written in pure Go
momepy - Urban Morphology Measuring Toolkit
MorphMan - Anki plugin that reorders language cards based on the words you know
quanfima - Quanfima (Quantitative Analysis of Fibrous Materials)
wanakana-py - Port of wanakana by the WaniKani team
mecab - Yet another Japanese morphological analyzer
simplemma - Simple multilingual lemmatizer for Python, especially useful for speed and efficiency