-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
icusegmenter implements _rule based segmentation, so you can actually customize the segmentation rules based on your needs by writing some toml and feeding it to datagen. The concept of a "character" or "word" has no single cross-linguistic meaning; it is not uncommon to need to tailor these algorithms by use case or even just the language being used. E.g. handling viramas in Indic scripts as a part of grapheme segmentation is a thing people might need, but may also not need, and UAX29 doesn't support that at the moment¹. CLDR contains a bunch of common tailorings for specific locales here, but as I mentioned folks may tailor further based on use case.
icusegmenter implements _rule based segmentation, so you can actually customize the segmentation rules based on your needs by writing some toml and feeding it to datagen. The concept of a "character" or "word" has no single cross-linguistic meaning; it is not uncommon to need to tailor these algorithms by use case or even just the language being used. E.g. handling viramas in Indic scripts as a part of grapheme segmentation is a thing people might need, but may also not need, and UAX29 doesn't support that at the moment¹. CLDR contains a bunch of common tailorings for specific locales here, but as I mentioned folks may tailor further based on use case.
Related posts
-
icu4x: pure rust implementation of the unicode ICU library
-
ICU4X: Solving Internationalization for Clients and Limited Environments
-
uni-algo v0.5.0: Modern Unicode Library
-
icu4x: Can we have `rustc_layout_scalar_valid_range_end` on stable. Lang team: You have `rustc_layout_scalar_valid_range_end` on stable. `rustc_layout_scalar_valid_range_end` on stable:
-
Not a Yoking Matter (Zero-Copy #1)