Unicode data file compression: achieving 40-70% reduction over gzip alone

InfluxDB - Power Real-Time Data Analytics at Scale

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

www.influxdata.com

featured

SaaSHub - Software Alternatives and Reviews

SaaSHub helps you find the best software and product alternatives

www.saashub.com

featured

ziglyph

5 207 6.7 Zig

Unicode text processing for the Zig programming language.

Yes, sorry about that - I omitted a bit of that information for brevity.
If you want to play with allkeys.txt (which is by far much more sequential, simpler data than UnicodeData.txt) then you only need to remove the non-NFD strings (since the Unicode Collation Algorithm's first step requires you to decompose the string's code points to canonical NFD form), that removes ~2,000 entries.
The full file parser code, which strips those out and other useless information like comments and version information can be found here: https://github.com/jecolon/ziglyph/blob/main/src/collator/Al...
If you want to play around with UnicodeData.txt (which is less sequential, more complex data) then only two fields are used (the code point and decomposition field), and only records where the second field is not empty (the full decomposition type name in angle brackets is not needed, only whether it is or is not there is important.)
The full parser code for that file can be found here: https://github.com/jecolon/ziglyph/blob/main/src/normalizer/...
Hope that helps!

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

What are your favorite utility libraries?

1 project | /r/Zig | 21 Feb 2023
Resizable string in Zig?

2 projects | /r/Zig | 16 Nov 2021
uni-algo: Unicode Algorithms Implementation for C/C++

1 project | news.ycombinator.com | 25 Mar 2024
Chunking strings in Elixir: how difficult can it be?

2 projects | news.ycombinator.com | 4 Jan 2023
How do i make this jetpack system ? (I got everything working, but i don't know how to make those cool unicodes)

1 project | /r/MinecraftCommands | 2 Aug 2022

Unicode data file compression: achieving 40-70% reduction over gzip alone

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Unicode Zig Characters grapheme-clusters utf-8
Post date: 4 Jul 2021

ziglyph

InfluxDB

Related posts

What are your favorite utility libraries?

Resizable string in Zig?

uni-algo: Unicode Algorithms Implementation for C/C++

Chunking strings in Elixir: how difficult can it be?

How do i make this jetpack system ? (I got everything working, but i don't know how to make those cool unicodes)

Unicode data file compression: achieving 40-70% reduction over gzip alone

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Unicode Zig Characters grapheme-clusters utf-8 Post date: 4 Jul 2021

ziglyph

InfluxDB

Related posts

What are your favorite utility libraries?

Resizable string in Zig?

uni-algo: Unicode Algorithms Implementation for C/C++

Chunking strings in Elixir: how difficult can it be?

How do i make this jetpack system ? (I got everything working, but i don't know how to make those cool unicodes)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Unicode Zig Characters grapheme-clusters utf-8
Post date: 4 Jul 2021