Unicode data file compression: achieving 40-70% reduction over gzip alone

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

InfluxDB - Power Real-Time Data Analytics at Scale
Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
www.influxdata.com
featured
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com
featured
  • ziglyph

    Unicode text processing for the Zig programming language.

  • Yes, sorry about that - I omitted a bit of that information for brevity.

    If you want to play with allkeys.txt (which is by far much more sequential, simpler data than UnicodeData.txt) then you only need to remove the non-NFD strings (since the Unicode Collation Algorithm's first step requires you to decompose the string's code points to canonical NFD form), that removes ~2,000 entries.

    The full file parser code, which strips those out and other useless information like comments and version information can be found here: https://github.com/jecolon/ziglyph/blob/main/src/collator/Al...

    If you want to play around with UnicodeData.txt (which is less sequential, more complex data) then only two fields are used (the code point and decomposition field), and only records where the second field is not empty (the full decomposition type name in angle brackets is not needed, only whether it is or is not there is important.)

    The full parser code for that file can be found here: https://github.com/jecolon/ziglyph/blob/main/src/normalizer/...

    Hope that helps!

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts

  • What are your favorite utility libraries?

    1 project | /r/Zig | 21 Feb 2023
  • Resizable string in Zig?

    2 projects | /r/Zig | 16 Nov 2021
  • uni-algo: Unicode Algorithms Implementation for C/C++

    1 project | news.ycombinator.com | 25 Mar 2024
  • Chunking strings in Elixir: how difficult can it be?

    2 projects | news.ycombinator.com | 4 Jan 2023
  • How do i make this jetpack system ? (I got everything working, but i don't know how to make those cool unicodes)

    1 project | /r/MinecraftCommands | 2 Aug 2022