Brian Kernighan adds Unicode support to Awk (May, 2022)

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • awk

    One true awk

    It needs the length for operations such as substring, or to apply length modifiers on regular expressions (such as \w{3,5}), which is a common thing in awk programs.

    In fact, the return value of the u8_rune as implemented in the branch we are discussing (https://github.com/onetrueawk/awk/compare/unicode-support) returns a length to be used as an offset later.

    This is not me saying, it's the author. There is a code comment there:

    > For most of Awk, utf-8 strings just "work", since they look like null-terminated sequences of 8-bit bytes. Functions like length(), index(), and substr() have to operate in units of utf-8 characters. The u8_* functions in run.c handle this.

    I know there might be different ways of doing it, but we're talking about a specific implementation.

    I was wrong to assume he is storing stuff in UTF-32. He could have, but there was already code in place there to make the UTF-8 storage easier to implement.

  • tectonic

    A modernized, complete, self-contained TeX/LaTeX engine, powered by XeTeX and TeXLive.

    > He says he wanted to try "XeTeX" (which supports Unicode) but "...I was going to download it as an experiment and they wanted 5 gigabytes and 5 gigabytes at the particular boonies place I'm living would...mmm..not be finished yet!"

    He can try "Tectonic" [0] - a modern XeTeX based TeX/LaTeX distribution that installs a minimum system and then downloads and installs dependencies on-demand. Tectonic is written in C and Rust [1].

    [0] https://tectonic-typesetting.github.io/en-US/

    [1] https://github.com/tectonic-typesetting/tectonic

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • goawk

    A POSIX-compliant AWK interpreter written in Go, with CSV support

    Yes, that's right. With my simplistic UTF-8-based implementation it turned length() -- for example -- from O(1) to O(N), turning O(N) algorithms which use length() into O(N^2). See this issue: https://github.com/benhoyt/goawk/issues/93

    Similar with substr() and other string functions, which when operating as bytes are O(1), but become O(N) when trying to count the number of codepoints as UTF-8.

    GNU Gawk has a fancier approach, which stores strings as UTF-8 as long as it can, but converts to UTF-32 if it needs to (eg: the string is non-ASCII and you call substr).

    It looks like Brian Kernighan's code has the same issue with length() and substr(). I'm going to try to email him about this, as I think it's kind of a performance blocker.

  • texlive-batch-installation

    Python Script for texlive batch installation

    The Problem is that TeXLive still defaults to doing a full install.

    A full install means installing ~4000 packages, including their source files (tens of thousands of tex files) and built documentation (thousands of PDF files) and hundreds of free fonts (otfs, ttfs, texs own format).

    This is huge (>7GB, not just the 5 GB claimed here).

    However, you don't need 99 % of this for any given document.

    Not installing the source files and documentation PDFs will alone reduce the size by roughly half.

    Only installing the packages you really need from a minimal installation gives you a few hundred megabytes at most for even complex documents.

    It's a bit annoying to get the list of packages needed though, since there is not really any working dependency management.

    I wrote a python wrapper around the tex live installer [1] to make this easy for CI jobs, see e.g. [2].

    On a side note: I'd recommend luatex over xetex.

    - [1] https://github.com/maxnoe/texlive-batch-installation/

    - [2] https://github.com/pep-dortmund/toolbox-workshop/blob/8b00f0...

  • toolbox-workshop

    Materialien zum PeP et al. Toolbox-Workshop

    The Problem is that TeXLive still defaults to doing a full install.

    A full install means installing ~4000 packages, including their source files (tens of thousands of tex files) and built documentation (thousands of PDF files) and hundreds of free fonts (otfs, ttfs, texs own format).

    This is huge (>7GB, not just the 5 GB claimed here).

    However, you don't need 99 % of this for any given document.

    Not installing the source files and documentation PDFs will alone reduce the size by roughly half.

    Only installing the packages you really need from a minimal installation gives you a few hundred megabytes at most for even complex documents.

    It's a bit annoying to get the list of packages needed though, since there is not really any working dependency management.

    I wrote a python wrapper around the tex live installer [1] to make this easy for CI jobs, see e.g. [2].

    On a side note: I'd recommend luatex over xetex.

    - [1] https://github.com/maxnoe/texlive-batch-installation/

    - [2] https://github.com/pep-dortmund/toolbox-workshop/blob/8b00f0...

  • regex

    An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

    This is just false. UTS#18 specifies multiple levels of Unicode support. \X is part of level 2. It is perfectly valid to generally say "has Unicode support" even if it's just Level 1, assuming you document somewhere what precisely is supported.

    For example, I regularly say that Rust's regex crate has Unicode support. But it does not support \X. It's more precisely documented here: https://github.com/rust-lang/regex/blob/master/UNICODE.md

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts