Why LuaJIT's interpreter is written in assembly

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • LuaJIT

    Mirror of the LuaJIT git repository

  • > There is nothing to add to it.

    I'm not sure that's true. Maybe LuaJIT was never going to add the features it's missing from Lua 5.2, 5.3 and 5.4. However, when Mike Pall stepped back in 2015 [0], he had still been planning to further improve the implementation - for example with a new garbage collector [1] and "hyperblock scheduling" [2] (which remain unimplemented), plus 64-bit pointer support (which was eventually completed by other people).

    [0] https://www.freelists.org/post/luajit/Looking-for-new-LuaJIT...

    [1] http://wiki.luajit.org/New-Garbage-Collector

    [2] https://github.com/LuaJIT/LuaJIT/issues/37

  • luajit2

    OpenResty's Branch of LuaJIT 2

  • https://github.com/openresty/luajit2

    It has a few extras but they agree with the original luajit authors opinion that not every 5.2 feature can be made in jit.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • hexdump

    hexdump.c: Single file C library implementation of the BSD command-line utility (by wahern)

  • It's not just as fast, but the difference is much less. For example, my hexdump (https://github.com/wahern/hexdump) clone compiles format specifications to bytecode for a tiny built-in VM. If I compile with VM_FASTER=1 (the default if __GNUC__ is defined), I still get an ~10% speed-up on Intel Core i9 (2019 MacBook Pro over compiling with VM_FASTER=0. Interestingly, I only get an ~5% speed-up on M1 (2020 MacBook Air), and on my AMD EPYC 3251 an ~2% slow-down unless I compile with -march=native, in which cases there indeed is no appreciable difference.

    Note that I'm rounding speed-up percentages down. Using Apple clang version 12.0.0 on the i9 and M1, and GCC 9.3.0 on the EPYC. Running `LC_ALL=C time ./hexdump -C /dev/null`. Switching between -O2 and -O3 doesn't seem to change relative gains. Ditto for using using -march=native except on EPYC.

    I don't have a pre-Haswell box around to test, but IIRC the difference used to be 30% or greater.

    Also, AFAIU GCC and clang have improved their switch statement support. For many years GCC and clang didn't actually optimize switch statements as well as the literature or what students were taught. Aggressive switch statement optimizations resulted in too many performance regressions in real-world applications. It seems only in the past few years have GCC and clang figured out how to apply those optimizations more aggressively while avoiding performance regressions.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts