A copy-and-patch JIT compiler for CPython

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • WorkOS - The modern identity platform for B2B SaaS
  • SaaSHub - Software Alternatives and Reviews
  • CPython

    The Python programming language

  • > What I wonder is if the current approach, stated as "copy-and-patch auto-generated code for each opcode", can ever reach that point without being replaced by a completely different design along the way.

    Of course this approach produces a worse code than a full compiler by definition---stencils would be too rigid to be further optimized. A stencil conceptually maps to a single opcode, so the only way to break out of this restriction is to add more opcodes. And there are only so many opcodes and stencils you can prefare. But I think you are thinking too much about a possibility to make Python as fast as, say, C for at least some cases. I believe that it won't happen at all, and the current approach clearly points why.

    Let's consider a simple CPython opcode named `BINARY_ADD` which has a stack effect of `(a b -- sum)`. Ideally it should eventually be compiled down to a fully specialized machine code something like `add rax, r12`, plus some guards. But the actual implementation (`PyNumber_Add` [1]) is far more complex: it may call at most 3 "slot" calls that may add or concatenate arguments, some of them may call back to a Python code.

    So let's assume that we have done type specialization and arguments are known to be integers. That will result in a single slot call to `PyLong_Add` [2], which again is still complex because CPython has two integer representations. Even when they are both "compact", i.e. at most 31/63 bits long, it may still have to switch to another representation when the resulting sum is no longer compact. So a fully specialized machine code would be only possible when both arguments are known to be integers and compact and have one more spare bit to prevent an overflow. That sounds way more restrictive.

    [1] https://github.com/python/cpython/blob/36adc79041f4d2764e1da...

    [2] https://github.com/python/cpython/blob/36adc79041f4d2764e1da...

    An uncomfortable truth is that all these explanations also almost perfectly apply to JavaScript---the slot resolution would be the `[[ToNumber]]` internal function and multiple representations will be something like V8's Smi. Modern JS engines do exploit most of them, but at the expense of extremely large codebase with tons of potential attack surfaces. It is really expensive to maintain, and people don't really realize that no performant JS engine is developed by a small group of developers. You have to cut some corners.

    In comparison, CPython's approach is essentially inside out. Any JIT implementation will require you to split all those subtasks into small bits that can be either optimized out or baked into a generated machine code. So what if we start with subtasks without thinking about JIT in the first place? This is what a specializing adaptive interpreter [3] did. The current CPython already has two tiers of interpreters, and micro-opcodes can only appear in the second tier. With them we can split larger opcodes into smaller ones, possibly with optimizations, but its performance is limited by the dispatch logic. The copy-and-patch JIT is not as powerful, but it does eliminate the dispatch logic without large design changes and it's a good choice for this purpose.

    [3] https://peps.python.org/pep-0659/

  • compiler

    an incomplete toy barebones compiler backend for amd64 x86_64 in Python and an incomplete JIT compiler written in C (by samsquire)

  • Wow! Thank you for your hard work. I use python for all experimental work so this would speed up my scripting work, such as processing data from API calls or filesystem.

    I wrote a simple toy JIT for a Javascript-like language. It might be useful for others to learn from because it's so simply written and not complicated. I do lazy patching of callsites, I haven't got anywhere near as advanced as tracing or copy-and-patching. Much of the code I wrote for this JIT was written in Python and ported to C. The Java Virtual Machine has a template interpreter which is interesting to research.

    I haven't got around to encoding amd64 x86_64 instructions as bitmasks yet, so I've hardcoded it.

    [1]: https://github.com/samsquire/compiler

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts