Is anyone using PyPy for real work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • WorkOS - The modern identity platform for B2B SaaS
  • InfluxDB - Power Real-Time Data Analytics at Scale
  • SaaSHub - Software Alternatives and Reviews
  • preshed

    💥 Cython hash tables that assume keys are pre-hashed

  • If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash

    There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

  • murmurhash

    💥 Cython bindings for MurmurHash2 (by explosion)

  • If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash

    There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

  • WorkOS

    The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

    WorkOS logo
  • Numba

    NumPy aware dynamic Python compiler using LLVM

  • Simulations are, at least in my experience, numba’s [0] wheelhouse.

    [0]: https://numba.pydata.org/

  • python-mysql-replication

    Pure Python Implementation of MySQL replication protocol build on top of PyMYSQL

  • I'm maintaining an internal change-data-capture application that uses a python library to decode mysql binlog and store the change records as json in the data lake (like Debezium). For our most busiest databases a single Cpython process couldn't process the amount of incoming changes in real time (thousands of events per second). It's not something that can be easily parallelized, as the bulk of the work is happening in the binlog decoding library (https://github.com/julien-duponchelle/python-mysql-replicati...).

    So we've made it configurable to run some instances with Pypy - which was able to work through the data in realtime, i.e. without generating a lag in the data stream. The downside of using pypy was increased memory usage (4-8x) - which isn't really a problem. An actually problem that I didn't really track down was that the test suite (running pytest) was taking 2-3 times longer with Pypy than with CPython.

    A few months ago I upgraded the system to run with CPython 3.11 and the performance improvements of 10-20% that come with that version now actually allowed us to drop Pypy and only run CPython. Which is more convenient and makes the deployment and configuration less complex.

  • I'm maintaining an internal change-data-capture application that uses a python library to decode mysql binlog and store the change records as json in the data lake (like Debezium). For our most busiest databases a single Cpython process couldn't process the amount of incoming changes in real time (thousands of events per second). It's not something that can be easily parallelized, as the bulk of the work is happening in the binlog decoding library (https://github.com/julien-duponchelle/python-mysql-replicati...).

    So we've made it configurable to run some instances with Pypy - which was able to work through the data in realtime, i.e. without generating a lag in the data stream. The downside of using pypy was increased memory usage (4-8x) - which isn't really a problem. An actually problem that I didn't really track down was that the test suite (running pytest) was taking 2-3 times longer with Pypy than with CPython.

    A few months ago I upgraded the system to run with CPython 3.11 and the performance improvements of 10-20% that come with that version now actually allowed us to drop Pypy and only run CPython. Which is more convenient and makes the deployment and configuration less complex.

  • legion

    The Legion Parallel Programming System (by StanfordLegion)

  • We use PyPy for performing verification of our software stack [1], and also for profiling tools [2]. The verification tool is basically a complete reimplementation of our main product, and therefore encodes a massive amount of business logic (and therefore difficult to impossible to rewrite in another language). As with other users, we found the switch to PyPy was seamless and provides us with something like a 2.5x speedup out of the box, with (I think) higher speedups in some specific cases.

    We eventually rewrote the profiler tool in Rust for additional speedups, but as mentioned for the verification engine, it's probably too complicated to ever do that so we really appreciate drop-in tools like PyPy that can speed up our code.

    [1]: https://github.com/StanfordLegion/legion/blob/master/tools/l...

    [2]: https://github.com/StanfordLegion/legion/blob/master/tools/l...

  • psycopg2cffi

    Port to cffi with some speed improvements

  • The only compatibility issue I've run into is database drivers.

    For PostgreSQL, psycopg2 is not supported. psycopg2cffi is largely unmaintained, and the 2.9.0 version in PyPI lacks some newer features of psycopg2: the `psycopg2.sql` module and empty result sets raise a RuntimeError in Python 3.7+. The latest commit in on Github does have these changes [1]. Psycopg 3 [2] and pg8000 [3] (as user tlocke mentioned elsewhere) are viable alternates provided you aren't stuck with older versions of PostgreSQL. I'm going to continue to use psycopg2cffi until I can upgrade an old PostgreSQL 9.4 database.

    For Microsoft SQL Server, pymssql does not support PyPy [4]. It's under new maintainership so it might gain support in the future. pypyodbc hasn't had any activity since 2022, and no new PyPI release since 2021 [5]. The datatypes returned can differ between libodbc1 versions. On Ubuntu 18.04 in particular: empty string columns are returned as a single space, integer columns are returned as a Decimal. Also, if you encounter a mysterious HY010 error ("Function sequence error"), you may need to upgrade libodbc1 to v2.3.7+ from v2.3.4 using the Microsoft repos.

    [1]: https://github.com/chtd/psycopg2cffi

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

    InfluxDB logo
  • pymssql

    Official home for the pymssql source code.

  • gevent

    Coroutine-based concurrency library for Python

  • A sub-question for the folks here: is anyone using the combination of gevent and PyPy for a production application? Or, more generally, other libraries that do deep monkey-patching across the Python standard library?

    Things like https://github.com/gevent/gevent/issues/676 and the fix at https://github.com/gevent/gevent/commit/f466ec51ea74755c5bee... indicate to me that there are subtleties on how PyPy's memory management interacts with low-level tweaks like gevent that have relied on often-implicit historical assumptions about memory management timing.

    Not sure if this is limited to gevent, either - other libraries like Sentry, NewRelic, and OpenTelemetry also have low-level monkey-patched hooks, and it's unclear whether they're low-level enough that they might run into similar issues.

    For a stack without any monkey-patching I'd be overjoyed to use PyPy - but between gevent and these monitoring tools, practically every project needs at least some monkey-patching, and I think that there's a lack of clarity on how battle-tested PyPy is with tools like these.

  • dockerfiles

    docker files to reproduce environments (by tgbugs)

  • See for example [3].

    Thanks to you and the whole PyPy team!

    1. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...

  • sparc-curation

    code and files for SPARC curation workflows

  • 3. https://github.com/SciCrunch/sparc-curation/blob/0fdf393e26f...

  • direnv

    unclutter your .profile

  • Given you'll want to activate a virtual environment for most Python projects, and projects live in directories.. I find myself constantly reaching for direnv. https://github.com/direnv/direnv/wiki/Python

        echo "layout python\npip install --upgrade pip pip-tools setuptools wheel\npip-sync" > .envrc

  • Pyjion

    Pyjion - A JIT for Python based upon CoreCLR (by tonybaloney)

  • I've actually come across and started using Pyjion recently (https://github.com/tonybaloney/pyjion); how does Pypy compare, both in terms of performance and purpose? There seems to be a lot of overlap...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts