Is anyone using PyPy for real work?

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

preshed

1 78 4.1 Cython

💥 Cython hash tables that assume keys are pre-hashed

If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash
There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

murmurhash

2 41 5.0 C++

💥 Cython bindings for MurmurHash2 (by explosion)

If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash
There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Numba

124 9,432 9.9 Python

NumPy aware dynamic Python compiler using LLVM

Simulations are, at least in my experience, numba’s [0] wheelhouse.
[0]: https://numba.pydata.org/

python-mysql-replication

5 2,254 9.2 Python

Pure Python Implementation of MySQL replication protocol build on top of PyMYSQL

I'm maintaining an internal change-data-capture application that uses a python library to decode mysql binlog and store the change records as json in the data lake (like Debezium). For our most busiest databases a single Cpython process couldn't process the amount of incoming changes in real time (thousands of events per second). It's not something that can be easily parallelized, as the bulk of the work is happening in the binlog decoding library (https://github.com/julien-duponchelle/python-mysql-replicati...).
So we've made it configurable to run some instances with Pypy - which was able to work through the data in realtime, i.e. without generating a lag in the data stream. The downside of using pypy was increased memory usage (4-8x) - which isn't really a problem. An actually problem that I didn't really track down was that the test suite (running pytest) was taking 2-3 times longer with Pypy than with CPython.
A few months ago I upgraded the system to run with CPython 3.11 and the performance improvements of 10-20% that come with that version now actually allowed us to drop Pypy and only run CPython. Which is more convenient and makes the deployment and configuration less complex.

python-mysql-replicati

1 - -

I'm maintaining an internal change-data-capture application that uses a python library to decode mysql binlog and store the change records as json in the data lake (like Debezium). For our most busiest databases a single Cpython process couldn't process the amount of incoming changes in real time (thousands of events per second). It's not something that can be easily parallelized, as the bulk of the work is happening in the binlog decoding library (https://github.com/julien-duponchelle/python-mysql-replicati...).
So we've made it configurable to run some instances with Pypy - which was able to work through the data in realtime, i.e. without generating a lag in the data stream. The downside of using pypy was increased memory usage (4-8x) - which isn't really a problem. An actually problem that I didn't really track down was that the test suite (running pytest) was taking 2-3 times longer with Pypy than with CPython.
A few months ago I upgraded the system to run with CPython 3.11 and the performance improvements of 10-20% that come with that version now actually allowed us to drop Pypy and only run CPython. Which is more convenient and makes the deployment and configuration less complex.

legion

11 647 9.9 C++

The Legion Parallel Programming System (by StanfordLegion)

We use PyPy for performing verification of our software stack [1], and also for profiling tools [2]. The verification tool is basically a complete reimplementation of our main product, and therefore encodes a massive amount of business logic (and therefore difficult to impossible to rewrite in another language). As with other users, we found the switch to PyPy was seamless and provides us with something like a 2.5x speedup out of the box, with (I think) higher speedups in some specific cases.
We eventually rewrote the profiler tool in Rust for additional speedups, but as mentioned for the verification engine, it's probably too complicated to ever do that so we really appreciate drop-in tools like PyPy that can speed up our code.
[1]: https://github.com/StanfordLegion/legion/blob/master/tools/l...
[2]: https://github.com/StanfordLegion/legion/blob/master/tools/l...

psycopg2cffi

2 177 0.0 Python

Port to cffi with some speed improvements

The only compatibility issue I've run into is database drivers.
For PostgreSQL, psycopg2 is not supported. psycopg2cffi is largely unmaintained, and the 2.9.0 version in PyPI lacks some newer features of psycopg2: the `psycopg2.sql` module and empty result sets raise a RuntimeError in Python 3.7+. The latest commit in on Github does have these changes [1]. Psycopg 3 [2] and pg8000 [3] (as user tlocke mentioned elsewhere) are viable alternates provided you aren't stuck with older versions of PostgreSQL. I'm going to continue to use psycopg2cffi until I can upgrade an old PostgreSQL 9.4 database.
For Microsoft SQL Server, pymssql does not support PyPy [4]. It's under new maintainership so it might gain support in the future. pypyodbc hasn't had any activity since 2022, and no new PyPI release since 2021 [5]. The datatypes returned can differ between libodbc1 versions. On Ubuntu 18.04 in particular: empty string columns are returned as a single space, integer columns are returned as a Decimal. Also, if you encounter a mysterious HY010 error ("Function sequence error"), you may need to upgrade libodbc1 to v2.3.7+ from v2.3.4 using the Microsoft repos.
[1]: https://github.com/chtd/psycopg2cffi

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
pymssql

6 811 8.7 Python

Official home for the pymssql source code.
gevent

5 6,161 8.7 Python

Coroutine-based concurrency library for Python

A sub-question for the folks here: is anyone using the combination of gevent and PyPy for a production application? Or, more generally, other libraries that do deep monkey-patching across the Python standard library?
Things like https://github.com/gevent/gevent/issues/676 and the fix at https://github.com/gevent/gevent/commit/f466ec51ea74755c5bee... indicate to me that there are subtleties on how PyPy's memory management interacts with low-level tweaks like gevent that have relied on often-implicit historical assumptions about memory management timing.
Not sure if this is limited to gevent, either - other libraries like Sentry, NewRelic, and OpenTelemetry also have low-level monkey-patched hooks, and it's unclear whether they're low-level enough that they might run into similar issues.
For a stack without any monkey-patching I'd be overjoyed to use PyPy - but between gevent and these monitoring tools, practically every project needs at least some monkey-patching, and I think that there's a lack of clarity on how battle-tested PyPy is with tools like these.

dockerfiles

1 1 7.6

docker files to reproduce environments (by tgbugs)

See for example [3].
Thanks to you and the whole PyPy team!
1. https://github.com/tgbugs/dockerfiles/blob/6f4ad5d873b7ab267...

sparc-curation

1 14 8.7 Python

code and files for SPARC curation workflows

3. https://github.com/SciCrunch/sparc-curation/blob/0fdf393e26f...

direnv

157 11,697 8.7 Go

unclutter your .profile

Given you'll want to activate a virtual environment for most Python projects, and projects live in directories.. I find myself constantly reaching for direnv. https://github.com/direnv/direnv/wiki/Python
    echo "layout python\npip install --upgrade pip pip-tools setuptools wheel\npip-sync" > .envrc

Pyjion

23 1,407 5.0 C++

Pyjion - A JIT for Python based upon CoreCLR (by tonybaloney)

I've actually come across and started using Pyjion recently (https://github.com/tonybaloney/pyjion); how does Pypy compare, both in terms of performance and purpose? There seems to be a lot of overlap...

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Been using Python for 3 years, never used a Class.
2 projects | /r/learnpython | 15 Mar 2023
Mistakes to Avoid During Python Library Creation
3 projects | dev.to | 13 Nov 2022
Python devs Vs C++ chads
2 projects | /r/ProgrammerHumor | 12 Nov 2022
Alta performance em Python
2 projects | /r/devpt | 15 Jun 2022
Pyjion – A Python JIT Compiler
8 projects | news.ycombinator.com | 9 Nov 2021

Is anyone using PyPy for real work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Cython Database Drivers Concurrency and Parallelism Science and Data analysis
Post date: 31 Jul 2023

preshed

murmurhash

WorkOS

Numba

python-mysql-replication

python-mysql-replicati

legion

psycopg2cffi

InfluxDB

pymssql

gevent

dockerfiles

sparc-curation

direnv

Pyjion

Related posts

Is anyone using PyPy for real work?

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com Python Cython Database Drivers Concurrency and Parallelism Science and Data analysis Post date: 31 Jul 2023

Related posts

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com
Python Cython Database Drivers Concurrency and Parallelism Science and Data analysis
Post date: 31 Jul 2023