Show HN: Up to 100x Faster FastAPI with simdjson and io_uring on Linux 5.19

This page summarizes the projects mentioned and recommended in the original post on news.ycombinator.com

Our great sponsors
  • ucall

    Remote Procedure Calls - 50x lower latency and 70x higher bandwidth than FastAPI, implementing JSON-RPC & 🔜 REST over io_uring and SIMDJSON ☎️

    You are right! For the convenience of Python users, we have to introspect the messages and parse JSON into Python objects. Every member of every dictionary being allocated on heap.

    To make it as fast as possible we don't use PyBind, NanoBind, SWIG, or any high-level tooling. Our Python bindings are a pure CPython integration. There is just no way to beat that combo, not that I know.

    https://github.com/unum-cloud/ujrpc/blob/main/src/python.c

  • ustore

    Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️

    Yes, we also constantly think about that! In the document collections of UKV, for example, we have interoperability between JSON, BSON, and MessagePack objects [1]. CSV is another potential option, but text-based formats aren't ideal for large scale transmissions.

    One thing people do - use two protocols. That is the case with Apache Arrow Flight RPC = gRPC for tasks, Arrow for data. It is a viable path, but compiling gRPC is a nightmare, and we don't want to integrate it into our other libraries, as we generally compile everything from sources. Seemingly, UJRPC can replace gRPC, and for the payload we can continue using Arrow. We will see :)

    [1]: https://github.com/unum-cloud/ukv/blob/main/src/modality_doc...

  • JetBrains Dev Survey

    What’s up with the C++ ecosystem in 2023? JetBrains Developer Ecosystem Survey 2023 has given us many interesting insights. The Embedded (37%) and Games (39%) industries are already working with C++20, developers are incorporating static analysis in their CI, and ChatGPT usage among coders is flourishing. Read on for more!

  • simdjson

    Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, WatermelonDB, Apache Doris, Milvus, StarRocks

    I'm talking about with simdjson. Lemire suggested reading line-by-line is not a good idea [0]. So I'm asking about the ideal approach using simdjson, not JSON parsers in general.

    [0] https://github.com/simdjson/simdjson/issues/188#issuecomment...

  • Apache Arrow

    Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

    If anything you'd probably want to send it in Arrow[1] format. CSV's don't even preserve data types.

    [1]: https://arrow.apache.org/

  • yyjson

    The fastest JSON library in C

    How does yyjson[0] compare to simdjson? Their benchmarks suggest it could be a positive.

    [0] https://github.com/ibireme/yyjson

  • japronto

    Screaming-fast Python 3.5+ HTTP toolkit integrated with pipelining HTTP server based on uvloop and picohttpparser.

    100x faster than FastAPI seems easy. I wonder how it compares to other fast Python libraries like Japronto[1] and non-Python ones too.

    1 - https://github.com/squeaky-pl/japronto

  • data-analysis

    Absolutely interested, on my end at least. I wrote this to manage the transparency in coverage files: https://github.com/dolthub/data-analysis/tree/main/transpare... but I'm always looking for better techniques.

    Oh wow, I see you used it on those exact files. How about that.

  • WorkOS

    The modern API for authentication & user identity. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.

  • FrameworkBenchmarks

    Source for the TechEmpower Framework Benchmarks project

    How large? Also I'm not sure the gRPC C++ server implementations you've tested are the fastest. If you're comparing to FastAPI (which is more of an HTTP server framework) then you should also compare to what is at the top of https://www.techempower.com/benchmarks/#section=data-r21.

  • is2

    embedded RESTy http(s) server library from Edgio

    I used the rapidjson streams with my little embedded REST HTTP(s) server library: https://github.com/Edgio/is2/

  • Cap'n Proto

    Cap'n Proto serialization/RPC system - core tools and C++ library

    Are there any synergies with capnproto [1] or is the focus here purely on huge payloads?

    I'm just an interested hobbyist when it comes to performant RPC frameworks but had some fun benchmarking capnproto for a small gamedev-related project and it was pretty awesome.

    [1] https://capnproto.org/

  • simdjson-go

    Golang port of simdjson: parsing gigabytes of JSON per second

    Speaking of Go, there's a simdjson implementation for golang too:

    > Performance wise, simdjson-go runs on average at about 40% to 60% of the speed of simdjson. Compared to Golang's standard package encoding/json, simdjson-go is about 10x faster.

    I haven't tried it yet but I don't really need that speed.

    https://github.com/minio/simdjson-go

  • jsplit

    A Go program to split large JSON files into many jsonl files

    Regarding the hard way, this little utility does a great job of splitting larger than memory JSON documents into collections of NDJSON files:

    https://github.com/dolthub/jsplit

  • json-buffet

    Ha! Thanks to you, Today I found out how big those uncompressed JSON files really are (the data wasn't accessible to me, so i shared the tool with my colleague and he was the one who ran the queries on his laptop): https://www.dolthub.com/blog/2022-09-02-a-trillion-prices/ .

    And yep, it was more or less they way you did with ijson. I found ijson just a day after I finished the prototype. Rapidjson would probably be faster. Especially after enabling SIMD. But the indexing was a one time thing.

    We have open sourced the codebase. Here's the link: https://github.com/multiversal-ventures/json-buffet . Since this was a quick and dirty prototype, comments were sparse. I have updated the Readme, and added a sample json-fetcher. Hope this is more useful for you.

    Another unwritten TODO was to nudge the data providers towards a more streaming friendly compression formats - and then just create an index to fetch the data directly from their compressed archives. That would have saved everyone a LOT of $$$.

  • msgspec

    A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML

    If you're primarily targeting Python as an application layer, you may also want to check out my msgspec library[1]. All the perf benefits of e.g. yyjson, but with schema validation like pydantic. It regularly benchmarks[2] as the fastest JSON library for Python. Much of the overhead of decoding JSON -> Python comes from the python layer, and msgspec employs every trick I know to minimize that overhead.

    [1]: https://github.com/jcrist/msgspec

    [2]: https://github.com/TkTech/json_benchmark

  • json_benchmark

    Python JSON benchmarking and correectness.

    If you're primarily targeting Python as an application layer, you may also want to check out my msgspec library[1]. All the perf benefits of e.g. yyjson, but with schema validation like pydantic. It regularly benchmarks[2] as the fastest JSON library for Python. Much of the overhead of decoding JSON -> Python comes from the python layer, and msgspec employs every trick I know to minimize that overhead.

    [1]: https://github.com/jcrist/msgspec

    [2]: https://github.com/TkTech/json_benchmark

  • typedload

    Python library to load dynamically typed data into statically typed data structures

    Author of typedload here!

    FastAPI relies on (not so fast) pydantic, which is one of the slowest libraries in that category.

    Don't expect to find such benchmarks on the pydantic documentation itself, but the competing libraries will have them.

    [0] https://ltworf.github.io/typedload/

  • zsv

    zsv+lib: world's fastest (simd) CSV parser, bare metal or wasm, with an extensible CLI for SQL querying, format conversion and more

    Parsing CSV doesn't have to be slow if you use something like xsv or zsv (https://github.com/liquidaty/zsv) (disclaimer: I'm an author). The speed of CSV parsers is fast enough that unless you are doing something ultra-trivial such as "count rows", your bottleneck will be elsewhere.

    The benefits of CSV are:

    - human readable

    - does not need to be typed (sometimes, data in the raw such as date-formatted data is not amenable to typing without introducing a pre-processing layer that gets you further from the original data)

    - accessible to anyone: you don't need to be a data person to dbl-click and open in Excel or similar

    The main drawback is that if your data is already typed, CSV does not communicate what the type is. You can alleviate this through various approaches such as is described at https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql..., though I wouldn't disagree that if you can be assured that your starting data conforms to non-text data types, there are probably better formats than CSV.

    The main benefit of Arrow, IMHO, is less as a format for transmitting / communicating but rather as a format for data at rest, that would benefit from having higher performance column-based read and compression

  • InfluxDB

    Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project

Related posts