-
ucall
Web Serving and Remote Procedure Calls at 50x lower latency and 70x higher bandwidth than FastAPI, implementing JSON-RPC & REST over io_uring ☎️
You are right! For the convenience of Python users, we have to introspect the messages and parse JSON into Python objects. Every member of every dictionary being allocated on heap.
To make it as fast as possible we don't use PyBind, NanoBind, SWIG, or any high-level tooling. Our Python bindings are a pure CPython integration. There is just no way to beat that combo, not that I know.
https://github.com/unum-cloud/ujrpc/blob/main/src/python.c
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
Yes, we also constantly think about that! In the document collections of UKV, for example, we have interoperability between JSON, BSON, and MessagePack objects [1]. CSV is another potential option, but text-based formats aren't ideal for large scale transmissions.
One thing people do - use two protocols. That is the case with Apache Arrow Flight RPC = gRPC for tasks, Arrow for data. It is a viable path, but compiling gRPC is a nightmare, and we don't want to integrate it into our other libraries, as we generally compile everything from sources. Seemingly, UJRPC can replace gRPC, and for the payload we can continue using Arrow. We will see :)
[1]: https://github.com/unum-cloud/ukv/blob/main/src/modality_doc...
-
simdjson
Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks
I'm talking about with simdjson. Lemire suggested reading line-by-line is not a good idea [0]. So I'm asking about the ideal approach using simdjson, not JSON parsers in general.
[0] https://github.com/simdjson/simdjson/issues/188#issuecomment...
-
Apache Arrow
Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
If anything you'd probably want to send it in Arrow[1] format. CSV's don't even preserve data types.
[1]: https://arrow.apache.org/
-
How does yyjson[0] compare to simdjson? Their benchmarks suggest it could be a positive.
[0] https://github.com/ibireme/yyjson
-
japronto
Screaming-fast Python 3.5+ HTTP toolkit integrated with pipelining HTTP server based on uvloop and picohttpparser.
100x faster than FastAPI seems easy. I wonder how it compares to other fast Python libraries like Japronto[1] and non-Python ones too.
1 - https://github.com/squeaky-pl/japronto
-
Absolutely interested, on my end at least. I wrote this to manage the transparency in coverage files: https://github.com/dolthub/data-analysis/tree/main/transpare... but I'm always looking for better techniques.
Oh wow, I see you used it on those exact files. How about that.
-
How large? Also I'm not sure the gRPC C++ server implementations you've tested are the fastest. If you're comparing to FastAPI (which is more of an HTTP server framework) then you should also compare to what is at the top of https://www.techempower.com/benchmarks/#section=data-r21.
-
I used the rapidjson streams with my little embedded REST HTTP(s) server library: https://github.com/Edgio/is2/
-
Are there any synergies with capnproto [1] or is the focus here purely on huge payloads?
I'm just an interested hobbyist when it comes to performant RPC frameworks but had some fun benchmarking capnproto for a small gamedev-related project and it was pretty awesome.
[1] https://capnproto.org/
-
Speaking of Go, there's a simdjson implementation for golang too:
> Performance wise, simdjson-go runs on average at about 40% to 60% of the speed of simdjson. Compared to Golang's standard package encoding/json, simdjson-go is about 10x faster.
I haven't tried it yet but I don't really need that speed.
https://github.com/minio/simdjson-go
-
Regarding the hard way, this little utility does a great job of splitting larger than memory JSON documents into collections of NDJSON files:
https://github.com/dolthub/jsplit
-
Ha! Thanks to you, Today I found out how big those uncompressed JSON files really are (the data wasn't accessible to me, so i shared the tool with my colleague and he was the one who ran the queries on his laptop): https://www.dolthub.com/blog/2022-09-02-a-trillion-prices/ .
And yep, it was more or less they way you did with ijson. I found ijson just a day after I finished the prototype. Rapidjson would probably be faster. Especially after enabling SIMD. But the indexing was a one time thing.
We have open sourced the codebase. Here's the link: https://github.com/multiversal-ventures/json-buffet . Since this was a quick and dirty prototype, comments were sparse. I have updated the Readme, and added a sample json-fetcher. Hope this is more useful for you.
Another unwritten TODO was to nudge the data providers towards a more streaming friendly compression formats - and then just create an index to fetch the data directly from their compressed archives. That would have saved everyone a LOT of $$$.
-
msgspec
A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
If you're primarily targeting Python as an application layer, you may also want to check out my msgspec library[1]. All the perf benefits of e.g. yyjson, but with schema validation like pydantic. It regularly benchmarks[2] as the fastest JSON library for Python. Much of the overhead of decoding JSON -> Python comes from the python layer, and msgspec employs every trick I know to minimize that overhead.
[1]: https://github.com/jcrist/msgspec
[2]: https://github.com/TkTech/json_benchmark
-
If you're primarily targeting Python as an application layer, you may also want to check out my msgspec library[1]. All the perf benefits of e.g. yyjson, but with schema validation like pydantic. It regularly benchmarks[2] as the fastest JSON library for Python. Much of the overhead of decoding JSON -> Python comes from the python layer, and msgspec employs every trick I know to minimize that overhead.
[1]: https://github.com/jcrist/msgspec
[2]: https://github.com/TkTech/json_benchmark
-
typedload
Discontinued Python library to load dynamically typed data into statically typed data structures
Author of typedload here!
FastAPI relies on (not so fast) pydantic, which is one of the slowest libraries in that category.
Don't expect to find such benchmarks on the pydantic documentation itself, but the competing libraries will have them.
[0] https://ltworf.github.io/typedload/
-
Parsing CSV doesn't have to be slow if you use something like xsv or zsv (https://github.com/liquidaty/zsv) (disclaimer: I'm an author). The speed of CSV parsers is fast enough that unless you are doing something ultra-trivial such as "count rows", your bottleneck will be elsewhere.
The benefits of CSV are:
- human readable
- does not need to be typed (sometimes, data in the raw such as date-formatted data is not amenable to typing without introducing a pre-processing layer that gets you further from the original data)
- accessible to anyone: you don't need to be a data person to dbl-click and open in Excel or similar
The main drawback is that if your data is already typed, CSV does not communicate what the type is. You can alleviate this through various approaches such as is described at https://github.com/liquidaty/zsv/blob/main/docs/csv_json_sql..., though I wouldn't disagree that if you can be assured that your starting data conforms to non-text data types, there are probably better formats than CSV.
The main benefit of Arrow, IMHO, is less as a format for transmitting / communicating but rather as a format for data at rest, that would benefit from having higher performance column-based read and compression