kafka-delta-ingest
polars
kafka-delta-ingest | polars | |
---|---|---|
6 | 144 | |
325 | 26,218 | |
3.4% | 2.9% | |
7.4 | 10.0 | |
18 days ago | 6 days ago | |
Rust | Rust | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
kafka-delta-ingest
-
Using rust for DE activities?
Rust can offer incredible cost savings when you can use it in place of spark to interact with your delta lake. One such project was kafka-delta-ingest. The developers were able to reduce the cost of running the pipeline by over 90%. However, most of this stuff is still very experimental and not ready for production but you will definitely be seeing more projects like this just based on how much money can be saved.
-
Which lakehouse table format do you expect your organization will be using by the end of 2023?
This independence from a catalog allows for path based reads and writes. This is handy when writing from Kafka directly to Delta Lake for the first layer of ingestion. You don’t need a catalog (or even Spark). https://github.com/delta-io/kafka-delta-ingest/tree/main/src
-
Streaming Data and Postgres
As far as I know no. You certainly could use events on a streaming ledger like Kafka or Redpanda and then store to delta with https://github.com/delta-io/kafka-delta-ingest and process them with all the gis goodness of spark. However, this is fairly complicated and much different from a simple postgis drop in replacement. There are specialized meaning faster and more efficient systems out there for specialized tasks such as geo fencing in real-time
-
Rust is showing a lot of promise in the DataFrame / tabular data space
kafka-delta-ingest is a good project to get streaming data into a Delta Lake. Here's a great talk on the topic.
-
process millions of events per sec
What about https://github.com/delta-io/kafka-delta-ingest?
- Exactly once delivery from Kafka to Delta Lake with Rust
polars
-
Why Python's Integer Division Floors (2010)
This is because 0.1 is in actuality the floating point value value 0.1000000000000000055511151231257827021181583404541015625, and thus 1 divided by it is ever so slightly smaller than 10. Nevertheless, fpround(1 / fpround(1 / 10)) = 10 exactly.
I found out about this recently because in Polars I defined a // b for floats to be (a / b).floor(), which does return 10 for this computation. Since Python's correctly-rounded division is rather expensive, I chose to stick to this (more context: https://github.com/pola-rs/polars/issues/14596#issuecomment-...).
-
Polars
https://github.com/pola-rs/polars/releases/tag/py-0.19.0
-
Stuff I Learned during Hanukkah of Data 2023
That turned out to be related to pola-rs/polars#11912, and this linked comment provided a deceptively simple solution - use PARSE_DECLTYPES when creating the connection:
- Polars 0.20 Released
- Segunda linguagem
- Polars: Dataframes powered by a multithreaded query engine, written in Rust
- Summing columns in remote Parquet files using DuckDB
- Polars 0.34 is released. (A query engine focussing on DataFrame front ends)
What are some alternatives?
delta-rs - A native Rust library for Delta Lake, with bindings into Python
vaex - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
dipa - dipa makes it easy to efficiently delta encode large Rust data structures.
modin - Modin: Scale your Pandas workflows by changing a single line of code
kafka-rust - Rust client for Apache Kafka
datafusion - Apache DataFusion SQL Query Engine
rust-rdkafka - A fully asynchronous, futures-based Kafka client library for Rust based on librdkafka
DataFrames.jl - In-memory tabular data in Julia
flowgger - A fast data collector in Rust
datatable - A Python package for manipulating 2-dimensional tabular data structures
arrow2 - Transmute-free Rust library to work with the Arrow format
Apache Arrow - Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing