arrow2
duckdb
arrow2 | duckdb | |
---|---|---|
25 | 52 | |
1,071 | 16,902 | |
- | 4.5% | |
0.0 | 10.0 | |
3 months ago | about 9 hours ago | |
Rust | C++ | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
arrow2
-
Polars: Company Formation Announcement
One of the interesting components of Polars that I've been watching is the use of the Apache Arrow memory format, which is a standard layout for data in memory that enables processing (querying, iterating, calculating, etc) in a language agnostic way, in particular without having to copy/convert it into the local object format first. This enables cross-language data access by mmaping or transferring a single buffer, with zero [de]serialization overhead.
For some history, there's has been a bit of contention between the official arrow-rs implementation and the arrow2 implementation created by the polars team which includes some extra features that they find important. I think the current status is that everyone agrees that having two crates that implement the same standard is not ideal, and they are working to port any necessary features to the arrow-rs crate and plan on eventually switching to it and deprecating arrow2. But that's not easy.
https://github.com/apache/arrow-rs/issues/1176
https://github.com/jorgecarleitao/arrow2/pull/1476
-
Data Engineering with Rust
https://github.com/jorgecarleitao/arrow2 https://github.com/apache/arrow-datafusion https://github.com/apache/arrow-ballista https://github.com/pola-rs/polars https://github.com/duckdb/duckdb
-
Polars[Query Engine/ DataFrame] 0.28.0 released :)
Currently datafusion and polars aren't directly operable iirc because they use different underlying arrows implementations, but there seems to be work being done on that here https://github.com/jorgecarleitao/arrow2/issues/1429
- Arrow2 0.15 has been released. Happy festivities everyone =)
-
Rust is showing a lot of promise in the DataFrame / tabular data space
[arrow2](https://github.com/jorgecarleitao/arrow2) and [parquet2](https://github.com/jorgecarleitao/parquet2) are great foundational libraries for and DataFrame libs in Rust.
-
Matano - Open source security lake built with Arrow2 + Rust
[1] https://github.com/jorgecarleitao/arrow2
-
Polars 0.23.0 released
In lockstep with arrow2's 0.13 release, we have published polars 0.23.0.
- Arrow2 v0.13.0, now with support to read Apache ORC and COW semantics!
-
::lending-iterator — Lending/streaming Iterators on Stable Rust (and a pinch of HKT)
This is so freaking life-saving! - we have been using StreamingIterator and FallibleStreamingIterator in libraries (arrow2 and parquet2) and the existing landscape is quite confusing for new users!
-
Mssql :(
arrow2 has support for mssql via ODBC (which microsoft has first class support to). Here are the integration tests we have (both read and write) against mssql specifically.
duckdb
- 🪄 DuckDB sql hack : get things SORTED w/ constraint CHECK
- DuckDB: Move to push-based execution model (2021)
-
DuckDB performance improvements with the latest release
I'm not sure if the fix is reassuring or not: https://github.com/duckdb/duckdb/pull/9411/files
-
Building a Distributed Data Warehouse Without Data Lakes
It's an interesting question!
The problem is that the data is spread everywhere - no choice about that. So with that in mind, how do you query that data? Today, the idea is that you HAVE to put it into a central location. With tools like Bacalhau[1] and DuckDB [2], you no longer have to - a single query can be sharded amongst all your data - EFFECTIVELY giving you a lot of what you want from a data lake.
It's not a replacement, but if you can do a few of these items WITHOUT moving the data, you will be able to see really significant cost and time savings.
[1] https://github.com/bacalhau-project/bacalhau
[2] https://github.com/duckdb/duckdb
- DuckDB 0.9.0
-
Push or Pull, is this a question?
[4] Switch to Push-Based Execution Model by Mytherin · Pull Request #2393 · duckdb/duckdb (github.com)
-
Show HN: Hydra 1.0 – open-source column-oriented Postgres
it depends on your query obviously.
In general, I did very deep benchmarking of pg, clickhouse and duckdb, and I sure didn't make stupid mistakes like this: https://news.ycombinator.com/item?id=36990831
My dataset has 50B rows and 2tb of data, and I think columnar dbs are very overhiped and I chose pg because:
- pg performance is acceptable, maybe 2-3x times slower than clickhouse and duckdb on some queries if pg is configured correctly and run on compressed storage
- clickhouse and duckdb start falling apart very fast because they specialized on very narrow type of queries: https://github.com/ClickHouse/ClickHouse/issues/47520 https://github.com/ClickHouse/ClickHouse/issues/47521 https://github.com/duckdb/duckdb/discussions/6696
-
🦆 Effortless Data Quality w/duckdb on GitHub ♾️
This action installs duckdb with the version provided in input.
-
Using SQL inside Python pipelines with Duckdb, Glaredb (and others?)
Duckdb: https://github.com/duckdb/duckdb - seems pretty popular, been keeping an eye on this for close to a year now.
-
CSV or Parquet File Format
The Parquet-Go library is very complex, not yet success to use it. So I ask whether DuckDB can provide API https://github.com/duckdb/duckdb/issues/7776
What are some alternatives?
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
ClickHouse - ClickHouse® is a free analytics DBMS for big data
datafusion - Apache DataFusion SQL Query Engine
sqlite-worker - A simple, and persistent, SQLite database for Web and Workers.
db-benchmark - reproducible benchmark of database-like ops
datasette - An open source multi-tool for exploring and publishing data
arrow-rs - Official Rust implementation of Apache Arrow
octosql - OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.
pyodide - Pyodide is a Python distribution for the browser and Node.js based on WebAssembly
metabase-clickhouse-driver - ClickHouse database driver for the Metabase business intelligence front-end
explorer - Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir