|6 days ago||7 days ago|
|Apache License 2.0||Apache License 2.0|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
How to use multiple Parquet files with Datafusion dataframe?
1 project | reddit.com/r/rust | 9 Jan 2022
Distributed systems you'd like to see in Rust?
8 projects | reddit.com/r/rust | 28 Dec 2021
This project looks cool: https://github.com/apache/arrow-datafusion
Any role that Rust could have in the Data world (Big Data, Data Science, Machine learning, etc.)?
8 projects | reddit.com/r/rust | 4 Dec 2021
Show HN: Box – Data Transformation Pipelines in Rust DataFusion
4 projects | news.ycombinator.com | 30 Nov 2021
A while ago I posted a link to [Arc](https://news.ycombinator.com/item?id=26573930) a declarative method for defining repeatable data pipelines which execute against [Apache Spark](https://spark.apache.org/).
Today I would like to present a proof-of-concept implementation of the [Arc declarative ETL framework](https://arc.tripl.ai) against [Apache Datafusion](https://arrow.apache.org/datafusion/) which is an Ansi SQL (Postgres) execution engine based upon Apache Arrow and built with Rust.
The idea of providing a declarative 'configuration' language for defining data pipelines was planned from the beginning of the Arc project to allow changing execution engines without having to rewrite the base business logic (the part that is valuable to your business). Instead, by defining an abstraction layer, we can change the execution engine and run the same logic with different execution characteristics.
The benefit of the DataFusion over Apache Spark is a significant increase in speed and reduction in execution resource requirements. Even through a Docker-for-Mac inefficiency layer the same job completes in ~4 seconds with DataFusion vs ~24 seconds with Apache Spark (including JVM startup time). Without Docker-for-Mac layer end-to-end execution times of 0.5 second for the same example job (TPC-H) is possible. * the aim is not to start a benchmarking flamewar but to provide some indicative data *.
The purpose of this post is to gather feedback from the community whether you would use a tool like this, what features would be required for you to use it (MVP) or whether you would be interested in contributing to the project. I would also like to highlight the excellent work being done by the DataFusion/Arrow (and Apache) community for providing such amazing tools to us all as open source projects.
Rust and what it needs to gain space in computation-oriented applications
7 projects | reddit.com/r/rust | 24 Nov 2021
You should check out polars, datafusion, influxdb iox and databend, all written in native Rust and powered by the Apache Arrow format. Polars in particular is pretty dam fast and has bindings for Python.
How to pass dataframes between Rust and Python?
4 projects | reddit.com/r/rust | 20 Nov 2021
A solution for either Polars or Datafusion (or something else?) would be fine. For both libraries, python packages exist, that contain the python bindings: https://github.com/pola-rs/polars/tree/master/py-polars https://github.com/apache/arrow-datafusion/tree/master/python
Using an ECS as a general-purpose storage container?
1 project | reddit.com/r/rust_gamedev | 2 Nov 2021
Datafusion runs SQL queries against an in-memory column store. It aims for a subset of Postgres SQL. It specifically targets big data use cases, and can integrate with other big-data tools via a 'parquet' file format.
Rrow Datafusion includes Ballista, which does SIMD and GPU vectorized ops
1 project | news.ycombinator.com | 24 Oct 2021
Apache Arrow DataFusion (Rust query engine) now has an online user guide
1 project | reddit.com/r/rust | 22 Sep 2021
Show HN: Columnq brings OLAP to Unix pipes
2 projects | news.ycombinator.com | 13 Sep 2021
Thanks! It's using Datafusion as the query engine: https://github.com/apache/arrow-datafusion
Grep one-liners as CI tasks
7 projects | news.ycombinator.com | 14 Jan 2022
Top Github repo trends in 2021
47 projects | dev.to | 12 Jan 2022
No surprises here: deep learning is the most popular subcategory, with hugging face transformers repo, YOLOv5, Tensorflow and Deepmind’s Alphafold all in the mix. Surprisingly, the only proper infrastructure-ey repos on the list are Meilisearch and Clickhouse, a tad bit surprising given all the hype data infrastructure receives in VC-world, but again, probably just a question of size of end-user populations + whether data scientists spend tons of time on Github vs. Web Developers…
Ask HN: Good open source alternatives to Google Analytics?
30 projects | news.ycombinator.com | 11 Jan 2022
go-faster/ch: fastest ClickHouse client, faster than Rust and C++
5 projects | reddit.com/r/golang | 5 Jan 2022
And by this do you mean this? https://clickhouse.com/
Ask HN: Top Skills to Learn for 2022?
3 projects | news.ycombinator.com | 20 Dec 2021
Enabling predictive capabilities in ClickHouse database
2 projects | dev.to | 16 Dec 2021
In this blog post, we will be reviewing how we can integrate predictive capabilities powered by machine learning with the ClickHouse database. ClickHouse is a fast, open-source, column-oriented SQL database that is very useful for data analysis and real-time analytics. The project is maintained and supported by ClickHouse, Inc. We will be exploring its features in tasks that require data preparation in support of machine learning.
Stream Processing Database
4 projects | reddit.com/r/Database | 28 Nov 2021
There's ksqldb (open source, built with java) and materialize (there's standalone edition), both need to use Kafka/RedPanda, also Clickhouse (open source, with materialize view with specific engine, but need to buffer the inserts using proxy like KittenHouse or buffering library like ch-timed-buffer), is there any other alternative to those 3 (that similarly doesn't do full scan to do aggregation)?
Open Source Analytics Stack: Bringing Control, Flexibility, and Data-Privacy to Your Analytics
15 projects | dev.to | 25 Nov 2021
Moreover, using open-source warehouse tools can allow unlocking additional insights from your data in real-time and at a lesser cost. PostgreSQL (website, repo) is a popular example of an efficient and low-cost data warehousing solution. Another example is ClickHouse (website, GitHub), an open-source, analytics-focused DBMS that allows generating analytical reports from data in real-time using SQL.
Welcome to the free open-source OLAP server project
2 projects | dev.to | 15 Nov 2021
The most efficient way is to use column store databases as data sources for eMondrian. For example, ClickHouse could run as a powerful and fast query engine while eMondrian works as a proxy representing data as cubes and executing MDX queries.
How to speed up ClickHouse queries using materialized columns
1 project | dev.to | 11 Nov 2021
As of writing, there's a feature request on Github for adding specific commands for materializing specific columns on ClickHouse data parts.
What are some alternatives?
VictoriaMetrics - VictoriaMetrics: fast, cost-effective monitoring solution and time series database
Trino - Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
TimescaleDB - An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
RocksDB - A library that provides an embeddable, persistent key-value store for fast storage.
PostgreSQL - Mirror of the official PostgreSQL GIT repository. Note that this is just a *mirror* - we don't work with pull requests on github. To contribute, please see https://wiki.postgresql.org/wiki/Submitting_a_Patch
Adminer - Database management in a single PHP file
duckdb - DuckDB is an in-process SQL OLAP Database Management System
TileDB - The Universal Storage Engine
polars - Fast multi-threaded DataFrame library in Rust | Python | Node.js
loki - Like Prometheus, but for logs.
MySQL - MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.