Apache Arrow
ClickHouse
Apache Arrow | ClickHouse | |
---|---|---|
83 | 231 | |
14,854 | 38,466 | |
1.1% | 1.7% | |
9.9 | 10.0 | |
about 17 hours ago | 2 days ago | |
C++ | C++ | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Apache Arrow
-
Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)
Apache Arrow : It contains a set of technologies that enable big data systems to process and move data fast
-
Using Polars in Rust for high-performance data analysis
One of the main selling points of Polars over similar solutions such as Pandas is performance. Polars is written in highly optimized Rust and uses the Apache Arrow container format.
-
Kotlin DataFrame ❤️ Arrow
Kotlin DataFrame v0.14 comes with improvements for reading Apache Arrow format, especially loading a DataFrame from any ArrowReader. This improvement can be used to easily load results from analytical databases (such as DuckDB, ClickHouse) directly into Kotlin DataFrame.
- Random access string compression with FSST and Rust
-
Declarative Multi-Engine Data Stack with Ibis
Apache Arrow
-
Shades of Open Source - Understanding The Many Meanings of "Open"
It's this kind of certainty that underscores the vital role of the Apache Software Foundation (ASF). Many first encounter Apache through its pioneering project, the open-source web server framework that remains ubiquitous in web operations today. The ASF was initially created to hold the intellectual property and assets of the Apache project, and it has since evolved into a cornerstone for open-source projects worldwide. The ASF enforces strict standards for diverse contributions, independence, and activity in its projects, ensuring they can withstand the test of time as standards in software development. Many open-source projects strive to become Apache projects to gain the community credibility necessary for adoption as standard software building blocks, such as Apache Tomcat for Java web applications, Apache Arrow for in-memory data representation, and Apache Parquet for data file formatting, among others.
- The Simdjson Library
-
Arrow Flight SQL in Apache Doris for 10X faster data transfer
Apache Doris 2.1 has a data transmission channel built on Arrow Flight SQL. (Apache Arrow is a software development platform designed for high data movement efficiency across systems and languages, and the Arrow format aims for high-performance, lossless data exchange.) It allows high-speed, large-scale data reading from Doris via SQL in various mainstream programming languages. For target clients that also support the Arrow format, the whole process will be free of serialization/deserialization, thus no performance loss. Another upside is, Arrow Flight can make full use of multi-node and multi-core architecture and implement parallel data transfer, which is another enabler of high data throughput.
-
How moving from Pandas to Polars made me write better code without writing better code
In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.
-
From slow to SIMD: A Go optimization story
I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.
The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).
Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.
ClickHouse
-
Should You Ditch Spark for DuckDB or Polars?
Clickhouse also has managed service (https://clickhouse.com/)
-
ClickHouse: The Key to Faster Insights
ClickHouse is rapidly gaining traction for its unmatched speed and efficiency in processing big data. Cloudflare, for example, uses ClickHouse to process millions of rows per second and reduce memory usage by over four times, making it a key player in large-scale analytics. With its advanced features and real-time query performance, ClickHouse is becoming a go-to choice for companies handling massive datasets. In this article, we'll explore why ClickHouse is increasingly favored for analytics, its key features, and how to deploy it on Kubernetes. We'll also cover some best practices for scaling ClickHouse to handle growing workloads and maximize performance.
-
All Hacker News posts dataset on Google BigQuery
I have this dataset being updated in ClickHouse in real-time: https://play.clickhouse.com/play?user=play#U0VMRUNUIG1heCh0a...
I also provide a way to export it or attach it to clickhouse-local and analyze it locally: https://github.com/ClickHouse/ClickHouse/issues/29693#issuec...
- Show HN: PDF2MD – Rust+Redis+ClickHouse+VLLM conversion pipeline for PDFs
-
Show HN: BemiDB – Postgres read replica optimized for analytics
And you can try it right now.
Install ClickHouse:
curl https://clickhouse.com/ | sh
-
Kotlin DataFrame ❤️ Arrow
ClickHouse is a high-performance, column-oriented SQL database management system (DBMS) designed for online analytical processing (OLAP). ClickHouse allows using Arrow Stream as an output format.
- Vecint: Average Color
-
Clickhouse for Embedded Analytics: First Impressions and Unexpected Challenges
We started to look for alternatives and quickly landed at Clickhouse.
-
Lessons Learned #2: Your new feature could introduce a security vulnerability to your old feature (Clickhouse CVE-2024-22412)
In today’s story, we will discuss CVE-2024-22412 which affected ClickHouse a popular open-source column-oriented database management system typically used for online analytical processing (OLAP) in real-time. You can find the full write-up of the vulnerability here.
-
Show HN: Insights.hn – Real-time Hacker News posts and comments analytics
This is really great!
I can suggest more ideas that will be easy to add:
- a spark line or heat map of upvotes for every thread: https://github.com/ClickHouse/ClickHouse/issues/59020
- a built-in SQL editor for custom queries;
If you need help in supporting or hosting it, write to milovidov at clickhouse.com
What are some alternatives?
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
loki - Like Prometheus, but for logs.
h5py - HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
DuckDB - DuckDB is an analytical in-process SQL database management system
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
Trino - Official repository of Trino, the distributed SQL query engine for big data, former
FlatBuffers - FlatBuffers: Memory Efficient Serialization Library
VictoriaMetrics - VictoriaMetrics: fast, cost-effective monitoring solution and time series database
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
RocksDB - A library that provides an embeddable, persistent key-value store for fast storage.
beam - Apache Beam is a unified programming model for Batch and Streaming data processing.
TimescaleDB - A time-series database for high-performance real-time analytics packaged as a Postgres extension