Apache Arrow
Airflow
Apache Arrow | Airflow | |
---|---|---|
83 | 182 | |
14,854 | 38,302 | |
1.1% | 2.4% | |
9.9 | 10.0 | |
about 11 hours ago | 1 day ago | |
C++ | Python | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Apache Arrow
-
Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)
Apache Arrow : It contains a set of technologies that enable big data systems to process and move data fast
-
Using Polars in Rust for high-performance data analysis
One of the main selling points of Polars over similar solutions such as Pandas is performance. Polars is written in highly optimized Rust and uses the Apache Arrow container format.
-
Kotlin DataFrame ❤️ Arrow
Kotlin DataFrame v0.14 comes with improvements for reading Apache Arrow format, especially loading a DataFrame from any ArrowReader. This improvement can be used to easily load results from analytical databases (such as DuckDB, ClickHouse) directly into Kotlin DataFrame.
- Random access string compression with FSST and Rust
-
Declarative Multi-Engine Data Stack with Ibis
Apache Arrow
-
Shades of Open Source - Understanding The Many Meanings of "Open"
It's this kind of certainty that underscores the vital role of the Apache Software Foundation (ASF). Many first encounter Apache through its pioneering project, the open-source web server framework that remains ubiquitous in web operations today. The ASF was initially created to hold the intellectual property and assets of the Apache project, and it has since evolved into a cornerstone for open-source projects worldwide. The ASF enforces strict standards for diverse contributions, independence, and activity in its projects, ensuring they can withstand the test of time as standards in software development. Many open-source projects strive to become Apache projects to gain the community credibility necessary for adoption as standard software building blocks, such as Apache Tomcat for Java web applications, Apache Arrow for in-memory data representation, and Apache Parquet for data file formatting, among others.
- The Simdjson Library
-
Arrow Flight SQL in Apache Doris for 10X faster data transfer
Apache Doris 2.1 has a data transmission channel built on Arrow Flight SQL. (Apache Arrow is a software development platform designed for high data movement efficiency across systems and languages, and the Arrow format aims for high-performance, lossless data exchange.) It allows high-speed, large-scale data reading from Doris via SQL in various mainstream programming languages. For target clients that also support the Arrow format, the whole process will be free of serialization/deserialization, thus no performance loss. Another upside is, Arrow Flight can make full use of multi-node and multi-core architecture and implement parallel data transfer, which is another enabler of high data throughput.
-
How moving from Pandas to Polars made me write better code without writing better code
In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.
-
From slow to SIMD: A Go optimization story
I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.
The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).
Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.
Airflow
-
AIOps, DevOps, MLOps, LLMOps – What’s the Difference?
Data pipelines: Apache Kafka and Airflow are often used for building data pipelines that can continuously feed data to models in production.
-
Data Engineering with DLT and REST
This article demonstrates how to work with near real-time and historical data using the dlt package. Whether you need to scale data access across the enterprise or provide historical data for post-event analysis, you can use the same framework to provide customer data. In a future article, I'll demonstrate how to use dlt with a workflow orchestrator such as Apache Airflow or Dagster.``
-
Enabling Apache Airflow to copy large S3 objects
This approach means the API doesn't change, i.e., you can just replace the S3CopyObjectOperator instances with S3CopyOperator instances. Additionally, we only perform the extra work of doing the multipart upload when the simpler method is insufficient. The trade-off is that we're inefficient if almost every object is larger than 5GB because we're doing a "useless" API call first. As usual, it depends. A similar approach has been discussed in this Github Issue in the Airflow repository.
-
Deploy Apache Airflow on AWS Elastic Kubernetes Service (EKS)
helm repo add apache-airflow https://airflow.apache.org
-
New Apache Airflow Operators for Google Generative AI
We only use KubernetesOperators, but this has many downsides, and it's very clearly a 2nd thought of the Airflow project. It creates confusion because users of Airflow expect features A, B, and C, and when using KubernetesOperators they aren't functional because your biz logic needs to be separated. There are a number of blog posts echoing a similar critique[1]. Using KubernetesOperators creates a lot of wrong abstractions, impedes testability, and makes Airflow as a whole a pretty overkill system just to monitor external tasks. At that point, you should have just had your orchestration in client code to begin with, and many other frameworks made this correct division between client and server. That would also make it easier to support multiple languages.
According to their README: https://github.com/apache/airflow#approach-to-dependencies-o...
-
Anyone Can Access Deleted and Private Repository Data on GitHub
> Nope, me too. The whole Repo network thing is not User facing at all.
There are some user-facing parts: You can find the fork network and some related bits under repo insights. (The UX is not great.)
https://github.com/apache/airflow/forks?include=active&page=...
-
Data on Kubernetes: Part 3 - Managing Workflows with Job Schedulers and Batch-Oriented Workflow Orchestrators
There are several tools available that can help manage these workflows. Apache Airflow is a platform designed to programmatically author, schedule, and monitor workflows.
-
Ask HN: What's the right tool for this job?
From what I've seen, there are sort of two paths. I'll provide a well known example from each.
1. lang specific distributed task library
For example, in Python, celery is a pretty popular task system. If you (the dev) are the one doing all the code and running the workflows, it might work well for you. You build the core code and functions, and it handles the processing and resource stuff with a little config.
* https://github.com/celery/celery
Or lower level:
* https://github.com/dask/dask
2. DAG Workflow systems
There are also whole systems for what you're describing. They've gotten especially popular in the ML ops and data engineering world. A common one is AirFlow:
* https://github.com/apache/airflow
-
Apache Doris Job Scheduler for Task Automation
Job scheduling is an important part of data management as it enables regular data updates and cleanups. In a data platform, it is often undertaken by workflow orchestration tools like Apache Airflow and Apache Dolphinscheduler. However, adding another component to the data architecture also means investing extra resources for management and maintenance. That's why Apache Doris 2.1.0 introduces a built-in Job Scheduler. It is strategically more tailored to Apache Doris, and brings higher scheduling flexibility and architectural simplicity.
-
How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop
Instead of the custom orchestrator I used, a proper orchestration tool should replace it like Apache Airflow, Dagster, ..., etc.
What are some alternatives?
h5py - HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Kedro - Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
dagster - An orchestration platform for the development, production, and observation of data assets.
FlatBuffers - FlatBuffers: Memory Efficient Serialization Library
n8n - Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations.
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
ClickHouse - ClickHouse® is a real-time analytics database management system
beam - Apache Beam is a unified programming model for Batch and Streaming data processing.
Pandas - Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more