Apache Arrow
Apache Spark
Apache Arrow | Apache Spark | |
---|---|---|
83 | 112 | |
14,831 | 40,319 | |
1.1% | 0.6% | |
9.9 | 10.0 | |
3 days ago | 3 days ago | |
C++ | Scala | |
Apache License 2.0 | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Apache Arrow
-
Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)
Apache Arrow : It contains a set of technologies that enable big data systems to process and move data fast
-
Using Polars in Rust for high-performance data analysis
One of the main selling points of Polars over similar solutions such as Pandas is performance. Polars is written in highly optimized Rust and uses the Apache Arrow container format.
-
Kotlin DataFrame ❤️ Arrow
Kotlin DataFrame v0.14 comes with improvements for reading Apache Arrow format, especially loading a DataFrame from any ArrowReader. This improvement can be used to easily load results from analytical databases (such as DuckDB, ClickHouse) directly into Kotlin DataFrame.
- Random access string compression with FSST and Rust
-
Declarative Multi-Engine Data Stack with Ibis
Apache Arrow
-
Shades of Open Source - Understanding The Many Meanings of "Open"
It's this kind of certainty that underscores the vital role of the Apache Software Foundation (ASF). Many first encounter Apache through its pioneering project, the open-source web server framework that remains ubiquitous in web operations today. The ASF was initially created to hold the intellectual property and assets of the Apache project, and it has since evolved into a cornerstone for open-source projects worldwide. The ASF enforces strict standards for diverse contributions, independence, and activity in its projects, ensuring they can withstand the test of time as standards in software development. Many open-source projects strive to become Apache projects to gain the community credibility necessary for adoption as standard software building blocks, such as Apache Tomcat for Java web applications, Apache Arrow for in-memory data representation, and Apache Parquet for data file formatting, among others.
- The Simdjson Library
-
Arrow Flight SQL in Apache Doris for 10X faster data transfer
Apache Doris 2.1 has a data transmission channel built on Arrow Flight SQL. (Apache Arrow is a software development platform designed for high data movement efficiency across systems and languages, and the Arrow format aims for high-performance, lossless data exchange.) It allows high-speed, large-scale data reading from Doris via SQL in various mainstream programming languages. For target clients that also support the Arrow format, the whole process will be free of serialization/deserialization, thus no performance loss. Another upside is, Arrow Flight can make full use of multi-node and multi-core architecture and implement parallel data transfer, which is another enabler of high data throughput.
-
How moving from Pandas to Polars made me write better code without writing better code
In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.
-
From slow to SIMD: A Go optimization story
I learned yesterday about GoLang's assembler https://go.dev/doc/asm - after browsing how arrow is implemented for different languages (my experience is mainly C/C++) - https://github.com/apache/arrow/tree/main/go/arrow/math - there are bunch of .S ("asm" files) and I'm still not able to comprehend how these work exactly (I guess it'll take more reading) - it seems very peculiar.
The last time I've used inlined assembly was back in Turbo/Borland Pascal, then bit in Visual Studio (32-bit), until they got disabled. Then did very little gcc with their more strict specification (while the former you had to know how the ABI worked, the latter too - but it was specced out).
Anyway - I wasn't expecting to find this in "Go" :) But I guess you can always start with .go code then produce assembly (-S) then optimize it, or find/hire someone to do it.
Apache Spark
- His Startup Is Now Worth $62B. It Gave Away Its First Product Free
-
How to Install PySpark on Your Local Machine
If you’re stepping into the world of Big Data, you have likely heard of Apache Spark, a powerful distributed computing system. PySpark, the Python library for Apache Spark, is a favorite among data enthusiasts for its combination of speed, scalability, and ease of use. But setting it up on your local machine can feel a bit intimidating at first.
-
How to Use PySpark for Machine Learning
According to the Apache Spark official website, PySpark lets you utilize the combined strengths of ApacheSpark (simplicity, speed, scalability, versatility) and Python (rich ecosystem, matured libraries, simplicity) for “data engineering, data science, and machine learning on single-node machines or clusters.”
-
Top FP technologies
spark
-
Why Apache Spark RDD is immutable?
Apache Spark is a powerful and widely used framework for distributed data processing, beloved for its efficiency and scalability. At the heart of Spark’s magic lies the RDD, an abstraction that’s more than just a mere data collection. In this blog post, we’ll explore why RDDs are immutable and the benefits this immutability provides in the context of Apache Spark.
- Spark SQL is getting pipe syntax
-
Intro to Ray on GKE
The Python Library components of Ray could be considered analogous to solutions like numpy, scipy, and pandas (which is most analogous to the Ray Data library specifically). As a framework and distributed computing solution, Ray could be used in place of a tool like Apache Spark or Python Dask. It’s also worthwhile to note that Ray Clusters can be used as a distributed computing solution within Kubernetes, as we’ve explored here, but Ray Clusters can also be created independent of Kubernetes.
-
Avoid These Top 10 Mistakes When Using Apache Spark
We all know how easy it is to overlook small parts of our code, especially when we have powerful tools like Apache Spark to handle the heavy lifting. Spark's core engine is great at optimizing our messy, complex code into a sleek, efficient physical plan. But here's the catch: Spark isn't flawless. It's on a journey to perfection, sure, but it still has its limits. And Spark is upfront about those limitations, listing them out in the documentation (sometimes as little notes).
-
IaaS vs PaaS vs SaaS: The Key Differences
One specific use case of the IaaS model is for deploying software that would have otherwise been bought as a SaaS. There are many such software from email servers to databases. You can choose to deploy MySQL in your infrastructure rather than buying from a MySQL SaaS provider. Other things you can deploy using the IaaS model include Mattermost for team collaboration, Apache Spark for data analytics, and SAP for Enterprise Resource Planning.
-
How I've implemented the Medallion architecture using Apache Spark and Apache Hdoop
In this project, I'm exploring the Medallion Architecture which is a data design pattern that organizes data into different layers based on structure and/or quality. I'm creating a fictional scenario where a large enterprise that has several branches across the country. Each branch receives purchase orders from an app and deliver the goods to their customers. The enterprise wants to identify the branch that receives the most purchase requests and the branch that has the minimum average delivery time. To achieve that, I've used Apache Spark as a distributed compute engine and Apache Hadoop, in particular HDFS, as my data storage layer. Apache Spark ingest, processes, and stores the app's data on HDFS to be served to a custom dashboard app. You can find all about it, in this Github repo
What are some alternatives?
Airflow - Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
Trino - Official repository of Trino, the distributed SQL query engine for big data, former
h5py - HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
Pytorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration
FlatBuffers - FlatBuffers: Memory Efficient Serialization Library
polars - Dataframes powered by a multithreaded, vectorized query engine, written in Rust
Scalding - A Scala API for Cascading
ClickHouse - ClickHouse® is a real-time analytics database management system
mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services
beam - Apache Beam is a unified programming model for Batch and Streaming data processing.
luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.