arquero
Apache Arrow

arquero | Apache Arrow | |
---|---|---|
10 | 86 | |
1,417 | 15,690 | |
1.2% | 1.1% | |
6.9 | 9.9 | |
about 2 months ago | 4 days ago | |
JavaScript | C++ | |
BSD 3-clause "New" or "Revised" License | Apache License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
arquero
-
Show HN: JAQT – JavaScript Queries and Transformations
In a similar vein is https://pbeshai.github.io/tidy/ which I've used for 3+ years. It's a really nice lightweight transformer.
I've also used https://github.com/uwdata/arquero once (better performance for large datasets).
-
A New Package for Making Charts in Emacs: Eplot
Neat!
This is one of my favorite spaces, so I'll add some generic advice which may or may not be helpful.
I once had the privilege of working for Max Roser and Hannah Ritchie at Our World in Data, as one of the engineers on their Grapher library (https://github.com/owid/owid-grapher), and learned a ton from them (and others on the team) about making great charts.
My one piece of advice from looking at your examples would be: don't neglect title, subtitle, and caption! They would be so easy to do well because you've already created your "simple headers thingies". A few words go along way. Check out "Storytelling with Data" by Cole Nussbaumer Knaflic for a great read on the subject. Owid's Grapher does those the best, IMO (followed closely by DataWrapper.de -- but that's not open source).
At some point, if you keep up with this, you'll also want to add a dataflow library and DSL. Hadley Wickham's dplyr in R was the GOAT, and I copied that in my Ohayo tool and in OWID Grapher's CoreTable library (https://github.com/owid/owid-grapher/tree/master/packages/%4...). Jeffrey Heer's newish Arquero (https://idl.uw.edu/arquero/) library is also along those lines.
Lately I've delving into Mike Bostock's new thing Plot (https://observablehq.com/plot/). So far, excited by it, but only spent a day or two with it at this point.
I don't use emacs anymore, but hopefully something helpful in the comments above.
- Show HN: Matrices – explore, visualize, and share large datasets
-
Goodbye, Node.js Buffer
https://github.com/uwdata/arquero
- Arquero is a JavaScript library for query processing and transformation of array-backed data tables
- Arquero – data tables wrangling in JavaScript
-
Hal9: Data Science with JavaScript
Transformations: We found out that JavaScript in combination with D3.js has a pretty decent set of data transformation functions; however, it comes nowhere near to Pandas or dplyr. We found out about Tidy.js quite early, loved it, and adopted it. The combination of Tidy.js and D3.js and Plot.js is absolutely amazing for visualizations and data wrangling with small datasets, say 10-100K rows. We were very happy with this for a while; however, once you move away from visualizations into real-world data analysis, we found out 100K rows restrictive, which gets worse when having 100 or 1K columns. So we switched gears and started using Arquero.js, which happens to be columnar and enabled us to process +1M rows in the browser, descent size for real-world data analysis.
- Arquero – Query processing and transformation of array-backed data tables
-
Apache Arrow 3.0.0 Release
Take a look at the arquero library from a research group at University of Washington (the same group that D3 came out of). https://github.com/uwdata/arquero
Apache Arrow
- New Parquet writer allows easy insert/delete/edit
-
Show HN: Aiopandas – Async .apply() and .map() for Pandas, Faster API/LLMs Calls
https://github.com/apache/arrow/blob/main/python/pyarrow/tes...
pyarrow/src/arrow/python/async.h:
-
Adding concurrent read/write to DuckDB with Arrow Flight
@1egg0myegg0 that's great to hear. I'll check to see if it applies to Arrow.
Another performance issue with DuckDB/Arrow integration that we've been working to solve is that Arrow lacked a canonical way to pass statistics along with a stream of data. So for example if you're reading Parquet files and passing them to DuckDB, you would lose the ability to pass the Parquet column statistics to DuckDB for things like join order optimization. We recently added an API to Arrow to enable passing statistics, and the DuckDB devs are working to implement this. Discussion at https://github.com/apache/arrow/issues/38837.
-
Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)
Apache Arrow : It contains a set of technologies that enable big data systems to process and move data fast
-
Using Polars in Rust for high-performance data analysis
One of the main selling points of Polars over similar solutions such as Pandas is performance. Polars is written in highly optimized Rust and uses the Apache Arrow container format.
-
Kotlin DataFrame ❤️ Arrow
Kotlin DataFrame v0.14 comes with improvements for reading Apache Arrow format, especially loading a DataFrame from any ArrowReader. This improvement can be used to easily load results from analytical databases (such as DuckDB, ClickHouse) directly into Kotlin DataFrame.
- Random access string compression with FSST and Rust
-
Declarative Multi-Engine Data Stack with Ibis
Apache Arrow
-
Shades of Open Source - Understanding The Many Meanings of "Open"
It's this kind of certainty that underscores the vital role of the Apache Software Foundation (ASF). Many first encounter Apache through its pioneering project, the open-source web server framework that remains ubiquitous in web operations today. The ASF was initially created to hold the intellectual property and assets of the Apache project, and it has since evolved into a cornerstone for open-source projects worldwide. The ASF enforces strict standards for diverse contributions, independence, and activity in its projects, ensuring they can withstand the test of time as standards in software development. Many open-source projects strive to become Apache projects to gain the community credibility necessary for adoption as standard software building blocks, such as Apache Tomcat for Java web applications, Apache Arrow for in-memory data representation, and Apache Parquet for data file formatting, among others.
- The Simdjson Library
What are some alternatives?
perspective - A data visualization and analytics component, especially well-suited for large and/or streaming datasets.
Apache Spark - Apache Spark - A unified analytics engine for large-scale data processing
hal9ai - Hal9 — Data apps powered by code and LLMs [Moved to: https://github.com/hal9ai/hal9]
FlatBuffers - FlatBuffers: Memory Efficient Serialization Library
vega-loader-arrow - Data loader for the Apache Arrow format.
h5py - HDF5 for Python -- The h5py package is a Pythonic interface to the HDF5 binary data format.
