The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning. Learn more →
Top 7 Python Bigdata Projects
-
vaex
Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
-
Optimus
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark (by ironmussa)
-
reddit_sse_stream
A Server Side Event stream to deliver Reddit comments and submissions in near real-time to a client.
-
wbz
A parallel implementation of the bzip2 data compressor in python, this data compression pipeline is using algorithms like Burrows–Wheeler transform (BWT) and Move to front (MTF) to improve the Huffman compression. For now, this tool only will be focused on compressing .csv files, and other files on tabular format.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Project mention: Potential of the Julia programming language for high energy physics computing | news.ycombinator.com | 2023-12-04> I wasn't proposing ROOT to be reimplemented in JS. That was what the GP attributed to me.
Sorry for assuming that. I really felt the pain of thinking of possibility of combining two things I hate so much together (JS+ROOT)
> "Laypeople" may also think that code is optimized to the last cycle in something like HEP simulations. It's made fast enough and the optimization is nowhere near the level of e.g. graphics heavy games.
I understand that in other areas there might be more sophisticated optimizations, but does not change things much inside HEP field community. And it is not optimized only for simulations but for other things too. It is not one problem optimization.
> Real-time usage like high frequency large data collection will probably never happen on the "single language". But I'd guess ROOT is not used at that level either? Also at least last time I checked, ROOT is moving to Python (probably not for the hottest loops of the simulation though).
I did not mean to indicate that ROOT is being used to handle the online processing (In HEP terms). It is usually handled via optimized C++ compiled code. My idea is that you will probably never use JS or any interpreted language (or anything other than C++ to be pessimistic) for that. ROOT at the end of the day is much closer to C++ than anything else. So learning curve wouldn't be that much if you come with some C++ knowledge initially.
> Also at least last time I checked, ROOT is moving to Python (probably not for the hottest loops of the simulation though).
I think you mean PyROOT [1]? This is the official python ROOT interface It provides a set of Python bindings to the ROOT C++ libraries, allowing Python scripts to interact directly with ROOT classes and methods as if they were native Python. But that does not represent and re-writing. It makes things easier for end users who are doing analysis though, while be efficient in terms of performance, especially for operations that are heavily optimized in ROOT.
There is also uproot [2] which is a purely Python-based reader and writer of ROOT files. It is not a part of the official ROOT project and does not depend on the ROOT libraries. Instead, uproot re-implements the I/O functionalities of ROOT in Python. However, it does not provide an interface to the full range of ROOT functionalities. It is particularly useful for integrating ROOT data into a Python-based data analysis pipeline, where libraries like NumPy, SciPy, Matplotlib, and Pandas ..etc are used.
> Off-topic: C++ interpretation like done in ROOT seems like a really bad idea.)
I will agree with you. But to be fair the purpose of ROOT is interactive data analysis but over the decades a lot of things gets added, and many experiments had their own soft forks and things started to get very messy quickly. So that there is no much inertia to fix problems and introduce improvements.
[1] https://root.cern/manual/python/
[2] https://github.com/scikit-hep/uproot5
Project mention: Show HN: Snowflake Data Quality Checks in Python | news.ycombinator.com | 2024-02-11
Project mention: Pushshift Live Again and How Moderators Can Request Pushshift Access | /r/pushshift | 2023-06-20Will you still be providing the SSE Stream API using this new bearer token authentication?
Python Bigdata related posts
- Pushshift Live Again and How Moderators Can Request Pushshift Access
- Thoughts on a pushshift alternative
- Introducing Sunbelt, a new service similar to Pushshift
- High performance (for the consumer) time series storage?
- Python Pandas vs Dask for csv file reading
- For stocks, what historical data do you store and how do you store it?
- A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0
-
A note from our sponsor - WorkOS
workos.com | 25 Apr 2024
Index
What are some of the best open-source Bigdata projects in Python? This list will help you:
Project | Stars | |
---|---|---|
1 | vaex | 8,173 |
2 | dpark | 2,691 |
3 | Optimus | 1,446 |
4 | uproot5 | 218 |
5 | cuallee | 105 |
6 | reddit_sse_stream | 47 |
7 | wbz | 13 |
Sponsored