SaaSHub helps you find the best software and product alternatives Learn more →
Top 23 Python Big Data Projects
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
SaaSHub
SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
-
-
Project mention: I Use Nim Instead of Python for Data Processing | news.ycombinator.com | 2024-09-05
-
catboost
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05 -
Project mention: RFC: The Feast Kubernetes Operator (The Open Source Feature Store) | news.ycombinator.com | 2024-09-24
Hey folks!
I'm a maintainer for Feast (https://github.com/feast-dev/feast) (the Open Source Feature Store) and the Feast community is working on creating a Kubernetes Operator for deploying Feast on Kubernetes and would love any feedback you have before we get started!
Here is the GitHub issue: https://github.com/feast-dev/feast/issues/4561, a design doc: https://docs.google.com/document/d/1vGKMizf3_14IyiF_W_Ik7CR0..., and a Slack channel: https://communityinviter.com/apps/feastopensource/feast-the-...!
Thanks a ton in advance for your interest/comments!
-
Stream-Framework
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:
Project mention: Building a Slack Clone with Next.js and TailwindCSS - Part One | dev.to | 2024-12-24Stream is a platform that allows developers to add rich chat and video features to their applications. Instead of dealing with the complexity of creating chat and video from the ground up, Stream provides APIs and SDKs to help you add them quickly and easily.
-
img2dataset
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
-
-
scikit-learn-intelex
Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
There is a bit similar project which supports Intel GPU offloading: https://github.com/intel/scikit-learn-intelex
-
-
-
-
listenbrainz-server
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
It really is an incredible resource, and Picard is a wonderful app. Very satisfying getting a library properly tagged! Takes a while, but totally worth it. Shoutout to ListenBrainz as well, their scrobbling service: https://listenbrainz.org/
-
Project mention: LHC experiments at CERN observe quantum entanglement at the highest energy yet | news.ycombinator.com | 2024-09-22
We already make the data available publicly...
http://opendata.cern.ch/
-
-
-
-
lithops
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
Lithops is a great library for abstracting over different cloud vendors so you can interact with a uniform API.
https://github.com/lithops-cloud/lithops
-
Yeah I think it has to do with the memberwise splitting. https://github.com/scikit-hep/uproot5/issues/38
I understand this has not been a priority so far.
It kinda works if you open a magic file with a specific on-disk representation which bypasses this, but that’s not a solution at all.
-
amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
-
-
Github URL: https://github.com/bodo-ai/Bodo
-
Python Big Data discussion
Python Big Data related posts
-
Github's Top 31 items of Dec 18, 2024
-
RFC: The Feast Kubernetes Operator (The Open Source Feature Store)
-
How to build a new Harlequin adapter with Poetry
-
Show HN: Interactive Graph by LLM (GPT-4o)
-
What's Happening with Feast?
-
Blaze: Fast query execution engine for Apache Spark
-
EP Showing Twice
-
A note from our sponsor - SaaSHub
www.saashub.com | 19 Jan 2025
Index
What are some of the best open-source Big Data projects in Python? This list will help you:
# | Project | Stars |
---|---|---|
1 | data-science-ipython-notebooks | 27,721 |
2 | Cookbook | 13,937 |
3 | Cython | 9,701 |
4 | catboost | 8,195 |
5 | feast | 5,722 |
6 | Stream-Framework | 4,732 |
7 | img2dataset | 3,845 |
8 | koalas | 3,346 |
9 | scikit-learn-intelex | 1,241 |
10 | DataEngineeringProject | 1,167 |
11 | Hail | 989 |
12 | NIPY | 752 |
13 | listenbrainz-server | 721 |
14 | opendata.cern.ch | 668 |
15 | eland | 659 |
16 | redislite | 585 |
17 | video2dataset | 560 |
18 | lithops | 325 |
19 | uproot5 | 241 |
20 | amazon-s3-find-and-forget | 240 |
21 | pmaw | 215 |
22 | Bodo | 176 |
23 | hazelcast-python-client | 110 |