Top 23 Python Big Data Projects

data-science-ipython-notebooks

1 26,459 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Cython

79 8,912 9.8 Python

The most widely used Python to C compiler

Project mention: Ask HN: C/C++ developer wanting to learn efficient Python | news.ycombinator.com | 2024-04-10

WorkOS

workos.com sponsored

The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
catboost

8 7,744 9.9 Python

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05

feast

8 5,255 9.3 Python

Feature Store for Machine Learning

Project mention: What's Happening with Feast? | news.ycombinator.com | 2023-12-07

Stream-Framework

34 4,719 0.0 Python

Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

Project mention: Recommendations for an external messenger integration/API? | /r/rails | 2023-10-30

I have looked into a getstream.io integration, however it seems that the Ruby SDK is really treated as a second class citizen. There's bugs with the documented API (I'm having issues even creating users and querying users), the usage of the gem is low and there is an open issue since May that no one has even looked at, which doesn't give me hope for long term support.

koalas

2 3,319 4.6 Python

Koalas: pandas API on Apache Spark
img2dataset

13 3,242 7.3 Python

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

Project mention: OpenAI sued for web scraping from millions of internet users in order to train ChatGPT | /r/ArtistHate | 2023-06-30

Lmao, no it doesn't. As we can see, their downloader uses very obscure "no ai" headers (which can be disabled, so its useless). They only claim it respects "robots.txt" because the google crawler respects it, if a site changes their robots.txt rules they don't remove it from their dataset, that is not "respecting". https://github.com/rom1504/img2dataset

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
scikit-learn-intelex

3 1,160 9.5 Python

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
DataEngineeringProject

5 985 0.0 Python

Example end to end data engineering project.
Hail

5 934 9.8 Python

Cloud-native genomic dataframes and batch computing
NIPY

0 731 8.9 Python

Workflows and interfaces for neuroimaging packages
listenbrainz-server

61 643 9.9 Python

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

Project mention: Analyzing Spotify Stream History | news.ycombinator.com | 2024-02-12

There's also ListenBrainz, run by the MusicBrainz org, which offers similar functionality without API restrictions or other paid features that Last.FM tries to push.
https://listenbrainz.org/
If you wish to use your scrobble data at all programmatically this is a far better tool to use.

opendata.cern.ch

13 635 9.3 Python

Source code for the CERN Open Data portal

Project mention: Observable 2.0, a static site generator for data apps | news.ycombinator.com | 2024-02-15

I think the idea of Framework is really good, but static data limits the applications, excluding monitoring and other cases in which the data is constantly changing, but the dashboard can stay as it is. For example, I'd love to see a revamped Framework version of the LHC beam monitor and related pages (see https://op-webtools.web.cern.ch/vistar/, but check again in 2 months or so, when the accelerator will be running).
In high-energy physics, ROOT is /the/ toolkit for data analysis, and I guess jsROOT (https://root.cern.ch/js/) could also be used to load data to be shown in Framework dashboards. I thought the idea of Framework as a blogging engine with powerful data visualization built-in could be very interesting. Think, for example, about physicists pulling open data (https://opendata.cern.ch) and writing about their analysis or someone pulling data from https://ourworldindata.org/ in their own visualizations to support their case while writing about a particular subject, etc.

eland

4 608 8.6 Python

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

Project mention: I'm getting elasticsearch.BadRequestError: BadRequestError(400, 'illegal_argument_exception', "specified fields can't be null or empty") using Eland library | /r/elasticsearch | 2023-05-02

We have a fix for this issue reported here merged and pending a release. Hopefully that release will happen in the next few days, then you can upgrade and the default experience for everyone won't be as confusing :)

redislite

1 564 5.0 Python

Redis in a python module.
video2dataset

2 449 8.3 Python

Easily create large video dataset from video urls

Project mention: FLaNK Stack Weekly for 17 July 2023 | dev.to | 2023-07-17

lithops

2 305 9.4 Python

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
amazon-s3-find-and-forget

3 232 7.2 Python

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
uproot5

2 218 9.2 Python

ROOT I/O in pure Python and NumPy.

Project mention: Potential of the Julia programming language for high energy physics computing | news.ycombinator.com | 2023-12-04

> I wasn't proposing ROOT to be reimplemented in JS. That was what the GP attributed to me.
Sorry for assuming that. I really felt the pain of thinking of possibility of combining two things I hate so much together (JS+ROOT)
> "Laypeople" may also think that code is optimized to the last cycle in something like HEP simulations. It's made fast enough and the optimization is nowhere near the level of e.g. graphics heavy games.
I understand that in other areas there might be more sophisticated optimizations, but does not change things much inside HEP field community. And it is not optimized only for simulations but for other things too. It is not one problem optimization.
> Real-time usage like high frequency large data collection will probably never happen on the "single language". But I'd guess ROOT is not used at that level either? Also at least last time I checked, ROOT is moving to Python (probably not for the hottest loops of the simulation though).
I did not mean to indicate that ROOT is being used to handle the online processing (In HEP terms). It is usually handled via optimized C++ compiled code. My idea is that you will probably never use JS or any interpreted language (or anything other than C++ to be pessimistic) for that. ROOT at the end of the day is much closer to C++ than anything else. So learning curve wouldn't be that much if you come with some C++ knowledge initially.
> Also at least last time I checked, ROOT is moving to Python (probably not for the hottest loops of the simulation though).
I think you mean PyROOT [1]? This is the official python ROOT interface It provides a set of Python bindings to the ROOT C++ libraries, allowing Python scripts to interact directly with ROOT classes and methods as if they were native Python. But that does not represent and re-writing. It makes things easier for end users who are doing analysis though, while be efficient in terms of performance, especially for operations that are heavily optimized in ROOT.
There is also uproot [2] which is a purely Python-based reader and writer of ROOT files. It is not a part of the official ROOT project and does not depend on the ROOT libraries. Instead, uproot re-implements the I/O functionalities of ROOT in Python. However, it does not provide an interface to the full range of ROOT functionalities. It is particularly useful for integrating ROOT data into a Python-based data analysis pipeline, where libraries like NumPy, SciPy, Matplotlib, and Pandas ..etc are used.
> Off-topic: C++ interpretation like done in ROOT seems like a really bad idea.)
I will agree with you. But to be fair the purpose of ROOT is interactive data analysis but over the decades a lot of things gets added, and many experiments had their own soft forks and things started to get very messy quickly. So that there is no much inertia to fix problems and introduce improvements.
[1] https://root.cern/manual/python/
[2] https://github.com/scikit-hep/uproot5

pmaw

21 212 0.0 Python

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

Project mention: I am making a replacement service to keep the 3rd party apps running. Join me. | /r/Save3rdPartyApps | 2023-06-05

Oh a whole lot of them. Pushshift was large enough to have its own libraries ( https://github.com/mattpodolak/pmaw ) so the first step will be to recreate the API 1:1 and get tools like reddit search and reveddit working again.

hazelcast-python-client

4 113 6.8 Python

Hazelcast Python Client
covid19-sir

3 108 9.3 Python

CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.
Lime-For-Time

1 92 0.0 Python

Application of the LIME algorithm by Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin to the domain of time series classification
SaaSHub

www.saashub.com sponsored

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Big Data related posts

What's Happening with Feast?
1 project | news.ycombinator.com | 7 Dec 2023
Blaze: Fast query execution engine for Apache Spark
3 projects | news.ycombinator.com | 19 Oct 2023
EP Showing Twice
1 project | /r/lastfm | 21 Sep 2023
Running The Feast Feature Store With Dragonfly
1 project | dev.to | 8 Aug 2023
Seeking Feedback: Tracking Vinyl Listening History
1 project | /r/vinyl | 7 Jul 2023
OpenAI sued for web scraping from millions of internet users in order to train ChatGPT
1 project | /r/ArtistHate | 30 Jun 2023
Who kept the bots out? Stopping content being harvested by AI
2 projects | dev.to | 22 Jun 2023
A note from our sponsor - SaaSHub
www.saashub.com | 25 Apr 2024

SaaSHub helps you find the best software and product alternatives Learn more →

Index

What are some of the best open-source Big Data projects in Python? This list will help you:

	Project	Stars
1	data-science-ipython-notebooks	26,459
2	Cython	8,912
3	catboost	7,744
4	feast	5,255
5	Stream-Framework	4,719
6	koalas	3,319
7	img2dataset	3,242
8	scikit-learn-intelex	1,160
9	DataEngineeringProject	985
10	Hail	934
11	NIPY	731
12	listenbrainz-server	643
13	opendata.cern.ch	635
14	eland	608
15	redislite	564
16	video2dataset	449
17	lithops	305
18	amazon-s3-find-and-forget	232
19	uproot5	218
20	pmaw	212
21	hazelcast-python-client	113
22	covid19-sir	108
23	Lime-For-Time	92