Python Big Data

Open-source Python projects categorized as Big Data

Top 23 Python Big Data Projects

  1. data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  2. SaaSHub

    SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

    SaaSHub logo
  3. Cookbook

    The Data Engineering Cookbook

  4. Cython

    The most widely used Python to C compiler

    Project mention: I Use Nim Instead of Python for Data Processing | news.ycombinator.com | 2024-09-05
  5. catboost

    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

    Project mention: CatBoost: Open-source gradient boosting library | news.ycombinator.com | 2024-03-05
  6. feast

    The Open Source Feature Store for Machine Learning

    Project mention: RFC: The Feast Kubernetes Operator (The Open Source Feature Store) | news.ycombinator.com | 2024-09-24

    Hey folks!

    I'm a maintainer for Feast (https://github.com/feast-dev/feast) (the Open Source Feature Store) and the Feast community is working on creating a Kubernetes Operator for deploying Feast on Kubernetes and would love any feedback you have before we get started!

    Here is the GitHub issue: https://github.com/feast-dev/feast/issues/4561, a design doc: https://docs.google.com/document/d/1vGKMizf3_14IyiF_W_Ik7CR0..., and a Slack channel: https://communityinviter.com/apps/feastopensource/feast-the-...!

    Thanks a ton in advance for your interest/comments!

  7. Stream-Framework

    Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

    Project mention: Building a Slack Clone with Next.js and TailwindCSS - Part One | dev.to | 2024-12-24

    Stream is a platform that allows developers to add rich chat and video features to their applications. Instead of dealing with the complexity of creating chat and video from the ground up, Stream provides APIs and SDKs to help you add them quickly and easily.

  8. img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

  9. koalas

    Koalas: pandas API on Apache Spark

  10. scikit-learn-intelex

    Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

    Project mention: CuPy: NumPy and SciPy for GPU | news.ycombinator.com | 2024-09-20

    There is a bit similar project which supports Intel GPU offloading: https://github.com/intel/scikit-learn-intelex

  11. DataEngineeringProject

    Example end to end data engineering project.

  12. Hail

    Cloud-native genomic dataframes and batch computing

    Project mention: Ask HN: Who is hiring? (October 2024) | news.ycombinator.com | 2024-10-01
  13. NIPY

    Workflows and interfaces for neuroimaging packages

  14. listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: The Open Music Encyclopedia | news.ycombinator.com | 2024-09-30

    It really is an incredible resource, and Picard is a wonderful app. Very satisfying getting a library properly tagged! Takes a while, but totally worth it. Shoutout to ListenBrainz as well, their scrobbling service: https://listenbrainz.org/

  15. opendata.cern.ch

    Source code for the CERN Open Data portal

    Project mention: LHC experiments at CERN observe quantum entanglement at the highest energy yet | news.ycombinator.com | 2024-09-22

    We already make the data available publicly...

    http://opendata.cern.ch/

  16. eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

  17. redislite

    Redis in a python module.

  18. video2dataset

    Easily create large video dataset from video urls

  19. lithops

    A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

    Project mention: Why aren't we all serverless yet? | news.ycombinator.com | 2025-01-09

    Lithops is a great library for abstracting over different cloud vendors so you can interact with a uniform API.

    https://github.com/lithops-cloud/lithops

  20. uproot5

    ROOT I/O in pure Python and NumPy.

    Project mention: CERN Root | news.ycombinator.com | 2024-06-01

    Yeah I think it has to do with the memberwise splitting. https://github.com/scikit-hep/uproot5/issues/38

    I understand this has not been a priority so far.

    It kinda works if you open a magic file with a specific on-disk representation which bypasses this, but that’s not a solution at all.

  21. amazon-s3-find-and-forget

    Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

  22. pmaw

    A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

  23. Bodo

    High-Performance Python Compute Engine for Data and AI

    Project mention: Github's Top 31 items of Dec 18, 2024 | dev.to | 2024-12-18

    Github URL: https://github.com/bodo-ai/Bodo

  24. hazelcast-python-client

    Hazelcast Python Client

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Python Big Data discussion

Log in or Post with

Python Big Data related posts

Index

What are some of the best open-source Big Data projects in Python? This list will help you:

# Project Stars
1 data-science-ipython-notebooks 27,721
2 Cookbook 13,937
3 Cython 9,701
4 catboost 8,195
5 feast 5,722
6 Stream-Framework 4,732
7 img2dataset 3,845
8 koalas 3,346
9 scikit-learn-intelex 1,241
10 DataEngineeringProject 1,167
11 Hail 989
12 NIPY 752
13 listenbrainz-server 721
14 opendata.cern.ch 668
15 eland 659
16 redislite 585
17 video2dataset 560
18 lithops 325
19 uproot5 241
20 amazon-s3-find-and-forget 240
21 pmaw 215
22 Bodo 176
23 hazelcast-python-client 110

Sponsored
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com

Did you know that Python is
the 2nd most popular programming language
based on number of references?