Python Big Data

Open-source Python projects categorized as Big Data

Top 23 Python Big Data Projects

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Cython

    The most widely used Python to C compiler

    Project mention: Do you guys have any resources for learning C++ and/or Fortran programming for physics? | reddit.com/r/Physics | 2022-11-21

    If you're already familiar with python and numpy, you might want to look into cython as an intermediate step before going straight into C. It allows you to compile most python code into a static binary that can be imported into a python script just like any other library. This allows you to get performance close to raw C without having to invest much effort, and has a lot of bells and whistles like "auto"-parallelization with openmp.

  • Zigi

    The context switching struggle is real. Zigi makes context switching a thing of the past. It monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack!

  • Stream-Framework

    Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

    Project mention: Is there any reason why you should not build chat functionality for your project and use chat as a service provider instead? | reddit.com/r/reactjs | 2022-11-16

    I have seen multiple developers on youtube implementing chat functionality using 3 party service provider https://getstream.io/ it's a SDK that provides easy to use react components, which seems convenient, however it starts with a price tag of $499 per month for 10K users.

  • feast

    Feature Store for Machine Learning

    Project mention: [D] Your 🫵 Preferred Feature Stores? | reddit.com/r/datascience | 2022-07-03
  • koalas

    Koalas: pandas API on Apache Spark

    Project mention: My new company uses Pyspark. I want to learn it before my starting date. Any advice? | reddit.com/r/datascience | 2022-11-10

    If they're using databricks and you're familiar with pandas, koalas should be right up your alley .

  • img2dataset

    Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.

    Project mention: A Stable Diffusion prompt changes its output for the style of 1500 artists | news.ycombinator.com | 2022-10-02

    You can download it here:

    https://github.com/rom1504/img2dataset/blob/main/dataset_exa...

    You probably would want to stop after getting the metadata, unless you have 240TB available for the images :)

    More details and links to dataset explorers here: https://laion.ai/blog/laion-5b/

  • Hail

    Scalable genomic data analysis.

    Project mention: Software engineers: consider working on genomics | news.ycombinator.com | 2022-11-19

    I don't have any funding to hire right now, but I'm always happy to chat about the industry and my experience building Hail (https://hail.is, https://github.com/hail-is/hail), a tool widely used by folks with large collections of human sequences.

    The other posters are not wrong about compensation. Total compensation is off by a factor of two to three.

    However, it is absolutely possible to work with a group of top-notch engineers on serious distributed systems & compilers in service of an excellent scientific-user experience. I know because I do. We are lucky to have a PI who respects and hires and diversity of expertise within his lab.

    I enjoy being deeply embedded with our users. I do not have to guess what they need or want because I help them do it every day.

    I also enjoy enmeshing engineering with statistics, mathematics, and biology. Work is more interesting when so many disciplines conspire towards the end of improved human health.

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • scikit-learn-intelex

    Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

    Project mention: Machine Learning with PyTorch and Scikit-Learn – The *New* Python ML Book | news.ycombinator.com | 2022-02-25
  • NIPY

    Workflows and interfaces for neuroimaging packages

  • DataEngineeringProject

    Example end to end data engineering project.

    Project mention: What are your favourite GitHub repos that shows how data engineering should be done? | reddit.com/r/dataengineering | 2022-11-18
  • opendata.cern.ch

    Source code for the CERN Open Data portal

    Project mention: Good Series, Tutorial, or Book on Particle Physics Analysis using Python or Root for Undergraduates | reddit.com/r/ParticlePhysics | 2022-11-09

    CERN Open Data has lots of examples from various collaborations: https://opendata.cern.ch/

  • redislite

    Redis in a python module.

    Project mention: Devious SQL: Message Queuing Using Native PostgreSQL | news.ycombinator.com | 2022-01-28

    If you are developing on Linux, rq + redis is probably the simplest message queueing system around. Redis can even be installed from pip,

    https://pypi.org/project/redis/

    or run within python,

    https://github.com/yahoo/redislite

    it's pretty much the easiest thing to deploy.

  • listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: Music stats aggregator(?) | reddit.com/r/AppleMusic | 2022-11-28
  • eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

    Project mention: hey, any idea about how to automatically extract data from kibana to a panda data frame in order to be analyzed? | reddit.com/r/kibana | 2022-06-13
  • lithops

    A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀

    Project mention: [D] For those of you who don't own a GPU, how do you run your experiments or train your models? | reddit.com/r/MachineLearning | 2021-12-19

    At work for non-ML/non-GPU stuff I've been using Lithops for running code on dynamically-provisioned cloud resources (serverless or VM). It pickles your code & runtime variables, sends them to cloud storage, runs the code & downloads the results, all relatively transparently. You're just calling Python functions with Python objects on your local computer and not having to worry about deploying your code, packaging your data, etc. Better still, you can scale up for things like hyperparameter sweeps by just dispatching more calls in parallel, and it will provision more resources.

  • amazon-s3-find-and-forget

    Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

    Project mention: Deleting particular data from S3 External Tables | reddit.com/r/dataengineering | 2022-10-31

    Take a look at this: https://github.com/awslabs/amazon-s3-find-and-forget We use it for GDPR compliance; it will open a file, delete a row and pack it back. It will modify the file so watch out if you are using Glue job bookmarks. Because you are using external tables, the manifest file will also have to be updated with a proper lenght for the new, updated file. If you have hundreds of tables and thousands of files, and you need to do this on a regular basis this would be the scalable solution, but if you have few files honestly I would do it manually

  • uproot5

    ROOT I/O in pure Python and NumPy.

  • pmaw

    A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

    Project mention: Dev Diary #7 - Comment Scraping | reddit.com/r/RecMe | 2022-11-13
  • hazelcast-python-client

    Hazelcast Python Client

  • covid19-sir

    CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.

    Project mention: Show HN: Deptry, a tool to check for dependency issues in a Python project | news.ycombinator.com | 2022-09-14

    I have recently been working on a project called `deptry`, a command line tool to check for issues in the dependencies of Python projects. It can be used to find obsolete, missing, transitive and misplaced development dependencies. It supports the following types of projects:

    - Projects that use Poetry and a corresponding pyproject.toml file

    - Projects that use a requirements.txt file according to the pip standards

    ---

    * Documentation: https://fpgmaas.github.io/deptry/

    * GitHub repository: https://github.com/fpgmaas/deptry

    ---

    I am quite happy with the project in its current form, but I also realise there is still a lot of room left for improvement. Therefore, I hope some people are willing to give it a try and provide me with feedback. So; if you have a project with a long list of dependencies and a little bit of spare time on your hands, please give it a try and let me know what you think!

    If you encounter any issues, find a bug, or have any other form of feedback, please don't hesitate to raise an issue in the GitHub repository, or leave a comment here.

    Kind regards,

    Florian

    P.S. Many thanks to Hirokazu Takaya (https://github.com/lisphilar) for incorporating it in the CI/CD pipeline of his project covid19-sir (https://github.com/lisphilar/covid19-sir). It provided me with very valuable early feedback.

  • Lime-For-Time

    Application of the LIME algorithm by Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin to the domain of time series classification

  • yaetos

    Write data & AI pipelines in (SQL, Spark, Pandas) and deploy to the cloud, simplified

    Project mention: Yaetos-A framework to simplify the creation of data pipelines | news.ycombinator.com | 2022-04-06
  • pyspark-on-aws-emr

    The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

    Project mention: Data Engineering Projects for Beginners | dev.to | 2022-06-15

    Building Big Data Pipelines in the Cloud with AWS EMR

  • InfluxDB

    Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-11-28.

Python Big Data related posts

Index

What are some of the best open-source Big Data projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 24,323
2 Cython 7,487
3 Stream-Framework 4,655
4 feast 3,780
5 koalas 3,209
6 img2dataset 1,235
7 Hail 844
8 scikit-learn-intelex 844
9 NIPY 654
10 DataEngineeringProject 636
11 opendata.cern.ch 546
12 redislite 528
13 listenbrainz-server 521
14 eland 456
15 lithops 240
16 amazon-s3-find-and-forget 198
17 uproot5 162
18 pmaw 127
19 hazelcast-python-client 102
20 covid19-sir 91
21 Lime-For-Time 75
22 yaetos 24
23 pyspark-on-aws-emr 16
Truly a developer’s best friend
Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.
scoutapm.com