Python Big Data

Open-source Python projects categorized as Big Data | Edit details

Top 19 Python Big Data Projects

  • GitHub repo data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

    Project mention: Beginner in Python for Data Science | reddit.com/r/learnpython | 2020-12-27

    data science ipython notebooks

  • GitHub repo spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Two Methods to Scan for PII in Data Warehouses | dev.to | 2021-11-29

    NLP libraries such as Stanford NER Detector and Spacy

  • Nanos

    Run Linux Software Faster and Safer than Linux with Unikernels.

  • GitHub repo Cython

    The most widely used Python to C compiler

    Project mention: PyTorch: Where we are headed and why it looks a lot like Julia (but not exactly) | news.ycombinator.com | 2021-11-26
  • GitHub repo Stream-Framework

    Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

    Project mention: Getstream.io | reddit.com/r/reactnative | 2021-12-04

    I also came around Getstream.io it looks awesome.

  • GitHub repo koalas

    Koalas: pandas API on Apache Spark

    Project mention: Spark vs Pandas | reddit.com/r/dataengineering | 2021-02-18

    If you like excessive use of square brackets.. I mean pandas, you might wanna check out Koalas. Koalas suppose to provide pandas datafrafe API implementation atop of Spark.

  • GitHub repo feast

    Feature Store for Machine Learning

    Project mention: [P] Announcing Feast 0.10: The simplest way to serve features in production | reddit.com/r/MachineLearning | 2021-04-15

    Github: https://github.com/feast-dev/feast

  • GitHub repo Hail

    Scalable genomic data analysis.

    Project mention: Ask HN: Who is hiring? (July 2021) | news.ycombinator.com | 2021-07-01

    Broad Institute of MIT and Harvard | Cambridge, MA | Associate Software Engineer | Onsite

    We are seeking an associate software engineer interested in contributing to an open-source data visualization library for analyzing the biological impact human genetic variation. You will contribute to projects like gnomAD (https://gnomad.broadinstitute.org), the world's largest catalogue of human genetic variation used by hundreds of thousands of researchers and help us scale towards millions of genomes in the coming years. We are also developing next-generation tools for enabling genetic analyses of large biobanks across richly phenotyped individuals (https://genebass.org). In this role you will gain experience developing data-intensive web applications with Typescript, React, Python, Terraform, Google Cloud Platform, and will make use of the scalable data analysis library Hail (https://hail.is). Key to our success is growing a strong team with a diverse membership who foster a culture of continual learning, and who support the growth and success of one another. Towards this end, we are committed to seeking applications from women and from underrepresented groups. We know that many excellent candidates choose not to apply despite their capabilities; please allow us to enthusiastically counter this tendency.

    Please provide a CV and links previous work or projects, ideally with contributions visible on Github.

    email: [email protected]

  • Scout APM

    Scout APM: A developer's best friend. Try free for 14-days. Scout APM uses tracing logic that ties bottlenecks to source code so you know the exact line of code causing performance issues and can get back to building a great product faster.

  • GitHub repo NIPY

    Workflows and interfaces for neuroimaging packages

  • GitHub repo listenbrainz-server

    Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

    Project mention: ListenBrainz | news.ycombinator.com | 2021-12-01
  • GitHub repo scikit-learn-intelex

    Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

    Project mention: Intel Extension for Scikit-Learn | news.ycombinator.com | 2021-11-01

    Looks like they are responding to https://github.com/intel/scikit-learn-intelex#-acceleration

    I completely agree. I hope some Intel competitor funds a scikit-learn developer to read this code and extract all the portable performance improvements.

  • GitHub repo uproot3

    ROOT I/O in pure Python and NumPy.

    Project mention: Root with python | reddit.com/r/ParticlePhysics | 2021-06-22

    Besides PyROOT, you can also use uproot to read ROOT files, if you want to avoid the ROOT-dependency. The current version (uproot4) does not yet support writing ROOT files, but the previous/deprecated version (uproot3) does. (Please note: uproot is not maintained by the ROOT project team).

  • GitHub repo DataEngineeringProject

    Example end to end data engineering project.

    Project mention: Is it me or are beginner-friendly ETL pipeline guides that explain from the ground-up how to incorporate the use of various technologies notoriously difficult to find. | reddit.com/r/dataengineering | 2021-07-23
  • GitHub repo eland

    Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch

    Project mention: Explore and analyze Elasticsearch data with Pandas-Compatible API | news.ycombinator.com | 2021-10-06
  • GitHub repo amazon-s3-find-and-forget

    Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

    Project mention: How to handle GDPR requests for data stored in S3 ? | reddit.com/r/dataengineering | 2021-11-22

    S3 Find and Forget is probably worth looking into, even if just to get ideas on how to implement a similar solution for yourself

  • GitHub repo uproot4

    ROOT I/O in pure Python and NumPy.

    Project mention: Root with python | reddit.com/r/ParticlePhysics | 2021-06-22

    Besides PyROOT, you can also use uproot to read ROOT files, if you want to avoid the ROOT-dependency. The current version (uproot4) does not yet support writing ROOT files, but the previous/deprecated version (uproot3) does. (Please note: uproot is not maintained by the ROOT project team).

  • GitHub repo hazelcast-python-client

    Hazelcast IMDG Python Client

    Project mention: Contribution to Hazelcast | reddit.com/r/Python | 2021-07-05

    More code samples here: https://github.com/hazelcast/hazelcast-python-client/tree/master/examples

  • GitHub repo Lime-For-Time

    Application of the LIME algorithm by Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin to the domain of time series classification

    Project mention: Understanding LSTM predictions | reddit.com/r/learnmachinelearning | 2021-04-24

    I haven't personally tried it, but here's a Github Repo called LIME for Time. I'm not sure about the state of attention visualization for timeseries but this repo has several models using attention.

  • GitHub repo pmaw

    A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

    Project mention: Collecting Removed/Deleted Comments | reddit.com/r/redditdev | 2021-11-02

    Requerying is slow. Ideally, I'd just stream at a 24 delay and check if the comment is still there, but I don't think that's an option? The alternative (what I have currently implemented) is checking each comment ID I have to see if it's still there which is slow with PRAW (around a comment per second). Pushshift is static after it's collected so I don't think it's appropriate here. I have not tried PMAW but I suspect it won't solve my problem.

  • GitHub repo BaseCase-3

    This is a Python Application that can be used to gather all files of a certain type from any archive.com repository

    Project mention: What would be the best way to archive an archive.org account? This person has been uploading thousands of high quality rare vinyl rips with lossless high-resolution scans. I don't wanna lose it | reddit.com/r/DataHoarder | 2021-09-20

    Hi, I made a tool a little while ago to help me accomplish what I believe OP is asking for, here us the link: https://github.com/henry386/BaseCase-3

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2021-12-04.

Python Big Data related posts

Index

What are some of the best open-source Big Data projects in Python? This list will help you:

Project Stars
1 data-science-ipython-notebooks 21,907
2 spaCy 21,870
3 Cython 6,556
4 Stream-Framework 4,571
5 koalas 3,029
6 feast 2,512
7 Hail 763
8 NIPY 597
9 listenbrainz-server 452
10 scikit-learn-intelex 325
11 uproot3 318
12 DataEngineeringProject 315
13 eland 310
14 amazon-s3-find-and-forget 143
15 uproot4 120
16 hazelcast-python-client 98
17 Lime-For-Time 59
18 pmaw 48
19 BaseCase-3 2
Find remote jobs at our new job board 99remotejobs.com. There are 32 new remote jobs listed recently.
Are you hiring? Post a new remote job listing for free.
SaaSHub - Software Alternatives and Reviews
SaaSHub helps you find the best software and product alternatives
www.saashub.com