InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code. Learn more →
Top 23 Python Big Data Projects
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
The most widely used Python to C compilerProject mention: Do you guys have any resources for learning C++ and/or Fortran programming for physics? | reddit.com/r/Physics | 2022-11-21
If you're already familiar with python and numpy, you might want to look into cython as an intermediate step before going straight into C. It allows you to compile most python code into a static binary that can be imported into a python script just like any other library. This allows you to get performance close to raw C without having to invest much effort, and has a lot of bells and whistles like "auto"-parallelization with openmp.
The context switching struggle is real. Zigi makes context switching a thing of the past. It monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack!
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:Project mention: Is there any reason why you should not build chat functionality for your project and use chat as a service provider instead? | reddit.com/r/reactjs | 2022-11-16
I have seen multiple developers on youtube implementing chat functionality using 3 party service provider https://getstream.io/ it's a SDK that provides easy to use react components, which seems convenient, however it starts with a price tag of $499 per month for 10K users.
Feature Store for Machine LearningProject mention: [D] Your Preferred Feature Stores? | reddit.com/r/datascience | 2022-07-03
Koalas: pandas API on Apache SparkProject mention: My new company uses Pyspark. I want to learn it before my starting date. Any advice? | reddit.com/r/datascience | 2022-11-10
If they're using databricks and you're familiar with pandas, koalas should be right up your alley .
Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.Project mention: A Stable Diffusion prompt changes its output for the style of 1500 artists | news.ycombinator.com | 2022-10-02
You can download it here:
You probably would want to stop after getting the metadata, unless you have 240TB available for the images :)
More details and links to dataset explorers here: https://laion.ai/blog/laion-5b/
Scalable genomic data analysis.Project mention: Software engineers: consider working on genomics | news.ycombinator.com | 2022-11-19
I don't have any funding to hire right now, but I'm always happy to chat about the industry and my experience building Hail (https://hail.is, https://github.com/hail-is/hail), a tool widely used by folks with large collections of human sequences.
The other posters are not wrong about compensation. Total compensation is off by a factor of two to three.
However, it is absolutely possible to work with a group of top-notch engineers on serious distributed systems & compilers in service of an excellent scientific-user experience. I know because I do. We are lucky to have a PI who respects and hires and diversity of expertise within his lab.
I enjoy being deeply embedded with our users. I do not have to guess what they need or want because I help them do it every day.
I also enjoy enmeshing engineering with statistics, mathematics, and biology. Work is more interesting when so many disciplines conspire towards the end of improved human health.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn applicationProject mention: Machine Learning with PyTorch and Scikit-Learn – The *New* Python ML Book | news.ycombinator.com | 2022-02-25
Workflows and interfaces for neuroimaging packages
Example end to end data engineering project.Project mention: What are your favourite GitHub repos that shows how data engineering should be done? | reddit.com/r/dataengineering | 2022-11-18
Source code for the CERN Open Data portalProject mention: Good Series, Tutorial, or Book on Particle Physics Analysis using Python or Root for Undergraduates | reddit.com/r/ParticlePhysics | 2022-11-09
CERN Open Data has lots of examples from various collaborations: https://opendata.cern.ch/
Redis in a python module.Project mention: Devious SQL: Message Queuing Using Native PostgreSQL | news.ycombinator.com | 2022-01-28
If you are developing on Linux, rq + redis is probably the simplest message queueing system around. Redis can even be installed from pip,
or run within python,
it's pretty much the easiest thing to deploy.
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in ElasticsearchProject mention: hey, any idea about how to automatically extract data from kibana to a panda data frame in order to be analyzed? | reddit.com/r/kibana | 2022-06-13
A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀Project mention: [D] For those of you who don't own a GPU, how do you run your experiments or train your models? | reddit.com/r/MachineLearning | 2021-12-19
At work for non-ML/non-GPU stuff I've been using Lithops for running code on dynamically-provisioned cloud resources (serverless or VM). It pickles your code & runtime variables, sends them to cloud storage, runs the code & downloads the results, all relatively transparently. You're just calling Python functions with Python objects on your local computer and not having to worry about deploying your code, packaging your data, etc. Better still, you can scale up for things like hyperparameter sweeps by just dispatching more calls in parallel, and it will provision more resources.
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)Project mention: Deleting particular data from S3 External Tables | reddit.com/r/dataengineering | 2022-10-31
Take a look at this: https://github.com/awslabs/amazon-s3-find-and-forget We use it for GDPR compliance; it will open a file, delete a row and pack it back. It will modify the file so watch out if you are using Glue job bookmarks. Because you are using external tables, the manifest file will also have to be updated with a proper lenght for the new, updated file. If you have hundreds of tables and thousands of files, and you need to do this on a regular basis this would be the scalable solution, but if you have few files honestly I would do it manually
ROOT I/O in pure Python and NumPy.
A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.Project mention: Dev Diary #7 - Comment Scraping | reddit.com/r/RecMe | 2022-11-13
Hazelcast Python Client
CovsirPhy: Python library for COVID-19 analysis with phase-dependent SIR-derived ODE models.Project mention: Show HN: Deptry, a tool to check for dependency issues in a Python project | news.ycombinator.com | 2022-09-14
I have recently been working on a project called `deptry`, a command line tool to check for issues in the dependencies of Python projects. It can be used to find obsolete, missing, transitive and misplaced development dependencies. It supports the following types of projects:
- Projects that use Poetry and a corresponding pyproject.toml file
- Projects that use a requirements.txt file according to the pip standards
* Documentation: https://fpgmaas.github.io/deptry/
* GitHub repository: https://github.com/fpgmaas/deptry
I am quite happy with the project in its current form, but I also realise there is still a lot of room left for improvement. Therefore, I hope some people are willing to give it a try and provide me with feedback. So; if you have a project with a long list of dependencies and a little bit of spare time on your hands, please give it a try and let me know what you think!
If you encounter any issues, find a bug, or have any other form of feedback, please don't hesitate to raise an issue in the GitHub repository, or leave a comment here.
P.S. Many thanks to Hirokazu Takaya (https://github.com/lisphilar) for incorporating it in the CI/CD pipeline of his project covid19-sir (https://github.com/lisphilar/covid19-sir). It provided me with very valuable early feedback.
Application of the LIME algorithm by Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin to the domain of time series classification
Write data & AI pipelines in (SQL, Spark, Pandas) and deploy to the cloud, simplifiedProject mention: Yaetos-A framework to simplify the creation of data pipelines | news.ycombinator.com | 2022-04-06
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.Project mention: Data Engineering Projects for Beginners | dev.to | 2022-06-15
Building Big Data Pipelines in the Cloud with AWS EMR
Build time-series-based applications quickly and at scale.. InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.
Python Big Data related posts
Music stats aggregator(?)
1 project | reddit.com/r/AppleMusic | 28 Nov 2022
AWS doesn't make sense for scientific computing
1 project | news.ycombinator.com | 7 Oct 2022
A Stable Diffusion prompt changes its output for the style of 1500 artists
1 project | news.ycombinator.com | 2 Oct 2022
A love letter to Plexamp + some practical use tips
1 project | reddit.com/r/plexamp | 28 Sep 2022
Show HN: Digs.fm – For passionate music explorers (like Goodreads but for music)
1 project | news.ycombinator.com | 22 Aug 2022
Are there any alternatives to last.fm?
1 project | reddit.com/r/lastfm | 21 Aug 2022
1 project | reddit.com/r/albumstheapp | 16 Aug 2022
A note from our sponsor - InfluxDB
www.influxdata.com | 2 Dec 2022
What are some of the best open-source Big Data projects in Python? This list will help you: