Python Data Science

Open-source Python projects categorized as Data Science

Top 23 Python Data Science Projects

  • Keras

    Deep Learning for humans

    Project mention: Weekly Quant Update 10.11.22 - Surviving a fundamental crisis with trading bots | reddit.com/r/u_KappaTrading | 2022-11-10

    All strategies share some common traits: They all use Neural Net libraries. 2 use TensorFlow The other uses python Keras Library https://github.com/keras-team/keras

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Scaling PostgresML to 1M Requests per Second | news.ycombinator.com | 2022-11-11

    Of course. The paper is at https://arxiv.org/abs/1408.3060.

    > Our method applies to any translation invariant and any dot-product kernel, such as the popular RBF kernels and polynomial kernels. We prove that the approximation is unbiased and has low variance. Experiments show that we achieve similar accuracy to full kernel expansions and Random Kitchen Sinks while being 100x faster and using 1000x less memory. These improvements, especially in terms of memory usage, make kernel methods more practical for applications that have large training sets and/or require real-time prediction.

    Sadly Fastfood didn't quite make it into Scikit[1], but did land in scikit-learn-extra[2].

    1. https://github.com/scikit-learn/scikit-learn/pull/3665. A shame, Scikit's equivalents scale very poorly.

    2. https://scikit-learn-extra.readthedocs.io/en/stable/generate...

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Has anyone here ever used the seaNMF model for short text topic modeling, and be willing to help me get started with it? | reddit.com/r/LanguageTechnology | 2022-11-24

    Tokenize with NLTK, SpaCy or CoreNLP

  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Ray

    Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for accelerating ML workloads.

    Project mention: Think about it for a second | reddit.com/r/mathmemes | 2022-10-19

    https://ray.io (just dropping the link)

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Coding K-Means Clustering using Python and NumPy | dev.to | 2022-09-22

    ML From Scratch - An excellent Github repository containing implementations of many machine learning models and algorithms. Easy to understand and highly recommended.

  • streamlit

    Streamlit — The fastest way to build data apps in Python

    Project mention: Advent of Code - Day Downloader - Website | reddit.com/r/adventofcode | 2022-11-27

    I made a Python streamlit web page to select and download the question and/or input of multiple days on Advent of Code.

  • Scout APM

    Truly a developer’s best friend. Scout APM is great for developers who want to find and fix performance issues in their applications. With Scout, we'll take care of the bugs so you can focus on building great things 🚀.

  • lightning

    Build and train PyTorch models and connect them to the ML lifecycle using Lightning App templates, without handling DIY infrastructure, cost management, scaling, and other headaches.

    Project mention: We just release a complete open-source solution for accelerating Stable Diffusion pretraining and fine-tuning! | reddit.com/r/StableDiffusion | 2022-11-11

    Our codebase for the diffusion models builds heavily on OpenAI's ADM codebase , lucidrains, Stable Diffusion, Lightning and Hugging Face. Thanks for open-sourcing!

  • dash

    Analytical Web Apps for Python, R, Julia, and Jupyter. No JavaScript Required.

    Project mention: Sharing interactive Plotly graphs | reddit.com/r/datascience | 2022-11-18

    looks like you can get it manually (albeit with a loss of interactivity) https://github.com/plotly/dash/issues/145

  • matplotlib

    matplotlib: plotting with Python

    Project mention: How to model the hanging chain PDE using numerical methods in Python? | reddit.com/r/learnpython | 2022-11-25

    There are plenty of data visualization tools in python, but probably the easiest to get started with is Matplotlib

  • d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 400 universities from 60 countries including Stanford, MIT, Harvard, and Cambridge.

    Project mention: How to pre-train BERT on different objective tasks using HuggingFace | reddit.com/r/deeplearning | 2022-04-10

    There might is bert library for pre-train bert model in huggingface, But I suggestion that you train bert model in native pytorch to understand detail, Limu's course is recommended for you

  • ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: Pandas 1.5 released | reddit.com/r/Python | 2022-09-19

    !pip install is error-prone, it is better to use %pip install, ipython even warns about this, https://github.com/ipython/ipython/pull/12954/

  • recommenders

    Best Practices on Recommendation Systems

    Project mention: There is framework for everything. | reddit.com/r/ProgrammerHumor | 2022-08-04
  • gensim

    Topic Modelling for Humans

    Project mention: Topic modeling --- allow multiple topics per statement | reddit.com/r/LanguageTechnology | 2022-11-22

    Try LDA as implemented in gemsin https://github.com/RaRe-Technologies/gensim

  • nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Best-Of Machine Learning with Python | news.ycombinator.com | 2022-04-28
  • allennlp

    An open-source NLP research library, built on PyTorch.

    Project mention: How to solve ConfigurationError using HuggingFace Token Classifier | reddit.com/r/learnpython | 2022-10-08

    No clue. So what I did was google the error. Here's what I found: https://github.com/allenai/allennlp/issues/4319

  • dvc

    🦉Data Version Control | Git for Data & Models | ML Experiments Management

    Project mention: How do you manage results, plots, etc.? | reddit.com/r/bioinformatics | 2022-11-17

    Bioinf has a lot of biologists who have transitioned into more technical/coding focused roles, so you'll find there's not a lot of engineering workflow standards out there compared to DS or SWE. As others have said, snakemake is the most common, but thats just a pipeline managment tool, it doesn't manage data or outputs. I personally use DVC for data and pipeline management (and include jupyter and papermill to make it all work), although I haven't yet gotten onboard with their experiments feature (which is what would manage different parameters and figures/results beyond versioning). I looked into MLflow and some other options when I was getting started (I do tool development and bioinf analysis), but I wanted data versioning to ensure experiment reproducibility (kind of a critcal part of science IMO), and many of the other solutions like Airflow (common in DS industry) seemed to be overkill for smaller bioinfo projects. DVC meets the requirements and I like it in concept, although in practice there have been many updates that have been a bit of a pain to keep up with/integrate. I've got a bioinfo/ds project template on github that roles together git, conda, DVC, jupyter and papermill to ensure experiment reproducibility, and is setup as a template that can be deployed with cookiecutter - check it out if you like.

  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: Example typescript project repos? | reddit.com/r/typescript | 2022-10-27

    If I was answering this question but for python, I'd recommend something like prefect, boto3, or tortoise-orm -- not extremely complex and with a pretty comprehensible featureset.

  • seaborn

    Statistical data visualization in Python

    Project mention: Ever wondered why banking sites suck? | reddit.com/r/ProgrammerHumor | 2022-11-11

    As a practical example let's look up a repository of an open source project, these are the stats of the first and second contributor as ranked by github:

  • pandas-profiling

    Create HTML profiling reports from pandas DataFrame objects

    Project mention: Data analysts: what’re some initial steps you take to get familiar w datasets? | reddit.com/r/Python | 2022-11-09

    Since you already mention pandas, I can suggest that you profile the dataframe to get a better understanding of the data wrt e.g. distributions, data types, missing data and so forth. There exists handy tools for that like pandas-profiling

  • TFLearn

    Deep learning library featuring a higher-level API for TensorFlow.

    Project mention: Beginner Friendly Resources to Master Artificial Intelligence and Machine Learning with Python (2022) | dev.to | 2022-08-14

    TFLearn – Deep learning library featuring a higher-level API for TensorFlow

  • ludwig

    Data-centric declarative deep learning framework

  • Zigi

    The context switching struggle is real. Zigi makes context switching a thing of the past. It monitors Jira and GitHub updates, pings you when PRs need approval and lets you take fast actions - all directly from Slack!

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2022-11-27.

Python Data Science related posts

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

Project Stars
1 Keras 56,727
2 scikit-learn 52,167
3 spaCy 24,644
4 data-science-ipython-notebooks 24,227
5 Ray 22,800
6 ML-From-Scratch 21,680
7 streamlit 21,567
8 lightning 20,706
9 dash 17,688
10 matplotlib 16,454
11 d2l-en 15,632
12 ipython 15,594
13 recommenders 14,582
14 gensim 13,718
15 nni 12,254
16 best-of-ml-python 11,952
17 allennlp 11,300
18 dvc 10,698
19 Prefect 10,522
20 seaborn 10,096
21 pandas-profiling 9,868
22 TFLearn 9,581
23 ludwig 8,629
Build time-series-based applications quickly and at scale.
InfluxDB is the Time Series Data Platform where developers build real-time applications for analytics, IoT and cloud-native services in less time with less code.
www.influxdata.com