Python Data Science

Open-source Python projects categorized as Data Science

Top 23 Python Data Science Projects

  • Keras

    Deep Learning for humans

    Project mention: Can someone explain how keras code gets into the Tensorflow package? | /r/tensorflow | 2023-07-24

    I'm guessing the "real" keras code is coming from the keras repository. Is that a correct assumption? How does that version of Keras get there? If I wanted to write my own activation layer next to ELU, where exactly would I do that?

  • scikit-learn

    scikit-learn: machine learning in Python

    Project mention: Transformers as Support Vector Machines | news.ycombinator.com | 2023-09-03

    It looks like you've been the victim of some misinformation. As Dr_Birdbrain said, an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says

    Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.

    [0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...

    [1] https://en.wikipedia.org/wiki/Platt_scaling

    [2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...

  • Sonar

    Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.

  • Pandas

    Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

    Project mention: Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide | dev.to | 2023-08-20

    AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.

  • Ray

    Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

    Project mention: Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models | news.ycombinator.com | 2023-08-11

    Training times for GSM8k are mentioned here: https://github.com/ray-project/ray/tree/master/doc/source/te...

  • streamlit

    Streamlit — A faster way to build and share data apps.

    Project mention: Show HN: Zero-dependency Java framework out of beta | news.ycombinator.com | 2023-09-25

    The 'batteries included' space is definitely a market. For example https://streamlit.io is wildly popular with data teams for quickly making a pre-styled, usable enough web UI to put on top of some model, with controls that are automatically reactive. Those ppl have zero interest in fiddling with modular systems or spending time optimizing and scaling web apps.

  • spaCy

    💫 Industrial-strength Natural Language Processing (NLP) in Python

    Project mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | dev.to | 2023-09-18
  • data-science-ipython-notebooks

    Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

  • Mergify

    Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.

  • lightning

    Deep learning framework to train, deploy, and ship AI products Lightning fast.

    Project mention: Best practice for saving logits/activation values of model in PyTorch Lightning | /r/deeplearning | 2023-07-19

    I've been wondering on what is the recommended method of saving logits/activations using PyTorch Lightning. I've looked at Callbacks, Loggers and ModelHooks but none of the use-cases seem to be for this kind of activity (even if I were to create my own custom variants of each utility). The ModelCheckpoint Callback in its utility makes me feel like custom Callbacks would be the way to go but I'm not quite sure. This closed GitHub issue does address my issue to some extent.

  • ML-From-Scratch

    Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

    Project mention: Tutorials on creating primitive ML algorithms from scratch? | /r/learnmachinelearning | 2023-01-24

    ml-from-scratch

  • gradio

    Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

    Project mention: Gradio sharable link expires too soon ( 30 mins to 1 hour, instead of lasting 72 hours ) | /r/StableDiffusion | 2023-06-10

    I found an issue on gradio github but looks like it's closed so I am not sure if it's still a common issue or only I am facing it due to certain settings/absence of a fix. ( https://github.com/gradio-app/gradio/issues/3060 )

  • dash

    Data Apps & Dashboards for Python. No JavaScript Required.

    Project mention: [Python] NiceGUI: Lassen Sie jeden Browser das Frontend für Ihren Python-Code sein | /r/aufdeutsch | 2023-04-25

    Of course there are valid use cases for splitting frontend and backend technologies. NiceGUI is for those who don’t want to leave the Python ecosystem and like to reap the benefits of having all code in one place. There are other options like Streamlit, Dash, Anvil, JustPy, and Pynecone. But we initially created NiceGUI to easily handle the state of external hardware like LEDs, motors, and cameras. Additionally, we wanted to offer a gentle learning curve while still providing the ability to go all the way down to HTML, CSS, and JavaScript if needed.

  • d2l-en

    Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

    Project mention: which book to chose for deep learning :lan Goodfellow or francois chollet | /r/learnmachinelearning | 2023-04-07
  • matplotlib

    matplotlib: plotting with Python

    Project mention: Tkinter, PyGame windows too large on Mac | /r/learnpython | 2023-06-29

    as suggested here.

  • recommenders

    Best Practices on Recommendation Systems

    Project mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16
  • ipython

    Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.

    Project mention: The new pdbp (Pdb+) Python debugger! | dev.to | 2023-08-02

    If you’re already using ipython, this isn’t a problem because you’ll already need to download most of these dependencies anyway. But if you’re not using ipython… you’ll still need to download those dependencies.

  • gensim

    Topic Modelling for Humans

    Project mention: Aggregating news from different sources | /r/learnprogramming | 2023-07-08
  • best-of-ml-python

    🏆 A ranked list of awesome machine learning Python libraries. Updated weekly.

    Project mention: Ask HN: How to get back into AI? | news.ycombinator.com | 2022-12-10

    For Python, here's a nice compilation: https://github.com/ml-tooling/best-of-ml-python/blob/main/RE...

  • nni

    An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.

    Project mention: Filter Pruning for PyTorch | /r/deeplearning | 2023-04-13
  • Prefect

    The easiest way to build, run, and monitor data pipelines at scale.

    Project mention: self hosted Alternative to easycron.com? | /r/selfhosted | 2022-12-30
  • dvc

    🦉 Data Version Control | Git for Data & Models | ML Experiments Management

    Project mention: Exploring MLOps Tools and Frameworks: Enhancing Machine Learning Operations | dev.to | 2023-06-06

    DVC (Data Version Control):

  • ydata-profiling

    1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

    Project mention: Data exploration is not dead | news.ycombinator.com | 2023-06-24
  • seaborn

    Statistical data visualization in Python

    Project mention: Best Portfolio Projects for Data Science | dev.to | 2023-09-19

    Seaborn Documentation

  • ludwig

    Low-code framework for building custom LLMs, neural networks, and other AI models

    Project mention: Python projects with best practices on Github? | /r/Python | 2023-02-14

    Two random examples I found from 30 seconds of googling: Here’s Netflix using it in their crisis management tool, and here’s Uber using it in their deep learning framework.

  • InfluxDB

    Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020). The latest post mention was on 2023-09-25.

Python Data Science related posts

Index

What are some of the best open-source Data Science projects in Python? This list will help you:

Project Stars
1 Keras 59,372
2 scikit-learn 55,910
3 Pandas 39,797
4 Ray 27,697
5 streamlit 27,263
6 spaCy 27,161
7 data-science-ipython-notebooks 25,620
8 lightning 24,714
9 ML-From-Scratch 22,413
10 gradio 22,164
11 dash 19,382
12 d2l-en 19,183
13 matplotlib 18,108
14 recommenders 16,359
15 ipython 15,929
16 gensim 14,649
17 best-of-ml-python 14,459
18 nni 13,270
19 Prefect 12,870
20 dvc 12,010
21 ydata-profiling 11,156
22 seaborn 11,153
23 ludwig 9,847
Collect and Analyze Billions of Data Points in Real Time
Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
www.influxdata.com