Top 23 Data Science Open-Source Projects

ML-For-Beginners

28 67,267 7.6 HTML

12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all

Project mention: Good coding groups for black women? | news.ycombinator.com | 2024-01-13

- https://github.com/microsoft/ML-For-Beginners
Also check out this list Pitt puts out every year:

Keras

79 61,044 9.9 Python

Deep Learning for humans

Project mention: Side Quest #3: maybe the real Deepfakes were the friends we made along the way | dev.to | 2024-05-20

def batcher_from_directory(batch_size:int, dataset_path:str, shuffle=False,seed=None) -> tf.data.Dataset: """ Return a tensorflow Dataset object that returns images and spectrograms as required. Partly inspired by https://github.com/keras-team/keras/blob/v3.3.3/keras/src/utils/image_dataset_utils.py Args: batch_size: The batch size. dataset_path: The path to the dataset folder which must contain the image folder and audio folder. shuffle: Whether to shuffle the dataset. Default to False. seed: The seed for the shuffle. Default to None. """ image_dataset_path = os.path.join(dataset_path, "image") # create the foundation datasets og_dataset = tf.data.Dataset.from_generator(lambda: original_image_path_gen(image_dataset_path), output_signature=tf.TensorSpec(shape=(), dtype=tf.string)) og_dataset = og_dataset.repeat(None) # repeat indefinitely ref_dataset = tf.data.Dataset.from_generator(lambda: ref_image_path_gen(image_dataset_path), output_signature=(tf.TensorSpec(shape=(), dtype=tf.string), tf.TensorSpec(shape=(), dtype=tf.bool))) ref_dataset = ref_dataset.repeat(None) # repeat indefinitely # create the input datasets og_image_dataset = og_dataset.map(lambda x: tf.py_function(load_image, [x, tf.convert_to_tensor(False, dtype=tf.bool)], tf.float32), num_parallel_calls=tf.data.AUTOTUNE) masked_image_dataset = og_image_dataset.map(lambda x: tf.py_function(load_masked_image, [x], tf.float32), num_parallel_calls=tf.data.AUTOTUNE) ref_image_dataset = ref_dataset.map(lambda x, y: tf.py_function(load_image, [x, y], tf.float32), num_parallel_calls=tf.data.AUTOTUNE) audio_spec_dataset = og_dataset.map(lambda x: tf.py_function(load_audio_data, [x, dataset_path], tf.float64), num_parallel_calls=tf.data.AUTOTUNE) unsync_spec_dataset = ref_dataset.map(lambda x, _: tf.py_function(load_audio_data, [x, dataset_path], tf.float64), num_parallel_calls=tf.data.AUTOTUNE) # ensure shape as tensorflow does not accept unknown shapes og_image_dataset = og_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE)) masked_image_dataset = masked_image_dataset.map(lambda x: tf.ensure_shape(x, MASKED_IMAGE_SHAPE)) ref_image_dataset = ref_image_dataset.map(lambda x: tf.ensure_shape(x, IMAGE_SHAPE)) audio_spec_dataset = audio_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE)) unsync_spec_dataset = unsync_spec_dataset.map(lambda x: tf.ensure_shape(x, AUDIO_SPECTROGRAM_SHAPE)) # multi input using https://discuss.tensorflow.org/t/train-a-model-on-multiple-input-dataset/17829/4 full_dataset = tf.data.Dataset.zip((masked_image_dataset, ref_image_dataset, audio_spec_dataset, unsync_spec_dataset), og_image_dataset) # if shuffle: # full_dataset = full_dataset.shuffle(buffer_size=batch_size * 8, seed=seed) # not sure why buffer size is such # batch full_dataset = full_dataset.batch(batch_size=batch_size) return full_dataset

InfluxDB

www.influxdata.com featured

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
superset

138 59,473 9.9 TypeScript

Apache Superset is a Data Visualization and Data Exploration Platform

Project mention: Show HN: Open-source BI and analytics for engineers | news.ycombinator.com | 2024-05-15

We are looking at moving our Power BI stuff to Apache Superset [1]. How does this compare to Superset?
[1] https://superset.apache.org/

scikit-learn

82 58,344 9.9 Python

scikit-learn: machine learning in Python

Project mention: How to Build a Logistic Regression Model: A Spam-filter Tutorial | dev.to | 2024-05-05

Online Courses: Coursera: "Machine Learning" by Andrew Ng edX: "Introduction to Machine Learning" by MIT Tutorials: Scikit-learn documentation: https://scikit-learn.org/ Kaggle Learn: https://www.kaggle.com/learn Books: "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Aurélien Géron "The Elements of Statistical Learning" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman By understanding the core concepts of logistic regression, its limitations, and exploring further resources, you'll be well-equipped to navigate the exciting world of machine learning!

Pandas

399 42,159 10.0 Python

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Project mention: The ultimate guide to creating a secure Python package | dev.to | 2024-05-08

It's also possible for you to give a package an alias by using the as keyword. For instance, you could use the pandas package as pd like this:

Made-With-ML

51 36,004 6.8 Jupyter Notebook

Learn how to design, develop, deploy and iterate on production-grade ML applications.

Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13

Made With ML

Airflow

170 34,705 10.0 Python

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

Project mention: AI Strategy Guide: How to Scale AI Across Your Business | dev.to | 2024-05-11

Level 1 of MLOps is when you've put each lifecycle stage and their intefaces in an automated pipeline. The pipeline could be a python or bash script, or it could be a directed acyclic graph run by some orchestration framework like Airflow, dagster or one of the cloud-provider offerings. AI- or data-specific platforms like MLflow, ClearML and dvc also feature pipeline capabilities.

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives
streamlit

258 32,222 9.8 Python

Streamlit — A faster way to build and share data apps.

Project mention: Developing a Generic Streamlit UI to Test Amazon Bedrock Agents | dev.to | 2024-05-05

I decided to use Streamlit to build the UI as it is a popular and fitting choice. Streamlit is an open-source Python library used for building interactive web applications specially for AI and data applications. Since the application code is written only in Python, it is easy to learn and build with.

Ray

43 31,414 10.0 Python

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Project mention: Ray: Unified framework for scaling AI and Python applications | news.ycombinator.com | 2024-05-03

gradio

116 29,400 9.9 Python

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!

Project mention: AI enthusiasm #9 - A multilingual chatbot📣🈸 | dev.to | 2024-05-01

gradio is a package developed to ease the development of app interfaces in python and other languages (GitHub)

spaCy

107 28,887 9.2 Python

💫 Industrial-strength Natural Language Processing (NLP) in Python

Project mention: How I discovered Named Entity Recognition while trying to remove gibberish from a string. | dev.to | 2024-05-06

AI-Expert-Roadmap

30 28,527 0.0 JavaScript

Roadmap to becoming an Artificial Intelligence Expert in 2022

Project mention: Best AI ML DL DS Roadmap | /r/deeplearning | 2023-12-07

**[I.am.ai AI Expert Roadmap](https://i.am.ai/roadmap)**: This roadmap focuses more on AI and includes various aspects of machine learning and deep learning. It's suitable for those who want to delve deeper into AI, particularly in cutting-edge research and applications.

pytorch-lightning

9 27,064 9.9 Python

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.

Project mention: SB-1047 will stifle open-source AI and decrease safety | news.ycombinator.com | 2024-04-29

It's very easy to get started, right in your Terminal, no fees! No credit card at all.
And there are cloud providers like https://replicate.com/ and https://lightning.ai/ that will let you use your LLM via an API key just like you did with OpenAI if you need that.
You don't need OpenAI - nobody does.

Data-Science-For-Beginners

15 26,583 6.1 Jupyter Notebook

10 Weeks, 20 Lessons, Data Science for All!

Project mention: Welcome to 14 days of Data Science! | dev.to | 2024-03-07

Get started with Data Science in the Data Science for Beginners curricula.

data-science-ipython-notebooks

1 26,545 0.0 Python

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

30 26,406 0.0 Jupyter Notebook

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

Project mention: Probabilistic Programming and Bayesian Methods for Hackers (2013) | news.ycombinator.com | 2024-02-10

applied-ml

13 26,050 3.0

📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
awesome-datascience

9 23,858 6.9

:memo: An awesome Data Science repository to learn and apply for real world problems.

Project mention: About Data analyst, data scientist and data engineer, resources and experiences | dev.to | 2024-03-26

Awesome Data Science by Academic

ML-From-Scratch

3 23,260 0.0 Python

Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
d2l-en

6 21,922 8.5 Python

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
fastbook

23 20,860 3.5 Jupyter Notebook

The fastai book, published as Jupyter Notebooks

Project mention: The fastai book, published as Jupyter Notebooks | news.ycombinator.com | 2024-01-17

dash

56 20,613 9.6 Python

Data Apps & Dashboards for Python. No JavaScript Required.

Project mention: dash VS solara - a user suggested alternative | libhunt.com/r/dash | 2023-10-13

matplotlib

36 19,382 10.0 Python

matplotlib: plotting with Python

Project mention: How and where is matplotlib package making use of PySide? | /r/learnpython | 2023-12-07

SaaSHub

www.saashub.com featured

SaaSHub - Software Alternatives and Reviews. SaaSHub helps you find the best software and product alternatives

NOTE: The open source projects on this list are ordered by number of github stars. The number of mentions indicates repo mentiontions in the last 12 Months or since we started tracking (Dec 2020).

Data Science related posts

Side Quest #3: maybe the real Deepfakes were the friends we made along the way

3 projects | dev.to | 20 May 2024
Compdemocracy/polis: open-source AI for large scale open ended feedback

1 project | news.ycombinator.com | 14 May 2024
Lessons learned reinventing the Python notebook

3 projects | news.ycombinator.com | 11 May 2024
AI Strategy Guide: How to Scale AI Across Your Business

4 projects | dev.to | 11 May 2024
Ask HN: Why all these GitHub fake accounts starring my project

1 project | news.ycombinator.com | 9 May 2024
How I discovered Named Entity Recognition while trying to remove gibberish from a string.

1 project | dev.to | 6 May 2024
Alternative clouds are booming as companies seek cheaper access to GPUs

3 projects | news.ycombinator.com | 6 May 2024
A note from our sponsor - InfluxDB
www.influxdata.com | 21 May 2024

Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality. Learn more →

Index

What are some of the best open-source Data Science projects? This list will help you:

	Project	Stars
1	ML-For-Beginners	67,267
2	Keras	61,044
3	superset	59,473
4	scikit-learn	58,344
5	Pandas	42,159
6	Made-With-ML	36,004
7	Airflow	34,705
8	streamlit	32,222
9	Ray	31,414
10	gradio	29,400
11	spaCy	28,887
12	AI-Expert-Roadmap	28,527
13	pytorch-lightning	27,064
14	Data-Science-For-Beginners	26,583
15	data-science-ipython-notebooks	26,545
16	Probabilistic-Programming-and-Bayesian-Methods-for-Hackers	26,406
17	applied-ml	26,050
18	awesome-datascience	23,858
19	ML-From-Scratch	23,260
20	d2l-en	21,922
21	fastbook	20,860
22	dash	20,613
23	matplotlib	19,382