Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free. Learn more →
Top 23 Data Science Open-Source Projects
Deep Learning for humansProject mention: Can someone explain how keras code gets into the Tensorflow package? | /r/tensorflow | 2023-07-24
I'm guessing the "real" keras code is coming from the keras repository. Is that a correct assumption? How does that version of Keras get there? If I wanted to write my own activation layer next to ELU, where exactly would I do that?
scikit-learn: machine learning in PythonProject mention: Transformers as Support Vector Machines | news.ycombinator.com | 2023-09-03
It looks like you've been the victim of some misinformation. As Dr_Birdbrain said, an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 . The random state is only used to shuffle the data to make probability estimates with Platt scaling . Of the random_state parameter, the sklearn documentation for SVC  says
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
Apache Superset is a Data Visualization and Data Exploration PlatformProject mention: Apache Superset Is a Data Visualization and Data Exploration Platform | news.ycombinator.com | 2023-09-11
12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for allProject mention: is it worth learning NLP without master degree? | /r/MLQuestions | 2023-05-01
I don't recommend just jumping in into natural language processing directly without understanding artificial intelligence theory. I personally recommend for you to start with the basic stuff (regression, classification, and clustering, for example), and then jump into more advanced topics. You already know software developer stuff, so that's a big step already, and it should be easier to understand some concepts. Maybe follow Microsoft's machine learning for beginners curriculum? It looks like a good roadmap overall to not instantly burn out on nlp
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much moreProject mention: Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide | dev.to | 2023-08-20
AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.
Learn how to design, develop, deploy and iterate on production-grade ML applications.Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13
Made With ML
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.Project mention: Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models | news.ycombinator.com | 2023-08-11
Training times for GSM8k are mentioned here: https://github.com/ray-project/ray/tree/master/doc/source/te...
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
Streamlit — A faster way to build and share data apps.Project mention: Stop LLM/GenAI hallucination fast: Serverless Kendra RAG with GO | dev.to | 2023-09-20
In evaluating whether a technical solution is suitable, the focus is on the simplicity of development. So, the AWS sample uses the well-known langchain library and a streamlit server for the chat sample.
💫 Industrial-strength Natural Language Processing (NLP) in PythonProject mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | dev.to | 2023-09-18
Roadmap to becoming an Artificial Intelligence Expert in 2022Project mention: Suggest which roadmap should I follow for ML? | /r/learnmachinelearning | 2023-07-20
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)Project mention: [Q] Bayesian statistics! | /r/statistics | 2023-06-11
Also this is quite nice practical introduction which might help with finding answers to your questions: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.Project mention: [D] Favorite ML Youtube Channels/Blogs/Newsletters | /r/MachineLearning | 2023-04-08
Also, have any of you stumbled across any cool GitHub repos like this one: https://github.com/eugeneyan/applied-ml ?
Deep learning framework to train, deploy, and ship AI products Lightning fast.Project mention: Best practice for saving logits/activation values of model in PyTorch Lightning | /r/deeplearning | 2023-07-19
I've been wondering on what is the recommended method of saving logits/activations using PyTorch Lightning. I've looked at Callbacks, Loggers and ModelHooks but none of the use-cases seem to be for this kind of activity (even if I were to create my own custom variants of each utility). The ModelCheckpoint Callback in its utility makes me feel like custom Callbacks would be the way to go but I'm not quite sure. This closed GitHub issue does address my issue to some extent.
10 Weeks, 20 Lessons, Data Science for All!Project mention: Data Science for Beginners - A Curriculum | /r/programming | 2023-09-08
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.Project mention: Tutorials on creating primitive ML algorithms from scratch? | /r/learnmachinelearning | 2023-01-24
Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!Project mention: Gradio sharable link expires too soon ( 30 mins to 1 hour, instead of lasting 72 hours ) | /r/StableDiffusion | 2023-06-10
I found an issue on gradio github but looks like it's closed so I am not sure if it's still a common issue or only I am facing it due to certain settings/absence of a fix. ( https://github.com/gradio-app/gradio/issues/3060 )
:memo: An awesome Data Science repository to learn and apply for real world problems.Project mention: Mastering Data Science: Top 10 GitHub Repos You Need to Know | dev.to | 2023-04-24
9. Awesome Data Science If you’re on the hunt for data science resources, Awesome Data Science is a goldmine. This curated list includes MOOCs, books, courses, blogs, podcasts, software, and more, all related to data science.
The fastai book, published as Jupyter NotebooksProject mention: Fastai Chapter 4 - The important parts, Part 2: Building a regression model | dev.to | 2023-01-25
The book is available online here The course is accessible here
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.Project mention: which book to chose for deep learning :lan Goodfellow or francois chollet | /r/learnmachinelearning | 2023-04-07
matplotlib: plotting with PythonProject mention: Tkinter, PyGame windows too large on Mac | /r/learnpython | 2023-06-29
as suggested here.
Best Practices on Recommendation SystemsProject mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16
Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.
Data Science related posts
Introducing SportsLabKit: A Toolkit for Multi-Object Tracking in Sports
1 project | /r/computervision | 24 Sep 2023
plotly-resampler: NEW Data - star count:800.0
1 project | /r/algoprojects | 23 Sep 2023
plotly-resampler: NEW Data - star count:800.0
1 project | /r/algoprojects | 22 Sep 2023
Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours
2 projects | /r/datascience | 22 Sep 2023
Is ClickHouse Moving Away from Open Source?
6 projects | news.ycombinator.com | 22 Sep 2023
Prism: the easiest way to create robust data workflows. Accessible via CLI
1 project | /r/coolgithubprojects | 21 Sep 2023
Stop LLM/GenAI hallucination fast: Serverless Kendra RAG with GO
2 projects | dev.to | 20 Sep 2023
A note from our sponsor - Mergify
blog.mergify.com | 24 Sep 2023
What are some of the best open-source Data Science projects? This list will help you: