Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free. Learn more →
Top 23 Data Science Open-Source Projects
-
Project mention: Can someone explain how keras code gets into the Tensorflow package? | /r/tensorflow | 2023-07-24
I'm guessing the "real" keras code is coming from the keras repository. Is that a correct assumption? How does that version of Keras get there? If I wanted to write my own activation layer next to ELU, where exactly would I do that?
-
It looks like you've been the victim of some misinformation. As Dr_Birdbrain said, an SVM is a convex problem with unique global optimum. sklearn.SVC relies on libsvm which initializes the weights to 0 [0]. The random state is only used to shuffle the data to make probability estimates with Platt scaling [1]. Of the random_state parameter, the sklearn documentation for SVC [2] says
Controls the pseudo random number generation for shuffling the data for probability estimates. Ignored when probability is False. Pass an int for reproducible output across multiple function calls. See Glossary.
[0] https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b...
[1] https://en.wikipedia.org/wiki/Platt_scaling
[2] https://scikit-learn.org/stable/modules/generated/sklearn.sv...
-
Sonar
Write Clean Python Code. Always.. Sonar helps you commit clean code every time. With over 225 unique rules to find Python bugs, code smells & vulnerabilities, Sonar finds the issues while you focus on the work.
-
Project mention: Apache Superset Is a Data Visualization and Data Exploration Platform | news.ycombinator.com | 2023-09-11
-
I don't recommend just jumping in into natural language processing directly without understanding artificial intelligence theory. I personally recommend for you to start with the basic stuff (regression, classification, and clustering, for example), and then jump into more advanced topics. You already know software developer stuff, so that's a big step already, and it should be easier to understand some concepts. Maybe follow Microsoft's machine learning for beginners curriculum? It looks like a good roadmap overall to not instantly burn out on nlp
-
Pandas
Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
Project mention: Interacting with Amazon S3 using AWS Data Wrangler (awswrangler) SDK for Pandas: A Comprehensive Guide | dev.to | 2023-08-20AWS Data Wrangler is a Python library that simplifies the process of interacting with various AWS services, built on top of some useful data tools and open-source projects such as Pandas, Apache Arrow and Boto3. It offers streamlined functions to connect to, retrieve, transform, and load data from AWS services, with a strong focus on Amazon S3.
-
Project mention: [D] How do you keep up to date on Machine Learning? | /r/learnmachinelearning | 2023-08-13
Made With ML
-
Ray
Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Project mention: Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Custom Models | news.ycombinator.com | 2023-08-11Training times for GSM8k are mentioned here: https://github.com/ray-project/ray/tree/master/doc/source/te...
-
InfluxDB
Collect and Analyze Billions of Data Points in Real Time. Manage all types of time series data in a single, purpose-built database. Run at any scale in any environment in the cloud, on-premises, or at the edge.
-
Project mention: Stop LLM/GenAI hallucination fast: Serverless Kendra RAG with GO | dev.to | 2023-09-20
In evaluating whether a technical solution is suitable, the focus is on the simplicity of development. So, the AWS sample uses the well-known langchain library and a streamlit server for the chat sample.
-
Project mention: Retrieval Augmented Generation (RAG): How To Get AI Models Learn Your Data & Give You Answers | dev.to | 2023-09-18
-
Project mention: Suggest which roadmap should I follow for ML? | /r/learnmachinelearning | 2023-07-20
1.https://i.am.ai/roadmap
-
Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)
Also this is quite nice practical introduction which might help with finding answers to your questions: https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers
-
data-science-ipython-notebooks
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
-
applied-ml
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
Project mention: [D] Favorite ML Youtube Channels/Blogs/Newsletters | /r/MachineLearning | 2023-04-08Also, have any of you stumbled across any cool GitHub repos like this one: https://github.com/eugeneyan/applied-ml ?
-
Project mention: Best practice for saving logits/activation values of model in PyTorch Lightning | /r/deeplearning | 2023-07-19
I've been wondering on what is the recommended method of saving logits/activations using PyTorch Lightning. I've looked at Callbacks, Loggers and ModelHooks but none of the use-cases seem to be for this kind of activity (even if I were to create my own custom variants of each utility). The ModelCheckpoint Callback in its utility makes me feel like custom Callbacks would be the way to go but I'm not quite sure. This closed GitHub issue does address my issue to some extent.
-
-
ML-From-Scratch
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.
Project mention: Tutorials on creating primitive ML algorithms from scratch? | /r/learnmachinelearning | 2023-01-24ml-from-scratch
-
Project mention: Gradio sharable link expires too soon ( 30 mins to 1 hour, instead of lasting 72 hours ) | /r/StableDiffusion | 2023-06-10
I found an issue on gradio github but looks like it's closed so I am not sure if it's still a common issue or only I am facing it due to certain settings/absence of a fix. ( https://github.com/gradio-app/gradio/issues/3060 )
-
awesome-datascience
:memo: An awesome Data Science repository to learn and apply for real world problems.
9. Awesome Data Science If you’re on the hunt for data science resources, Awesome Data Science is a goldmine. This curated list includes MOOCs, books, courses, blogs, podcasts, software, and more, all related to data science.
-
Project mention: [Python] NiceGUI: Lassen Sie jeden Browser das Frontend für Ihren Python-Code sein | /r/aufdeutsch | 2023-04-25
Of course there are valid use cases for splitting frontend and backend technologies. NiceGUI is for those who don’t want to leave the Python ecosystem and like to reap the benefits of having all code in one place. There are other options like Streamlit, Dash, Anvil, JustPy, and Pynecone. But we initially created NiceGUI to easily handle the state of external hardware like LEDs, motors, and cameras. Additionally, we wanted to offer a gentle learning curve while still providing the ability to go all the way down to HTML, CSS, and JavaScript if needed.
-
Project mention: Fastai Chapter 4 - The important parts, Part 2: Building a regression model | dev.to | 2023-01-25
The book is available online here The course is accessible here
-
d2l-en
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
Project mention: which book to chose for deep learning :lan Goodfellow or francois chollet | /r/learnmachinelearning | 2023-04-07 -
as suggested here.
-
Project mention: My kernel dies when I fit my LightFm model from Microsoft Recommenders | /r/Jupyter | 2023-06-16
-
Mergify
Updating dependencies is time-consuming.. Solutions like Dependabot or Renovate update but don't merge dependencies. You need to do it manually while it could be fully automated! Add a Merge Queue to your workflow and stop caring about PR management & merging. Try Mergify for free.
Data Science related posts
- Introducing SportsLabKit: A Toolkit for Multi-Object Tracking in Sports
- plotly-resampler: NEW Data - star count:800.0
- plotly-resampler: NEW Data - star count:800.0
- Jupyter Lab Extension to run your GPU-heavy stuff (for free for now) on somebody's else server without blocking yours
- Is ClickHouse Moving Away from Open Source?
- Prism: the easiest way to create robust data workflows. Accessible via CLI
- Stop LLM/GenAI hallucination fast: Serverless Kendra RAG with GO
-
A note from our sponsor - Mergify
blog.mergify.com | 24 Sep 2023
Index
What are some of the best open-source Data Science projects? This list will help you:
Project | Stars | |
---|---|---|
1 | Keras | 59,372 |
2 | scikit-learn | 55,910 |
3 | superset | 54,269 |
4 | ML-For-Beginners | 53,590 |
5 | Pandas | 39,797 |
6 | Made-With-ML | 34,174 |
7 | Ray | 27,697 |
8 | streamlit | 27,263 |
9 | spaCy | 27,161 |
10 | AI-Expert-Roadmap | 26,781 |
11 | Probabilistic-Programming-and-Bayesian-Methods-for-Hackers | 25,865 |
12 | data-science-ipython-notebooks | 25,585 |
13 | applied-ml | 24,749 |
14 | lightning | 24,653 |
15 | Data-Science-For-Beginners | 22,550 |
16 | ML-From-Scratch | 22,390 |
17 | gradio | 21,988 |
18 | awesome-datascience | 21,847 |
19 | dash | 19,382 |
20 | fastbook | 19,211 |
21 | d2l-en | 19,183 |
22 | matplotlib | 18,108 |
23 | recommenders | 16,359 |