Top 23 Data Science Open-Source Projects
Deep Learning for humansProject mention: That time I optimized a Python program by 5000x | reddit.com/r/Python | 2022-01-11
The report output for scalene does look much nicer, but the slowness for me dropped me from continuing to use it. Maybe there's some bad interaction with tensorflow/pytest. I can try to make an example, but I'd guess if you try running it on tensorflows actual unit tests (something like this) you'd get similar behavior.
scikit-learn: machine learning in PythonProject mention: scikit-learn test case results? | reddit.com/r/scikit_learn | 2022-01-05
Less time debugging, more time building. Scout APM allows you to find and fix performance issues with no hassle. Now with error monitoring and external services monitoring, Scout is a developer's best friend when it comes to application development.
Apache Superset is a Data Visualization and Data Exploration PlatformProject mention: Churn Prediction With BigQueryML to Increase Mobile Game Revenue | dev.to | 2022-01-10
You may recall a blog post in 2020 where we discussed how Torpedo was leveraging the RudderStack Unity SDK to send over 1 billion events per month to Redshift to power a robust analytics dashboard built with SuperSet. This cost-effective solution shed light on what player activities were important for driving engagement and helped refine the definition of high-value player cohorts as well as identify player churn. After crossing the 100 million download milestone, the teams at Wynn Resorts and Torpedo were looking for innovative ways to continue improving the gaming experience, user engagement and most importantly, reduce the churn on those high-value customers making in-app purchases.
Learn how to responsibly deliver value with ML.Project mention: New to mlops, where do I need to start | reddit.com/r/mlops | 2021-11-01
Standing recommendation for beginners (we should eventually make a wiki) is https://madewithml.com/
12 weeks, 26 lessons, 52 quizzes, classic Machine Learning for all
three educational courses- Web Dev, ML, and IoT for beginners. Note re using educational resources as a strategy for marketing , at least the ML course links to various Azure services. Google does this a bunch as well, with Collab notebooks often being used to demo educational materials.
aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)Project mention: A/B test improved your website's conversion rate? Not so fast | news.ycombinator.com | 2022-01-07
Bayesian Methods For Hackers is a very popular one.
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
Deliver Cleaner and Safer Code - Right in Your IDE of Choice!. SonarLint is a free and open source IDE extension that identifies and catches bugs and vulnerabilities as you code, directly in the IDE. Install from your favorite IDE marketplace today.
💫 Industrial-strength Natural Language Processing (NLP) in Python
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.Project mention: Neural Network from Scratch | news.ycombinator.com | 2022-01-04
Interesting find. Just FYI, this repo has been the OG for several years, when it comes to building NN from scratch:
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.Project mention: Is it normal to have a negative and near-zero explained variance in PPO? | reddit.com/r/reinforcementlearning | 2021-12-25
I guess I did, as I directly use the PPO agent provided by the RLlib.
📚 Papers & tech blogs by companies sharing their work on data science & machine learning in production.
The second repo I LOVE is Eugene Yan’s Applied ML repository. This is a brilliant idea to create and actually something I was planning on sort of casually doing in my non-existent free time… Anyhow, it is a curated list of technical posts from top engineering teams (Netflix, Amazon, Pinterest, Linkedin, etc.) detailing how they built out different types of AI/ML systems (e.g. forecasting, recommenders, search and ranking, etc.). Ofc, it focuses on AI/ML, but something similar could be made for the traditional or BI-oriented analytics stack, as well as the streaming world, super high value for practitioners! Btw-one of my favorite things at BCG used to be looking at our IT architecture team’s reference architecture diagrams… the best way to understand technologies is to look at how a ton of stuff is architected… and its fun!
:memo: An awesome Data Science repository to learn and apply for real world problems.Project mention: High income skills? | reddit.com/r/Fire | 2021-12-22
There are several on github, such as: https://github.com/academic/awesome-datascience
Streamlit — The fastest way to build data apps in PythonProject mention: How to Build a Machine Learning Demo in 2022 | dev.to | 2022-01-16
So what if you want something almost as flexible as what is possible with the full-stack approach, but without the development requirements? Well, you are in luck because the past few years have seen the emergence of Python libraries that allow the creation of impressively interactive demos with only a few lines of code. In this article, we are going to focus on two of the most promising libraries: Gradio and Streamlit. There are notable differences between the two that will be explored below, but the high level idea is the same: eliminate most of the painful back and front end work outlined in the full-stack section, albeit at the cost of some flexibility.
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.Project mention: [D] Are you using PyTorch or TensorFlow going into 2022? | reddit.com/r/MachineLearning | 2021-12-14
Is the problem the sheer number of options, or the fact that they are all together in one place? Would it be better if they were organized into the different trainer entrypoints (fit, validate, ...)? If that is the case, there was an RFC proposing this which you might find interesting, feel free to drop by and comment on the issue: https://github.com/PyTorchLightning/pytorch-lightning/issues/10444
You might be able to use Dash for your needs. https://plotly.com/dash/
Roadmap to becoming an Artificial Intelligence Expert in 2021
the AI Expert Roadmap (interactive web page), seems to have taken inspiration from the developer roadmap linked above and is awesome. I LOVE how they separate out different personas, from data scientist, to machine learning, to deep learning, to data engineering, etc. It’s really well done and fun to browse through! It is also kind of fun to juxtapose this with the aforementioned Developer Roadmap, as well as the Analytics Engineers Club, as they collected cover so much of modern tech is slightly MECE² (#BCG) ways 😃
Official repository for IPython itself. Other repos in the IPython organization contain things like the website, documentation builds, etc.Project mention: New IPython defaults makes it less useful for education purposes. [Raymond Hettinger on Twitter] | reddit.com/r/Python | 2022-01-15
That's not correct. In the feedback he got (GitHub, Twitter), most people (of the few that replied) came down on "opt-in" and against making it the default.
The fastai book, published as Jupyter NotebooksProject mention: Starting a career as a Python developer | reddit.com/r/learnpython | 2021-12-20
I’m a fan of fast book by fastai.
Topic Modelling for HumansProject mention: Unsupervised Learning for String Matching in Python - can I have advice on how to go about this? | reddit.com/r/learnmachinelearning | 2021-12-16
VIP cheatsheets for Stanford's CS 229 Machine LearningProject mention: Stanford University Probabilities and Statistics refresher | reddit.com/r/learnmachinelearning | 2021-03-24
A comprehensive list of pytorch related content on github,such as different models,implementations,helper libraries,tutorials etc.Project mention: Similar open source long library list to TF like Pytorch "ECOSYSTEM TOOLS" | reddit.com/r/tensorflow | 2021-11-19
I got the following as recombination from elsewhere - https://github.com/jtoy/awesome-tensorflow and there is one for pt as well https://github.com/bharathgs/Awesome-pytorch-list . Thx for the help :D
Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 300 universities from 55 countries including Stanford, MIT, Harvard, and Cambridge.Project mention: The Transformer in Machine Translation | reddit.com/r/MindSporeOSS | 2022-01-13
GitHub's article on Dive into Deep Learning
Best Practices on Recommendation SystemsProject mention: Opinion on choice of model - Recommender System | reddit.com/r/datascience | 2021-04-10
Then I tried to find some more advanced models and I found this really good list and in there I found the Microsoft one. So it's' where we are now, which a bunch of different models and not a documentation/tutorials out there.
Data Science related posts
Best Course to learn TensorFlow?
1 project | reddit.com/r/tensorflow | 17 Jan 2022
Deep Learning Interviews book: Hundreds of fully solved job interview questions from a wide range of key topics in AI.
1 project | reddit.com/r/learnmachinelearning | 16 Jan 2022
Machine Learning for Trading: Notebooks, resources and references accompanying the book Machine Learning for Algorithmic Trading. Courses - star count:5136.0
1 project | reddit.com/r/algoprojects | 16 Jan 2022
Statistical Rethinking (2022 Edition)
2 projects | news.ycombinator.com | 16 Jan 2022
GitHub - BoltzmannEntropy/interviews.ai: Deep Learning Interviews book: Hundreds of fully solved job interview questions from a wide range of key topics in AI
1 project | reddit.com/r/artificial | 16 Jan 2022
[P] Open-source tool for building NLP training sets with weak supervision and search queries
1 project | reddit.com/r/MachineLearning | 16 Jan 2022
akshare: NEW Derivatives and Hedging - star count:4425.0
1 project | reddit.com/r/algoprojects | 15 Jan 2022
What are some of the best open-source Data Science projects? This list will help you:
Are you hiring? Post a new remote job listing for free.