|6 days ago||9 days ago|
|BSD 3-clause "New" or "Revised" License||BSD 3-clause "New" or "Revised" License|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
2 projects | reddit.com/r/u_Matusadona_Wild303 | 25 Oct 2022
Machine Learning Pipelines with Spark: Introductory Guide (Part 1)
5 projects | dev.to | 23 Oct 2022
The concepts are similar to the Scikit-learn project. They follow Spark’s “ease of use” characteristic giving you one more reason for adoption. You will learn more about these main concepts in this guide.
How do you programmers make sense of production-level code?
2 projects | reddit.com/r/learnprogramming | 20 Oct 2022
If you look at the README for scikit-learn on GitHub, they say this.2 projects | reddit.com/r/learnprogramming | 20 Oct 2022
Take a smaller segment to look at. Opening up the front page to a Github repo can be quite daunting. https://github.com/scikit-learn/scikit-learn
Talking Data: What do we need for engaging data analytics?
4 projects | dev.to | 6 Oct 2022
Many data workers are complaining about the fierce competition in the data area. Fortunately, the situation seems to be improving. Data analysts had to manually analyze distribution charts for deep insights, but now they can use smart machine learning models to automate this process. Traditional data analysis and modeling skills have been gradually becoming easy. For instance, Power BI or Tableau allow users to use a drag-and-drop low-code fashion to generate visual charts and models, whilst the old way is to import Python libraries such as pandas, matplotlib and sklearn to do the same in Jupyter Notebook. Open-source projects Apache Superset and Metabase allow users to easily analyze data on the web pages. This is quite similar to the development of digital cameras, from the film cameras to digital cameras and to smartphone cameras used by everyone. With lower and lower technical barriers, the whole industry can be developing fast. "Everyone can be data analyst" will no longer be a fantasy.
A few (unordered) thoughts about data (1/2)
6 projects | dev.to | 5 Oct 2022
Can anyone share some good examples of Python OOP Repos for DS?
4 projects | reddit.com/r/datascience | 17 Sep 2022
Beginner Friendly Resources to Master Artificial Intelligence and Machine Learning with Python (2022)
8 projects | dev.to | 14 Aug 2022
scikit-learn – Simple and efficient tools for predictive data analysis, built on NumPy, SciPy, and matplotlib
Why do many data scientist use C++ for machine learning?
4 projects | reddit.com/r/learnmachinelearning | 29 Jul 2022
For example, there is PyTorch which is primarily C++ but has Python bindings. Most people use the Python bindings, same for TensorFlow. JAX is mostly Python, same for scikit-learn.
Don't Waste Data! An Experiment with Machine Learning
3 projects | dev.to | 23 Jun 2022
Once we had determined the shape of the data and the features we should focus on, we set out to create a model. (There is a wealth of ML tools available across programming languages like Python and Julia.) We chose scikit-learn, one of the most popular ML libraries around, and plugged the data into a random forest regression. (Say what? Here’s a quick and dirty guide to random forest regression.) As input, we used the ZIP codes of the print partner and the destination of the mailpiece. Our output target was the metric we had calculated during pre-processing: the difference in days between the earliest and latest USPS events recorded for each mailpiece (the mailpiece's time in transit).
Dislike button would improve Spotify's recommendations
4 projects | news.ycombinator.com | 16 Oct 2021
I spent the latter half of 2019 trying to build this as a startup. Ultimately I pivoted (now I do newsletter recommendations instead), but if I hadn't made some mistakes I think it could've gotten more traction. Mostly I should've simplified the idea to make it easier to build. If anyone's interested in working on this, here's what I would do:
(But first some background: The way I saw it, you can split music recommendation into two tasks: (1) picking a song you already know that should be played right now, and (2) picking a new song you've never heard of before. (Music recommendation is unique in this way since in most other domains there isn't much value in re-recommending items). I think #1 is more important, and if you nail that, you can do a so-so job of #2 and still have a good system.)
Make a website that imports your Last.fm history. Organize the history into sessions (say, groups of listen events with a >= 30 minute gap in between). Feed those sessions into a collaborative filtering library like Surprise, as a CSV of `, , 1` (1 being a rating--in this case we only have positive ratings). Then make some UI that lets people create and export playlists. e.g. I pick a couple seed songs from my listening history, then the app uses Surprise to suggest more songs. Present a list of 10 songs at a time. Click a song to add it, and have a "skip all" button that gets a new list of songs. Save these interactions as ratings--e.g. if I skip a song, that's a -1 rating for this playlist. For some percentage of the suggestions (20% by default? Make it configurable), use Last.fm's or Spotify's API to pick a new song not in your history, based on the songs in the current playlist. Also sometimes include songs that were added to the playlist previously--if you skip them, they get removed from the playlist. Then you can spend a couple minutes every week refreshing your playlists. Export the playlists to Spotify/Apple Music/whatever.
As you get more users, you can do "regular" collaborative filtering (i.e. with different users) to recommend new songs instead of relying on external APIs. There are probably lots of other things you could do too--e.g. scrape wikipedia to figure out what artists have done collaborations or something. In general I think the right approach is to build a model for artist similarity rather than individual song similarity. At recommendation time, you pick an artist and then suggest their top songs (and sometimes pick an artist already in the user's history, and suggest songs they haven't heard yet--that's even easier).
This is the simplest thing I can think of that would solve my "I love music but I listen to the same old songs everyday because I'm busy and don't want to futz around with curating my music library" problem. You wouldn't have to waste time building a crappy custom music app, and users won't have to use said crappy custom music app (speaking from personal experience...). You wouldn't have to deal with music rights or integrating with Spotify/Apple Music since you're not actually playing any music.
If you want to go further with it, you could get traction first and then launch your own streaming service or something. (Reminds me a bit of Readwise starting with just highlights and then launching their own reader recently). I think it'd be neat to make an indie streaming service--kind of like Bandcamp but with an algorithm to help you find the good stuff. Let users upload and listen to their own MP3s so it can still work with popular music. Of course it'd be nicer for users in the short term if you just made deals with the big record labels, however this would help you not end up in Spotify's position of pivoting to podcasts so you can get out of paying record labels. And then maybe in a few decades all the good music won't be on the big labels anyway :).
Anyway if anyone is remotely interested in building something like this, I'll be your first user. I really need it. Otherwise I'll probably build it myself at some point in the next year or two as a side project.
What are some alternatives?
LightFM - A Python implementation of LightFM, a hybrid recommendation algorithm.
Keras - Deep Learning for humans
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
tensorflow - An Open Source Machine Learning Framework for Everyone
gensim - Topic Modelling for Humans
H2O - H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
python-recsys - A python library for implementing a recommender system
seqeval - A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)
MLflow - Open source platform for the machine learning lifecycle
xgboost - Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow