|6 days ago||13 days ago|
|BSD 3-clause "New" or "Revised" License||GNU Lesser General Public License v2.1 only|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
2 projects | reddit.com/r/u_Matusadona_Wild303 | 25 Oct 2022
Machine Learning Pipelines with Spark: Introductory Guide (Part 1)
5 projects | dev.to | 23 Oct 2022
The concepts are similar to the Scikit-learn project. They follow Spark’s “ease of use” characteristic giving you one more reason for adoption. You will learn more about these main concepts in this guide.
How do you programmers make sense of production-level code?
2 projects | reddit.com/r/learnprogramming | 20 Oct 2022
If you look at the README for scikit-learn on GitHub, they say this.2 projects | reddit.com/r/learnprogramming | 20 Oct 2022
Take a smaller segment to look at. Opening up the front page to a Github repo can be quite daunting. https://github.com/scikit-learn/scikit-learn
Talking Data: What do we need for engaging data analytics?
4 projects | dev.to | 6 Oct 2022
Many data workers are complaining about the fierce competition in the data area. Fortunately, the situation seems to be improving. Data analysts had to manually analyze distribution charts for deep insights, but now they can use smart machine learning models to automate this process. Traditional data analysis and modeling skills have been gradually becoming easy. For instance, Power BI or Tableau allow users to use a drag-and-drop low-code fashion to generate visual charts and models, whilst the old way is to import Python libraries such as pandas, matplotlib and sklearn to do the same in Jupyter Notebook. Open-source projects Apache Superset and Metabase allow users to easily analyze data on the web pages. This is quite similar to the development of digital cameras, from the film cameras to digital cameras and to smartphone cameras used by everyone. With lower and lower technical barriers, the whole industry can be developing fast. "Everyone can be data analyst" will no longer be a fantasy.
A few (unordered) thoughts about data (1/2)
6 projects | dev.to | 5 Oct 2022
Can anyone share some good examples of Python OOP Repos for DS?
4 projects | reddit.com/r/datascience | 17 Sep 2022
Beginner Friendly Resources to Master Artificial Intelligence and Machine Learning with Python (2022)
8 projects | dev.to | 14 Aug 2022
scikit-learn – Simple and efficient tools for predictive data analysis, built on NumPy, SciPy, and matplotlib
Why do many data scientist use C++ for machine learning?
4 projects | reddit.com/r/learnmachinelearning | 29 Jul 2022
For example, there is PyTorch which is primarily C++ but has Python bindings. Most people use the Python bindings, same for TensorFlow. JAX is mostly Python, same for scikit-learn.
Don't Waste Data! An Experiment with Machine Learning
3 projects | dev.to | 23 Jun 2022
Once we had determined the shape of the data and the features we should focus on, we set out to create a model. (There is a wealth of ML tools available across programming languages like Python and Julia.) We chose scikit-learn, one of the most popular ML libraries around, and plugged the data into a random forest regression. (Say what? Here’s a quick and dirty guide to random forest regression.) As input, we used the ZIP codes of the print partner and the destination of the mailpiece. Our output target was the metric we had calculated during pre-processing: the difference in days between the earliest and latest USPS events recorded for each mailpiece (the mailpiece's time in transit).
Is it home bias or is data wrangling for machine learning in python much less intuitive and much more burdensome than in R?
2 projects | reddit.com/r/rstats | 24 Aug 2022
Standout python NLP libraries include Spacy and Gensim, as well as pre-trained model availability in Hugginface. These libraries have widespread use in and support from industry and it shows. Spacy has best-in-class methods for pre-processing text for further applications. Gensim helps you manage your corpus of documents, and contains a lot of different tools for solving a common industry task, topic modeling.
Topic modelling with Gensim and SpaCy on startup news
3 projects | dev.to | 17 Jan 2022
For the topic modelling itself, I am going to use Gensim library by Radim Rehurek, which is very developer friendly and easy to use.
Unsupervised Learning for String Matching in Python - can I have advice on how to go about this?
2 projects | reddit.com/r/learnmachinelearning | 16 Dec 2021
How to build a search engine with word embeddings
2 projects | dev.to | 22 Nov 2021
We will be using gensim to load our Google News pre-trained word vectors. Find the code for this here.
The Levenshtein Distance in Production
4 projects | news.ycombinator.com | 6 Jun 2021
> Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences
Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".
This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata ) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.
We recently merged a PR like that into Gensim .
This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.
Koan: A word2vec negative sampling implementation with correct CBOW update
2 projects | news.ycombinator.com | 2 Jan 2021
Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
What are some alternatives?
Keras - Deep Learning for humans
Surprise - A Python scikit for building and analyzing recommender systems
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
tensorflow - An Open Source Machine Learning Framework for Everyone
BERTopic - Leveraging BERT and c-TF-IDF to create easily interpretable topics.
MLflow - Open source platform for the machine learning lifecycle
H2O - H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
seqeval - A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)