|8 days ago||about 5 hours ago|
|GNU Lesser General Public License v2.1 only||BSD 3-clause "New" or "Revised" License|
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Gensim – a Python library for topic modelling, document indexing
1 project | news.ycombinator.com | 25 Nov 2021
How to build a search engine with word embeddings
2 projects | dev.to | 22 Nov 2021
We will be using gensim to load our Google News pre-trained word vectors. Find the code for this here.
The unthinking application of this regex-efficiency check wasted our attention
1 project | news.ycombinator.com | 30 Sep 2021
The Levenshtein Distance in Production
4 projects | news.ycombinator.com | 6 Jun 2021
> Problem statement: the Levenshtein distance is a string metric for measuring the difference between two sequences
Another variant is "I have a bunch of words (a dictionary) and one query word, and want to find all words from the dictionary that are close to the query word".
This leads to an interesting class of problems, because you can do clever things where you precompute search structures (Levenshtein automata ) from the dictionary. The similarity queries then run (much) faster – in production, performance matters.
We recently merged a PR like that into Gensim .
This gave a ~1,500x speed-up compared to naively comparing all pairwise strings with Levenshtein distance. A difference between the training step running for years (=unusable) and minutes.
Superior tools to Gensim's similarity
1 project | reddit.com/r/LanguageTechnology | 20 Mar 2021
So Gensim's Similarity module seems like a good fit for this problem, especially soft cosine similarity checking. But inside I can't get comfortable, because transformers are very popular lately.
Koan: A word2vec negative sampling implementation with correct CBOW update
2 projects | news.ycombinator.com | 2 Jan 2021
Apparently it did: https://github.com/RaRe-Technologies/gensim/issues/1873
Data Science toolset summary from 2021
13 projects | dev.to | 13 Nov 2021
Scikit-learn - It is one of the most widely used frameworks for Python based Data science tasks. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Link - https://scikit-learn.org/
Intel Extension for Scikit-Learn
4 projects | news.ycombinator.com | 1 Nov 2021
Currently some works is being done to improve computational primitives of scikit-learn to enhance its overhaul performances natively.
You can have a look at this exploratory PR: https://github.com/scikit-learn/scikit-learn/pull/20254
This other PR is a clear revamp of this previous one:
Scikit-Learn Version 1.0
11 projects | news.ycombinator.com | 14 Sep 2021
Just to clarify, scikit-learn 1.0 has not been released yet. The latest tag in the github repo is 1.0.rc2
Top 10 Python Libraries for Machine Learning
14 projects | dev.to | 9 Sep 2021
Website: https://scikit-learn.org/ Github Repository: https://github.com/scikit-learn/scikit-learn Developed By: SkLearn.org Primary Purpose: Predictive Data Analysis and Data Modeling
where is binary_metric function in sklearn package
1 project | reddit.com/r/learnmachinelearning | 20 Aug 2021
There is a function named binary_metric in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/_base.py
Use Scikit-Learn and Runflow
2 projects | dev.to | 6 Jul 2021
If you're not familiar with Scikit-learn and Runflow,
Confused as to what exaclty a piece of code does
1 project | reddit.com/r/learnmachinelearning | 18 Jun 2021
well you can start at https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/model_selection/_validation.py, or maybe someone will guide you later
What Makes Python Libraries So Important For Data Science Learning?
3 projects | reddit.com/r/u_Snoo36930 | 16 Jun 2021
Next comes the complexity of drawing the maximum possible number of valuable insights. Using different python libraries such as Scikit-Learn, PyTorch, Pandas, etc., complications of data analysis can be solved within a minute. And the complexity associated with visualisation gets handled by other data visualisation libraries like Matploitlib, PyTorch, etc.
Is there a way to map cluster centers back to a dataframe?
1 project | reddit.com/r/learnpython | 19 May 2021
To avoid the issue with convergence (and the discrepancy between the labels_ and cluster_centers_), you can set tol=0, though this can of course lead to issues if convergence is a problem. There was an issue about it here. Assuming it's converged, then the order is fine.
Any from scratch Hamming Loss implementations?
1 project | reddit.com/r/LearnML | 10 May 2021
The source code for the function you refer to is quite straightforward anyway. The definition of count_nonzero() is here.
What are some alternatives?
Keras - Deep Learning for humans
Surprise - A Python scikit for building and analyzing recommender systems
Prophet - Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.
tensorflow - An Open Source Machine Learning Framework for Everyone
MLflow - Open source platform for the machine learning lifecycle
TFLearn - Deep learning library featuring a higher-level API for TensorFlow.
seqeval - A Python framework for sequence labeling evaluation(named-entity recognition, pos tagging, etc...)
BERTopic - Leveraging BERT and c-TF-IDF to create easily interpretable topics.