homemade-machine-learning
hdbscan
Our great sponsors
- SonarQube - Static code analysis for 29 languages.
- CodiumAI - TestGPT | Generating meaningful tests for busy devs
- InfluxDB - Access the most powerful time series database as a service
- ONLYOFFICE ONLYOFFICE Docs — document collaboration in your environment
homemade-machine-learning | hdbscan | |
---|---|---|
7 | 6 | |
21,324 | 2,445 | |
- | 2.1% | |
5.2 | 6.9 | |
25 days ago | 2 months ago | |
Jupyter Notebook | Jupyter Notebook | |
MIT License | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
homemade-machine-learning
-
✨ 5 Best GitHub Repositories to Learn Machine Learning in 2022 for Free 💯
4️⃣ Homemade Machine Learning
hdbscan
-
Introducing the Semantic Graph
A number of excellent topic modeling libraries exist in Python today. BERTopic and Top2Vec are two of the most popular. Both use sentence-transformers to encode data into vectors, UMAP for dimensionality reduction and HDBSCAN to cluster nodes.
-
Introduction to K-Means Clustering
Working in spatial data science, I rarely find applications where k-means is the best tool. The problem is that it is difficult to know how many clusters you can expect on maps. Is it 5, 500, or 10,000? Here HDBSCAN [1] shines because it will cluster _and_ select the most suitable number of clusters, to cut the single linkage cluster tree.
-
[D] Good algorithm for clustering big data (sentences represented as embeddings)?
Maybe use (H)DBScan which I think should work also for huge datasets. I don't think there is a ready to use clustering with unbuild cosine similarily metrics, and you also won't be able to precompute the 100k X 100k dense similarity matrix. The only way to go on this is to L2 normalize your embeddings, then the dot product will be the angular distance as a proxy to the cosine similarily. See also https://github.com/scikit-learn-contrib/hdbscan/issues/69
What are some alternatives?
faiss - A library for efficient similarity search and clustering of dense vectors.
Top2Vec - Top2Vec learns jointly embedded topic, document and word vectors.
lego-mindstorms - My LEGO MINDSTORMS projects (using set 51515 electronics)
Milvus - A cloud-native vector database, storage for next generation AI applications
wordle-solver - For educational purposes, a simple script that assists in solving the word game Wordle.
rmi - A learned index structure
PyImpetus - PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features
PythonRobotics - Python sample codes for robotics algorithms.
raku-jupyter-kernel - Raku Kernel for Jupyter/IPython notebooks
CFDPython - A sequence of Jupyter notebooks featuring the "12 Steps to Navier-Stokes" http://lorenabarba.com/
100DaysofMLCode - My journey to learn and grow in the domain of Machine Learning and Artificial Intelligence by performing the #100DaysofMLCode Challenge. Now supported by bright developers adding their learnings :+1:
AlgorithmicTrading - This repository contains three ways to obtain arbitrage which are Dual Listing, Options and Statistical Arbitrage. These are projects in collaboration with Optiver and have been peer-reviewed by staff members of Optiver.