pyLDAvis
Top2Vec
pyLDAvis | Top2Vec | |
---|---|---|
2 | 13 | |
1,785 | 2,848 | |
- | - | |
5.0 | 6.2 | |
22 days ago | 9 days ago | |
Jupyter Notebook | Python | |
BSD 3-clause "New" or "Revised" License | BSD 3-clause "New" or "Revised" License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
pyLDAvis
- How can I group domain specific keywords based on their word embeddings?
-
Doing a TF-IDF implementation to be able to create a Pandas DataFrame column called "TF-IDF"
Also, although it unfortunately hasn't been updated in a while (probably because LDA has gone out of fashion as NLP has increasingly moved towards solutions with deep learning and 'large language models'), you'll probably find this very useful: https://github.com/bmabey/pyLDAvis
Top2Vec
-
[D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset?
I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:
-
Tips for best Top2Vec (HDBSCAN) usage
I asked in a previous post for advice about how to find insight in unstructured text data. Almost everyone recommended BERTopic, but I wasn't able to run BERTopic on my machine locally (segmentation fault). Fortunately, I found Top2Vec, which uses HBDSCAN and UMAP to quickly find good topics in uncleaned(!) text data.
- How can I group domain specific keywords based on their word embeddings?
-
Introducing the Semantic Graph
A number of excellent topic modeling libraries exist in Python today. BERTopic and Top2Vec are two of the most popular. Both use sentence-transformers to encode data into vectors, UMAP for dimensionality reduction and HDBSCAN to cluster nodes.
- Top2Vec: Embed topics, documents and word vectors
- How to cluster articles about software vulnerabilities?
- Ciencia de Dados - Classificacao de texto
-
Extracting topics from 250k facebook posts
Since you already have the facebook posts, you can use top2vec https://github.com/ddangelov/Top2Vec
- [D] Good algorithm for clustering big data (sentences represented as embeddings)?
-
SOTA for Topic Modeling
Here's an implementation that uses UMAP and HDBSCAN: https://github.com/ddangelov/Top2Vec but you could use a semi-supervised algorithm in the clustering step if you wanted specific topics.
What are some alternatives?
BERTopic - Leveraging BERT and c-TF-IDF to create easily interpretable topics.
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT
faiss - A library for efficient similarity search and clustering of dense vectors.
Milvus - A cloud-native vector database, storage for next generation AI applications
hdbscan - A high performance implementation of HDBSCAN clustering.
GuidedLDA - semi supervised guided topic model with custom guidedLDA
contextualized-topic-models - A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
umap - Uniform Manifold Approximation and Projection
nalcos - Search Git commits in natural language
currency_convert
cli-pomodoro - Pomodoro timer in the command line.