Top2Vec
GuidedLDA
Top2Vec | GuidedLDA | |
---|---|---|
13 | 1 | |
2,843 | 494 | |
- | - | |
7.0 | 0.0 | |
5 months ago | over 3 years ago | |
Python | Python | |
BSD 3-clause "New" or "Revised" License | Mozilla Public License 2.0 |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
Top2Vec
-
[D] Is it better to create a different set of Doc2Vec embeddings for each group in my dataset, rather than generating embeddings for the entire dataset?
I'm using Top2Vec with Doc2Vec embeddings to find topics in a dataset of ~4000 social media posts. This dataset has three groups:
-
Tips for best Top2Vec (HDBSCAN) usage
I asked in a previous post for advice about how to find insight in unstructured text data. Almost everyone recommended BERTopic, but I wasn't able to run BERTopic on my machine locally (segmentation fault). Fortunately, I found Top2Vec, which uses HBDSCAN and UMAP to quickly find good topics in uncleaned(!) text data.
- How can I group domain specific keywords based on their word embeddings?
-
Introducing the Semantic Graph
A number of excellent topic modeling libraries exist in Python today. BERTopic and Top2Vec are two of the most popular. Both use sentence-transformers to encode data into vectors, UMAP for dimensionality reduction and HDBSCAN to cluster nodes.
- Top2Vec: Embed topics, documents and word vectors
- How to cluster articles about software vulnerabilities?
- Ciencia de Dados - Classificacao de texto
-
Extracting topics from 250k facebook posts
Since you already have the facebook posts, you can use top2vec https://github.com/ddangelov/Top2Vec
- [D] Good algorithm for clustering big data (sentences represented as embeddings)?
-
SOTA for Topic Modeling
Here's an implementation that uses UMAP and HDBSCAN: https://github.com/ddangelov/Top2Vec but you could use a semi-supervised algorithm in the clustering step if you wanted specific topics.
GuidedLDA
What are some alternatives?
BERTopic - Leveraging BERT and c-TF-IDF to create easily interpretable topics.
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT
gensim - Topic Modelling for Humans
faiss - A library for efficient similarity search and clustering of dense vectors.
corex_topic - Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx
Milvus - A cloud-native vector database, storage for next generation AI applications
scikit-learn - scikit-learn: machine learning in Python
hdbscan - A high performance implementation of HDBSCAN clustering.
contextualized-topic-models - A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021.
umap - Uniform Manifold Approximation and Projection
pyLDAvis - Python library for interactive topic model visualization. Port of the R LDAvis package.