[NLP] Topic Identification & Quantisation

Our great sponsors

InfluxDB - Power Real-Time Data Analytics at Scale

WorkOS - The modern identity platform for B2B SaaS

SaaSHub - Software Alternatives and Reviews

Our great sponsors

corex_topic

5 622 3.8 Python

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

We ended up settling on a HuggingFace transformer + HDBSCAN pipeline from BERTopic. I like this because it makes it straightforward to tune and test, and you probabilistically assign documents to clusters, so you can do interesting aggregation and sampling after you have your inference done, like selecting text. Other options include top2vec which basically does the same thing without some guiding tools available in BERTopic. Either is suitable for what you’re doing. Older techniques include things like the Latent Dirichlet Allocation and COREX.

InfluxDB

www.influxdata.com sponsored

Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.

NOTE: The number of mentions on this list indicates mentions on common posts plus user suggested alternatives. Hence, a higher number means a more popular project.

Suggest a related project