Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Hi Raza here, one of the other co-founders.
I know that HN likes to nerd out over technical details so thought I’d share a bit more on how we aggregate the noisy labels to clean them up.
At the moment we use the great Skweak [1] open source library to do this. Skweak uses an HMM to infer the most likely unobserved label given the evidence of the votes from each of the labelling functions.
This whole strategy of first training a label model and then training a neural net was pioneered by Snorkel. We’ve used this approach for now but we actually think there are big opportunities for improvement.
We’re working on an end-to-end approach that de-noises the labelling function and trains the model at the same time. So far we’ve seen improvements on the standard benchmarks [2] and are planning to submit to Neurips.
R
[1]: Skweak package: https://github.com/NorskRegnesentral/skweak