Our great sponsors
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
I think there's a new approach for “How do you get the data?” that wasn't available when this article was written in 2015. The new text and image generative models can now be used to synthesize training datasets.
I was working on an typing autocorrect project and needed a corpus of "text messages". Most of the traditional NLP corpuses like those available through NLTK [0] aren't suitable. But it was easy to script ChatGPT to generate thousands of believable text messages by throwing random topics at it.
Similarly, you can synthesize a training dataset by giving GPT the outputs/labels and asking it to generate a variety of inputs. For sentiment analysis... "Give me 1000 negative movie reviews" and "Now give me 1000 positive movie reviews".
The Alpaca folks used GPT-3 to generate high-quality instruction-following datasets [1] based on a small set of human samples.
Etc.
[0] https://www.nltk.org/nltk_data/
[1] https://crfm.stanford.edu/2023/03/13/alpaca.html