-
RedPajama-Data
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
https://github.com/togethercomputer/RedPajama-Data
Even more than that, this is web scrapped data. There are trillions of valuable tokens worth of text from the likes of pdfs, ebooks and other documents that essentially has no web presence otherwise.
https://annas-archive.org/llm
NOTE:
The number of mentions on this list indicates mentions on common posts plus user suggested alternatives.
Hence, a higher number means a more popular project.