Our great sponsors
-
delta
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs (by delta-io)
-
open_llama
OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Databricks provides Jupyter lab like notebooks for analysis and ETL pipelines using spark through pyspark, sparkql or scala. I think R is supported as well but it doesn't interop as well with their newer features as well as python and SQL do. It interfaces with cloud storage backend like S3 and offers some improvements to the parquet format of data querying that allows for updating, ordering and merged through https://delta.io . They integrate pretty seamlessly to other data visualisation tooling if you want to use it for that but their built in graphs are fine for most cases. They also have ML on rails type through menus and models if I recall but I typically don't use it for that. I've typically used it for ETL or ELT type workflows for data that's too big or isn't stored in a database.
OpenLLaMA models up to 13B parameters have now been trained on 1T tokens:
https://github.com/openlm-research/open_llama
Mosaic's MPT models are already supported in GGML: https://github.com/ggerganov/ggml
Here's MPT-30B running in 4-bit precision on CPU :) https://twitter.com/abacaj/status/1673133443339763712?s=20
I used: https://github.com/artidoro/qlora but there are quite a few others that likely work better. It was literally my first attempt at doing anything like this, and took the better part of an evening to work through CUDA/Python issues to get it training, and ~20 hours of training.