TheVault vs airoboros

TheVault

[EMNLP 2023] The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation (by FSoft-AI4Code)

Source Code

Suggest alternative

Edit details

airoboros

Customizable implementation of the self-instruct paper. (by jondurbin)

Suggest topics

Source Code

Suggest alternative

Edit details

Our great sponsors

WorkOS - The modern identity platform for B2B SaaS

InfluxDB - Power Real-Time Data Analytics at Scale

SaaSHub - Software Alternatives and Reviews

Our great sponsors

TheVault		airoboros
	Project
4	Mentions	8
78	Stars	940
-	Growth	-
7.9	Activity	8.7
3 months ago	Latest Commit	about 2 months ago
Jupyter Notebook	Language	Python
MIT License	License	Apache License 2.0

The number of mentions indicates the total number of mentions that we've tracked plus the number of user suggested alternatives.
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.

TheVault

Posts with mentions or reviews of TheVault. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-06-02.

(2/2) May 2023
14 projects | /r/dailyainews | 2 Jun 2023

A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation (https://github.com/FSoft-AI4Code/TheVault)
List of code generation datasets (open source)
4 projects | /r/datasets | 30 May 2023

TheVault
[P] Fine-tuning LLaMA on TheVault by AI4Code
2 projects | /r/LocalLLaMA | 30 May 2023

I essentially want to fine-tune LLaMA on a dataset that's geared towards code generation. After a bit of research I found TheVault which seems good enough for the job (let me know if there are better datasets tho).
[R] Introducing The Vault: A new multilingual dataset for advancing code understanding and generation.
1 project | /r/MachineLearning | 12 May 2023

Github page: https://github.com/FSoft-AI4Code/TheVault

airoboros

Posts with mentions or reviews of airoboros. We have used some of these posts to build our list of alternatives and similar projects. The last one was on 2023-09-04.

TinyLlama project aims to pretrain a 1.1B Llama model on 3T tokens
4 projects | news.ycombinator.com | 4 Sep 2023
Airoboros: Customizable implementation of the self-instruct paper
1 project | news.ycombinator.com | 24 Aug 2023
airoboros (tool) overhaul
1 project | /r/LocalLLaMA | 20 Jul 2023

Just wanted to drop a note that I overhauled the airoboros tool not the models to have most of the prompts I've been using to build the datasets, plus a couple extras.
(2/2) May 2023
14 projects | /r/dailyainews | 2 Jun 2023

airoboros: using large language models to fine-tune large language models (https://github.com/jondurbin/airoboros)
Airoboros [7B/13B]
1 project | /r/LocalLLM | 24 May 2023

This is a fine-tuned LlaMa model, using completely synthetic training data created by https://github.com/jondurbin/airoboros
airobors-13b - 98% eval vs gpt-3.5-turbo
1 project | /r/LocalLLaMA | 21 May 2023

I used airoboros, a python tool I wrote, to generate the synthetic instruction response pairs, and included a jailbreak prompt to attempt to bypass OpenAI censorship. This is the only dataset used to fine-tune the model.
[P] airoboros 7b - instruction tuned on 100k synthetic instruction/responses
2 projects | /r/MachineLearning | 12 May 2023

This is a 7b parameter, fine-tuned on 100k synthetic instruction/response pairs generated by gpt-3.5-turbo using my version of self-instruct airoboros
[P] airoboros: a rewrite of self-instruct/alpaca synthetic prompt generation
1 project | /r/MachineLearning | 3 May 2023

GitHub Repo

What are some alternatives?

When comparing TheVault and airoboros you can also consider the following projects:

DB-GPT - AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents

WizardLM - Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder and WizardMath

GirlfriendGPT - Girlfriend GPT is a Python project to build your own AI girlfriend using ChatGPT4.0

TinyLlama - The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

tree-of-thoughts - Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70%

WizardVicunaLM - LLM that combines the principles of wizardLM and vicunaLM

code_contests

datablations - Scaling Data-Constrained Language Models

waymo-open-dataset - Waymo Open Dataset

chain-of-thought-hub - Benchmarking large language models' complex reasoning ability with chain-of-thought prompting

whylogs - An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈

gorilla - Gorilla: An API store for LLMs