TruthfulQA
TruthfulQA | auto-evaluator | |
---|---|---|
4 | 1 | |
508 | 708 | |
- | 1.6% | |
2.8 | 8.6 | |
6 months ago | 4 months ago | |
Jupyter Notebook | TypeScript | |
Apache License 2.0 | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
TruthfulQA
-
airoboros gpt-4 instructed + context-obedient question answering
Dataset: https://github.com/sylinrl/TruthfulQA
-
Scaling Transformer to 1M tokens and beyond with RMT
this is a great point.
do you know of any benchmarks doing this today?
given the acute need to evaluate models on contextual factuality, we're exploring how to create a benchmark for this purpose but prefer existing benchmarks if possible.
openai's truthfulqa[0] is close but does not focus on contextual factuality and targets a much harder problem of absolute truth.
if none exist, and people are interested in contributing, please reach out.
[0] https://github.com/sylinrl/TruthfulQA
-
[D] Is all the talk about what GPT can do on Twitter and Reddit exaggerated or fairly accurate?
I agree they show that you can brute-force mimick uncertainty estimates to some degree, and that the model is generally well calibrated (though on what is basically a set of trivia questions, so YMMV)... yet:
-
[R] TruthfulQA: Measuring How Models Mimic Human Falsehoods
Code for https://arxiv.org/abs/2109.07958 found: https://github.com/sylinrl/TruthfulQA
auto-evaluator
-
airoboros gpt-4 instructed + context-obedient question answering
My notes have another option: https://github.com/langchain-ai/auto-evaluator
What are some alternatives?
safari - Convolutions for Sequence Modeling
FastChat - An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
recurrent-memory-transformer - [NeurIPS 22] [AAAI 24] Recurrent Transformer-based long-context architecture.
heinsen_routing - Reference implementation of "An Algorithm for Routing Vectors in Sequences" (Heinsen, 2022) and "An Algorithm for Routing Capsules in All Domains" (Heinsen, 2019), for composing deep neural networks.
JARVIS - JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf