-
FastChat
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
-
InfluxDB
Power Real-Time Data Analytics at Scale. Get real-time insights from all types of time series data with InfluxDB. Ingest, query, and analyze billions of data points in real-time with unbounded cardinality.
Indeed, we are also eager to see more quantifiable benchmarks. Unfortunately, for questions that lack a ground truth and require critical thinking, there is currently no widely accepted evaluation method. Vicuna once used GPT-4 to score the outputs of various models, which is an interesting approach, but it's challenging to prove that this evaluation is unbiased (for example, GPT-4 might favor models generated through distillation of GPT outputs). Currently, LM-Sys is hosting a competition among different LLMs, with human users conducting blind tests and providing evaluations for the models. We have submitted a Pull Request to this project at https://github.com/lm-sys/FastChat/pull/655. If accepted, we hope to obtain a quantifiable performance score through this method in the future.
Related posts
-
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases
-
Easy CSV Handling with Python: A Beginner's Guide (Bite-size Article)
-
self-host a Streamlit app' on a Ubuntu server
-
Show HN: AI Runner – my personal opensource, local, multi-modal, AI assistant
-
Initial commit to Synapse, the original Matrix.org homeserver implementation