Our great sponsors
-
llm-colosseum
Benchmark LLMs by fighting in Street Fighter 3! The new way to evaluate the quality of an LLM
-
WorkOS
The modern identity platform for B2B SaaS. The APIs are flexible and easy-to-use, supporting authentication, user identity, and complex enterprise features like SSO and SCIM provisioning.
Hello guys,
Tired of current boring LLMs benchmark ? I'm sharing with you a fun project built during the Mistral AI SF hackathon.
Using a RL framework, we made LLMs fight against each other in real time in Street Fighter III. You can find the repo here : https://github.com/OpenGenerativeAI/llm-colosseum.
Aside from the fact that it's very funny to see Mistral and others performing Hadouken, we found that it is a great way to benchmark language models. They need to quickly understand their environment and take actions accordingly.
With >400 fights, check out the ELO ranking on the HF space here : https://huggingface.co/spaces/junior-labs/llm-colosseum