kcores-llm-arena
NoLiMa
kcores-llm-arena | NoLiMa | |
---|---|---|
2 | 5 | |
855 | 113 | |
3.6% | 28.3% | |
8.5 | 6.2 | |
2 months ago | 24 days ago | |
HTML | Python | |
GNU General Public License v3.0 or later | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
kcores-llm-arena
-
RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code
Gemini 2.5 pro is far ahead of even Claude
Chart:
https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...
Description of the challenges:
https://github.com/KCORES/kcores-llm-arena
- GPT-4.1 in the API – OpenAI
NoLiMa
- NoLiMa: Long-Context Evaluation Beyond Literal Matching
-
GPT-4.1 in the API – OpenAI
Updated results from the authors: https://github.com/adobe-research/NoLiMa
It's the best known performer on this benchmark, but still falls off even relatively modest context lengths. (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might be better.)
-
Strong evidence suggesting Quasar Alpha is OpenAI's new model
I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.
* - I also reproduced the Llama 3.1 8B result to check my setup.
[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa*
-
Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.
There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa
What are some alternatives?
chat.md - An md file as a chat interface and editable history in one.
mcp-gemini-tutorial - Building MCP Servers with Google Gemini
openai-cookbook - Examples and guides for using the OpenAI API
Elemental - Distributed-memory, arbitrary-precision, dense and sparse-direct linear algebra, conic optimization, and lattice reduction