kcores-llm-arena
chat.md
kcores-llm-arena | chat.md | |
---|---|---|
2 | 4 | |
855 | 36 | |
3.6% | - | |
8.5 | 9.2 | |
2 months ago | 10 days ago | |
HTML | TypeScript | |
GNU General Public License v3.0 or later | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
kcores-llm-arena
-
RustAssistant: Using LLMs to Fix Compilation Errors in Rust Code
Gemini 2.5 pro is far ahead of even Claude
Chart:
https://raw.githubusercontent.com/KCORES/kcores-llm-arena/re...
Description of the challenges:
https://github.com/KCORES/kcores-llm-arena
- GPT-4.1 in the API – OpenAI
chat.md
-
OpenAI o3 and o4-mini – OpenAI
FWIW, in limited trials so far, agentic capabilities feel worse than gpt-4.1 with various tool calling error modes not seen in gpt-4.1 [4].
Attempt 1: couldn't understand correct syntax, aborted. [1]
Attempt 2: problems in understanding tool result and passing correct inputs, aborted [2]
Attempt 3: waits multiple times for user confirmation, shows lack of native agentic looping unlike sonnet. Finally has same failure as 2 [3]
Sonnet 3.7 obviously has least number of such errors, followed by gemini 2.5 pro [5]
[1] https://github.com/rusiaaman/chat.md/blob/main/samples/o4-mi...
[2] https://github.com/rusiaaman/chat.md/blob/main/samples/o4-mi...
[3] https://github.com/rusiaaman/chat.md/blob/main/samples/o4-mi...
[4] https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
[5] https://github.com/rusiaaman/chat.md/blob/main/samples/gemin...
-
GPT-4.1 in the API – OpenAI
Did some quick tests. I believe its the same model as Quasar. It struggles with agentic loop [1]. You'd have to force it to do tool calls.
Tool use ability feels ability better than gemini-2.5-pro-exp [2] which struggles with JSON schema understanding sometimes.
Llama 4 has suprising agentic capabilities, better than both of them [3] but isn't as intelligent as the others.
[1] https://github.com/rusiaaman/chat.md/blob/main/samples/4.1/t...
- Show HN: Chat.md
- Show HN: Chat.md – file as chat interface with editable history [MCP-client]
What are some alternatives?
NoLiMa - Official repository for "NoLiMa: Long-Context Evaluation Beyond Literal Matching"
klavis - Klavis AI (YC X25): Open Source MCP integration for AI applications
openai-cookbook - Examples and guides for using the OpenAI API
polaris - Distributed AI Agent Framework