NoLiMa
Elemental
| NoLiMa | Elemental | |
|---|---|---|
| 7 | 1 | |
| 198 | 516 | |
| 3.5% | 0.0% | |
| 6.2 | 5.5 | |
| 11 months ago | 3 months ago | |
| Python | C++ | |
| GNU General Public License v3.0 or later | GNU General Public License v3.0 or later |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
NoLiMa
-
Agentic Pelican on a Bicycle
Thanks for the answer, sir. OK, yes. That makes a lot more sense. I am context greedy ever since I read that Adobe research paper that I shared with you months ago. [0]
The whole "context engineering" seems like a thing, though I dislike throwing around the word "engineer" all willy-nilly like that. :)
In any case, thanks for the response. I just wanted to make sure I was not missing something.
[0] https://github.com/adobe-research/NoLiMa
- Claude Sonnet 4 now supports 1M tokens of context
- NoLiMa: Long-Context Evaluation Beyond Literal Matching
-
GPT-4.1 in the API – OpenAI
Updated results from the authors: https://github.com/adobe-research/NoLiMa
It's the best known performer on this benchmark, but still falls off even relatively modest context lengths. (Cutting edge reasoning models like Gemini 2.5 Pro haven't been evaluated due to their cost and might be better.)
-
Strong evidence suggesting Quasar Alpha is OpenAI's new model
I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.
* - I also reproduced the Llama 3.1 8B result to check my setup.
[0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa*
-
Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
They are testing for a very straightforward needle retrieval, as LLMs traditionally were terrible for this in longer contexts.
There are some more advanced tests where it's far less impressive. Just a couple of days ago Adobe released one such test- https://github.com/adobe-research/NoLiMa
Elemental
-
Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
I know many people who can and will one-shot a rewrite of 500 LOC. In my world, 500 LOC is about the length of a single function. I don't understand why we should be talking about generating a high level plan with multiple tests etc. for a single function.
And I don't think this is uncommon. Just a random example from Github, this file is 1800 LOC and 4 functions. It implements one very specific thing that's part of a broader library. (I have no affiliation with this code.)
https://github.com/elemental/Elemental/blob/master/src/optim...
What are some alternatives?
kcores-llm-arena - LLM Arena by KCORES team
mcp-gemini-tutorial - Building MCP Servers with Google Gemini
solvespace - Parametric 2d/3d CAD
chat.md - An md file as a chat interface and editable history in one.
dune3d - 3D CAD application