refactor-benchmark
codespin
refactor-benchmark | codespin | |
---|---|---|
2 | 5 | |
21 | 59 | |
- | - | |
5.9 | 9.5 | |
3 months ago | 4 days ago | |
Python | TypeScript | |
Apache License 2.0 | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
refactor-benchmark
-
GPT-4 Turbo with Vision is a step backwards for coding
FWIW, I agree with you that each model has its own personality and that models may do better or worse on different kinds of coding tasks. Aider leans into both of these concepts.
The GPT-4 Turbo models have a lazy coding personality, and I spent months of effort figuring out how to both measure and reduce that laziness. This resulted in aider supporting "unified diffs" as a code editing format to reduce such laziness by 3X [0] and the aider refactoring benchmark as a way to quantify these benefits [1].
The benchmark results I just shared about GPT-4 Turbo with Vision cover both smaller, toy coding problems [2] as well as larger edits to larger source files [3]. The new model slightly underperforms on smaller coding tasks, and significantly underperforms on the larger edits where laziness is often a culprit.
[0] https://aider.chat/2023/12/21/unified-diffs.html
[1] https://github.com/paul-gauthier/refactor-benchmark
[2] https://aider.chat/2024/04/09/gpt-4-turbo.html#code-editing-...
[3] https://aider.chat/2024/04/09/gpt-4-turbo.html#lazy-coding
-
OpenAI: Memory and New Controls for ChatGPT
1-2 sentences: Rather than writing code, GPT-4 Turbo often inserts comments like "... finish implementing function here ...". I made a benchmark that provokes and quantifies that behavior.
1-2 paragraphs:
I found that I could provoke lazy coding by giving GPT-4 Turbo refactoring tasks, where I ask it to refactor a large method out of a large class. I analyzed 9 popular open source python repos and found 89 such methods that were conceptually easy to refactor, and built them into a benchmark [0].
GPT succeeds on a task if it can remove the method from its original class and add it to the top level of the file with appropriate changes to the SIZE of the abstract syntax tree. By measuring the size of the AST, we infer that GPT didn't replace a bunch of code with a comment like "... insert original method here...". I also gathered other laziness metrics like counting the number of new comments that contained "...", which correlated well with the AST size test.
[0] https://github.com/paul-gauthier/refactor-benchmark
codespin
-
GPT-4 Turbo with Vision is a step backwards for coding
Shameless plug. I have a VS Code extension that's very nearly ready.
Codespin CLI tools (ready to use): https://github.com/codespin-ai/codespin
VS Code extension for the CLI tool (soon): https://www.youtube.com/watch?v=2TJqosFmkao
I'll do a Show HN in a week or two.
-
LLMs and Programming in the first days of 2024
Shameless plug: https://github.com/codespin-ai/codespin-cli
It's similar to aider (which is a great tool btw) in goals, but with a different recipe.
-
Copying Angry Birds with nothing but AI
That AI is transformative for development is not in doubt any more. Just this past week, I've been able to build two medium sized services (a couple of thousand lines of code in python, a language I hadn't used for more than a decade!). What's truly impressive is that for the large part, it's better than the code I'd have written anyway. Want a nice README.md? Just provide the source code that contains routes/cli args/whatever, and it'll generate it for you. Want tests? Sure. Developers have never had it so easy.
Another thing to note is that for code generation, GPT4 runs circles around GPT3.5. GPT35 is alright at copying if you provide very tight examples, but GPT4 kinda "thinks".
Shameless plug: I have this open source app which automates a lot of grunt work in prompt generation - https://github.com/codespin-ai/codespin-cli
- An Open Source Node.JS-based CLI tool for Generating Code using GPT
- CodeSpin: Code generation framework and tools using OpenAI APIs
What are some alternatives?
llama-cpp-python - Python bindings for llama.cpp
matter-js - a 2D rigid body physics engine for the web ▲● ■
nitter - Alternative Twitter front-end