deep-swe
claude-code-system-prompts
| deep-swe | claude-code-system-prompts | |
|---|---|---|
| 11 | 8 | |
| 101 | 10,977 | |
| 0.0% | 14.1% | |
| - | 9.6 | |
| 22 days ago | 6 days ago | |
| Shell | JavaScript | |
| - | MIT License |
Stars - the number of stars that a project has on GitHub. Growth - month over month growth in stars.
Activity is a relative number indicating how actively a project is being developed. Recent commits have higher weight than older ones.
For example, an activity of 9.0 indicates that a project is amongst the top 10% of the most actively developed projects that we are tracking.
deep-swe
-
AWS Bedrock to require sharing data with Anthropic for Mythos and future models
That remains to be seen.
It's notable that Anthropic are still using SWEBench as a coding benchmark rather that the newer more difficult DeepSWE which shows them well behind GPT 5.5
https://deepswe.datacurve.ai/
Bear in mind that all the marketing efforts such as solving Erdos problem are the result of concerted RL training to impart those narrow capabilities, and how much of any benchmark results, or paid shill vibe reports, reflect improved performance for more general real-world use cases remains to be seen.
-
DeepSeek V4 Pro beats GPT-5.5 Pro on precision
This benchmark draws a very different picture having GPT5.5 on the very top with 70% and DeepSeek at 8%
https://deepswe.datacurve.ai
- DeepSWE results are unreliable – 3/3 DSv4 "failed" tasks solved with same model
- DeepSWE: Measuring frontier coding agents on original, long-horizon SWE tasks
- DeepSWE Audit: DeepSeek-v4-pro results are unreliable
-
DeepSWE: More and cheaper intelligence from maxed GPT 5.5 than maxed Opus 4.8
Source: https://deepswe.datacurve.ai
Just select the two models from the drop down.
-
Claude Opus 4.8
Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.
https://github.com/datacurve-ai/deep-swe
- DeepSWE: Measuring coding agents on original, long-horizon engineering tasks
- DeepSWE Measuring frontier coding agents
claude-code-system-prompts
-
Claude Opus 4.8
It's interesting that (for example for the explore agent https://github.com/Piebald-AI/claude-code-system-prompts/blo... ) they use a personality "you are a file search specialist" and "your strengths" framing. I thought that was largely thought to be useless, or even counterproductive nowadays? Does anyone know more about this stuff?
-
Claude Code is unusable for complex engineering tasks with the Feb updates
This might be more complex than I imagined. Seems Claude Code dynamically customizes the system prompt. They also update it with every version. Outright replacing it might miss out on updates.
https://github.com/Piebald-AI/claude-code-system-prompts
https://github.com/Piebald-AI/tweakcc
-
Shall I implement it? No
This is the prompt that Claude Code adds when you use /btw
https://github.com/Piebald-AI/claude-code-system-prompts/blo...
-
Why XML Tags Are So Fundamental to Claude
https://github.com/Piebald-AI/claude-code-system-prompts/blo... They seem to use XML-esque tags here in the first prompt I looked at
-
Karpathy on Programming
That refers to the sandbox "escape hatch" [1], running a command without a sandbox is a separate approval so you get another prompt even if that command has been pre-approved. Their system prompt [2] is too vague about what kinds of failures the sandbox can cause, in my experience the agent always jumps straight to disabling the sandbox if a command fails. Probably best to disable the escape hatch and deal with failures manually.
[1] https://code.claude.com/docs/en/sandboxing#configure-sandbox...
[2] https://github.com/Piebald-AI/claude-code-system-prompts/blo...
- Show HN: Claude Code system prompts with change log
What are some alternatives?
arena-ai-leaderboards - 📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena) leaderboards — LLM, Vision, Code, Video, Image & more. Structured JSON with historical tracking.
claude-code-transcripts - Tools for publishing transcripts for Claude Code sessions