The SWE benchmark is a top tool for comparing AI models in software engineering. It focuses on tasks like coding, debugging, and creating patches. As of September 2025, models from Anthropic (Claude), OpenAI (GPT series), and Google (Gemini) lead the rankings. Each model performs well on different SWE-bench datasets and levels of difficulty.
Okay, imagine you’re in school, and the teacher gives you a test. That test tells if you’re good at maths, reading, or drawing. SWE Benchmarks are like those school tests, but for AI models. Without tests, how would we know if an AI is smart or just pretending? Benchmarks help us see if one AI is better at maths, while another is better at telling jokes. It’s like comparing superheroes—who’s faster, stronger, or funnier?
SWE-Bench Verified vs SWE-Bench Pro
SWE-Bench Verified and SWE-Bench Pro are both tools that check how well AI coding agents do on real software work. However, they differ from each other. They vary in how hard they are, what they are made for, and the kind of checking they provide.
SWE-Bench Verified
- SWE-bench Verified is still the best way to test AI models on real jobs in software engineering.
- This test has a handpicked group of 500 real bug fixes from well-known open-source Python projects.
- Top AI models get more than 70% right on this test. This shows that they are getting better, but it does not always mean that they profoundly understand everything.
- SWE-Bench Verified works well for checking basic results that you can repeat for common tasks that people can understand.
SWE-Bench Pro
- The newer SWE-bench Pro brings an even tougher test. It has 1,865 tasks from 41 well-known repositories. This set ensures safety from contamination.
- Leading models only get about 23% on this score. With the first SWE-bench Verified, they go over 70%.
- SWE-bench Pro is a huge test of how well systems can change and think for themselves when faced with real, hard software problems. It shows how current AI struggles much more with these.
- Moving from Verified to Pro shows the clear difference between doing well on polished demos and having real skills. It shows how to handle new and tough company-level issues.
| Feature | SWE-Bench Verified | SWE-Bench Pro |
|---|---|---|
| Dataset Size | 500 tasks | 1,865+ tasks |
| Task Source | Open source, manually curated | Open source (copyleft) + proprietary |
| Human Verification | Yes, 3+ annotators per sample | Yes, but the focus is on realism |
| Task Complexity | Single-file/minimal edits | Multi-file, long horizon |
| Pass Rate (Top Models) | 70%+ | 15–23% |
| Use Case | Baseline evaluation, research | Real-world, enterprise scenario test |
| Focus | Clarity, solvability | Generalisation, contamination-resistance |
Latest Results (2025) – Current Performance Leaders
The top models on the SWE-bench leaderboards as of September 2025 are Claude 4 Opus from Anthropic and GPT-5 from OpenAI. Gemini 2.5 Pro from Google is also performing well. Many people recognize Claude 4 Opus as the leader on SWE-bench, especially in complex reasoning tasks.
Top Proprietary SWE-bench Models
Claude 4 Opus:
- 72.5% on SWE-bench Verified without deep thinking
- 79.4% with high computer power and extended thinking
- Stays in front for nonstop work on long coding jobs
GPT-5: right now, it leads the software engineering benchmarks. It does exceptionally well.
- 74.9% on SWE-bench Verified – showing top results
- 88% on Aider Polyglot – showing strong skills in editing many coding languages
- Significant improvement over GPT-4o’s 30.8% on SWE-bench Verified
Gemini 2.5 Pro: gives good, but not the top, results:
- 63.8% on SWE-bench Verified with a custom agent setup
- 74.0% on Aider Polyglot (whole file editing)
- 70.4% on LiveCodeBench v5
- Strengths include an extensive 1-million token context window. It helps with large codebases
Claude 4 Sonnet:
- The score is 72.7% on the SWE-bench Verified for its usual test.
- With more power given, it reaches 80.2%.
- A 69% better result is achieved compared to Claude 3.7 using SWE-agent.
Claude 3.5 Sonnet (previous generation):
- 49-70.3% on SWE-bench Verified. The score depends on how someone sets it up.
- This model has a solid base, but Claude 4 models move ahead by a significant margin.
OpenAI o3: shows good skill in coding that needs strong reasoning.
- The setup of SWE-bench Verified results in a score between 69.1% and 71.7%.
- It achieves 52.8% when you turn on thinking mode.
- This is a significant improvement from o1, which got 48.9%.
GPT-4o: delivers solid but not leading performance:
- 30.8% on SWE-bench: Verified
Key Metrics: Claude 4 Opus vs. GPT-5 vs. Gemini 2.5 Pro
| Model | Provider | SWE Benchmark Scores | Strengths |
|---|---|---|---|
| Claude 4 Opus | Anthropic | 72.5% | Complex reasoning, debugging |
| GPT-5 | OpenAI | 74.9% | Versatile, multimodal, reasoning |
| Gemini 2.5 Pro | 63.8% | Cost-effective, large context | |
| Claude 4 Sonnet | Anthropic | 72.7% | Speed, agentic coding tasks |
| Claude 3.5 Sonnet | Anthropic | 49–70.3% | Strong foundation |
| OpenAI o3 | OpenAI | 69.1–71.7% | Code reasoning ability, code intelligence |
| GPT-4o | OpenAI | 30.8% | Reliable in instruction-following |
These scores represent the current public SWE-Bench Pro leaderboard for proprietary LLMs and agents.
SWE-Bench Pro Top Proprietary Models
The best proprietary models on the SWE-Bench Pro (Public Dataset) leaderboard and what they scored are shown below:
- Claude Opus 4.1: 23.1% (public); 17.8% (proprietary)
- GPT-5: 23.3% (public); 14.9% (proprietary)
- Scores for all top models fall fast when you compare them to the easier test. This shows that true generalization is much harder.
| Model | Provider | SWE-Bench Pro Score (Public) |
|---|---|---|
| GPT-5 (2025-08-07) | OpenAI | 23.26 ± 3.06% |
| Claude Opus 4.1 (2025-08-05) | Anthropic | 22.71 ± 3.04% |
| Claude 4 Sonnet (2025-05-14) | Anthropic | 17.65 ± 2.76% |
| Gemini 2.5 Pro (Preview 06-05) | 13.54 ± 2.48% |
These results show the situation for September 2025. They highlight the progress made by top AI model providers in coding and software engineering tasks.
Cost Effectiveness
The top models that do best on the SWE-bench at the lowest price in 2025 are:
- Gemini 2.5 Pro
- OpenAI GPT-4.1 Mini/Nano
- OpenAI o3-mini
Gemini 2.5 Pro, OpenAI GPT-4.1 Mini/Nano, and OpenAI o3-mini are top models in 2025. They offer a good mix of high SWE scores and low costs per token.
Claude 3.7 Sonnet and Claude 4 do better on SWE-bench when you look at the main score. But their token prices are higher. This makes them not as good if you want to save money or manage a tight budget.
If a business wants top results on SWE benchmarks while keeping costs low, it should start with Gemini 2.5 Pro or GPT-4.1 Mini. However, for teams who need more advanced coding tools and deeper reasoning capabilities and have room in the budget, Claude 3.7 Sonnet or OpenAI o3-mini are the recommended choices, though they come at a higher cost.
Real-World Implications
The results from the SWE tests show how these new AI systems turn their good results into real help when you write code. Using these AI-driven models with top SWE scores in your work is important if you want to keep up. You have to use them to stay ahead. When you use these AI models, ARC Infosoft can give you work 10x faster. You also get more correct work, with better use of time and money. Therefore, this makes a good base for new ideas and helps you grow fast.
At ARC Infosoft, we now use the best AI tools in the business. This includes OpenAI, Claude, and Gemini. We incorporate these tools into our development process. The change from using them has been big and has made our work much better.
- Our development speed is now almost 10 times faster. This lets us finish projects much quicker than before.
- We have improved cost-saving coding solutions. You always get top-quality without paying too much.
- These models help a lot with daily development tasks. They also work great for hard software engineering problems.