Blog

SWE Benchmark For Various Proprietary Models

September 29, 202510 min read • Blogs
SWE Benchmark For Various Proprietary Models

The SWE benchmark is a top tool for comparing AI models in software engineering. It focuses on tasks like coding, debugging, and creating patches. As of September 2025, models from Anthropic (Claude), OpenAI (GPT series), and Google (Gemini) lead the rankings. Each model performs well on different SWE-bench datasets and levels of difficulty.

Okay, imagine you’re in school, and the teacher gives you a test. That test tells if you’re good at maths, reading, or drawing. SWE Benchmarks are like those school tests, but for AI models. Without tests, how would we know if an AI is smart or just pretending? Benchmarks help us see if one AI is better at maths, while another is better at telling jokes. It’s like comparing superheroes—who’s faster, stronger, or funnier?

SWE-Bench Verified vs SWE-Bench Pro

SWE-Bench Verified and SWE-Bench Pro are both tools that check how well AI coding agents do on real software work. However, they differ from each other. They vary in how hard they are, what they are made for, and the kind of checking they provide.

SWE-Bench Verified

  • SWE-bench Verified is still the best way to test AI models on real jobs in software engineering. 
  • This test has a handpicked group of 500 real bug fixes from well-known open-source Python projects.
  • Top AI models get more than 70% right on this test. This shows that they are getting better, but it does not always mean that they profoundly understand everything.
  • SWE-Bench Verified works well for checking basic results that you can repeat for common tasks that people can understand.

SWE-Bench Pro

  • The newer SWE-bench Pro brings an even tougher test. It has 1,865 tasks from 41 well-known repositories. This set ensures safety from contamination.
  • Leading models only get about 23% on this score. With the first SWE-bench Verified, they go over 70%.
  • SWE-bench Pro is a huge test of how well systems can change and think for themselves when faced with real, hard software problems. It shows how current AI struggles much more with these.
  • Moving from Verified to Pro shows the clear difference between doing well on polished demos and having real skills. It shows how to handle new and tough company-level issues.
Feature SWE-Bench Verified SWE-Bench Pro
Dataset Size 500 tasks 1,865+ tasks
Task Source Open source, manually curated Open source (copyleft) + proprietary
Human Verification Yes, 3+ annotators per sample Yes, but the focus is on realism
Task Complexity Single-file/minimal edits Multi-file, long horizon
Pass Rate (Top Models) 70%+ 15–23%
Use Case Baseline evaluation, research Real-world, enterprise scenario test
Focus Clarity, solvability Generalisation, contamination-resistance

Latest Results (2025) – Current Performance Leaders

The top models on the SWE-bench leaderboards as of September 2025 are Claude 4 Opus from Anthropic and GPT-5 from OpenAI. Gemini 2.5 Pro from Google is also performing well. Many people recognize Claude 4 Opus as the leader on SWE-bench, especially in complex reasoning tasks.

Top Proprietary SWE-bench Models

Claude 4 Opus:

  • 72.5% on SWE-bench Verified without deep thinking
  • 79.4% with high computer power and extended thinking
  • Stays in front for nonstop work on long coding jobs

GPT-5: right now, it leads the software engineering benchmarks. It does exceptionally well.

  • 74.9% on SWE-bench Verified – showing top results
  • 88% on Aider Polyglot – showing strong skills in editing many coding languages
  • Significant improvement over GPT-4o’s 30.8% on SWE-bench Verified

Gemini 2.5 Pro: gives good, but not the top, results:

  • 63.8% on SWE-bench Verified with a custom agent setup
  • 74.0% on Aider Polyglot (whole file editing)
  • 70.4% on LiveCodeBench v5
  • Strengths include an extensive 1-million token context window. It helps with large codebases

Claude 4 Sonnet:

  • The score is 72.7% on the SWE-bench Verified for its usual test.
  • With more power given, it reaches 80.2%.
  • A 69% better result is achieved compared to Claude 3.7 using SWE-agent.

Claude 3.5 Sonnet (previous generation):

  • 49-70.3% on SWE-bench Verified. The score depends on how someone sets it up.
  • This model has a solid base, but Claude 4 models move ahead by a significant margin.

OpenAI o3: shows good skill in coding that needs strong reasoning.

  • The setup of SWE-bench Verified results in a score between 69.1% and 71.7%.
  • It achieves 52.8% when you turn on thinking mode.
  • This is a significant improvement from o1, which got 48.9%.

GPT-4o: delivers solid but not leading performance:

  • 30.8% on SWE-bench: Verified 

Key Metrics: Claude 4 Opus vs. GPT-5 vs. Gemini 2.5 Pro

Model Provider SWE Benchmark Scores Strengths
Claude 4 Opus Anthropic 72.5% Complex reasoning, debugging
GPT-5 OpenAI 74.9% Versatile, multimodal, reasoning
Gemini 2.5 Pro Google 63.8% Cost-effective, large context
Claude 4 Sonnet Anthropic 72.7% Speed, agentic coding tasks
Claude 3.5 Sonnet Anthropic 49–70.3% Strong foundation
OpenAI o3 OpenAI 69.1–71.7% Code reasoning ability, code intelligence
GPT-4o OpenAI 30.8% Reliable in instruction-following

These scores represent the current public SWE-Bench Pro leaderboard for proprietary LLMs and agents.

SWE-Bench Pro Top Proprietary Models

The best proprietary models on the SWE-Bench Pro (Public Dataset) leaderboard and what they scored are shown below:

  • Claude Opus 4.1: 23.1% (public); 17.8% (proprietary)
  • GPT-5: 23.3% (public); 14.9% (proprietary)
  • Scores for all top models fall fast when you compare them to the easier test. This shows that true generalization is much harder.
Model Provider SWE-Bench Pro Score (Public)
GPT-5 (2025-08-07) OpenAI 23.26 ± 3.06%
Claude Opus 4.1 (2025-08-05) Anthropic 22.71 ± 3.04%
Claude 4 Sonnet (2025-05-14) Anthropic 17.65 ± 2.76%
Gemini 2.5 Pro (Preview 06-05) Google 13.54 ± 2.48%

These results show the situation for September 2025. They highlight the progress made by top AI model providers in coding and software engineering tasks.

Cost Effectiveness

The top models that do best on the SWE-bench at the lowest price in 2025 are:

  • Gemini 2.5 Pro
  • OpenAI GPT-4.1 Mini/Nano
  • OpenAI o3-mini

Gemini 2.5 Pro, OpenAI GPT-4.1 Mini/Nano, and OpenAI o3-mini are top models in 2025. They offer a good mix of high SWE scores and low costs per token.

Claude 3.7 Sonnet and Claude 4 do better on SWE-bench when you look at the main score. But their token prices are higher. This makes them not as good if you want to save money or manage a tight budget.

If a business wants top results on SWE benchmarks while keeping costs low, it should start with Gemini 2.5 Pro or GPT-4.1 Mini. However, for teams who need more advanced coding tools and deeper reasoning capabilities and have room in the budget, Claude 3.7 Sonnet or OpenAI o3-mini are the recommended choices, though they come at a higher cost.

Real-World Implications 

The results from the SWE tests show how these new AI systems turn their good results into real help when you write code. Using these AI-driven models with top SWE scores in your work is important if you want to keep up. You have to use them to stay ahead. When you use these AI models, ARC Infosoft can give you work 10x faster. You also get more correct work, with better use of time and money. Therefore, this makes a good base for new ideas and helps you grow fast.

At ARC Infosoft, we now use the best AI tools in the business. This includes OpenAI, Claude, and Gemini. We incorporate these tools into our development process. The change from using them has been big and has made our work much better.

  • Our development speed is now almost 10 times faster. This lets us finish projects much quicker than before.
  • We have improved cost-saving coding solutions. You always get top-quality without paying too much.
  • These models help a lot with daily development tasks. They also work great for hard software engineering problems.
If you want to get your AI-powered software projects moving faster and take advantage of new smart coding help, get in touch with ARC Infosoft today. Let’s make something great together.
You might also like

Related Posts

Transform or Wait? Navigating AI for Modern Software Products
Transform or Wait? Navigating AI for Modern Software Products

Artificial Intelligence (AI) is no longer a futuristic buzzword—it’s the reality shaping how businesses build modern software. From chatbots to predictive analytics, AI promises…

Read More
Proprietary vs Open Source AI Models – Choose the Right Path for Your Application
Proprietary vs Open Source AI Models – Choose the Right Path for Your Application

Artificial Intelligence (AI) is transforming how businesses deliver products and services. When building an AI-powered application, one of the first choices you’ll face is…

Read More
AWS vs Azure: Choosing the Best Cloud Platform for Scalable Cloud-Native Apps
AWS vs Azure: Choosing the Best Cloud Platform for Scalable Cloud-Native Apps

In today’s digital-first world, businesses are shifting to cloud-native development to build scalable, resilient, and cost-effective applications. When it comes to cloud platforms, AWS…

Read More