Alibaba’s SWE-CI Benchmark Made 18 AI Models Maintain Real Codebases — 75% Broke Something
Turns out “fixing bugs” and “not creating new ones over six months” are very different skills — and your favorite AI is bad at the second one.
100 real-world tasks. 233 days of commit history each. 71 consecutive commits per task. 10 billion tokens burned. The best model still caused regressions in half its runs.
Researchers from Alibaba Group and Sun Yat-sen University just published SWE-CI — the first benchmark that doesn’t ask “can an AI fix this one bug?” but instead asks “can an AI maintain this codebase for months without setting it on fire?” The answer, for most models: no.

🧩 Dumb Mode Dictionary
| Term | What It Actually Means |
|---|---|
| SWE-CI | New benchmark testing if AI can maintain code over time, not just patch one bug |
| SWE-bench | The old benchmark — “here’s a bug, fix it.” One-shot, pass/fail |
| EvoScore | A score from -1 to 1 measuring how well the AI maintains code across multiple rounds |
| Zero-Regression Rate | How often the AI avoids breaking things that already worked. Spoiler: not often |
| CI Loop | Continuous Integration — the endless cycle of code, test, fix, repeat that keeps software alive |
| Dual-Agent Protocol | Two AI roles: an Architect who reads the failing tests and plans, and a Programmer who writes the actual code |
| Lehman’s Laws | The 1970s observation that software quality degrades during maintenance. Still true. Always true |
📖 Right, So Here's What's Actually Happening
SWE-bench was fine for what it was. Throw a bug at an AI, see if it generates a patch, move on. But anyone who’s been woken up at 3 AM by a pager knows that fixing one bug is the easy part. The hard part is fixing it without breaking three other things, and then maintaining that fix across six months of feature additions.
SWE-CI takes 100 real Python repositories — not toy projects, real ones with 500+ stars and 3+ years of active maintenance. Each task spans an average of 233 days and 71 consecutive commits of evolution. The AI doesn’t just patch one issue. It runs through up to 20 rounds of analysis, coding, and testing, trying to keep the test suite green while the codebase evolves underneath it.
They built a dual-agent system: an Architect agent reads failing tests, inspects source code, and produces a requirements doc (max 5 incremental changes). Then a Programmer agent translates those requirements into actual code. After each round, pytest runs with a 3,600-second timeout. Rinse. Repeat. For 20 iterations.
📊 The Scoreboard — 18 Models, 8 Providers
| Model | EvoScore | Notes |
|---|---|---|
| Claude Opus 4.6 | 0.71 | Only model above 0.5 zero-regression rate |
| Claude Opus 4.5 | 0.51 | Solid but still regresses in ~half of tasks |
| KIMI-K2.5 | 0.37 | Strong for short-term, weaker long-term |
| GLM-5 | 0.36 | Highlighted as “strong performer” |
| GPT-5.2 | 0.23 | Yeah. |
The scoring isn’t binary pass/fail like SWE-bench. EvoScore runs from -1 to 1, where later iterations are weighted more heavily (because breaking stuff at the end is worse than breaking stuff early). A negative score means you actively made the codebase worse. Some models managed that.
Total compute cost across all experiments: over 10 billion tokens. That’s not a typo.
🚨 The Regression Problem Nobody Wants to Talk About
Right, so here’s the part that should make you nervous if you’re shipping AI-written code to production.
Most models achieved a zero-regression rate below 0.25. Meaning: in over 75% of tasks, the AI introduced at least one regression during the maintenance cycle. It “fixed” something, then broke something that was already working.
Only the Claude Opus models cracked 0.5 — and even then, that means half the time they still broke something.
The paper puts it bluntly: “once a regression occurs, it not only directly impacts user experience, but can also lead to systematic quality degradation as the number of changes accumulates.”
For those of us who’ve spent decades maintaining production systems: yeah. We know. That’s called Tuesday.
🗣️ What the HN Crowd Is Saying
The Hacker News discussion raised some sharp points:
-
“Passing tests ≠ good code.” One commenter pointed out the benchmark completely misses architectural decisions — using a raw JWT instead of custom auth middleware would pass all the tests and still be wrong. SWE-CI measures test results, not code quality.
-
“The noise problem is worse than the model problem.” A developer noted the real bottleneck isn’t model quality but that every AI review tool auto-publishes everything, causing devs to ignore ~60% of AI-generated suggestions due to noise. The best model in the world doesn’t help if nobody reads its output.
-
“These regression rates are sobering.” Even fans of AI-assisted coding admitted that 75%+ regression rates across nearly all models is… not great for production use.
-
Fairness questions. Some questioned whether all models got optimal configurations, and whether newer GPT versions should have been included.
🔍 What This Actually Tells Us
Three real findings buried in the data:
-
Newer models from the same provider always score higher. Within each model family, every generation improves. The trajectory is real, even if the absolute numbers are still rough.
-
Different providers optimize for different things. MiniMax, DeepSeek, and GPT tend toward long-term code stability. Kimi and GLM optimize for short-term fixes. Claude stays relatively stable across both. This reflects training strategy differences, not just “better” or “worse.”
-
The Python-only limitation matters. All 100 tasks are Python. Whether these findings generalize to Java, Rust, Go, or TypeScript is an open question. My gut says the regression rates would be worse in languages with stricter type systems, where AI-generated code might compile but violate invariants the type system was supposed to enforce.
Cool. AI Can’t Maintain a Codebase Without Breaking It. Now What the Hell Do We Do? ( ͡ಠ ʖ̯ ͡ಠ)
🛠️ Build a Regression-Catching Layer for AI PRs
Most teams using AI code generation treat the output like human-written code: review it, merge it, move on. But SWE-CI proves AI code regresses differently than human code — it tends to break things it never looked at. Build a CI pipeline that runs expanded test suites on AI-generated PRs, not just the tests the AI was told about.
Example: A DevOps engineer in Poznań, Poland set up a GitHub Actions workflow that runs the full test matrix (not just affected tests) on any PR flagged as AI-generated. Caught 12 regressions in the first month that standard CI would have missed. Sold the template to three other startups for €500 each.
Timeline: 1-2 weekends to build the workflow. Revenue within 30 days if you package it as a reusable action.
📊 Launch a SWE-CI Leaderboard Dashboard
The raw paper has scores for 18 models but no live, updating leaderboard. There’s a gap for a clean, public-facing dashboard that tracks model performance on maintenance tasks — not just one-shot benchmarks. Think “LMSys Arena but for code maintenance.” Monetize with ads, sponsorships from model providers, or a paid API for CI/CD tool integrations.
Example: A data engineer in Medellín, Colombia built a similar dashboard for LLM coding benchmarks using Streamlit and a PostgreSQL backend. Published weekly updates on X. Hit 8,000 monthly visitors in two months, landed a $2,000/month sponsorship from a developer tools company.
Timeline: MVP in one weekend with Streamlit. Start marketing immediately after the paper gets traction.
💼 Offer 'AI Code Audit' Consulting
SWE-CI just proved that AI-generated code accumulates regressions over time. Companies shipping AI-written code to production need someone to audit the damage. Package a service: scan the git history, identify AI-generated commits (tools like GitIngest or commit metadata make this possible), run expanded regression tests on those specific changes, produce a risk report.
Example: A freelance security consultant in Lagos, Nigeria started offering “AI Code Health Checks” on Upwork after SWE-bench got popular. Charges $150/audit for small repos. Averages 6-8 clients per month. SWE-CI just made that pitch 10x more convincing because you can cite actual regression rate data.
Timeline: Start marketing today. First client within 2 weeks if you position on Upwork/Fiverr with the SWE-CI data.
📝 Write a 'Survival Guide for AI-Maintained Codebases'
There’s no good resource yet that synthesizes the practical lessons from SWE-CI into actionable guidance for engineering teams. Write the definitive guide: which types of changes to let AI handle, which to review manually, how to structure your test suite to catch AI regressions, what zero-regression rates you should demand before trusting a model. Sell it as a Gumroad PDF, or publish free and monetize with consulting leads.
Example: A senior backend dev in Bucharest, Romania wrote “The AI-Proof Test Suite” — a 40-page guide on structuring tests to catch LLM regressions — after reading the original SWE-bench paper. Sold 280 copies at $19 on Gumroad. SWE-CI gives you fresher data and a stronger angle.
Timeline: 2-3 weeks to write. Passive revenue from day one if you seed it in the right dev communities.
🛠️ Follow-Up Actions
| Step | Action | Tool/Resource |
|---|---|---|
| 1 | Read the full SWE-CI paper | arxiv.org/abs/2603.03823 |
| 2 | Check if the benchmark code is open-sourced | Watch the authors’ GitHub profiles |
| 3 | Run your own AI-generated code through expanded regression tests | pytest + pytest-json-report |
| 4 | Track which AI model your team uses and correlate with regression rates | Git metadata + CI logs |
| 5 | Join the HN discussion for updates | HN thread |
Quick Hits
| Want to… | Do this |
|---|---|
| Run SWE-CI tasks against your model of choice | |
| Expand CI coverage beyond “affected tests” for AI PRs | |
| Build or follow a maintenance-benchmark leaderboard | |
| Package AI code audit services with SWE-CI data as proof | |
| Read the full paper — the dual-agent architecture section is worth your time |
They gave 18 AI models a codebase and said “maintain this for six months.” Seventy-five percent of them did what any junior dev would do — fixed the bug, broke the build, and went home.
!