← Back to blog

My first SWE-bench Lite run with Selene cleared 60.67%

This was my first real SWE-bench Lite pass with Selene, not a polished rerun. I used Claude Opus 4.6 in non-thinking mode, kept the default Selene agent, ran tasks sequentially, and still landed at 182 resolved out of 300.

Mar 14, 20266 min read
EngineeringBenchmarksSWE-bench LitebenchmarkClaude Opus 4.6Seleneagentsbaseline
SWE-bench Lite CLI result showing 182 resolved out of 300

I wanted the first benchmark post here to be honest, not cleaned up after a bunch of retries. So this one is the real early pass: Claude Opus 4.6 in non-thinking mode, the default Selene agent, tasks processed one by one instead of in parallel, and a run that took about 18 hours to get through the full sweep.

Terminal output showing the first SWE-bench Lite result for Selene
The actual result from the first run: 60.67%, with 182 resolved out of 300 and all 300 tasks submitted.

That run landed at 182 resolved instances out of 300 total, which is 60.67% on the full set. The report also showed 293 successful runs, zero failed runs, seven errors, and no pending tasks. For a first real pass on a pretty plain setup, that was enough for me to take seriously.

Why I care about this run

The part I care about is not just the number. It is that this run used the plain default Selene agent, not a benchmark-specialized coder prompt with heavily narrowed behavior. I wanted to see what the product could do in a harder environment before I started tuning anything around it.

That makes the result more useful to me. I do not read it as an optimized score. I read it as a baseline that already has real signal before I start squeezing the workflow, the prompt shape, or the model setup.

Selene desktop app handling a SWE-bench task inside the interface
One of the benchmark tasks inside Selene: real repo context, tool use, and patching work instead of a fake wrapper around the benchmark.

What the harness was actually doing

This was not a fake benchmark wrapper. The harness walked real SWE-bench instances, checked out the target repositories, handed the issue prompt through Selene, collected the patch output, and formatted predictions for evaluation. That matters because it means the run was exercising the actual agent stack rather than a toy shortcut.

In practice that meant auth, sessions, repository setup, file search, patch generation, retries, and evaluation formatting were all part of the path. That is much closer to product behavior than to a benchmark-only script.

What I would improve next

There is still obvious headroom here. The biggest thing is workflow shape: this run processed tasks sequentially, so it was basically a patient marathon instead of a faster distributed run. I also kept the default Selene agent and used Opus in non-thinking mode rather than pushing for a more aggressive coding setup.

So when I look at 60.67%, I do not read it as a ceiling. I read it as a useful first marker from a setup that was still intentionally pretty plain.

Why I am posting it anyway

I like posting this run early because it shows the kind of product I want Selene to become: not a benchmark-only machine, and not a hand-tuned lab curiosity, but a general agent system that can do product work and still show up with real numbers when pointed at something hard.

This one took a long time, used a default setup, and still crossed 60.67%. That feels like a good place to start from.

More from the blog

View all posts
A friendly cream-colored blob character sits at a wooden desk holding a long paper receipt with clean line items, next to a small glowing laptop. Warm terracotta palette, cozy late-night coding vibe.

Apr 21, 20267 min read

How to predict your AI coding costs in 2026: a practical model for teams

Most AI coding bills look unpredictable because the pricing is built out of synthetic credits on top of real tokens. Here is a four-variable cost model, verified Anthropic and OpenAI rate cards, and an honest monthly bill for a five-developer team — $368 uncached or roughly $150 with prompt caching, not $3,000.

IndustryEngineeringpredictable AI coding pricingAI coding costs 2026
Three friendly round blobs gathered around a glowing laptop at night, each with a paper role card floating above its head — a crescent moon, a house, and a question mark. A warm illustration of a multi-agent team playing a role-guessing game.

Apr 21, 20264 min read

How to name your AI agents (and why the ones with names get all the work)

Claude, Devin, Jules, Goose, Harvey — every AI agent people actually call by name went out the door with a human handle, not a version string. A short, warm read on why names matter, a lineup of real named agents in production, and three rules we now use when we ship a new one.

ProductEngineeringnaming AI agentsmulti-agent systems
GitHub repository page for tercumantanumut/selene showing the repo is Public, MIT licensed, with 165 stars, 32 forks and 1,471 commits.

Apr 19, 20263 min read

Why an open-source license outlasts any vendor policy: the Selene promise explained

Every AI coding vendor eventually re-prices its plans — Augment Code is a recent example, and the same gravity pulls on everyone else. Here’s why subscription pricing is structurally unstable, what an MIT license actually guarantees, and how Selene’s BYOK + self-host architecture shrinks the surface where promises can break.

ProductIndustryopen source AI agentAugment Code policy changes