testers.ai Quality intelligence for AI-driven testing

Web bug-finder benchmark

AI Bug-Finder Leaderboard

How 15 AI models compare at finding real defects on web pages — across security, accessibility, privacy, performance, reliability, UX and more.

testers.ai ships five tunable modes — one knob (optimize_for) picks which trade-off you want. Each mode is its own row below; the ★ tuned for this badge marks the mode specifically tuned for that leaderboard’s metric.

Updated April 28, 2026 · 9:54 PM UTC · Cohort: 3 AI-generated web pages with deliberately seeded bugs — including the failure modes AI coding tools commonly ship.

Why this is hard to benchmark

You never fully know the right answer

Evaluating a bug-finding agent looks deceptively simple — "did it find bugs?" — until you start counting. A single page can have hundreds of latent issues a careful reviewer would catch over a long enough timeframe. Some are stark, some are subtle, and the boundary between “bug” and “weird-but-intentional” rarely sits where you'd want it to.

Concretely, every score we compute on this page sits inside a fog of four uncertainties:

False positives. The model flags something that isn’t actually wrong — sometimes a hallucination, sometimes a misunderstanding, sometimes a real concern phrased so vaguely the judge can’t map it to a known bug.
False negatives. The model misses something real. We can only count this against bugs we already know are there. Whatever the model misses that we also missed sits invisible in the reports.
Unknown unknowns. Real pages always have bugs no eval author thought to plant. A model that surfaces those isn’t wrong — it’s found something real we just didn’t seed. We run a second-pass judge to split those “real-but-unseeded” wins from outright hallucinations, but the line is fuzzy.
Bug or feature? :) Half the disagreements between humans and models reduce to opinions. Is a 13-px font on a marketing page a bug? An autoplay hero video? A 5-star review widget that grays out at 4.9? Reasonable people, and reasonable models, disagree. The judge has guidelines, but the line moves.

Net effect: any single number you read here has more uncertainty than the digit count suggests. We mitigate this by reporting multiple metrics with different cost-of-error assumptions, so you can choose the one that lines up with your team’s actual risk profile rather than just trusting one.

There is no single “best” bug-finder — just trade-offs

Different teams pay different costs

A security team running a pre-launch audit pays a huge cost for any missed vulnerability and almost nothing for a false alarm — they want maximum recall, hallucinations be damned. A team feeding LLM findings into an auto-PR pipeline pays the opposite cost: a hallucinated bug becomes a real regression, so they need groundedness above all. A general engineering team triaging a backlog wants the best precision-weighted balance — F_0.5. There’s no metric right for everyone; the right one is the one whose error costs match yours.

Because of that, we publish six leaderboards below — five specialized metrics plus a single Overall score that averages all five for the case where you have no specific bias. The Overall is the right place to start; the specialized boards are where you go once you know what your team actually pays for.

testers.ai exposes the trade-off explicitly through five tunable modes — the table below shows one row per mode so you can see exactly what each rule-pack gives up to win its target metric.

How we evaluate

AI-generated pages with deliberately seeded bugs

Every score on this page comes from a held-out cohort of three AI-generated web pages — a search-engine results page, a news article, and a social feed. Each page was produced by an LLM coding tool and then deliberately seeded with a known set of defects across a range of categories: accessibility violations, missing security headers, broken links, console errors, layout overflow, content typos, performance issues, and so on.

The seeded bugs explicitly include the failure modes AI coding tools commonly ship — missing alt text on images, hardcoded credentials in client-side code, unsanitised user input, dead-code links, broken aria-labels, mixed-protocol resources, and the long tail of "looks fine, fails when reviewed" defects that show up when an LLM writes the code. That's the point: this benchmark scores how well bug-finder agents catch the bugs that AI-generated code is actually producing right now.

Each page is fully self-contained (inline CSS, inline SVG, no network access required at scoring time). Bug counts per page are tracked in the eval’s ground_truth.json alongside each page’s artifacts.

Each model receives the same artifact bundle: rendered HTML, browser console log, network transcript (with response bodies), and a screenshot. No live browser, no network access, no model-specific harness tricks — just the four artifacts and a shared system prompt.

An independent LLM-as-judge then matches each model’s findings against the seeded ground-truth bugs. Matched findings count as true positives. Unmatched findings — potential false positives — go through a second-pass classifier that asks: "is this finding pointing at something real on the page that we just didn’t seed, or did the model fabricate this?" That second pass is what powers the Groundedness metric and lets us reward models for finding eval gaps rather than punishing them for it.

Why this matters: every metric below has a known ground-truth denominator and a calibrated FP classifier. The numbers are reproducible — same artifacts, same judge prompt, same scoring math — even though the underlying models change.

Determinism: every model is invoked with decoding parameters set as deterministic as the provider allows — temperature=0.1, top_p=0.9 (or provider equivalents; reasoning-tier models that reject these are run with their stricter defaults). Every score on this page is averaged across the three pages in the cohort, not measured on one.

Six benchmarks at a glance

Benchmark	What it scores	Best for
Overall	Equal-weighted average of all five metrics below	No specific bias — "give me a single number"
F_0.5	Precision-weighted F-score (FPs hurt 2× more than misses)	Engineering teams triaging real backlogs
Discovery F_0.5	Same shape, but rare bugs count more	Bug archaeology, finding eval gaps
Precision	Of every flagged bug, how many are real?	High-trust output: PR comments, customer-facing reports
Recall	Of every real bug, how many were caught?	Pre-launch audits, security sweeps
Groundedness	Fraction of findings that are real (not fabricated)	Auto-fix pipelines — avoid FPs at all costs (loses some discovery)

Overall composite

Overall — equal-weighted average across all five metrics

What it measures: A single composite score combining F_0.5, Discovery F_0.5, Precision, Recall, and Groundedness. No bias toward any particular error type — just "how does this model do across the whole evaluation?"

Math: Overall = (F_0.5 + Discovery + Precision + Recall + Groundedness) / 5. Each component is in [0, 1]; the average is also in [0, 1].

Best for teams that: want the single safest pick across mixed workloads — an evaluation pass with no specific cost asymmetry, comparison tables for executives, "which model should we default to?" decisions.

Note: the ★ tuned for this badge appears on the five specialized boards below to mark the testers.ai mode tuned for that metric. It doesn’t apply to Overall — there is no dedicated optimize_for=overall mode; it’s an unweighted average of the five.

Overall (avg)

F₀.₅

Discovery

Precision

Recall

#	Model	Overall	F_0.5	Discovery	Precision	Recall
1	testers.ai	84.4%	87.8%	71.7%	89.6%	81.1%
2	gemini-3.1-pro	82.5%	83.8%	70.8%	88.2%	69.8%
3	gpt-5.4-nano	80.1%	80.2%	67.2%	83.3%	69.8%
4	claude-haiku-4-5	79.9%	82.9%	70.7%	84.6%	76.7%
5	gpt-5.4	78.4%	79.3%	70.3%	78.3%	83.7%
6	testers.ai (groundedness)	78.3%	83.7%	59.7%	89.7%	66.0%
7	gpt-5.3-codex	77.8%	83.8%	56.9%	91.7%	62.3%
8	gemini-2.5-pro	77.2%	75.9%	64.1%	78.4%	67.4%
9	gpt-5.4-mini	76.5%	78.2%	62.2%	88.5%	53.5%
10	claude-opus-4-7	75.6%	75.1%	64.7%	73.3%	83.0%
11	gemini-3-flash	74.9%	77.8%	61.9%	83.9%	60.5%
12	claude-sonnet-4-6	74.1%	70.7%	66.7%	66.7%	93.0%
13	gemini-3.1-flash-lite	70.3%	76.4%	45.1%	92.3%	45.3%
14	testers.ai (recall+vision)	67.5%	65.2%	60.0%	61.3%	86.8%
15	gemma-4-e4b (local)	50.5%	52.6%	28.3%	70.0%	26.4%

F_0.5 headline

F_0.5 — precision-weighted F-score (the headline metric)

What it measures: A balance of precision (the fraction of flagged bugs that are real) and recall (the fraction of real bugs that were caught), weighted so that false positives hurt 2× more than misses. In other words: a model that catches 9 of 10 bugs cleanly beats one that catches all 10 buried under a pile of false alarms. Range 0–100%, higher is better.

Math: F_β = (1+β²) · P · R / (β²·P + R) with β=0.5 → F_0.5 = 1.25 · P · R / (0.25·P + R). Compare with F₁ (β=1, equal weight on P and R) and F₂ (β=2, recall-weighted).

Best for teams that: triage a real bug backlog, where every false alarm eats engineer time; CI gates that fail builds on AI findings; bug-tracker auto-assignment; any workflow where a person reads each finding and a wrong one costs more than a missing one.

Skip this metric if: you're running a security audit (use Recall), or feeding output to an automated pipeline (use Groundedness).

F₀.₅ (headline)

Precision

Recall

F₁

#	Model	F_0.5	Precision	Recall	F₁
1	testers.ai★ tuned for this	87.8%	89.6%	81.1%	85.1%
2	gemini-3.1-pro	83.8%	88.2%	69.8%	77.9%
3	gpt-5.3-codex	83.8%	91.7%	62.3%	74.2%
4	testers.ai (groundedness)	83.7%	89.7%	66.0%	76.1%
5	claude-haiku-4-5	82.9%	84.6%	76.7%	80.5%
6	gpt-5.4-nano	80.2%	83.3%	69.8%	75.9%
7	gpt-5.4	79.3%	78.3%	83.7%	80.9%
8	gpt-5.4-mini	78.2%	88.5%	53.5%	66.7%
9	gemini-3-flash	77.8%	83.9%	60.5%	70.3%
10	gemini-3.1-flash-lite	76.4%	92.3%	45.3%	60.8%
11	gemini-2.5-pro	75.9%	78.4%	67.4%	72.5%
12	claude-opus-4-7	75.1%	73.3%	83.0%	77.9%
13	claude-sonnet-4-6	70.7%	66.7%	93.0%	77.7%
14	testers.ai (recall+vision)	65.2%	61.3%	86.8%	71.9%
15	gemma-4-e4b (local)	52.6%	70.0%	26.4%	38.4%

Discovery rare bugs

Discovery F_0.5 — weighted reward for finding rare bugs

What it measures: Like F_0.5, but recall is rarity-weighted: a bug caught by only one of the 15 models counts 15× more than a bug everyone catches. Rewards finding the long-tail issues your other tools miss; a model that just rediscovers what every model already finds gets a low score here.

Math: Replace plain recall with RareR = Σ_b∈caught(1/n_b) / Σ_b∈GT(1/n_b) where n_b is the number of models that caught bug b. Then DiscF_0.5 = 1.25 · P · RareR / (0.25·P + RareR).

Best for teams that: do bug-archaeology on legacy code; want a "second-opinion" model that catches what their primary one misses; build red-team / chaos-engineering tooling; run periodic eval-gap audits to find bugs nobody else has reported yet.

Skip this metric if: you mostly care about catching the obvious bugs reliably (use F_0.5 or Recall instead).

Discovery F₀.₅

Rarity-weighted recall

Precision

#	Model	Discovery F_0.5	Rarity-weighted recall	Precision
1	testers.ai★ tuned for this	71.7%	39.8%	89.6%
2	gemini-3.1-pro	70.8%	39.6%	88.2%
3	claude-haiku-4-5	70.7%	42.6%	84.6%
4	gpt-5.4	70.3%	50.0%	78.3%
5	gpt-5.4-nano	67.2%	37.9%	83.3%
6	claude-sonnet-4-6	66.7%	66.9%	66.7%
7	claude-opus-4-7	64.7%	44.1%	73.3%
8	gemini-2.5-pro	64.1%	37.1%	78.4%
9	gpt-5.4-mini	62.2%	28.5%	88.5%
10	gemini-3-flash	61.9%	30.3%	83.9%
11	testers.ai (recall+vision)	60.0%	55.3%	61.3%
12	testers.ai (groundedness)	59.7%	25.5%	89.7%
13	gpt-5.3-codex	56.9%	22.6%	91.7%
14	gemini-3.1-flash-lite	45.1%	14.8%	92.3%
15	gemma-4-e4b (local)	28.3%	8.4%	70.0%

Precision signal vs noise

Precision — of every flagged bug, how many are real?

What it measures: The fraction of the model's findings that are real defects. High precision means the engineer reading the report can trust each finding without hours of triage; low precision means most of the output is noise. Counts every false positive equally regardless of severity. FP = total false flags, hallu = the subset the second-pass judge classified as outright fabricated (not real bugs at all).

Math: P = TP / (TP + FP). Range 0–100%. A model that emits one finding and is correct scores 100%; a model that emits 100 findings of which 90 are real scores 90%. Indifferent to recall — a model that catches 1 real bug out of 50 still scores 100% if it had no false alarms.

Best for teams that: feed AI output directly to humans (PR comments, customer-facing reports, compliance audits); run AI in “suggest mode” where each suggestion costs reviewer attention; prioritize trust-in-output over coverage.

Skip this metric if: missing a bug is more expensive than chasing a false one (use Recall), or you have any human review at all (F_0.5 already handles this trade-off well).

Precision

FP groundedness

Hallu groundedness

#	Model	Precision	FP	Hallucinations
1	gemini-3.1-flash-lite	92.3%	2	2
2	gpt-5.3-codex	91.7%	3	2
3	testers.ai (groundedness)	89.7%	4	3
4	testers.ai★ tuned for this	89.6%	5	4
5	gpt-5.4-mini	88.5%	3	0
6	gemini-3.1-pro	88.2%	4	0
7	claude-haiku-4-5	84.6%	6	6
8	gemini-3-flash	83.9%	5	3
9	gpt-5.4-nano	83.3%	6	0
10	gemini-2.5-pro	78.4%	8	0
11	gpt-5.4	78.3%	10	9
12	claude-opus-4-7	73.3%	16	11
13	gemma-4-e4b (local)	70.0%	6	5
14	claude-sonnet-4-6	66.7%	20	16
15	testers.ai (recall+vision)	61.3%	29	27

Recall coverage

Recall — of every real bug, how many were caught?

What it measures: The fraction of seeded ground-truth bugs the model found. High recall means nothing slips through; low recall means real bugs are silently missing from the output. Indifferent to false positives — a model that flags every page element as buggy still gets perfect recall if it caught all the real ones.

Math: R = TP / (TP + FN). Range 0–100%. TP = real bugs caught, FN = real bugs missed. Trivially gameable by emitting every possible finding (so we report it alongside Precision — the two together tell the full story).

Best for teams that: do pre-launch security audits (one missed XSS >> 100 false alarms); accessibility compliance (ADA / WCAG audits); chaos engineering and pre-mortem testing; any “find everything I might have missed” workflow where a human reviewer is going to triage anyway.

Skip this metric if: false positives have any meaningful cost in your workflow (use F_0.5 or Precision).

Recall

TP captured

FN avoidance

#	Model	Recall	TP	FN (missed)
1	claude-sonnet-4-6	93.0%	40	3
2	testers.ai (recall+vision)★ tuned for this	86.8%	46	7
3	gpt-5.4	83.7%	36	7
4	claude-opus-4-7	83.0%	44	9
5	testers.ai	81.1%	43	10
6	claude-haiku-4-5	76.7%	33	10
7	gemini-3.1-pro	69.8%	30	13
8	gpt-5.4-nano	69.8%	30	13
9	gemini-2.5-pro	67.4%	29	14
10	testers.ai (groundedness)	66.0%	35	18
11	gpt-5.3-codex	62.3%	33	20
12	gemini-3-flash	60.5%	26	17
13	gpt-5.4-mini	53.5%	23	20
14	gemini-3.1-flash-lite	45.3%	24	29
15	gemma-4-e4b (local)	26.4%	14	39

Groundedness avoid FPs at all costs

Groundedness — what fraction of findings is grounded in something real?

What it measures: Of every finding the model emitted, what fraction is grounded — backed by something real on the page — rather than fabricated? Higher is better.

How we compute it: A model’s findings split into TP (matched a seeded ground-truth bug) and FP (didn’t match anything seeded). Each FP is then reviewed by a second-pass judge and classified as either real-FP (a real defect we forgot to seed) or hallucinated (the model invented it). Hallucinations are a strict subset of FP, so they can never exceed the total finding count.

Math: total findings = TP + FP. Of those, the grounded ones are TP + real-FP; the rest are hallucinations. So Groundedness = (TP + real-FP) / (TP + FP) = 1 − hallu / (TP + FP). Range 0–100%. A model that emits 20 findings with 1 hallucination scores 95%; a model that emits 5 findings, all real, scores 100%.

Use this metric when — you want to avoid a false positive at all costs. Best for: feeding AI findings to automated downstream systems where there is no human gate — auto-PRs, auto-fix bots, ticket auto-creation, alerting rules. A single hallucinated finding becomes a fake regression, a fake ticket, or a 3 a.m. page; one fabrication can cascade into production breakage.

The trade-off: optimizing for groundedness comes at the cost of discovering more issues. The strictest mode keeps only findings that pass multiple corroboration filters — fewer hallucinations, but real bugs that only one leaf flagged get dropped too. If you want maximum coverage, use Recall; if you want a balance, use F_0.5.

Skip this metric if: a human reviews every finding before action (then F_0.5 or Precision is the right metric), or you specifically need to find every real issue (use Recall).

Reading this leaderboard fairly: a model can score 100% groundedness simply by emitting very few findings — if you only flag 3 things and all 3 are real, you’re trivially perfect. The bare models tied at 100% below typically catch 2–7 bugs total across the cohort. testers.ai (groundedness) catches ~36 real bugs at 92%+ groundedness — near-perfect, but with about 10× the coverage. When choosing a model for an automated pipeline, consider both columns together: groundedness and the absolute count of findings.

Groundedness

Hallucination avoidance

Real-FP avoidance

#	Model	Groundedness	Hallucinations	Real-FP	Total FP
1	gpt-5.4-mini	100.0%	0	0	3
2	gemini-3.1-pro	100.0%	0	0	4
3	gpt-5.4-nano	100.0%	0	0	6
4	gemini-2.5-pro	100.0%	0	0	8
5	gpt-5.3-codex	94.4%	2	1	3
6	gemini-3.1-flash-lite	92.3%	2	0	2
7	testers.ai (groundedness)★ tuned for this	92.3%	3	1	4
8	testers.ai	91.7%	4	1	5
9	gemini-3-flash	90.3%	3	2	5
10	claude-haiku-4-5	84.6%	6	0	6
11	claude-opus-4-7	81.7%	11	5	16
12	gpt-5.4	80.4%	9	1	10
13	gemma-4-e4b (local)	75.0%	5	1	6
14	claude-sonnet-4-6	73.3%	16	4	20
15	testers.ai (recall+vision)	64.0%	27	2	29

Why this is hard to benchmark

You never fully know the right answer

There is no single “best” bug-finder — just trade-offs

Different teams pay different costs

How we evaluate

AI-generated pages with deliberately seeded bugs

Six benchmarks at a glance

Overall composite

Overall — equal-weighted average across all five metrics

F0.5 headline

F0.5 — precision-weighted F-score (the headline metric)

Discovery rare bugs

Discovery F0.5 — weighted reward for finding rare bugs

Precision signal vs noise

Precision — of every flagged bug, how many are real?

Recall coverage

Recall — of every real bug, how many were caught?

Groundedness avoid FPs at all costs

Groundedness — what fraction of findings is grounded in something real?

F_0.5 headline

F_0.5 — precision-weighted F-score (the headline metric)

Discovery F_0.5 — weighted reward for finding rare bugs