How 15 AI models compare at finding real defects on web pages — across security, accessibility, privacy, performance, reliability, UX and more.
testers.ai ships five tunable modes — one knob (optimize_for) picks which trade-off you want. Each mode is its own row below; the ★ tuned for this badge marks the mode specifically tuned for that leaderboard’s metric.
Evaluating a bug-finding agent looks deceptively simple — "did it find bugs?" — until you start counting. A single page can have hundreds of latent issues a careful reviewer would catch over a long enough timeframe. Some are stark, some are subtle, and the boundary between “bug” and “weird-but-intentional” rarely sits where you'd want it to.
Concretely, every score we compute on this page sits inside a fog of four uncertainties:
autoplay hero video? A 5-star review widget that grays out at 4.9? Reasonable people, and reasonable models, disagree. The judge has guidelines, but the line moves.Net effect: any single number you read here has more uncertainty than the digit count suggests. We mitigate this by reporting multiple metrics with different cost-of-error assumptions, so you can choose the one that lines up with your team’s actual risk profile rather than just trusting one.
A security team running a pre-launch audit pays a huge cost for any missed vulnerability and almost nothing for a false alarm — they want maximum recall, hallucinations be damned. A team feeding LLM findings into an auto-PR pipeline pays the opposite cost: a hallucinated bug becomes a real regression, so they need groundedness above all. A general engineering team triaging a backlog wants the best precision-weighted balance — F0.5. There’s no metric right for everyone; the right one is the one whose error costs match yours.
Because of that, we publish six leaderboards below — five specialized metrics plus a single Overall score that averages all five for the case where you have no specific bias. The Overall is the right place to start; the specialized boards are where you go once you know what your team actually pays for.
testers.ai exposes the trade-off explicitly through five tunable modes — the table below shows one row per mode so you can see exactly what each rule-pack gives up to win its target metric.
Every score on this page comes from a held-out cohort of three AI-generated web pages — a search-engine results page, a news article, and a social feed. Each page was produced by an LLM coding tool and then deliberately seeded with a known set of defects across a range of categories: accessibility violations, missing security headers, broken links, console errors, layout overflow, content typos, performance issues, and so on.
The seeded bugs explicitly include the failure modes AI coding tools commonly ship — missing alt text on images, hardcoded credentials in client-side code, unsanitised user input, dead-code links, broken aria-labels, mixed-protocol resources, and the long tail of "looks fine, fails when reviewed" defects that show up when an LLM writes the code. That's the point: this benchmark scores how well bug-finder agents catch the bugs that AI-generated code is actually producing right now.
Each page is fully self-contained (inline CSS, inline SVG, no network access required at scoring time). Bug counts per page are tracked in the eval’s ground_truth.json alongside each page’s artifacts.
Each model receives the same artifact bundle: rendered HTML, browser console log, network transcript (with response bodies), and a screenshot. No live browser, no network access, no model-specific harness tricks — just the four artifacts and a shared system prompt.
An independent LLM-as-judge then matches each model’s findings against the seeded ground-truth bugs. Matched findings count as true positives. Unmatched findings — potential false positives — go through a second-pass classifier that asks: "is this finding pointing at something real on the page that we just didn’t seed, or did the model fabricate this?" That second pass is what powers the Groundedness metric and lets us reward models for finding eval gaps rather than punishing them for it.
Why this matters: every metric below has a known ground-truth denominator and a calibrated FP classifier. The numbers are reproducible — same artifacts, same judge prompt, same scoring math — even though the underlying models change.
Determinism: every model is invoked with decoding parameters set as deterministic as the provider allows — temperature=0.1, top_p=0.9 (or provider equivalents; reasoning-tier models that reject these are run with their stricter defaults). Every score on this page is averaged across the three pages in the cohort, not measured on one.
| Benchmark | What it scores | Best for |
|---|---|---|
| Overall | Equal-weighted average of all five metrics below | No specific bias — "give me a single number" |
| F0.5 | Precision-weighted F-score (FPs hurt 2× more than misses) | Engineering teams triaging real backlogs |
| Discovery F0.5 | Same shape, but rare bugs count more | Bug archaeology, finding eval gaps |
| Precision | Of every flagged bug, how many are real? | High-trust output: PR comments, customer-facing reports |
| Recall | Of every real bug, how many were caught? | Pre-launch audits, security sweeps |
| Groundedness | Fraction of findings that are real (not fabricated) | Auto-fix pipelines — avoid FPs at all costs (loses some discovery) |
What it measures: A single composite score combining F0.5, Discovery F0.5, Precision, Recall, and Groundedness. No bias toward any particular error type — just "how does this model do across the whole evaluation?"
Math: Overall = (F0.5 + Discovery + Precision + Recall + Groundedness) / 5. Each component is in [0, 1]; the average is also in [0, 1].
Best for teams that: want the single safest pick across mixed workloads — an evaluation pass with no specific cost asymmetry, comparison tables for executives, "which model should we default to?" decisions.
Note: the ★ tuned for this badge appears on the five specialized boards below to mark the testers.ai mode tuned for that metric. It doesn’t apply to Overall — there is no dedicated optimize_for=overall mode; it’s an unweighted average of the five.
| # | Model | Overall | F0.5 | Discovery | Precision | Recall | |
|---|---|---|---|---|---|---|---|
| 1 | testers.ai | 84.4% | 87.8% | 71.7% | 89.6% | 81.1% | |
| 2 | gemini-3.1-pro | 82.5% | 83.8% | 70.8% | 88.2% | 69.8% | |
| 3 | gpt-5.4-nano | 80.1% | 80.2% | 67.2% | 83.3% | 69.8% | |
| 4 | claude-haiku-4-5 | 79.9% | 82.9% | 70.7% | 84.6% | 76.7% | |
| 5 | gpt-5.4 | 78.4% | 79.3% | 70.3% | 78.3% | 83.7% | |
| 6 | testers.ai (groundedness) | 78.3% | 83.7% | 59.7% | 89.7% | 66.0% | |
| 7 | gpt-5.3-codex | 77.8% | 83.8% | 56.9% | 91.7% | 62.3% | |
| 8 | gemini-2.5-pro | 77.2% | 75.9% | 64.1% | 78.4% | 67.4% | |
| 9 | gpt-5.4-mini | 76.5% | 78.2% | 62.2% | 88.5% | 53.5% | |
| 10 | claude-opus-4-7 | 75.6% | 75.1% | 64.7% | 73.3% | 83.0% | |
| 11 | gemini-3-flash | 74.9% | 77.8% | 61.9% | 83.9% | 60.5% | |
| 12 | claude-sonnet-4-6 | 74.1% | 70.7% | 66.7% | 66.7% | 93.0% | |
| 13 | gemini-3.1-flash-lite | 70.3% | 76.4% | 45.1% | 92.3% | 45.3% | |
| 14 | testers.ai (recall+vision) | 67.5% | 65.2% | 60.0% | 61.3% | 86.8% | |
| 15 | gemma-4-e4b (local) | 50.5% | 52.6% | 28.3% | 70.0% | 26.4% |
What it measures: A balance of precision (the fraction of flagged bugs that are real) and recall (the fraction of real bugs that were caught), weighted so that false positives hurt 2× more than misses. In other words: a model that catches 9 of 10 bugs cleanly beats one that catches all 10 buried under a pile of false alarms. Range 0–100%, higher is better.
Math: Fβ = (1+β²) · P · R / (β²·P + R) with β=0.5 → F0.5 = 1.25 · P · R / (0.25·P + R). Compare with F1 (β=1, equal weight on P and R) and F2 (β=2, recall-weighted).
Best for teams that: triage a real bug backlog, where every false alarm eats engineer time; CI gates that fail builds on AI findings; bug-tracker auto-assignment; any workflow where a person reads each finding and a wrong one costs more than a missing one.
Skip this metric if: you're running a security audit (use Recall), or feeding output to an automated pipeline (use Groundedness).
| # | Model | F0.5 | Precision | Recall | F1 | |
|---|---|---|---|---|---|---|
| 1 | testers.ai★ tuned for this | 87.8% | 89.6% | 81.1% | 85.1% | |
| 2 | gemini-3.1-pro | 83.8% | 88.2% | 69.8% | 77.9% | |
| 3 | gpt-5.3-codex | 83.8% | 91.7% | 62.3% | 74.2% | |
| 4 | testers.ai (groundedness) | 83.7% | 89.7% | 66.0% | 76.1% | |
| 5 | claude-haiku-4-5 | 82.9% | 84.6% | 76.7% | 80.5% | |
| 6 | gpt-5.4-nano | 80.2% | 83.3% | 69.8% | 75.9% | |
| 7 | gpt-5.4 | 79.3% | 78.3% | 83.7% | 80.9% | |
| 8 | gpt-5.4-mini | 78.2% | 88.5% | 53.5% | 66.7% | |
| 9 | gemini-3-flash | 77.8% | 83.9% | 60.5% | 70.3% | |
| 10 | gemini-3.1-flash-lite | 76.4% | 92.3% | 45.3% | 60.8% | |
| 11 | gemini-2.5-pro | 75.9% | 78.4% | 67.4% | 72.5% | |
| 12 | claude-opus-4-7 | 75.1% | 73.3% | 83.0% | 77.9% | |
| 13 | claude-sonnet-4-6 | 70.7% | 66.7% | 93.0% | 77.7% | |
| 14 | testers.ai (recall+vision) | 65.2% | 61.3% | 86.8% | 71.9% | |
| 15 | gemma-4-e4b (local) | 52.6% | 70.0% | 26.4% | 38.4% |
What it measures: Like F0.5, but recall is rarity-weighted: a bug caught by only one of the 15 models counts 15× more than a bug everyone catches. Rewards finding the long-tail issues your other tools miss; a model that just rediscovers what every model already finds gets a low score here.
Math: Replace plain recall with RareR = Σb∈caught(1/nb) / Σb∈GT(1/nb) where nb is the number of models that caught bug b. Then DiscF0.5 = 1.25 · P · RareR / (0.25·P + RareR).
Best for teams that: do bug-archaeology on legacy code; want a "second-opinion" model that catches what their primary one misses; build red-team / chaos-engineering tooling; run periodic eval-gap audits to find bugs nobody else has reported yet.
Skip this metric if: you mostly care about catching the obvious bugs reliably (use F0.5 or Recall instead).
| # | Model | Discovery F0.5 | Rarity-weighted recall | Precision | |
|---|---|---|---|---|---|
| 1 | testers.ai★ tuned for this | 71.7% | 39.8% | 89.6% | |
| 2 | gemini-3.1-pro | 70.8% | 39.6% | 88.2% | |
| 3 | claude-haiku-4-5 | 70.7% | 42.6% | 84.6% | |
| 4 | gpt-5.4 | 70.3% | 50.0% | 78.3% | |
| 5 | gpt-5.4-nano | 67.2% | 37.9% | 83.3% | |
| 6 | claude-sonnet-4-6 | 66.7% | 66.9% | 66.7% | |
| 7 | claude-opus-4-7 | 64.7% | 44.1% | 73.3% | |
| 8 | gemini-2.5-pro | 64.1% | 37.1% | 78.4% | |
| 9 | gpt-5.4-mini | 62.2% | 28.5% | 88.5% | |
| 10 | gemini-3-flash | 61.9% | 30.3% | 83.9% | |
| 11 | testers.ai (recall+vision) | 60.0% | 55.3% | 61.3% | |
| 12 | testers.ai (groundedness) | 59.7% | 25.5% | 89.7% | |
| 13 | gpt-5.3-codex | 56.9% | 22.6% | 91.7% | |
| 14 | gemini-3.1-flash-lite | 45.1% | 14.8% | 92.3% | |
| 15 | gemma-4-e4b (local) | 28.3% | 8.4% | 70.0% |
What it measures: The fraction of the model's findings that are real defects. High precision means the engineer reading the report can trust each finding without hours of triage; low precision means most of the output is noise. Counts every false positive equally regardless of severity. FP = total false flags, hallu = the subset the second-pass judge classified as outright fabricated (not real bugs at all).
Math: P = TP / (TP + FP). Range 0–100%. A model that emits one finding and is correct scores 100%; a model that emits 100 findings of which 90 are real scores 90%. Indifferent to recall — a model that catches 1 real bug out of 50 still scores 100% if it had no false alarms.
Best for teams that: feed AI output directly to humans (PR comments, customer-facing reports, compliance audits); run AI in “suggest mode” where each suggestion costs reviewer attention; prioritize trust-in-output over coverage.
Skip this metric if: missing a bug is more expensive than chasing a false one (use Recall), or you have any human review at all (F0.5 already handles this trade-off well).
| # | Model | Precision | FP | Hallucinations | |
|---|---|---|---|---|---|
| 1 | gemini-3.1-flash-lite | 92.3% | 2 | 2 | |
| 2 | gpt-5.3-codex | 91.7% | 3 | 2 | |
| 3 | testers.ai (groundedness) | 89.7% | 4 | 3 | |
| 4 | testers.ai★ tuned for this | 89.6% | 5 | 4 | |
| 5 | gpt-5.4-mini | 88.5% | 3 | 0 | |
| 6 | gemini-3.1-pro | 88.2% | 4 | 0 | |
| 7 | claude-haiku-4-5 | 84.6% | 6 | 6 | |
| 8 | gemini-3-flash | 83.9% | 5 | 3 | |
| 9 | gpt-5.4-nano | 83.3% | 6 | 0 | |
| 10 | gemini-2.5-pro | 78.4% | 8 | 0 | |
| 11 | gpt-5.4 | 78.3% | 10 | 9 | |
| 12 | claude-opus-4-7 | 73.3% | 16 | 11 | |
| 13 | gemma-4-e4b (local) | 70.0% | 6 | 5 | |
| 14 | claude-sonnet-4-6 | 66.7% | 20 | 16 | |
| 15 | testers.ai (recall+vision) | 61.3% | 29 | 27 |
What it measures: The fraction of seeded ground-truth bugs the model found. High recall means nothing slips through; low recall means real bugs are silently missing from the output. Indifferent to false positives — a model that flags every page element as buggy still gets perfect recall if it caught all the real ones.
Math: R = TP / (TP + FN). Range 0–100%. TP = real bugs caught, FN = real bugs missed. Trivially gameable by emitting every possible finding (so we report it alongside Precision — the two together tell the full story).
Best for teams that: do pre-launch security audits (one missed XSS >> 100 false alarms); accessibility compliance (ADA / WCAG audits); chaos engineering and pre-mortem testing; any “find everything I might have missed” workflow where a human reviewer is going to triage anyway.
Skip this metric if: false positives have any meaningful cost in your workflow (use F0.5 or Precision).
| # | Model | Recall | TP | FN (missed) | |
|---|---|---|---|---|---|
| 1 | claude-sonnet-4-6 | 93.0% | 40 | 3 | |
| 2 | testers.ai (recall+vision)★ tuned for this | 86.8% | 46 | 7 | |
| 3 | gpt-5.4 | 83.7% | 36 | 7 | |
| 4 | claude-opus-4-7 | 83.0% | 44 | 9 | |
| 5 | testers.ai | 81.1% | 43 | 10 | |
| 6 | claude-haiku-4-5 | 76.7% | 33 | 10 | |
| 7 | gemini-3.1-pro | 69.8% | 30 | 13 | |
| 8 | gpt-5.4-nano | 69.8% | 30 | 13 | |
| 9 | gemini-2.5-pro | 67.4% | 29 | 14 | |
| 10 | testers.ai (groundedness) | 66.0% | 35 | 18 | |
| 11 | gpt-5.3-codex | 62.3% | 33 | 20 | |
| 12 | gemini-3-flash | 60.5% | 26 | 17 | |
| 13 | gpt-5.4-mini | 53.5% | 23 | 20 | |
| 14 | gemini-3.1-flash-lite | 45.3% | 24 | 29 | |
| 15 | gemma-4-e4b (local) | 26.4% | 14 | 39 |
What it measures: Of every finding the model emitted, what fraction is grounded — backed by something real on the page — rather than fabricated? Higher is better.
How we compute it: A model’s findings split into TP (matched a seeded ground-truth bug) and FP (didn’t match anything seeded). Each FP is then reviewed by a second-pass judge and classified as either real-FP (a real defect we forgot to seed) or hallucinated (the model invented it). Hallucinations are a strict subset of FP, so they can never exceed the total finding count.
Math: total findings = TP + FP. Of those, the grounded ones are TP + real-FP; the rest are hallucinations. So Groundedness = (TP + real-FP) / (TP + FP) = 1 − hallu / (TP + FP). Range 0–100%. A model that emits 20 findings with 1 hallucination scores 95%; a model that emits 5 findings, all real, scores 100%.
Use this metric when — you want to avoid a false positive at all costs. Best for: feeding AI findings to automated downstream systems where there is no human gate — auto-PRs, auto-fix bots, ticket auto-creation, alerting rules. A single hallucinated finding becomes a fake regression, a fake ticket, or a 3 a.m. page; one fabrication can cascade into production breakage.
The trade-off: optimizing for groundedness comes at the cost of discovering more issues. The strictest mode keeps only findings that pass multiple corroboration filters — fewer hallucinations, but real bugs that only one leaf flagged get dropped too. If you want maximum coverage, use Recall; if you want a balance, use F0.5.
Skip this metric if: a human reviews every finding before action (then F0.5 or Precision is the right metric), or you specifically need to find every real issue (use Recall).
Reading this leaderboard fairly: a model can score 100% groundedness simply by emitting very few findings — if you only flag 3 things and all 3 are real, you’re trivially perfect. The bare models tied at 100% below typically catch 2–7 bugs total across the cohort. testers.ai (groundedness) catches ~36 real bugs at 92%+ groundedness — near-perfect, but with about 10× the coverage. When choosing a model for an automated pipeline, consider both columns together: groundedness and the absolute count of findings.
| # | Model | Groundedness | Hallucinations | Real-FP | Total FP | |
|---|---|---|---|---|---|---|
| 1 | gpt-5.4-mini | 100.0% | 0 | 0 | 3 | |
| 2 | gemini-3.1-pro | 100.0% | 0 | 0 | 4 | |
| 3 | gpt-5.4-nano | 100.0% | 0 | 0 | 6 | |
| 4 | gemini-2.5-pro | 100.0% | 0 | 0 | 8 | |
| 5 | gpt-5.3-codex | 94.4% | 2 | 1 | 3 | |
| 6 | gemini-3.1-flash-lite | 92.3% | 2 | 0 | 2 | |
| 7 | testers.ai (groundedness)★ tuned for this | 92.3% | 3 | 1 | 4 | |
| 8 | testers.ai | 91.7% | 4 | 1 | 5 | |
| 9 | gemini-3-flash | 90.3% | 3 | 2 | 5 | |
| 10 | claude-haiku-4-5 | 84.6% | 6 | 0 | 6 | |
| 11 | claude-opus-4-7 | 81.7% | 11 | 5 | 16 | |
| 12 | gpt-5.4 | 80.4% | 9 | 1 | 10 | |
| 13 | gemma-4-e4b (local) | 75.0% | 5 | 1 | 6 | |
| 14 | claude-sonnet-4-6 | 73.3% | 16 | 4 | 20 | |
| 15 | testers.ai (recall+vision) | 64.0% | 27 | 2 | 29 |