Research2026-04-04 · 4 min read

A confidence floor is the cheapest noise filter you have

Across 1,200 internal review runs, dropping findings below 0.75 confidence cut acknowledged-noise by 38% while removing fewer than 4% of accepted findings. The data, and the caveats.

By Quorum team

The setup

We replayed 1,200 internal review runs from a 90-day window. Each run had three reviewers and a final aggregated post; each finding had a self-reported confidence and a maintainer reaction (accepted, dismissed, ignored). "Acknowledged noise" is finding count where the maintainer dismissed within 24h.

What the floor changes

At floor 0.50: 0% noise reduction (the floor never fires). Baseline.

At floor 0.65: 19% reduction in acknowledged noise; 1.2% of accepted findings lost.

At floor 0.75: 38% reduction in acknowledged noise; 3.7% of accepted findings lost.

At floor 0.85: 61% reduction in acknowledged noise; 18.4% of accepted findings lost — too many.

0.75 is the knee. Below it, noise drops fast; above it, the curve flattens and we start losing findings maintainers actually wanted. Which is exactly the shape you would hope a calibrated self-assessment produces, and a useful sanity check on the underlying model behaviour.

Caveats

Self-reported confidence is not magic. It is correlated with finding quality, not equivalent to it. We saw a small population of high-confidence wrong findings — confidently-claimed bugs that did not exist — and a smaller population of low-confidence right findings the floor would have dropped. The floor is a triage tool, not a truth oracle.

Confidence calibration also drifts when you swap models. We re-run the floor analysis whenever the default reviewer model changes. If you are using BYOK with a non-default model, the 0.75 default is a reasonable starting point but you should treat it as a knob.

The point is not that 0.75 is universally correct. The point is that a single scalar threshold, applied uniformly, removes more than a third of perceived noise for negligible engineering cost. Most of the harder filters we tried — embedding similarity, severity gating, post-hoc rerankers — did not justify their complexity once the floor was in place.