A confidence floor is the cheapest noise filter you have
Across 1,200 internal review runs, dropping findings below 0.75 confidence cut acknowledged-noise by 38% while removing fewer than 4% of accepted findings. The data, and the caveats.
Across 1,200 internal review runs, dropping findings below 0.75 confidence cut acknowledged-noise by 38% while removing fewer than 4% of accepted findings. The data, and the caveats.
Every reviewer returns a self-reported confidence between 0 and 1. We default to dropping anything below 0.75. People sometimes ask whether models can self-assess confidence reliably. The honest answer is "well enough for this job" — and the data is more interesting than the philosophy.
We replayed 1,200 internal review runs from a 90-day window. Each run had three reviewers and a final aggregated post; each finding had a self-reported confidence and a maintainer reaction (accepted, dismissed, ignored). "Acknowledged noise" is finding count where the maintainer dismissed within 24h.
0.75 is the knee. Below it, noise drops fast; above it, the curve flattens and we start losing findings maintainers actually wanted. Which is exactly the shape you would hope a calibrated self-assessment produces, and a useful sanity check on the underlying model behaviour.
Self-reported confidence is not magic. It is correlated with finding quality, not equivalent to it. We saw a small population of high-confidence wrong findings — confidently-claimed bugs that did not exist — and a smaller population of low-confidence right findings the floor would have dropped. The floor is a triage tool, not a truth oracle.
Confidence calibration also drifts when you swap models. We re-run the floor analysis whenever the default reviewer model changes. If you are using BYOK with a non-default model, the 0.75 default is a reasonable starting point but you should treat it as a knob.
The point is not that 0.75 is universally correct. The point is that a single scalar threshold, applied uniformly, removes more than a third of perceived noise for negligible engineering cost. Most of the harder filters we tried — embedding similarity, severity gating, post-hoc rerankers — did not justify their complexity once the floor was in place.