Technical Notes6 min read

Why prediction accuracy is not enough for physical systems

Machine learning for physical systems is often reported the same way as machine learning for everything else: a single accuracy or error number on a held-out test set. For engineering problems, that number is a weak proxy for what actually matters.

A model that scores well on average can still be useless — or unsafe — in the situations where an engineer most needs it. The reason is that physical systems carry structure, constraints, and consequences that a scalar error metric does not see.

Downstream use matters

A surrogate model is rarely the end product. It feeds a decision: a design iteration, an optimisation loop, a maintenance schedule, a safety margin. The relevant question is not "how low is the test error" but "does the model change the decision, and is that change correct?"

Two models with identical average error can produce very different downstream behaviour. One might be wrong in a smooth, predictable way that an engineer can correct for. The other might be wrong precisely at the design points that matter, while looking good on average.

Distribution shift matters

Test-set accuracy assumes the future looks like the past. Physical systems routinely violate this: new geometries, new load cases, new materials, damage, wear, and operating regimes the model never saw. A model evaluated only in-distribution tells you little about how it behaves where you intend to use it.

Physical plausibility matters

A prediction can be numerically close and still physically wrong — violating conservation, monotonicity, symmetry, or known limiting behaviour. Engineers notice these violations immediately, and they erode trust faster than a slightly higher error would. A model that respects physical structure is more useful than a marginally more accurate one that does not.

Uncertainty matters

A useful model knows when it does not know. Without a calibrated sense of uncertainty, every prediction is presented with the same false confidence, and the engineer has no signal about when to fall back to simulation or physical testing. Uncertainty is what makes a fast-but-approximate model safe to rely on.

What to evaluate instead

Behaviour under distribution shift, not just in-distribution error.
Error where it matters — at the design points and decisions the model will actually inform.
Physical plausibility and respect for known constraints.
Calibrated uncertainty and graceful failure.
The quality of the decision the model enables, end to end.

None of this means accuracy is irrelevant. It means accuracy is a necessary condition, not a sufficient one. Engineering value depends on the decisions a model enables, not the metrics it improves.

All writing