Ask an LLM to calculate the pressure drop through a 2-inch schedule 40 steel pipe carrying water at 150°F and 50 GPM over 200 feet. You'll get an answer. It will be formatted beautifully, with equations laid out step by step, intermediate values clearly labeled, and a final result presented with apparent precision. It will also, more often than not, contain at least one error that an experienced engineer would catch immediately — a unit conversion done backwards, a Reynolds number calculated with the wrong diameter, a friction factor pulled from the wrong regime, or a fluid property cited at the wrong temperature.
The engineer catches the error because the engineer knows what the answer should look like. A pressure drop of 0.3 psi over 200 feet of 2-inch pipe at 50 GPM feels right. A pressure drop of 30 psi does not. A pressure drop of 0.003 psi does not. The engineer has calibrated intuition — decades of experience that tells them when a number is in the right ballpark. The LLM has no such intuition. It has pattern-matching, and pattern-matching does not know the difference between plausible and correct.
How LLMs Get Engineering Wrong
The failure modes are not random. They are systematic, predictable, and rooted in how language models actually work. An LLM trained on millions of solved physics problems learns the shape of a solution — what equations typically appear, what variables are named, what sequence of steps is expected — without learning the physical reasoning that connects them. It is performing a sophisticated form of autocompletion, not physics.
A 2024 study from researchers investigating LLM physics reasoning found that GPT-4 would ignore physical context like tensorial order and dimensional constraints in favor of algebraic pattern-matching. When the researchers removed key premises from physics problems — premises that a physicist would consider essential — the LLM often produced the same answer anyway, because the answer was being generated from the pattern of the problem type, not from the physical reasoning chain.
None of these failures are bugs that will be fixed in the next model release. They are structural consequences of how language models work. An LLM generates the most probable next token given the context. When the most probable next token happens to be the physically correct one — which it often is for well-trodden problems — the output is correct. When the most probable next token diverges from physical reality — at system boundaries, regime transitions, edge cases, elevated conditions — the output is wrong with no change in confidence.
Why Engineers Use It Anyway
Despite everything above, engineers use LLMs daily. Not because they trust the outputs — most engineers are deeply skeptical — but because the alternative is worse. Before ChatGPT, an engineer who needed to recall the Dittus-Boelter correlation would open Incropera's textbook, find the right chapter, locate the equation, note the applicability constraints, and transcribe it into their calculation. That took fifteen minutes. Now it takes fifteen seconds — even after accounting for the verification step.
The pattern is consistent across the profession: LLMs as recall accelerators, not reasoning engines. Engineers use them to remember equations they've used before, to draft report sections they'll rewrite, to generate boilerplate code they'll debug, and to brainstorm approaches they'll evaluate. The LLM saves time on the parts of engineering that are memory-intensive but not judgment-intensive. The moment judgment is required — selecting the right correlation, interpreting the result, comparing against an allowable — the engineer takes over.
This is a rational workflow, but it's also an unstable one. It depends entirely on the engineer knowing enough to catch the LLM's errors. A senior engineer with twenty years of experience has the calibrated intuition to spot a pressure drop that's off by an order of magnitude. A junior engineer with two years of experience might not. And as organizations hire faster, as experienced engineers retire, and as project timelines compress, the gap between what the LLM produces and what the engineer can verify is widening.
What "AI Proposes, Physics Validates, Human Decides" Looks Like
The solution is not better LLMs. It is better architecture around LLMs. The fundamental insight is simple: do not let the LLM compute. Let the LLM understand the problem, select the approach, and orchestrate the workflow. Let deterministic engines — symbolic math solvers, unit-tracking libraries, material property databases, engineering standard lookups — perform the actual computation. And let the human review the result before it enters the engineering record.
This is not a theoretical architecture. The components exist today. SymPy performs symbolic mathematics with exact arithmetic — it doesn't hallucinate algebra. Pint and similar dimensional analysis libraries track units through every operation and raise errors on dimensional inconsistency — they don't silently drop conversions. CoolProp provides validated thermophysical properties for over 100 fluids — it doesn't interpolate from training data. The missing piece is not the computation engines. It's the orchestration layer that connects natural language input to deterministic computation to human review.
In this architecture, the LLM's strengths are leveraged — natural language understanding, broad engineering knowledge, the ability to select appropriate methods — while its weaknesses are contained. The LLM decides that Dittus-Boelter is the right correlation. The symbolic engine evaluates it with exact arithmetic and tracked units. The validation layer checks that the Reynolds number actually exceeds 10,000 and that the Prandtl number is within the valid range. The human sees every step, every assumption, every intermediate value with units attached, and decides whether to accept the result.
The critical design principle is transparency at every stage. The engineer should never see a final answer without the path that produced it. Not because the engineer wants to re-derive the solution — that defeats the purpose — but because the engineer needs to verify the approach, not the arithmetic. "Did the tool use the right correlation? Did it pull properties at the right temperature? Did it apply the right safety factor?" These are engineering judgment calls. The tool should surface them clearly, not bury them in a chat transcript.
Skepticism Is the Design Constraint
Engineers are right to be skeptical of AI for engineering calculations. That skepticism should not be treated as a marketing problem to be overcome with better demos. It should be treated as a design constraint to be satisfied with better architecture.
The "(Yet)" in the title of this piece is not about waiting for a smarter model. No amount of parameter scaling will give an LLM the ability to perform reliable dimensional analysis, because dimensional analysis is not a language task — it's a mathematical constraint that must be enforced, not predicted. The "(Yet)" is about architecture. When engineering AI tools are built with the right separation of concerns — LLM for understanding, engines for computation, governance for trust — then AI will be able to do your engineering math. Not because the AI got smarter, but because the system around it got honest about what each component can and cannot do.
Until then, the senior engineer's calibrated intuition remains the last line of defense between a confidently wrong AI output and a shipping product. That's not a workflow. That's a liability. And building the architecture that replaces intuition-as-guardrail with structure-as-guardrail is the most important unsolved problem in engineering AI.