My EuroSys 2026 paper is obsolete

My team and I have a paper appearing at EuroSys 2026: Lessons Learned from Incorporating Formal Methods in Huawei Cloud Reliability (paper).
It’s an industrial experience report covering three projects where we applied probabilistic concurrency testing, model checking, and deductive verification to real cloud infrastructure. The results are real: we found and fixed critical bugs. We documented it all, person-month effort metrics, tool comparisons, practical guidance, to help other organizations decide whether and how to invest in formal methods.

I’m proud of this work. And it’s already obsolete.

Not the bugs. Not the methodology. The approach of combining lightweight and heavyweight formal methods at different abstraction levels, depending on components’ criticality and risk, still makes sense. What’s obsolete is the thing we were most proud of: the effort data.

Our paper’s promise to the reader was: “Hey, this is more feasible than you think, and here’s exactly what it did cost us”. We reported person-months, team composition, and expertise levels. We said: a single engineer with formal methods background and intermediate tool proficiency can build a meaningful TLA+ model of a complex distributed transaction protocol in about two person-weeks of modelling work, after a two-month onboarding period to understand the system. We said: a newly hired engineer with no prior Gobra experience can verify a small critical library in 2.5 months. We said: engineers without formal methods training can learn P and produce useful models within a few months.

These numbers were honest. They were precise. They were the result we thought would have the most impact—giving engineering managers a concrete basis for deciding whether to invest in formal methods.

And they are now deeply misleading because, since writing this paper, the entry barrier to formal methods has dropped by an order of magnitude.

Only a sketch. But the trend is real

The work reported in the paper was done between mid-2024 and early 2025. In the twelve months since, LLMs have fundamentally changed what it takes to use formal methods. We are now using LLMs to build TLA+ models of distributed protocols in days; to set up bounded model checking on unfamiliar codebases in hours; to write verification annotations with no prior experience in the tools. Tasks that our paper budgets in person-months are now collapsing into person-days. The bottleneck has shifted.

In our paper, we found that roughly 60% of project time was spent understanding the target system, reading code, attending meetings, asking questions, building mental models, while only 40% went to actual modelling and verification. That 60/40 split was one of our most important findings, because it tells you where the real cost is: domain knowledge transfer, not formal methods expertise.

LLMs now compress both sides. If the old budget was 100: 60 spent on understanding and 40 spent on verification; the new budget might look more like 25: 20 on understanding, 5 on verification. Not the same ratio. And a fundamentally different total. But that 20 should never become zero. Understanding what to model, which properties matter, where the interesting corner cases live, that still requires a human who understands the system.

One of the lessons learned we emphasised in the paper is about model validation. We write: “Many safety properties are trivially satisfied by overly restrictive models. A model that deadlocks or omits key system traces may falsely appear to prove safety properties, despite the actual system being vulnerable to failures.” We hit this problem ourselves, multiple times. We initially believed a model verified certain properties, only to discover later that the model had silently excluded the traces that could violate them. We developed a practice of deliberately “breaking” models during validation, disabling safety checks to verify that violations could occur under reasonable conditions. If a property holds even when you try to break it, that’s a red flag.

This trap gets worse with LLMs. A language model can generate a spec that looks plausible, type-checks, and passes model checking, but captures a simplification so aggressive that the verification is vacuous. The spec looks right. The output says verified. And nobody deeply understands why, because nobody wrote it line by line. The floor has dropped, but the ceiling hasn’t moved.

So what? What can you actually take away from our paper?

Not the effort numbers. Those belong to the past. Take the methodology: how to combine tools at different abstraction levels, how to validate models, how to collaborate with development teams, how to frame results for stakeholders. Take the finding that domain knowledge, not formal methods expertise, is the primary bottleneck.

And take this: formal methods just became dramatically more accessible. The fear that it’s too expensive, too specialised, too slow—it’s gone. If you’re building distributed systems, databases, consensus protocols, or security-critical infrastructure, the cost of not verifying your designs has always been high. The cost of doing it just dropped by an order of magnitude.

Our paper might document the last generation of purely human-driven formal methods in large-scale systems. It’s a useful historical record. But if you’re deciding whether to invest in formal methods today, in 2026, the answer is easier than anything we wrote in that paper would suggest.

The paper’s is available here. It includes bugs found, effort metrics, tool comparisons, and an actionable checklist for teams considering adopting formal methods.

My EuroSys 2026 paper is obsolete

Discover more from Claudia Cauli