MAY 22, 2026

The state of provable software.

In Trust and Verify, we called the analysis we run "established mathematics from formal methods." This post is a tour of that world — how software gets analyzed today, what each method can and can't promise, and why "provable software" is finally moving from aerospace into everyday code.

The Archway team

10 min read · A tour

New here? Start with Trust and Verify for what Archway is. This post zooms out to the landscape it sits in.

Two ways to know if code is correct

There have always been two fundamentally different ways to gain confidence in software.

The first is testing: run the code on some inputs, check the outputs. Testing is indispensable, but it has a hard ceiling. As Dijkstra famously put it in his 1972 Turing Award lecture, testing can show the presence of bugs, never their absence. The practical reading: a test tells you about the case you ran — and nothing about the rest.

The second is static analysis: reason about the code without running it, to make statements about all possible executions at once. Developers consume it daily — type checkers, linters, security scanners — but usually as black boxes. Underneath each is a particular technique, and they vary a lot in what they can promise. This is the lineage that leads to "provable software." Let's walk it.

Testing samples the input space; static analysis reasons about the whole of it.

Testing shows the presence of bugs; static analysis aims at their absence.

The toolbox: how code gets analyzed without running it

These methods aren't competitors so much as a toolkit, each with a different trade between what it can prove and what it costs to run.

Type systems

Static analysis tied to the language's type system. A type checker proves, before the program runs, that you never do something nonsensical — like adding a number to a function. In statically-typed languages — TypeScript, Rust, the ML family — the checker is built into the compiler. In dynamically-typed ones like Python, type hints are part of the language but the checker runs as a separate tool (mypy, Pyright). Either way, modern type systems push this surprisingly far. At the high end, the line between "type system" and "proof system" gets blurry: dependent types (which power Lean) and refinement types (Liquid Haskell, F*) let the type system encode rich properties about behavior, not just shape. The everyday limit is that not everything you care about is easily expressed as a type.

Linters and pattern-based analysis

The next layer most teams add on top. Tools like ESLint, Semgrep, CodeQL, and SonarQube scan code against a curated library of patterns — "don't use eval," "this kind of API call needs a null check after," "this string concatenation looks like SQL injection." They don't prove anything; they detect known shapes of trouble. The strength is breadth and speed. The limit is symmetric: they only catch what someone has written a rule for, and a clean run is no guarantee of correctness.

Abstract interpretation

The workhorse behind serious static analyzers, and worth understanding because it's the heart of what makes "reason about all executions" tractable. The idea, from Patrick and Radhia Cousot's 1977 paper, is to run the program in your head using approximations instead of real values. Instead of tracking that x = 7, you track "x is positive," or "x is in this range." By computing over these abstractions, you can prove a property holds for every run — at the cost of imprecision: the abstraction may flag a possible bug where the real program would never actually have one.

The art is choosing an abstraction precise enough to be useful but coarse enough to compute. It's how analyzers reason about infinitely many executions in finite time. Industrial proof points: Astrée has been used to verify that the Airbus A380 flight-control software contains no runtime errors; Infer (Meta) uses a dialect called separation logic — particularly good at reasoning about pointers and heap structure — to run at scale across Facebook's codebases.

Abstract interpretation: reason over approximations to cover every possible run at once.

The cost is precision — it may say "maybe" when the truth is "never."

Model checking

Exhaustively explore the states a system can be in, to prove it can never reach a bad one — a deadlock, a safety violation. Tools like TLA+ and SPIN have been the workhorses, with major wins in distributed systems (most visibly Amazon's use of TLA+ across AWS) and protocol verification. The classic challenge is state-space explosion — the number of states can blow up — which has driven decades of clever techniques to keep it tractable.

Symbolic execution & SMT solvers

Instead of concrete inputs, run the program on symbols, building up a logical formula describing what each path requires. Hand that formula to an SMT solver — Z3 is the canonical example — which can prove whether a condition is satisfiable. This is how a tool can say "here is the exact input that triggers this bug," or prove no such input exists. KLEE is the canonical symbolic-execution engine built on top of an SMT solver, widely used in research and security testing.

Proof assistants & formal proof

At the rigorous end: tools like Lean, Coq/Rocq, Dafny, and F*, where humans (and increasingly AI) write machine-checked proofs that code meets a specification. This is the gold standard — used for aerospace, cryptography (Microsoft's verified miTLS implementation of TLS 1.3 was written in F*), and safety-critical systems. Its historical cost is also why it stayed there: it has demanded PhD-level expertise and, famously, can require many lines of proof per line of code.

The trade nobody escapes

Step back and the whole field is organized around one tension: the more you want to prove, the more it costs — in compute, in expertise, or in restrictions on how you write code.

Linters are cheap and broad, but catch only what someone has written a rule for.
Type systems range from cheap-and-limited (basic types) to powerful-but-effortful (dependent / refinement types).
Abstract interpretation scales to real codebases, but approximates.
Model checking is exhaustive, but fights state-space explosion.
Proof assistants prove almost anything, but have demanded enormous human effort.

Every practical tool is a bet on where to sit on this spectrum.

Every tool that proves something sits somewhere on this curve — linters and pattern-based checks sit outside it, since they detect known shapes rather than prove properties. Positions are illustrative; there is no canonical ranking, and most real systems blend several methods.

The more you want to prove, the more it costs.

Illustrative, not measured — and linters sit off the curve, since they detect known shapes rather than prove properties.

Where the field is moving — and who's pushing it

Two things changed at once. AI started writing a huge share of code, and AI started getting good enough to help with the proofs themselves. The result is a wave of work pulling formal methods out of the aerospace lab and toward everyday software.

A few currents worth knowing, each an instance of the toolbox above:

Hybrid static analysis + AI review. Tools like DeepSource combine static analysis with AI assistance in ordinary pull-request workflows.
Annotation-first toolchains. Tools like LemmaScript (Midspiral) let you write ordinary TypeScript with inline specifications and generate machine-checked proofs in Lean or Dafny. A lighter-weight on-ramp to the annotate-up-front bargain — still requiring the intent to be written down.
AI that writes proofs. A well-funded cohort — Axiom, Harmonic, Logical Intelligence — is training AI to generate machine-checkable proofs, mostly in Lean or Dafny. They're attacking the historical cost of proof assistants by automating the proving itself.
Languages built for local reasoning. And then there's Rust — designed so the compiler statically proves a meaningful slice of correctness (memory safety, no data races) at compile time, without runtime checks. Not full functional verification, but arguably the most successful attempt to bake any provable guarantee into a mainstream language by design. Rust is such a clean case study in the trade-off above that it deserves its own post — that's where we're headed next.

The throughline: everyone is trying to move up the "how much it can prove" axis while pushing the cost down — whether by automating proofs, building smarter languages, or finding new representations that make powerful analysis tractable on real code.

Where Archway fits

Archway doesn't ask you to write in a proof language, and it isn't another probabilistic reviewer. As we described in our first post, our translation engine turns existing code into a categorical representation, and our analysis engine runs the established techniques above — type-level reasoning, abstract interpretation, proofs — on top of it. The wager is that the representation is what's been missing: the thing that makes decades of powerful formal-methods research finally tractable on the messy, dynamic code people actually write.

— The Archway team

FEEDBACK

Got something we should add to the field guide?

Working in formal methods, or running into one of these tools in production? Tell us where we got it wrong — or what we missed.

Send us a note