Hi all! I wanted to share some results and high-level techniques of using Alloy as part of AI-based development, specifically using agents and harnesses. I’ll leave my own conclusions until the end of the post. And please reach out if there are follow-up questions or there’s anything I can do to help you out.
The systems I work on are high-assurance, “always on; never fail” globally distributed systems. The teams I work with have experience using lightweight formal methods, but experience greatly varies across different teams (and some teams have no experience). I’ve led the teams through the adoption of agent-based development, and through that journey we hit challenges that are well-documented by others: The “barbell problem” is real; specification and verification become the bottlenecks in the SDLC.
To solve these challenges we fully automated an approach to lightweight formal methods across all teams, inspired by Jackson’s Dependability Case concept. We use ‘steering docs’ to drive agent interactions, workflow, and manage agent alignment during tasks. Here is an example of the steering doc explaining the high-level approach: devbox/docs/lfm.md at main · ohpauleez/devbox · GitHub (this approach is what we have automated).
Our approach to spec-driven development is very much ‘invariants-first’. We follow a workflow we call DESIRED (the acronym provides something tangible to latch on to for different stakeholders across the company), and an early version is described in our guide: Getting Started with Spec-Driven Development - Spec-Driven Development Workshop (the Github repo also contains other commands, skills, etc. to make working this way a bit easier). We use OpenSpec to manage our specification documents, and very specifically, we use the srs-driven schema which contains specific agent steering instructions within the specification template docs: GitHub - ohpauleez/openspec_srs-driven: A slightly more rigorous OpenSpec schema · GitHub . This schema was tuned in an AutoEvolve / autoresearch loop across a series of different projects, and then further adjusted as we used it internally on production systems. We’re still making adjustments to it, but we’re happy with where it is at the moment.
As part of the specification process, our `spec.md` files become Alloy models, turning specifications into something that can be machine verified. As changes/updates to specs are proposed (through “delta specs”), the modeling ensures we can keep the designs sound. In our process, the agents write all of the specification docs, including the Alloy models. You can see an example of one of these specs here: devbox/openspec/specs/box-registry/spec.md at main · ohpauleez/devbox · GitHub
There are four details worth calling out within the spec files:
-
We “tag” specifications (eg: `[BOX-CLI-REGISTRY]`) and have a separate coverage tool that ensures the different approaches/tests within the verification pyramid cover the requirement/scenario; the tags can also optionally appear in the source code. We call this “Spec Traceability”.
-
Specs are written in EARS format, which is possible to analyze with an SMT solver: GitHub - ohpauleez/spec-check: Requirements analysis and source intent alignment tool · GitHub ; Spec-check also confirms that the intent of the code semantically matches the specification.
-
Specs contain one layer of “evidence” that they’re faithfully implemented as designed in the system
-
We have a separate tool called `evidence.sh` that evaluates all the evidence of a specification (including the Alloy model). This tool is inspired by Emina Torlak’s “evidence” extension to Alloy, but done as a separate tool.
When an agent implements a specification, it’s tightly controlled with another steering doc (basically a “style guide” for engineering high-assurance systems). As part of that implementation, all preconditions, postconditions, invariants, failure modes, and safety claims are documented within docstrings – this helps drive stronger alignment in the implementation as well (eg: the model is more likely to produce code that aligns with the claims in the doc string, and the claims in the docstring come from the specification documents). After the agent implements a specification, it will optionally perform a second pass to add deductive verification (in our case, OpenJML or LemmaScript) to the deterministic core of the system. Implementation also includes the full verification plan (unit tests for contracts, property-based tests for invariants and state machines, concurrency tests where appropriate, deterministic simulation tests, fault-injection paired with property-based tests in a metamorphic testing approach, etc.). The agents also use strong typing systems and static analyzers during this phase before marking the work complete.
We then use a collection of commands, skills, and tools to auto-review the implementation, tests, etc., and build up a manifest of the evidence for the change. Humans evaluate the final evidence – which drives the main question we continually ask ourselves: “Is this sufficient, direct evidence to determine the dependability of the system” and this question drives the continual evolution of the approach.
- - - -
We’ve used the approach described above for greenfield and brownfield projects, across teams of different sizes, with different familiarity with lightweight formal methods.
What has been most surprising is the ability to automate LFM in such a way that it fades into the background – to have relatively junior engineers produce incredible results because the workflow effectively coached them along the way and because they were forced into working ‘invariants-first’. Another surprising result was the way this has shaped debugging – when we find a bug, we use the agent to interrogate the spec and model first, understanding if there is a design gap or just a defect in the software. We use those cases to improve the steering docs. We’re still learning and adapting the tools and techniques, but we’re far enough along to say, “this works well.” There are more pieces that I left out of the post (eg: how we let an agent “see”, develop, and test a fully distributed system), but I can share those too if folks are interested.