A field report of using Alloy with agent-based development

Hi all! I wanted to share some results and high-level techniques of using Alloy as part of AI-based development, specifically using agents and harnesses. I’ll leave my own conclusions until the end of the post. And please reach out if there are follow-up questions or there’s anything I can do to help you out.

The systems I work on are high-assurance, “always on; never fail” globally distributed systems. The teams I work with have experience using lightweight formal methods, but experience greatly varies across different teams (and some teams have no experience). I’ve led the teams through the adoption of agent-based development, and through that journey we hit challenges that are well-documented by others: The “barbell problem” is real; specification and verification become the bottlenecks in the SDLC.

To solve these challenges we fully automated an approach to lightweight formal methods across all teams, inspired by Jackson’s Dependability Case concept. We use ‘steering docs’ to drive agent interactions, workflow, and manage agent alignment during tasks. Here is an example of the steering doc explaining the high-level approach: devbox/docs/lfm.md at main · ohpauleez/devbox · GitHub (this approach is what we have automated).

Our approach to spec-driven development is very much ‘invariants-first’. We follow a workflow we call DESIRED (the acronym provides something tangible to latch on to for different stakeholders across the company), and an early version is described in our guide: Getting Started with Spec-Driven Development - Spec-Driven Development Workshop (the Github repo also contains other commands, skills, etc. to make working this way a bit easier). We use OpenSpec to manage our specification documents, and very specifically, we use the srs-driven schema which contains specific agent steering instructions within the specification template docs: GitHub - ohpauleez/openspec_srs-driven: A slightly more rigorous OpenSpec schema · GitHub . This schema was tuned in an AutoEvolve / autoresearch loop across a series of different projects, and then further adjusted as we used it internally on production systems. We’re still making adjustments to it, but we’re happy with where it is at the moment.

As part of the specification process, our `spec.md` files become Alloy models, turning specifications into something that can be machine verified. As changes/updates to specs are proposed (through “delta specs”), the modeling ensures we can keep the designs sound. In our process, the agents write all of the specification docs, including the Alloy models. You can see an example of one of these specs here: devbox/openspec/specs/box-registry/spec.md at main · ohpauleez/devbox · GitHub

There are four details worth calling out within the spec files:

  • We “tag” specifications (eg: `[BOX-CLI-REGISTRY]`) and have a separate coverage tool that ensures the different approaches/tests within the verification pyramid cover the requirement/scenario; the tags can also optionally appear in the source code. We call this “Spec Traceability”.

  • Specs are written in EARS format, which is possible to analyze with an SMT solver: GitHub - ohpauleez/spec-check: Requirements analysis and source intent alignment tool · GitHub ; Spec-check also confirms that the intent of the code semantically matches the specification.

  • Specs contain one layer of “evidence” that they’re faithfully implemented as designed in the system

  • We have a separate tool called `evidence.sh` that evaluates all the evidence of a specification (including the Alloy model). This tool is inspired by Emina Torlak’s “evidence” extension to Alloy, but done as a separate tool.

When an agent implements a specification, it’s tightly controlled with another steering doc (basically a “style guide” for engineering high-assurance systems). As part of that implementation, all preconditions, postconditions, invariants, failure modes, and safety claims are documented within docstrings – this helps drive stronger alignment in the implementation as well (eg: the model is more likely to produce code that aligns with the claims in the doc string, and the claims in the docstring come from the specification documents). After the agent implements a specification, it will optionally perform a second pass to add deductive verification (in our case, OpenJML or LemmaScript) to the deterministic core of the system. Implementation also includes the full verification plan (unit tests for contracts, property-based tests for invariants and state machines, concurrency tests where appropriate, deterministic simulation tests, fault-injection paired with property-based tests in a metamorphic testing approach, etc.). The agents also use strong typing systems and static analyzers during this phase before marking the work complete.

We then use a collection of commands, skills, and tools to auto-review the implementation, tests, etc., and build up a manifest of the evidence for the change. Humans evaluate the final evidence – which drives the main question we continually ask ourselves: “Is this sufficient, direct evidence to determine the dependability of the system” and this question drives the continual evolution of the approach.

- - - -

We’ve used the approach described above for greenfield and brownfield projects, across teams of different sizes, with different familiarity with lightweight formal methods.
What has been most surprising is the ability to automate LFM in such a way that it fades into the background – to have relatively junior engineers produce incredible results because the workflow effectively coached them along the way and because they were forced into working ‘invariants-first’. Another surprising result was the way this has shaped debugging – when we find a bug, we use the agent to interrogate the spec and model first, understanding if there is a design gap or just a defect in the software. We use those cases to improve the steering docs. We’re still learning and adapting the tools and techniques, but we’re far enough along to say, “this works well.” There are more pieces that I left out of the post (eg: how we let an agent “see”, develop, and test a fully distributed system), but I can share those too if folks are interested.

1 Like

If you have tools or techniques you think I should try with the teams, or even general suggestions, please let me know. If you’ve tried something different and can provide a short field report, that would be appreciated. Also if there is something you want me to try because you’re looking for feedback, I’m more than happy to try and report back!

Do you mix hand-written and vibecoded parts? I’m just curious. I prefer not to, however I use an agent basically to do pair programming, with very small steps, I know and own all my code, my diffs are highly readable. It’s closer to writing code manually and asking an LLM in a chat than to deferring to agents.

My question is how would you downscale the described spec-driven vibecoding approach to this agentic pair-programming workflow?

What would be the first or maybe top-3 things to introduce that would be the most impactful?

The biggest problem I see is that, while models are code, they are a form of documentation as well, so they are separated from codebase and tend to diverge from it with time.

Thanks for the interest and wonderful questions!

We do mix hand-written and agent-produced parts, although purely hand-written is becoming rare. “Hand-adjusted” is still common when a human reviews the final artifacts. I don’t think anything is vibecoded at this point (we’re much more principled than that and our domain requires quite a bit more rigor). Developers still need to review final outputs and take full ownership of systems, but “review fatigue” is real, which is why we rely heavily on the approach to evidence and ‘dependability cases’ (so we can focus the review a bit more on critical pieces). We still keep diffs “human-sized” and well organized, but we are exploring what it might take to process much larger diffs (and in what scenarios that would and wouldn’t work).

I left many details out of my description above, including how we got to this point (and problems we had along the way). We certainly had a point where the teams were “LLM-assisted” in the pair programming sense you’re discussing. The engineers had a natural evolution from LLM-assisted to agent-based (as verification techniques were introduced and automated), and teams learned new techniques to ownership (ie: how do you retain the learning processes that happen when writing code at human-speed, that have a large influence on ownership during maintenance). Our primary interest is in making software easier to own and operate (software spends most of its life in ‘maintenance’ and we’re focused on getting that cost down, not the cost to create software).

Here are a few items that I think would be the most impactful:

  • Specs as “living docs” makes a real difference. Studies have already shown the majority of defects shipped to production happen in design and requirements and this is amplified by LLMs. Seeing the “inputs” and adding rigor there makes everyone in the Org stronger (Product people, other developers, LLMs, etc.) – to some degree, this aligns (in spirit) with Jackson’s Software Concepts (and we organize our specifications by “capability”). At some point in this evolution, seeing the “inputs” becomes necessary, so establishing a strong practice earlier is best.
  • The cost of producing code is effectively 0 – but this also means the cost of enhancing your test suite, integrating models, using differential testing, etc. is also effectively 0. There are too many headlines on “the speed at which we can produce code” which is covering the much more important fact: the depth at which we can have confidence. Every implementation decision can have a microbenchmark to support it, every code change can have a robust test suite far beyond what would normally be cost-appropriate, docstrings and literate-style docs can be automated across large projects – The quality of software can be dramatically higher (lowering maintenance costs) even if the overall throughput remained the same (spoiler: the throughput increases substantially too). So my recommendation is focus on quality first and the rest will follow.
  • Divergence between artifacts isn’t a problem in practice because of how we automated things and the guidelines we set. All “living” pieces (code, tests, docs, specs, models, etc.) are included when making a change. Because we drive all changes through specification, the code doesn’t really get the chance to diverge. We also build “executable models” for differential testing (and we did this before we started with LLMs) – that practice ensures models and code stay aligned. My recommendation here is spend the time to make the “maintenance” tasks automated and part of the workflow – make a set of commands/prompts/scripts for reviewing and aligning models and code.

I hope this help! Let me know if you have any other questions or anything else I can do to help out!

1 Like

Thank you for the elaborate answer. I haven’t read the Daniel Jackson’s book “The Essence of Software”, it seems the time to open it has come.

A question:

literate-style docs can be automated across large projects

What tools do you use and what’s the workflow?