A field report of using Alloy with agent-based development

ohpauleez · June 25, 2026, 3:38pm

Hi all! I wanted to share some results and high-level techniques of using Alloy as part of AI-based development, specifically using agents and harnesses. I’ll leave my own conclusions until the end of the post. And please reach out if there are follow-up questions or there’s anything I can do to help you out.

The systems I work on are high-assurance, “always on; never fail” globally distributed systems. The teams I work with have experience using lightweight formal methods, but experience greatly varies across different teams (and some teams have no experience). I’ve led the teams through the adoption of agent-based development, and through that journey we hit challenges that are well-documented by others: The “barbell problem” is real; specification and verification become the bottlenecks in the SDLC.

To solve these challenges we fully automated an approach to lightweight formal methods across all teams, inspired by Jackson’s Dependability Case concept. We use ‘steering docs’ to drive agent interactions, workflow, and manage agent alignment during tasks. Here is an example of the steering doc explaining the high-level approach: devbox/docs/lfm.md at main · ohpauleez/devbox · GitHub (this approach is what we have automated).

Our approach to spec-driven development is very much ‘invariants-first’. We follow a workflow we call DESIRED (the acronym provides something tangible to latch on to for different stakeholders across the company), and an early version is described in our guide: Getting Started with Spec-Driven Development - Spec-Driven Development Workshop (the Github repo also contains other commands, skills, etc. to make working this way a bit easier). We use OpenSpec to manage our specification documents, and very specifically, we use the srs-driven schema which contains specific agent steering instructions within the specification template docs: GitHub - ohpauleez/openspec_srs-driven: A slightly more rigorous OpenSpec schema · GitHub . This schema was tuned in an AutoEvolve / autoresearch loop across a series of different projects, and then further adjusted as we used it internally on production systems. We’re still making adjustments to it, but we’re happy with where it is at the moment.

As part of the specification process, our `spec.md` files become Alloy models, turning specifications into something that can be machine verified. As changes/updates to specs are proposed (through “delta specs”), the modeling ensures we can keep the designs sound. In our process, the agents write all of the specification docs, including the Alloy models. You can see an example of one of these specs here: devbox/openspec/specs/box-registry/spec.md at main · ohpauleez/devbox · GitHub

There are four details worth calling out within the spec files:

We “tag” specifications (eg: `[BOX-CLI-REGISTRY]`) and have a separate coverage tool that ensures the different approaches/tests within the verification pyramid cover the requirement/scenario; the tags can also optionally appear in the source code. We call this “Spec Traceability”.
Specs are written in EARS format, which is possible to analyze with an SMT solver: GitHub - ohpauleez/spec-check: Requirements analysis and source intent alignment tool · GitHub ; Spec-check also confirms that the intent of the code semantically matches the specification.
Specs contain one layer of “evidence” that they’re faithfully implemented as designed in the system
We have a separate tool called `evidence.sh` that evaluates all the evidence of a specification (including the Alloy model). This tool is inspired by Emina Torlak’s “evidence” extension to Alloy, but done as a separate tool.

When an agent implements a specification, it’s tightly controlled with another steering doc (basically a “style guide” for engineering high-assurance systems). As part of that implementation, all preconditions, postconditions, invariants, failure modes, and safety claims are documented within docstrings – this helps drive stronger alignment in the implementation as well (eg: the model is more likely to produce code that aligns with the claims in the doc string, and the claims in the docstring come from the specification documents). After the agent implements a specification, it will optionally perform a second pass to add deductive verification (in our case, OpenJML or LemmaScript) to the deterministic core of the system. Implementation also includes the full verification plan (unit tests for contracts, property-based tests for invariants and state machines, concurrency tests where appropriate, deterministic simulation tests, fault-injection paired with property-based tests in a metamorphic testing approach, etc.). The agents also use strong typing systems and static analyzers during this phase before marking the work complete.

We then use a collection of commands, skills, and tools to auto-review the implementation, tests, etc., and build up a manifest of the evidence for the change. Humans evaluate the final evidence – which drives the main question we continually ask ourselves: “Is this sufficient, direct evidence to determine the dependability of the system” and this question drives the continual evolution of the approach.

- - - -

We’ve used the approach described above for greenfield and brownfield projects, across teams of different sizes, with different familiarity with lightweight formal methods.
What has been most surprising is the ability to automate LFM in such a way that it fades into the background – to have relatively junior engineers produce incredible results because the workflow effectively coached them along the way and because they were forced into working ‘invariants-first’. Another surprising result was the way this has shaped debugging – when we find a bug, we use the agent to interrogate the spec and model first, understanding if there is a design gap or just a defect in the software. We use those cases to improve the steering docs. We’re still learning and adapting the tools and techniques, but we’re far enough along to say, “this works well.” There are more pieces that I left out of the post (eg: how we let an agent “see”, develop, and test a fully distributed system), but I can share those too if folks are interested.

ohpauleez · June 25, 2026, 3:50pm

If you have tools or techniques you think I should try with the teams, or even general suggestions, please let me know. If you’ve tried something different and can provide a short field report, that would be appreciated. Also if there is something you want me to try because you’re looking for feedback, I’m more than happy to try and report back!

Alejandro · June 25, 2026, 7:37pm

Do you mix hand-written and vibecoded parts? I’m just curious. I prefer not to, however I use an agent basically to do pair programming, with very small steps, I know and own all my code, my diffs are highly readable. It’s closer to writing code manually and asking an LLM in a chat than to deferring to agents.

My question is how would you downscale the described spec-driven vibecoding approach to this agentic pair-programming workflow?

What would be the first or maybe top-3 things to introduce that would be the most impactful?

The biggest problem I see is that, while models are code, they are a form of documentation as well, so they are separated from codebase and tend to diverge from it with time.

ohpauleez · June 26, 2026, 10:49am

Thanks for the interest and wonderful questions!

We do mix hand-written and agent-produced parts, although purely hand-written is becoming rare. “Hand-adjusted” is still common when a human reviews the final artifacts. I don’t think anything is vibecoded at this point (we’re much more principled than that and our domain requires quite a bit more rigor). Developers still need to review final outputs and take full ownership of systems, but “review fatigue” is real, which is why we rely heavily on the approach to evidence and ‘dependability cases’ (so we can focus the review a bit more on critical pieces). We still keep diffs “human-sized” and well organized, but we are exploring what it might take to process much larger diffs (and in what scenarios that would and wouldn’t work).

I left many details out of my description above, including how we got to this point (and problems we had along the way). We certainly had a point where the teams were “LLM-assisted” in the pair programming sense you’re discussing. The engineers had a natural evolution from LLM-assisted to agent-based (as verification techniques were introduced and automated), and teams learned new techniques to ownership (ie: how do you retain the learning processes that happen when writing code at human-speed, that have a large influence on ownership during maintenance). Our primary interest is in making software easier to own and operate (software spends most of its life in ‘maintenance’ and we’re focused on getting that cost down, not the cost to create software).

Here are a few items that I think would be the most impactful:

Specs as “living docs” makes a real difference. Studies have already shown the majority of defects shipped to production happen in design and requirements and this is amplified by LLMs. Seeing the “inputs” and adding rigor there makes everyone in the Org stronger (Product people, other developers, LLMs, etc.) – to some degree, this aligns (in spirit) with Jackson’s Software Concepts (and we organize our specifications by “capability”). At some point in this evolution, seeing the “inputs” becomes necessary, so establishing a strong practice earlier is best.
The cost of producing code is effectively 0 – but this also means the cost of enhancing your test suite, integrating models, using differential testing, etc. is also effectively 0. There are too many headlines on “the speed at which we can produce code” which is covering the much more important fact: the depth at which we can have confidence. Every implementation decision can have a microbenchmark to support it, every code change can have a robust test suite far beyond what would normally be cost-appropriate, docstrings and literate-style docs can be automated across large projects – The quality of software can be dramatically higher (lowering maintenance costs) even if the overall throughput remained the same (spoiler: the throughput increases substantially too). So my recommendation is focus on quality first and the rest will follow.
Divergence between artifacts isn’t a problem in practice because of how we automated things and the guidelines we set. All “living” pieces (code, tests, docs, specs, models, etc.) are included when making a change. Because we drive all changes through specification, the code doesn’t really get the chance to diverge. We also build “executable models” for differential testing (and we did this before we started with LLMs) – that practice ensures models and code stay aligned. My recommendation here is spend the time to make the “maintenance” tasks automated and part of the workflow – make a set of commands/prompts/scripts for reviewing and aligning models and code.

I hope this help! Let me know if you have any other questions or anything else I can do to help out!

Alejandro · June 26, 2026, 11:18am

Thank you for the elaborate answer. I haven’t read the Daniel Jackson’s book “The Essence of Software”, it seems the time to open it has come.

A question:

literate-style docs can be automated across large projects

What tools do you use for literate-style docs and what’s the workflow?

Alejandro · June 26, 2026, 8:33pm

I’ve started reading it, and I believe, while I haven’t learned the exact definitions yet, I understand what the author is trying to convey. The book is about finding good abstractions, which the author calls “concepts”, I’ll put this word into quotes to emphasize that it’s a defined term from the book.

Here’s a blog post on how the realization of “concepts” could be achieved in practice using state charts:

Basically, state charts are hard to maintain for humans because of boilerplate, but boilerplate is not an problem for agents. And state charts can be mapped to models one to one. So, the hardest part now would be to find good “concepts”.

geert56 · June 28, 2026, 2:05am

Very nice report. You provided so much material at once that it will take me awhile to go though it all.
I am a big fan of formal verification (Floyd-Hoare-style), and toyed myself also with using Alloy as a
model for LLMs to capture spec intent and help improve the code generation.

Alejandro · June 28, 2026, 9:22am

using Alloy as a model for LLMs to capture spec intent

Frankly, I struggle with reading models unless they are well explained, which means they convey intent mostly to agents.

After reading this blog post I wonder if the workflow could be extended a bit:

What if we use one of the following tools first to model state charts for humans, because these can be used for clicking through the states interactively? This helps a lot for getting the idea, and agents understand them as well. Then these state charts can be manually or semi-manually converted to alloy models, this is pretty straightforward, which can be used for formal verification.

https://sketch.systems/new/

https://sketch.stately.ai

This adds overhead of maintaining a state chart and a model, but this is code, they can be modified in a synchronous manner, with readable and commented diffs.

Alejandro · June 28, 2026, 9:27am

I’ve asked above about the tools for literate programming. They would be useful for grouping relevant parts of state charts and models. And code, probably.

ohpauleez · June 28, 2026, 12:20pm

Thank you all for the comments! I’ll address some of the questions, but please ask follow-ups if you want more detail. It’s also important to note that all of this is automated- We drive the agent and a custom harness but the bulk of the work is fully automated through a combination of closed-loop and open-loop feedback. Now, onto the questions:

I haven’t had much luck using Alloy as the only format to specify a system for code generation and we haven’t seen a measurable impact that Alloy (integrated into the spec) improves code generation. What we have seen (and measured) is that integrating Alloy finds defects in the spec, which then improves the overall system (and all the tooling downstream like code generation), and this is especially true as that system is being evolved. At this point, we think Alloy is an essential part to layer into the specification documents (which we usually do after code generation, but you can do it before code generation as well).

Regarding Statecharts, Alloy, and other formalisms:
I think Alloy is relatively easy to learn – I usually can teach someone the basics within a day, even if they’re a junior engineer. We have internal workshop docs we use, but we also point people at the very wonderful book, Practical Alloy (telling them to skip the “Advanced topics” sections until they need them). We also have a collection of Skills for use with agents (alloy for small contexts and alloy-more for larger contexts or complex problems). As an added benefit, these skills work great for learners as a “cheatsheet”: devbox/.opencode/skills/alloy/SKILL.md at main · ohpauleez/devbox · GitHub

The specification (and Alloy model) captures the domain, invariants, state machines, liveness+safety properties/claims, etc.. In addition to that, the template we use for our design docs captures the state machines separately (complete with decision tables, invariants, and liveness+safety). It’s important that these diagrams are in a text format such that they can be easily analyzed (and used) by the LLMs, even though this design.md document is mostly for humans understanding the system. We use Mermaid (as the format) so we can get rendered diagrams when viewing the docs on github. You can see an example of one state diagram here: devbox/docs/design.md at main · ohpauleez/devbox · GitHub

Learning how to write statecharts within Alloy is a very valuable skill for the working engineer. With the model in hand, you can get instant visualizations, ask “what if” scenarios with predicates, reproduce production bugs without having to manipulate systems directly, and more.
When your specification documents are enhanced with Alloy, you extend these capabilities to the agents, which makes them approachable (and automated) for all engineers, of all skill levels, across the entire company.

On “Changes” and their format:
A full Change in OpenSpec is made up of a proposal.md document (What and Why), a design.md (How), a spec.md (core specification), and tasks.md (instructions for the agent and an audit log of the implementation). When a change is completed, it’s archived in an ADR-like fashion: devbox/openspec/changes/archive/2026-06-05-devbox-core at main · ohpauleez/devbox · GitHub
The “living docs” (docs/design.md, openspec/specs/**/spec.md) are all updated with each Change and represent the current state of the world. Today, we only enrich the living specifications with Alloy.

For the literate programming tooling: Aside from the specs-with-Alloy, we steer the agents to produce literate comments within the code for complex operations. These are written as line comments now - we’re planning on making a tool like Clojure’s Marginalia (but language agnostic) for producing something a bit easier to read. As of right now, it’s not an immediate need (ie: it’s not a source of missing “evidence”), but it would be a nice-to-have.

State machines, model checking, and deductive verification:
As a general practice, we implement the core of our systems in a completely deterministic fashion, pushing all sources of non-determinism to the edges of our system (and passing those in as parameters/arguments to the core system). That core, deterministic system is usually built as a series of state machines that are composed together (if you’re coming from functional programming, this all seems very natural). When we add deductive verification (OpenJML, LemmaScript, Creusot), it’s only on an “executable model” or this deterministic core (on the state machines). We sometimes use bounded model checking if we have a system/constraints where that works better (JBMC, Kani). This deterministic core is essential, since it enables many of the techniques on our verification pyramid (differential testing, deterministic simulation testing, etc.). We drive the agent towards these implementation decisions using simple steering docs (like java_style.md, typescript_style.md, etc.), but there are other techniques to achieve this (like Midspiral’s lemmafit).

I hope this helps! Thanks for engaging and reach out with any more questions or suggestions!

Alejandro · June 28, 2026, 5:58pm

I’ve learned so much from this thread already, thank you.

Another question:

That core, deterministic system is usually built as a series of state machines that are composed together (if you’re coming from functional programming, this all seems very natural).

I’m familiar with FCIS, which is Functional Core, Imperative Shell.

But usually the functional core in the code I write and read is just computations (pure functions), not actions (functions with effects). Do you structure your computations as state machines as well? Any recommendation what to read on this?

Or maybe you mean that you structure your imperative shell the same way, creating state machines that reify concepts? This is what I’d like to try, this idea seems very natural now indeed after reading the book.

Alejandro · June 28, 2026, 6:08pm

And thank you for mentioning the tools for deductive verification and bounded model checking directly in code, I’ve never heard of these.

Alejandro · June 29, 2026, 1:58pm

Sorry, I was confused. I’ve reread your messages and the part about functional core as a composition of state machines, and it clicked.

Yeah, reducing expressive power from Turing-complete to state charts everywhere makes a lot of sense, because this helps with managing complexity and we have very practical tools for verification in addition to testing.

I now finally understand what kind of harnesses you’re building, this is cool.

Topic		Replies	Views
The use of Alloy and LLM's Experimental	5	600	September 30, 2024
Alloy 6 vs. TLA+ Alloy 6	25	4911	January 4, 2024
What documentation / learning material would you like to see?	9	709	August 19, 2023
Alloy, Dependability Cases, and `evidence`	3	225	July 22, 2024
How to tell a story using alloy models? Questions	2	29	June 28, 2026

A field report of using Alloy with agent-based development

Related topics