Hi @DanielJackson,
Thanks for the interest! Hopefully this answers your question, but let me know if I missed the mark.
I’m always happy to chat to anyone here publicly or privately and if there’s anything I can do to help anyone in this community, please don’t hesitate to reach out.
I’m weirdly connected to many of you (I’m just in MIT’s backyard, I work with a professor from UPorto and make make many trips there during the year, I’ve been spending free time in Montpellier and will be making a trip to Toulouse soon, etc.).
I mentioned some high level details in an older post of mine about evidence
and dependability cases. My personal history with Alloy is quite long (including embedding Alloy into the Clojure programming language), but most recently I’ve been integrating Alloy as a tool across the entire software development lifecycle (SDLC) for “always on; never fail” globally distributed systems.
The problem domain
The core system is a SaaS platform upon which other companies build ad-tech solutions. You can think of the system as a multi-tenanted, geo-replicated database performing information retrieval, multi-objective optimization, applied ML, and probabilistic selection. Queries need to complete within a p95 latency of 50ms, with a minimum availability SLA of 99.999% (however, our realized availability is better than that contractual SLA). We have engineered the core data distribution system to be fully deterministic, as well as most parts of the ad decision engine (the component that handles the query load).
We use lightweight formal methods to engineer all these core systems. This is quite common in industry now, as seen in Using lightweight formal methods to validate a key-value storage node in Amazon S3, Emina Torlak’s evolution of the same idea in How we built Cedar: A verification-guided approach, how Equinix develops their internal network / distributed system backbone (as described in this conference talk), and others. Many here (including yourself) were pioneering this idea years earlier.
I’ll walk through a few of our key systems to illustrate our approach to lightweight formal methods and the different ways we use Alloy. As we’re engineering these systems, we make an explicit distinction between Requirements (“What is necessary for this system to be successful”) and Invariants (“What is necessary for this system to be correct”) – this is done to build a habit and culture focused on reasoning about correctness even as the systems get large and the domains get complex.
A common beginning
Most of the system designs start out the same. We do a project kick-off to walk through the known requirements and the initial set of invariants. We review any results from simulations or prototypes, talk through the initial design ideas, and do a quick risk assessment. From there the team works in a collaborative diagramming tool, informally drawing different views of the system, domain, internal structures, algorithms, etc. We draw “call-outs” of invariants in these diagrams.
We start using Alloy once we have a few sketches in hand, treating it mostly like a “design REPL.” We iterate quite quickly through a model uncovering invariants as we go. Once we think the model is in a decent spot, we ‘turn it around’ and start asking questions of it. We have a habit of making everything predicates to enable us to quickly compose “scenarios” during this “question asking” phase. The insights gained from the modeling are copied back into the diagramming tool and the models are (sadly) discarded (placed into a “pasture” directory within version control).
From here, we implement a new executable model for each system component, in our target runtime (for example, we might implement the model in something that runs on the JVM and exposes static methods for functionality). This model is single-threaded, only operates on immutable values, and through a collection of different tools and techniques we show this model satisfies all of the necessary invariants. The model is maintained alongside the production system in source control by all the engineers working on the system.
It’s helpful to think of this model as an evolutionary, ideal prototype. With the initial implementation, the team uncovers problems/unknowns/risks about realizing the concepts within a working system. But as we progress further into the SDLC, the model is then used as part of an oracle-like differential testing system, where we use property-based tests that exercise the production implementation and the model and ensure agreement between the two.
Keep in mind, this is at the “component” level which can be quite small, and we currently have a gap in our approach because there isn’t some higher level “meta” model – this is why I want the evidence
macro in Alloy (I believe that Alloy can and should span the entire SDLC; Investments in the model should continue to deliver value throughout the life of the system).
As defects are discovered or radical changes in design are proposed, the team returns to the interactive diagramming tool and Alloy, and the process repeats. Anytime the team encounters a requirement or tweak that touches an invariant, it’s back to the diagrams and Alloy. But sadly, just like before, these models are tossed aside.
What we’re modeling: State machines
Most of our systems are engineered using an approach we internally call “channel-connected machines” – Internally, all systems are built as a collection of share-nothing/share-little deterministic state machines, connected by channels, which only convey values (or anything that has value semantics) between machines. In the large, we integrate these systems in a distributed manner using the same approach, with some additional protocol enforcement, versioning, etc.
We take this approach because it’s easier to reason about deterministic behavior, invariant enforcement (including consistency and versioning guarantees), there’s quick and easy visualization (statecharts / parnas tables / etc.), and we have found it dramatically reduces the delivered-defect rate.
When we’re thinking through the construction of a specific component, we use Alloy to model the state machines and their associated invariants. Basically “Alloy as a CSP modeling tool” that allows us to also include important details about the business and data domain. Alloy traces illustrate how the machines interact in our various “scenarios” (and I’m still committed to making a ShiViz visualizer for these traces). I remain hopeful that more insights from the Dash+ work will migrate back into Alloy.
One of our systems is a “scheduler” which has a very tricky deterministic concurrency invariant – Alloy was critical in helping us understand how that needed to be enforced when being composed into a larger set of “channel-connected machines.”
What we’re modeling: Algorithms
The domain data has different representations across the different systems within the SaaS platform. As a simple example, customers interact with a fully normalized data model, but we denormalize the entire data model on our decision engines (where queries run).
Some of the transformations between representations can be conceptually challenging once versioning and consistency guarantees are added. We’ve used Alloy to model each of these data/domain representations (including versioning information) and how the transformations should happen. For example, we need to ensure the “happens-before” invariant holds when moving from a normalized representation to a denormalized representation (eg: where maybe only one entity was changed on the normalized data model, but many entities need to change on the denormalized model, while still being versioned correctly for the consistency controls).
I’m hopeful that insights from Peter Krien’s PlusCal work are carried back into Alloy. Some of the “PlusCal” pieces are also quite useful when modeling state machines and transitions.
What we’re modeling: Protocols
At the heart of our system is a novel, cloud-native, causal consistency system, which supports geo partial replication and uncoordinated read transactions. It is discussed in detail in Diana Freitas’ Master’s Thesis; UPorto. This protocol is designed as a traditional replicated state machine (if you squint, it’s a variant of viewstamped replication) – basically a combination of “state machines” and “algorithms” from above. Correctness across our platform is dependent on the read-transaction design. The protocol also has an interesting (and fully extensible) data/domain model to express Deltas and Values (which was also part of our Alloy models).
Hopefully some of that is interesting or helpful for folks here. Let me know if anyone wants more details!