The diagrams always show the happy path. Request goes in, response comes out, maybe a queue in between. Three boxes and two arrows. The failure modes live in the whitespace.
Distributed systems work at demo time. In production, the third service fails after the first two succeeded: the debit went through, the credit went through, the write to the audit log dropped on a socket timeout, and both sides of the transfer had moved without anything recording that they had moved. The same row was about to be replayed the next morning. Someone noticed the mismatch six weeks later, at the end of the month, in a spreadsheet one of the ops people was maintaining by hand.
The plumbing – retries, timeouts, backoff, sagas, reconciliation, dead letter queues – is what turns a diagram into a system.
The first time I had this problem it was called integration middleware. The second time it was a service mesh. The next time it will have another name. The names have churn. The problem has tenure.
Start from the failure modes. The interface can wait.
The plumbing outlasts the abstractions.
