Error Handling, Retries, and Dead-Letter Queues

Why reliability is a design problem, not an accident

An integration engine that only works when every system is healthy is not really an interface — it is a demo. In production, networks drop, servers reboot during maintenance windows, and downstream applications reject content they do not like. A robust interface is one that keeps moving messages correctly across all of those failures, and — crucially — never silently loses one. This lesson assumes you already understand the v2 acknowledgment paradigm (AA/AE/AR carried in the MSA segment) and that messages are framed with MLLP over a TCP session. The job now is to use those mechanics to build something that survives the real world.

What can fail

Failures arrive in a few recognizable shapes, and a good interface distinguishes between them because each calls for a different response.

The connection is down. There is no MLLP/TCP session to the destination at all — the listener is offline, the port is blocked, or the network path is broken. No message can be sent until the link comes back ¹.
The receiver returns a negative acknowledgment. The session works and the message is delivered, but the MSA reports AE (application error) or AR (reject). The receiver is telling you it could not accept this message as-is.
A transformation throws an error. Before the message even reaches the wire, the engine’s own mapping or scripting step fails — a missing field, a bad date, an unmappable code.
The downstream rejects the content. The message is syntactically fine and acknowledged, but the receiving application later refuses it on business grounds (an unknown patient identifier, a closed encounter).

The first case is transient and self-healing once the link returns. The others usually indicate a problem with this specific message that retrying alone will not fix.

Store-and-forward: the durability backbone

The single most important reliability mechanism is store-and-forward queuing. When the engine accepts a message, it first writes that message to a persistent queue — typically backed by a database or durable disk store — and only then attempts delivery. If the destination is unavailable, nothing is lost: the message simply waits in the queue and is delivered when the link recovers. This persistence is what lets an interface ride out a downed system without dropping data, and it is the backbone of reliable delivery ². Because each message is durably recorded before any send is attempted, a crash of the engine itself does not lose in-flight work — on restart, the queue is still there.

Retries with backoff

When a send fails for a transient reason — the connection was down, or a NAK suggested a temporary condition — the engine retries. The naive approach, retrying immediately and continuously, hammers an already-struggling system and can keep it from recovering. The better pattern is backoff: wait a short interval, then progressively longer intervals between attempts (for example, a few seconds, then a minute, then several minutes). This gives a rebooting server room to come back while still delivering promptly once it does.

The duplicate risk

Retries introduce a subtle hazard that every reliable interface must confront. Consider a message that was actually delivered and processed by the receiver, but whose acknowledgment was lost on the way back — the TCP session dropped after the receiver committed the data but before the ACK arrived. The sender never saw an ACK, so it retries, and the receiver now processes the message a second time. The result is a duplicate: a duplicate order, a duplicate result, a duplicate admission.

This is the central trade-off of retry logic: guaranteeing at-least-once delivery means you can deliver a message more than once. Retries must therefore be paired with idempotency or de-duplication on the receiving side. The standard handle for this in v2 is the message control id in MSH-10, which uniquely identifies each message; a receiver that records the control ids it has already processed can recognize a retried message and safely ignore the repeat. Reliable delivery is not “retry until it works” — it is “retry safely, so that a retry cannot corrupt the data.”

Dead-letter queues

Some messages never succeed no matter how many times you retry: a transformation that throws on every attempt, or content the downstream will always reject. Leaving these in the main queue is dangerous — a single poison message can stall everything behind it (head-of-line blocking), and silently discarding it violates the no-loss rule. The answer is a dead-letter queue (DLQ): after a message exhausts its retries, the engine moves it out of the main flow and parks it in the DLQ for human investigation. The main queue keeps draining, and the failed message is preserved, visible, and recoverable — a parked-for-review backstop rather than a black hole.

Monitoring and alerting

All of these mechanisms degrade quietly unless someone is watching. A queue that is backing up, a connection that has been dead for an hour, a stream of repeated NAKs, or a DLQ that is filling are all signals that automation has reached its limit and a human needs to intervene before data falls badly behind. Effective interfaces emit alerts on these conditions so problems are caught in minutes, not discovered days later when a clinician notices missing results.

Tying it together

Reliability comes down to one promise: never silently lose a message. Persist every message before sending so a failure cannot erase it; retry safely with backoff and de-duplication so transient outages heal without creating duplicates; and escalate to a dead-letter queue with monitoring what cannot be delivered, so a human can act. Each mechanism covers a different failure mode, and together they turn a fragile pipe into a dependable interface.