Lesson 49 of 51 · Reliability and Troubleshooting

Diagnosing Lost and Duplicate Messages

Interface Troubleshooting

The two complaints

Almost every interface incident an analyst is asked to investigate reduces to one of two reports. The first is “the message never arrived” — a clinician cannot see a result, an order, or an admission that the sending system insists it sent. The second is “we got it twice” — a duplicate appears in the receiving system, often a duplicate order or a doubled chart entry. Both are diagnosed the same way: not by guessing, but by methodically locating which point in the message’s journey it did or did not pass 1. Troubleshooting is a discipline of narrowing, and the model below is the frame for it.

A three-checkpoint model

Every message that should have moved passes — or fails to pass — three checkpoints.

  1. Sent. The source application actually emitted the message onto the wire.
  2. Received. The engine (and ultimately the destination) accepted the bytes at the transport layer. For v2 over MLLP, this is the framing-level handshake: the message was delimited correctly and the connection acknowledged it.
  3. Processed. The receiving application consumed the message and committed it. This is the application-acknowledgment level — the MSA segment of an application ACK reporting AA (accept) rather than AE/AR (error or reject).

The distinction between checkpoints 2 and 3 is the same MLLP-framing versus MSA-acknowledgment distinction covered earlier: a message can be received on the wire yet still be rejected or dropped by the application. The fault always lies between two adjacent checkpoints, and your job is to find which gap it fell into. Confirm sent, confirm received, confirm processed; the first checkpoint that fails tells you where to look.

Tracing a single message

The primary tool is the message control id in MSH-10, which uniquely identifies each message. Take the control id (or, if the user cannot give it, search by patient identifier and a timestamp window) and follow that one message end to end: the sending system’s outbound log, the engine’s message browser, and the receiving system’s inbound log. Each hop either shows the control id or does not, and the first hop that is missing it is the failed checkpoint.

When no single id is in hand, reconcile counts. Pull every message the sender claims to have sent in a time window and every message the receiver logged, then compare. A gap — sender shows 1,000, receiver shows 994 — both proves a loss and hands you the six control ids to chase individually 1.

Lost messages: where they go

A “lost” message has usually not vanished; it stopped somewhere, and the logs say where:

  • Filtered out. A route rule decided the message was not relevant and dropped it by design. The engine log shows it arriving and being filtered — no error, just no forward. This is the easiest to mistake for a true loss.
  • Stuck in a queue. The message reached the engine but the outbound connection to the destination is down, so it sits in the queue waiting to be delivered. The message browser shows it queued, not sent; the destination log shows nothing.
  • Negatively acknowledged. The destination returned AE or AR — it received the message but rejected it (a validation failure, an unknown patient). Unless something reprocesses it, it stays failed. The engine log carries the rejection and often the error reason.
  • Transform error. The message hit an exception during transformation and never produced valid output to send. The engine log shows the error and the raw input.

The discriminator is how far the control id got: absent from the engine entirely points back toward the sender; present but filtered, queued, errored, or rejected each names a different fix.

Duplicate messages: causes

Duplicates almost always come from a retry or replay that lacked de-duplication:

  • Retry after a lost ACK. The destination processed the message but its acknowledgment never made it back, so the sender — following its reliability logic — resent. Without a check on the control id, the receiver commits it twice.
  • File or batch replay. An operator re-dropped a file, or a batch was reprocessed after a partial failure, re-sending messages that had already been delivered.
  • Two interfaces at once. During a migration, an old and a new interface version both run for a window and both deliver the same source message.

Matching the duplicated control ids against the engine’s audit trail usually reveals which of these occurred — identical MSH-10 values point to a retry or replay, distinct ones point to two parallel feeds.

Tooling and habits

The engine’s message browser and audit trail are where most of this work happens: search by control id, patient identifier, or timestamp; read the per-message status and any error attached to it 1. A few habits make the difference between a clean diagnosis and a guess. Always preserve the original raw message before anything is reprocessed — it is the evidence. Reproduce suspected faults in a test environment rather than the live feed, so a fix can be confirmed without risking patient data. And record which checkpoint failed, so the next analyst who sees the same symptom starts where you left off.

References

  1. Tim Benson, Grahame Grieve. Principles of Health Interoperability: FHIR, HL7 and SNOMED CT. 4th ed. Springer. 2021. verified