Lesson 8 of 51 · Message Structure in Depth

Escaping, Encoding, and Character Sets

Encoding and Escaping

The Problem: When Data Looks Like Structure

An HL7 v2 message is plain text whose meaning depends entirely on a small set of delimiter characters declared in MSH-2 (^~\&) and the field separator |. The parser does not understand words; it understands positions. It walks the byte stream and splits on each delimiter it sees. This creates a hazard the moment real-world data happens to contain one of those same characters.

Consider an organization named Acme & Sons. The ampersand is the subcomponent separator. If that name is written literally into a field, a naive parser sees a boundary where the author intended a letter:

PID|1||1234^^^Acme & Sons|...

Here Acme becomes one subcomponent and Sons becomes another. The field has silently split. Worse, downstream systems may store only the first piece, truncating the value, or shift every following subcomponent out of position. A single literal character has corrupted the record without raising any error 1.

The Escape Mechanism

HL7 v2 solves this with the escape character, the fourth encoding character, normally a backslash (\). An escape sequence begins with the escape character, contains a code, and ends with the escape character again. When a parser encounters the escape character mid-data, it reads to the next escape character and translates the enclosed code back into the literal it represents. The bytes between two escape characters are never treated as delimiters 1.

The standard delimiter escape sequences map one-to-one onto the encoding characters:

\F\   field separator      |
\S\   component separator   ^
\T\   subcomponent sep.     &
\R\   repetition separator  ~
\E\   the escape char.      \

Applying this to the earlier example, the safe encoding of Acme & Sons replaces the literal & with \T\:

PID|1||1234^^^Acme \T\ Sons|...

Now the parser sees one intact subcomponent. On the receiving side it reverses the substitution and restores Acme & Sons for storage or display. Note that \E\ must be processed carefully: because the escape character introduces every sequence, a literal backslash in data must itself be escaped, or the parser will misread everything that follows.

Formatting and Hexadecimal Escapes

Beyond delimiters, the escape mechanism also carries presentation hints and arbitrary characters. Formatting escapes affect how text is rendered rather than what it means. The highlighting pair \H\ and \N\ mark the start and end of highlighted (emphasized) text, with \N\ returning to normal rendering 1:

OBX|1|TX|...||\H\Abnormal\N\ result follows|...

For characters that cannot be typed directly, the hexadecimal escape \Xdd...\ encodes one or more bytes as pairs of hex digits. For example, \X0D\ represents a carriage return, which could not otherwise appear inside a segment without being mistaken for the segment terminator:

NTE|1||Line one\X0D\Line two

Character Sets

Escaping handles reserved characters; character sets handle which characters are available at all. By default, an HL7 v2 message is interpreted using a basic ASCII repertoire, which covers unaccented English text but not names such as Müller, José, or 北京. To carry these, the message declares an alternate character set in MSH-18 1.

MSH-18 names the encoding the sender used, for example UNICODE UTF-8 or an ISO-IR designation such as ISO IR87 for Japanese:

MSH|^~\&|SEND|HOSP|RECV|CLINIC|20260601||ADT^A01|1|P|2.5.1||||||UNICODE UTF-8
PID|1||5678||Müller^Hans

The hard requirement is agreement. The character set is metadata, not magic: the receiver must actually decode the bytes using the set that MSH-18 declares. When sender and receiver disagree — one writes UTF-8 but the other reads a single-byte set — the result is mojibake, where Müller arrives as Müller. A multi-byte character may also be split or truncated if a downstream system counts bytes as if they were characters, cutting a name short or storing an incomplete final glyph 1.

Why Correctness Matters

Escaping and character-set declaration are not cosmetic. Both protect the same invariant: the literal content a sender intends must arrive unchanged and in the right position. A single unescaped & can split a field and shift every value after it; a single mismatched character set can mangle every accented name in a feed. Because HL7 v2 parsers report no error when structure is merely misread, these failures are silent and propagate into permanent records. Encoding data correctly on the way out — and decoding faithfully on the way in — is what keeps the message’s structure and its data from being confused for one another.

References

  1. HL7 Standards — Section 1d: Version 2 (V2). HL7 International. verified