Ampersand character in OBX segment causing problems - HL7 formatting - html

There are html equivalents for ">" and "<" ("<" and ">") in the OBX-5 field which is causing the Terser.get(..) method to only fetch the characters up to the ampersand character. The encoding characters in MSH-2 are "^~\&". Is the terser.get(..) failing because there's an encoding character in the OBX-5 field? Is there a way to change these characters to ">" and "<" easily?
Thanks a lot for your help.

Yes, it fails because the ampersand has been declared as subcomponent separator and the message you are trying to process is not valid -- it should not contain (unescaped) html character entities (< and >).
If you cannot help how the incoming messages are encoded you should preprocess the message before giving it to terser, replacing illegal characters. I'm pretty sure HAPI cannot help you there.
In a valid HL7v2 message, the data type used in OBX-5 is determined by OBX-2. OBX-5 should only contain the characters and escape sequences allowed by declared data type. < and > are among them (if not declared as separators in MSH-2).
HL7 standtard defines escape sequences for the separator and delimiter characters (e.g. \T\ is the escape sequence for subcomponent separator).

Related

Control characters in JSON string

The JSON specification states that control characters that must be escaped are only with codes from U+0000 to U+001F:
7. Strings
The representation of strings is similar to conventions used in the C
family of programming languages. A string begins and ends with
quotation marks. All Unicode characters may be placed within the
quotation marks, except for the characters that must be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).
Main idea of escaping is to don't damage output when printing JSON document or message on terminal or paper.
But there other control characters like [DEL] from C0 and other control characters from C1 set (U+0080 through U+009F). Shouldn't be they also escaped in JSON strings?
From the JSON specification:
8. String and Character Issues
8.1. Character Encoding
JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
In UTF-8, all codepoints above 127 are encoded in multiple bytes. About half of those bytes are in the C1 control character range. So in order to avoid having those bytes in a UTF-8 encoded JSON string, all of those code points would need to be escaped. This effectively eliminates the use of UTF-8 and the JSON string might as well be encoded in ASCII. As ASCII is a subset of UTF-8 this is not disallowed by the standard. So if you are concerned with putting C1 control characters in the byte stream just escape them, but requiring every JSON representation to use ASCII would be wildly inefficient in anything but an english environment.
UTF-16 and UTF-32 could not possibly be parsed by something that uses the C1 (or even C0) control characters so the point is rather moot for those encodings.

Do ampersands still need to be encoded in URLs in HTML5?

I learned recently (from these questions) that at some point it was advisable to encode ampersands in href parameters. That is to say, instead of writing:
...
One should write:
...
Apparently, the former example shouldn't work, but browser error recovery means it does.
Is this still the case in HTML5?
We're now past the era of draconian XHTML requirements. Was this a requirement of XHTML's strict handling, or is it really still something that I should be aware of as a web developer?
It is true that one of the differences between HTML5 and HTML4, quoted from the W3C Differences Page, is:
The ampersand (&) may be left unescaped in more cases compared to HTML4.
In fact, the HTML5 spec goes to great lengths describing actual algorithms that determine what it means to consume (and interpret) characters.
In particular, in the section on tokenizing character references from Chapter 8 in the HTML5 spec, we see that when you are inside an attribute, and you see an ampersand character that is followed by:
a tab, LF, FF, space, <, &, EOF, or the additional allowed character (a " or ' if the attribute value is quoted or a > if not) ===> then the ampersand is just an ampersand, no worries;
a number sign ===> then the HTML5 tokenizer will go through the many steps to determine if it has a numeric character entity reference or not, but note in this case one is subject to parse errors (do read the spec)
any other character ===> the parser will try to find a named character reference, e.g., something like ∉.
The last case is the one of interest to you since your example has:
...
You have the character sequence
AMPERSAND
LATIN SMALL LETTER Y
EQUAL SIGN
Now here is the part from the HTML5 spec that is relevant in your case, because y is not a named entity reference:
If no match can be made, then no characters are consumed, and nothing is returned. In this case, if the characters after the U+0026 AMPERSAND character (&) consist of a sequence of one or more alphanumeric ASCII characters followed by a U+003B SEMICOLON character (;), then this is a parse error.
You don't have a semicolon there, so you don't have a parse error.
Now suppose you had, instead,
...
which is different because é is a named entity reference in HTML. In this case, the following rule kicks in:
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
So there the = makes it an error, because legacy browsers might get confused.
Despite the fact the HTML5 spec seems to go to great lengths to say "well this ampersand is not beginning a character entity reference so there's no reference here" the fact that you might run into URLs that have named references (e.g., isin, part, sum, sub) which would result in parse errors, then IMHO you're better off with them. But of course, you only asked whether restrictions were relaxed in attributes, not what you should do, and it does appear that they have been.
It would be interesting to see what validators can do.

Too many characters in character literal while converting HTML Tag to Entity reference

I have generated an HTML tag through C# code. I am able to render correctly in the text area. When I googled it, I found this. To render the HTML tags in the text area, we need to convert the '<','>' into HTML entity references. But when I am trying to replace using String.Replace, it throws an error: Too many characters in character literal
.
string psHtmlOutput="<html><body><table border='0' cellspacing='3' cellpadding='3'><tr><th> Name </th><th>DomainName</th><th>DomainType</th><th>Defualt</th></tr><tr><td>india.local</td><td>india.local</td><td>Authoritative</td><td>True</td></tr></table></body></html>";
psHtmlOutput.Replace('>','>');
psHtmlOutput.Replace('<','<');
Error: Too many characters in character literal
Please help; how can I proceed?
The String.Replace method has two overloads:
One that operates on Strings.
One that operates on Chars.
In C#, single quotation marks are used to specify Char literals. Because you have used single quotes, the second overload of the method has been used. However, your second argument is not a valid character literal because > is not a single character.
So if you actually want to replace the character with a string, just use the overload that takes strings:
psHtmlOutput.Replace(">", ">");
psHtmlOutput.Replace("<", "<");

Why doesn't JSON data include special characters?

Why doesn't JSON data support special characters?
If json data includes special characters, etc:\r,/,\b,\t, you must transfer them, but why?
JSON supports all Unicode characters in strings. What do you mean by "transferring"?
Those characters need to be escaped because JSON specification says so. For some characters reasons is simple -- for example, double-quotes need to be escaped because regular double-quote ends String value, so there would be no way to tell end marker for character in content. For linefeeds reason probably was to enforce limitation that no String value spans multiple text lines; and for other control-character to avoid "invisible characters". This is similar to escaping required by XML or CSV; all textual data formats require escaping, or prohibit use of certain characters.

What does a ^ sign mean in a URL?

What is the meaning of a ^ sign in a URL?
I needed to crawl some link data from a webpage and I was using a simple handwritten PHP crawler for it. The crawler usually works fine; then I came to a URL like this:
http://www.example.com/example.asp?x7=3^^^^^select%20col1,col2%20from%20table%20where%20recordid%3E=20^^^^^
This URL works fine when typed in a browser but my crawler is not able to retrieve this page. I am getting an "HTTP request failed error".
^ characters should be encoded, see RFC 1738 Uniform Resource Locators (URL):
Other characters are unsafe because
gateways and other transport agents
are known to sometimes modify such
characters. These characters are "{",
"}", "|", "\", "^", "~", "[", "]",
and "`".
All unsafe characters must always
be encoded within a URL
You could try URL encoding the ^ character.
Based on the context, I'd guess they're a homespun attempt to URL-encode quote-marks.
Caret (^) is not a reserved character in URLs, so it should be acceptable to use as-is. However, if you re having problems, just replace it with its hex encoding %5E.
And yeah, putting raw SQL in the URL is like a big flashing neon sign reading "EXPLOIT ME PLEASE!".
Caret is neither reserved nor "unreserved", which makes it an "unsafe character" in URLs. They should never appear in URLs unencoded. From RFC2396:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" |
"$" | ","
The "reserved" syntax class above refers to those characters that are
allowed within a URI, but which may not be allowed within a
particular component of the generic URI syntax; they are used as
delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts.
The set of characters actually reserved within any given URI
component is defined by that component. In general, a character is
reserved if the semantics of the URI changes if the character is
replaced with its escaped US-ASCII encoding.
2.3. Unreserved Characters
Data characters that are allowed in a URI but do not have a reserved
purpose are called unreserved. These include upper and lower case
letters, decimal digits, and a limited set of punctuation marks and
symbols.
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
Unreserved characters can be escaped without changing the semantics
of the URI, but this should not be done unless the URI is being used
in a context that does not allow the unescaped character to appear.
2.4. Escape Sequences
Data must be escaped if it does not have a representation using an
unreserved character; this includes data that does not correspond to
a printable character of the US-ASCII coded character set, or that
corresponds to any US-ASCII character that is disallowed, as
explained below.
The crawler may be using regular expressions to parse the URL and therefore is falling over because the caret (^) means beginning of line. I'm thinking these URLs are really bad practice since they are exposing the underlying database structure; whomever wrote this might want to consider serious refactoring!
HTH!