XML to HTML:Character entities encoding - html

I am doing XML to HTML Transformation and I need to convert some character entities. My XML file has Unicode values like è which I need to convert to its corresponding html value è. Other entities also need to be converted respectively. Character mapping for each and every entity can be quiet difficult as there are many.
I am using XSLT 2.0. My output method is xhtml. And currently I am getting the actual characters (in the above case è) in my HTML code. Need help. My Saxon Processor version is 9.1.0.5.

With normal XSLT processing Saxon will simply use an XML parser like Xerces or the version of Xerces the Sun/Oracle JRE comes with and once the parser has done its work and Saxon operates on its tree model there is no way to know whether the original input had a literal character like è or a decimal character references like è or a hexadecimal one like è.
And when serializing the result tree of a transformation you can of course use a character map to map characters to any representation you want but that will then happy of any è in the result tree, not only for ones resulting from hexadecimal character references in the input.
If you want to make sure all non-ASCII characters are serialized as character references then you need to use xsl:output encoding="US-ASCII".
Saxon 9.1 also provides http://saxonica.com/documentation9.1/extensions/output-extras/character-representation.html to control the format.
But I agree with the comments made, these days having UTF-8 as the output encoding and then simply literal characters in the serialization of the result tree should not pose any problems.

Related

Does JSON to XML lose me anything?

We have a program that accepts as data XML, JSON, SQL, OData, etc. For the XML we use Saxon and its XPath support and that works fantastic.
For JSON we use the jsonPath library which is not as powerful as XPath 3.1. And jsonPath is a little squirrelly in some corner cases.
So... what if we convert the JSON we get to XML and then use Saxon? Are there limitations to that approach? Are there JSON constructs that won't convert to XML, like anonymous arrays?
The headline question: The json-to-xml() function in XPath 3.1 is lossless, except that by default, characters that are invalid in XML (such as NUL, or unpaired surrogates) are replaced by a SUB character -- you can change this behaviour with the option escape=true.
The losslessness has been achieved at some cost in convenience. For example, JSON property names are not translated to XML element or attribute names, but rather to values of the key attribute.
Lots of different people have come up with lots of different conversions of JSON to XML. As already pointed out, the XPath 3.1 and the XSLT 3.0 spec have a loss-less, round-tripping conversion with json-to-xml and xml-to-json that can handle any JSON.
There are simpler conversions that handle limited sets of JSON, the main problem is how to represent property names of JSON that don't map to XML names e.g. { "prop 1" : "value" } is represented by json-to-xml as <string key="prop 1">value</string> while conversions trying to map the property name to an element or attribute name either fail to create well-formed XML (e.g. <prop 1>value</prop 1>) or have to escape the space in the element name (e.g. <prop_1>value</prop_1> or some hex representation of the Unicode of the space inserted).
In the end I guess you want to select the property foo in { "foo" : "value" } as foo which the simple conversion would give you; in XPath 3.1 you would need ?foo for the XDM map or fn:string[#key = 'foo'] for the json-to-xml result format.
With { "prop 1" : "value" } the latter kind of remains as fn:string[#key = 'prop 1'], the ? approach needs to be changed to ?('prop 1') or .('prop 1'). Any conversion that has escaped the space in an element name requires you to change the path to e.g. prop_1.
There is no ideal way for all kind of JSON I think, in the end it depends on the JSON formats you expect and the willingness or time of users to learn a new selection/querying approach.
Of course you can use other JSON to XML conversions than the json-to-xml and then use XPath 3.1 on any XML format; I think that is what the oXygen guys opted for, they had some JSON to XML conversion before XPath 3.1 provided one and are mainly sticking with it, so in oXygen you can write "path" expressions against JSON as under the hood the path is evaluated against an XML conversion of the JSON. I am not sure which effort it takes to indicate which JSON values in the original JSON have been selected by XPath path expressions in the XML format, that is probably not that easy and straightforward.

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".
ECMA-404: The JSON Data Interchange Format
I believe that there is no need to encode this character at all, so it could be represented directly as "𝄞". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?
One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)
But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

How to allow jackson to trade \uXXXX as plaint text?

I use jackson to parse json data. Now I have a problem with handling a \uXXXX issue.
The data I got here is like
{"UID":"here_\ud83d\udc3b"}
After I use ObjectMapper.readValue(jsonContent, UserId.class); to convert json to an instance of UserId, the UID property is not literally "here_\ud83d\udc3b". Jackson convert \ud83d\udc3b to 2 chars as the unicode value.
My question is, is it possible to let jackson skip this "Unicode transformation" and key the literal value "\ud83d\udc3b" as it is?
No. JSON parsers are required to handle Unicode escapes to produce underlying Unicode characters.
When writing, on the other hand, some characters may also be encoded using similar Unicode escapes.
So if you need to use escaping, you need to re-encode such values yourself.

How to transform & nbsp; in XSLT

I have a following xslt:
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
After transformation I get:
<span>&nbsp;Some text</span>
which is rendered as: & nbsp;Some text
I want to render & nbsp; as space character. I have tried also change disable-output-escaping to no, but it didn't helped.
thanks for help.
The other two answers are correct, but I decided to take a little broader view to this subject.
What everyone should know about CDATA sections
CDATA section is just an alternative serialization form to an escaped XML string. This means that parser produces the same result for <span><![CDATA[ a & b < 2 ]]></span> and <span> a & b < 2 </span>. XML applications work on the parsed data, so an XML application should produce the same output for both example input elements.
Briefly: escaped data and un-escaped data inside a CDATA section mean exactly the same.
In this case
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
is identical to
<span><xsl:text disable-output-escaping="yes">&nbsp;Some text</xsl:text></span>
Note that the & character has been escaped to & in the latter serialization form.
What everyone should know about disable-output-escaping
disable-output-escaping is a feature that concerns the serialization only. In order to maintain the well-formedness of the serialized XML, XSLT processors escape & and < (and possibly other characters) by using entities. Their escaped forms are & and <. Escaped or not, the XML data is the same. XSLT elements <xsl:value-of> and <xsl:text> can have a disable-output-escaping attribute but it is generally advised to avoid using this feature. Reasons for this are:
XSLT processor may produce only a result tree, which is passed on to another process without serializing it between the processes. In such case disabling output escaping will fail because the XSLT processor is not able to control the serialization of the result tree.
An XSLT processor is not required to support disable-output-escaping attribute. In such case the processor must escape the output (or it may raise an error) so again, disabling output escaping will fail.
An XSLT processor must escape characters that cannot be represented as such in the encoding that is used for the document output. Using disable-output-escaping on such characters will result in error or escaped text so again, disabling output escaping will fail.
Disabling output escaping will easily lead to malformed or invalid XML so using it requires great attention or post processing of the output with non-XML tools.
disable-output-escaping is often misunderstood and misused and the same result could be achieved with more regular ways e.g. creating new elements as literals or with <xsl:element>.
In this case
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
should output
<span> Some text</span>
but the & character got escaped instead, so in this case the output escaping seems to fail.
What everyone should know about using entities
If an XML document contains an entity reference, the entity must be declared, if not, the document is not valid. XML has only 5 pre-defined entities. They are:
& for &
< for <
> for >
" for "
&apos; for '
All other entity references must be defined either in an internal DTD of the document or in an external DTD that the document refers to (directly or indirectly). Therefore blindly adding entity references to an XML document might result in invalid documents. Documents with (X)HTML DOCTYPE can use several entities (like ) because the XHTML DTD refers to a DTD that contains their definition. The entities are defined in these three DTDs: http://www.w3.org/TR/html4/HTMLlat1.ent , http://www.w3.org/TR/html4/HTMLsymbol.ent and http://www.w3.org/TR/html4/HTMLspecial.ent .
An entity reference does not always get replaced with its replacement text. This could happen for example if the parser has no net connection to retrieve the DTD. Also non-validating parsers do not need to include the replacement text. In such cases the data represented by the entity is "lost". If the entity gets replacement works, there will be no signs in the parsed data model that the XML serialization had any entity references at all. The data model will be the same if one uses entities or their replacement values. Briefly: entities are only an alternative way to represent the replacement text of the entity reference.
In this case the replacement text of is   (which is same than   using hexadecimal notation). Instead of trying to output the entity, it will be easier and more robust to just use the solution suggested by #phihag. If you like the readability of the entity you can follow the solution suggested by #Michael Krelin and define that entity in an internal DTD. After that, you can use it directly within your XSLT code.
Do note that in both cases the XSLT processor will output the literal non-breaking space character and not the entity reference or the   character reference. Creating such references manually with XSLT 1.0 requires the usage of disable-output-escaping feature, which has its own problems as stated above.
I think you should use  , because entity is likely to be not defined. And no CDATA.
One more possibility is to define nbsp entity for your xsl file:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY nbsp " ">
]>
<xsl:stylesheet version="1.0" …
In CDATA, all values are literal. You want:
<span><xsl:text> Some text</xsl:text></span>

JSON and escaping characters

I have a string which gets serialized to JSON in Javascript, and then deserialized to Java.
It looks like if the string contains a degree symbol, then I get a problem.
I could use some help in figuring out who to blame:
is it the Spidermonkey 1.8 implementation? (this has a JSON implementation built-in)
is it Google gson?
is it me for not doing something properly?
Here's what happens in JSDB:
js>s='15\u00f8C'
15°C
js>JSON.stringify(s)
"15°C"
I would have expected "15\u00f8C' which leads me to believe that Spidermonkey's JSON implementation isn't doing the right thing... except that the JSON homepage's syntax description (is that the spec?) says that a char can be
any-Unicode-character-
except-"-or-\-or-
control-character"
so maybe it passes the string along as-is without encoding it as \u00f8... in which case I would think the problem is with the gson library.
Can anyone help?
I suppose my workaround is to use either a different JSON library, or manually escape strings myself after calling JSON.stringify() -- but if this is a bug then I'd like to file a bug report.
This is not a bug in either implementation. There is no requirement to escape U+00B0. To quote the RFC:
2.5. Strings
The representation of strings is
similar to conventions used in the C
family of programming languages. A
string begins and ends with quotation
marks. All Unicode characters may be
placed within the quotation marks
except for the characters that must be
escaped: quotation mark, reverse
solidus, and the control characters
(U+0000 through U+001F).
Any character may be escaped.
Escaping everything inflates the size of the data (all code points can be represented in four or fewer bytes in all Unicode transformation formats; whereas encoding them all makes them six or twelve bytes).
It is more likely that you have a text transcoding bug somewhere in your code and escaping everything in the ASCII subset masks the problem. It is a requirement of the JSON spec that all data use a Unicode encoding.
hmm, well here's a workaround anyway:
function JSON_stringify(s, emit_unicode)
{
var json = JSON.stringify(s);
return emit_unicode ? json : json.replace(/[\u007f-\uffff]/g,
function(c) {
return '\\u'+('0000'+c.charCodeAt(0).toString(16)).slice(-4);
}
);
}
test case:
js>s='15\u00f8C 3\u0111';
15°C 3◄
js>JSON_stringify(s, true)
"15°C 3◄"
js>JSON_stringify(s, false)
"15\u00f8C 3\u0111"
This is SUPER late and probably not relevant anymore, but if anyone stumbles upon this answer, I believe I know the cause.
So the JSON encoded string is perfectly valid with the degree symbol in it, as the other answer mentions. The problem is most likely in the character encoding that you are reading/writing with. Depending on how you are using Gson, you are probably passing it a java.io.Reader instance. Any time you are creating a Reader from an InputStream, you need to specify the character encoding, or java.nio.charset.Charset instance (it's usually best to use java.nio.charset.StandardCharsets.UTF_8). If you don't specify a Charset, Java will use your platform default encoding, which on Windows is usually CP-1252.