How to transform & nbsp; in XSLT - html

I have a following xslt:
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
After transformation I get:
<span>&nbsp;Some text</span>
which is rendered as: & nbsp;Some text
I want to render & nbsp; as space character. I have tried also change disable-output-escaping to no, but it didn't helped.
thanks for help.

The other two answers are correct, but I decided to take a little broader view to this subject.
What everyone should know about CDATA sections
CDATA section is just an alternative serialization form to an escaped XML string. This means that parser produces the same result for <span><![CDATA[ a & b < 2 ]]></span> and <span> a & b < 2 </span>. XML applications work on the parsed data, so an XML application should produce the same output for both example input elements.
Briefly: escaped data and un-escaped data inside a CDATA section mean exactly the same.
In this case
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
is identical to
<span><xsl:text disable-output-escaping="yes">&nbsp;Some text</xsl:text></span>
Note that the & character has been escaped to & in the latter serialization form.
What everyone should know about disable-output-escaping
disable-output-escaping is a feature that concerns the serialization only. In order to maintain the well-formedness of the serialized XML, XSLT processors escape & and < (and possibly other characters) by using entities. Their escaped forms are & and <. Escaped or not, the XML data is the same. XSLT elements <xsl:value-of> and <xsl:text> can have a disable-output-escaping attribute but it is generally advised to avoid using this feature. Reasons for this are:
XSLT processor may produce only a result tree, which is passed on to another process without serializing it between the processes. In such case disabling output escaping will fail because the XSLT processor is not able to control the serialization of the result tree.
An XSLT processor is not required to support disable-output-escaping attribute. In such case the processor must escape the output (or it may raise an error) so again, disabling output escaping will fail.
An XSLT processor must escape characters that cannot be represented as such in the encoding that is used for the document output. Using disable-output-escaping on such characters will result in error or escaped text so again, disabling output escaping will fail.
Disabling output escaping will easily lead to malformed or invalid XML so using it requires great attention or post processing of the output with non-XML tools.
disable-output-escaping is often misunderstood and misused and the same result could be achieved with more regular ways e.g. creating new elements as literals or with <xsl:element>.
In this case
<span><xsl:text disable-output-escaping="yes"><![CDATA[ Some text]]></xsl:text></span>
should output
<span> Some text</span>
but the & character got escaped instead, so in this case the output escaping seems to fail.
What everyone should know about using entities
If an XML document contains an entity reference, the entity must be declared, if not, the document is not valid. XML has only 5 pre-defined entities. They are:
& for &
< for <
> for >
" for "
&apos; for '
All other entity references must be defined either in an internal DTD of the document or in an external DTD that the document refers to (directly or indirectly). Therefore blindly adding entity references to an XML document might result in invalid documents. Documents with (X)HTML DOCTYPE can use several entities (like ) because the XHTML DTD refers to a DTD that contains their definition. The entities are defined in these three DTDs: http://www.w3.org/TR/html4/HTMLlat1.ent , http://www.w3.org/TR/html4/HTMLsymbol.ent and http://www.w3.org/TR/html4/HTMLspecial.ent .
An entity reference does not always get replaced with its replacement text. This could happen for example if the parser has no net connection to retrieve the DTD. Also non-validating parsers do not need to include the replacement text. In such cases the data represented by the entity is "lost". If the entity gets replacement works, there will be no signs in the parsed data model that the XML serialization had any entity references at all. The data model will be the same if one uses entities or their replacement values. Briefly: entities are only an alternative way to represent the replacement text of the entity reference.
In this case the replacement text of is   (which is same than   using hexadecimal notation). Instead of trying to output the entity, it will be easier and more robust to just use the solution suggested by #phihag. If you like the readability of the entity you can follow the solution suggested by #Michael Krelin and define that entity in an internal DTD. After that, you can use it directly within your XSLT code.
Do note that in both cases the XSLT processor will output the literal non-breaking space character and not the entity reference or the   character reference. Creating such references manually with XSLT 1.0 requires the usage of disable-output-escaping feature, which has its own problems as stated above.

I think you should use  , because entity is likely to be not defined. And no CDATA.
One more possibility is to define nbsp entity for your xsl file:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY nbsp " ">
]>
<xsl:stylesheet version="1.0" …

In CDATA, all values are literal. You want:
<span><xsl:text> Some text</xsl:text></span>

Related

Does JSON to XML lose me anything?

We have a program that accepts as data XML, JSON, SQL, OData, etc. For the XML we use Saxon and its XPath support and that works fantastic.
For JSON we use the jsonPath library which is not as powerful as XPath 3.1. And jsonPath is a little squirrelly in some corner cases.
So... what if we convert the JSON we get to XML and then use Saxon? Are there limitations to that approach? Are there JSON constructs that won't convert to XML, like anonymous arrays?
The headline question: The json-to-xml() function in XPath 3.1 is lossless, except that by default, characters that are invalid in XML (such as NUL, or unpaired surrogates) are replaced by a SUB character -- you can change this behaviour with the option escape=true.
The losslessness has been achieved at some cost in convenience. For example, JSON property names are not translated to XML element or attribute names, but rather to values of the key attribute.
Lots of different people have come up with lots of different conversions of JSON to XML. As already pointed out, the XPath 3.1 and the XSLT 3.0 spec have a loss-less, round-tripping conversion with json-to-xml and xml-to-json that can handle any JSON.
There are simpler conversions that handle limited sets of JSON, the main problem is how to represent property names of JSON that don't map to XML names e.g. { "prop 1" : "value" } is represented by json-to-xml as <string key="prop 1">value</string> while conversions trying to map the property name to an element or attribute name either fail to create well-formed XML (e.g. <prop 1>value</prop 1>) or have to escape the space in the element name (e.g. <prop_1>value</prop_1> or some hex representation of the Unicode of the space inserted).
In the end I guess you want to select the property foo in { "foo" : "value" } as foo which the simple conversion would give you; in XPath 3.1 you would need ?foo for the XDM map or fn:string[#key = 'foo'] for the json-to-xml result format.
With { "prop 1" : "value" } the latter kind of remains as fn:string[#key = 'prop 1'], the ? approach needs to be changed to ?('prop 1') or .('prop 1'). Any conversion that has escaped the space in an element name requires you to change the path to e.g. prop_1.
There is no ideal way for all kind of JSON I think, in the end it depends on the JSON formats you expect and the willingness or time of users to learn a new selection/querying approach.
Of course you can use other JSON to XML conversions than the json-to-xml and then use XPath 3.1 on any XML format; I think that is what the oXygen guys opted for, they had some JSON to XML conversion before XPath 3.1 provided one and are mainly sticking with it, so in oXygen you can write "path" expressions against JSON as under the hood the path is evaluated against an XML conversion of the JSON. I am not sure which effort it takes to indicate which JSON values in the original JSON have been selected by XPath path expressions in the XML format, that is probably not that easy and straightforward.

what does data types in html5 means

I didn't get what does it really mean, when someone refers to data types in html5.
I googled it, and found http://www.w3.org/TR/html-markup/datatypes.html
It says,
data types (microsyntaxes) that are referenced by attribute
descriptions
But, now I'm even confused what it means with micorsyntaxes.
Wikipedia says:
[...] the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be a correctly structured document or fragment in that language. This applies both to programming languages, where the document represents source code, and markup languages, where the document represents data.
So in order for an HTML document to be read and understood by a browser, it should adhere to the syntax of HTML: That is, it should follow the rules that define the language. A microsyntax is essentially a very small syntax, applying to a very specific thing.
A data type is simply a type of data. The HTML specifications refer to various data types (e.g. String, Token, Integer, Date, Set of comma-separated strings, etc) and the document you linked describes exactly what those things are. It does this by defining a set of rules, or a microsyntax.
E.g. the microsyntax which defines a Set of comma-separated strings is:
Zero or more strings that are themselves each zero or more characters, each optionally with leading and/or trailing space characters, and each separated from the next by a single "," (comma) character. Each string itself must not begin or end with any space characters, and each string itself must not contain any "," (comma) characters.

XML to HTML:Character entities encoding

I am doing XML to HTML Transformation and I need to convert some character entities. My XML file has Unicode values like è which I need to convert to its corresponding html value è. Other entities also need to be converted respectively. Character mapping for each and every entity can be quiet difficult as there are many.
I am using XSLT 2.0. My output method is xhtml. And currently I am getting the actual characters (in the above case è) in my HTML code. Need help. My Saxon Processor version is 9.1.0.5.
With normal XSLT processing Saxon will simply use an XML parser like Xerces or the version of Xerces the Sun/Oracle JRE comes with and once the parser has done its work and Saxon operates on its tree model there is no way to know whether the original input had a literal character like è or a decimal character references like è or a hexadecimal one like è.
And when serializing the result tree of a transformation you can of course use a character map to map characters to any representation you want but that will then happy of any è in the result tree, not only for ones resulting from hexadecimal character references in the input.
If you want to make sure all non-ASCII characters are serialized as character references then you need to use xsl:output encoding="US-ASCII".
Saxon 9.1 also provides http://saxonica.com/documentation9.1/extensions/output-extras/character-representation.html to control the format.
But I agree with the comments made, these days having UTF-8 as the output encoding and then simply literal characters in the serialization of the result tree should not pose any problems.

XSLT 2.0 replace function returns same string when comparing special html characters

I am writing an XSLT to produce HTML for HTML Help Viewer 1.0. Some of the titles contain the < and > sequences. This causes loads of problems with the viewer as it converts them back to angle brackets. Having read on-line that having them as there Unicode versions will work (http://mshcmigrate.helpmvp.com/faq/notes10) I tried to use the replace function to do this and the result is the same as the input.
The code I am using is:
replace(replace(/*/name, '<', '<'), '>', '>')
The input:
DocumentationTest.GenericClass<T> Namespace
Outputs as:
DocumentationTest.GenericClass<T> Namespace
How can I perform a string replace to get this output?
DocumentationTest.GenericClass<T> Namespace
< and < are two different representations of the same character. You can use replace() (or more simply, translate()) to change one character to another, but you can't use it to control how that character is represented in the serialized output - that's entirely up to the serializer. You can influence it, however, using character maps - see xsl:character-map.
This is correct behavior as per the XML parser itself, which executes before XSLT gets ahold of it. If you must maintain the < in the document, then it should be encoded as &lt;, and a simple find and replace on all & symbols, replacing them with &, should do the trick.

Which are the valid control characters in HTML/XHTML forms

I'm tring to create form validation unit that, in addition to "regular" tests checks
encoding as well.
According to this article http://www.w3.org/International/questions/qa-forms-utf-8 the
allowed characters are CR, LF and TAB in range of 0-31, the DEL=127 in not allowed.
On the other hand, there are control characters in range 0x80-0xA0. In different sources
I had seen that they are allowed and that not. Also I had seen that this is different
for XHTML, HTML and XML.
Some articles had told that FF is allowed as well?
Can someone provide a good answer with sources what can be given and what isn't?
EDIT: Even there http://www.w3.org/International/questions/qa-controls some ambiguity
The C1 range is supported
But table shows that they are illegal and previous shown UTF-8 validations allows them?
I think you're looking at this the wrong way around. The resources you link specify what encoded values are valid in (X)HTML, but it sounds like you want to validate the "response" from a web form — as in, the values of the various form controls, as passed back to your server. In that case, you shouldn't be looking at what's valid in (X)HTML, but what's valid in the application/x-www-form-urlencoded, and possibly also multipart/form-data, MIME types. The HTML 4.01 standards for <FORM> elements clearly states that for application/x-www-form-urlencoded, "Non-alphanumeric characters are replaced by '%HH'":
This is the default content type. Forms submitted with this content type must be encoded as follows:
Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., `%0D%0A').
The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
As for what character encoding is contained, (i.e. whether %A0 is a non-breaking space or an error), that's negotiated by the accept-charset attribute on your <FORM> element and the response's (well, really a GET or POST request) Content-Type header.
Postel's Law: Be conservative in what you do; be liberal in what you accept from others.
If you're generating documents for others to read, you should avoid/escape all control characters, even if they're technically legal. And if you're parsing documents, you should endeavor to accept all control characters even if they're technically illegal.
The Unicode characters in these ranges are valid in HTML 4.01:
0x09..0x0A
0x0D
0x20..0x7E
0x00A0..0xD7FF
0xE000..0x10FFFF
In XHTML 1.0... it's unclear. See http://cmsmcq.com/2007/C1.xml#o127626258
First of all any octet is valid. The mentioned regular expression for UTF-8 sequences just omits some of them as they are rather uncommon in practice to be entered by a user. But that doesn’t mean that they are invalid. They are just not expected to occur.
The first link you mention does not have anything to do with validating the allowed characters in XHTML... the example on that link is simply showing a common/generic pattern for detecting whether or not raw data is in utf-8 encoding or not.
This is a quote from the second link:
HTML, XHTML and XML 1.0 do not support
the C0 range, except for HT
(Horizontal Tabulation) U+0009, LF
(Line Feed) U+000A, and CR (Carriage
Return) U+000D. The C1 range is
supported, i.e. you can encode the
controls directly or represent them as
NCRs (Numeric Character References).
The way I read this is:
Any control character in the C1 range is supported if you encode them (using base64, or Hex representations) or represent them as NCRs.
Only U+0009, U+000A, and U+000D are supported in the C0 range. No other control code in that range can be represented.
If the document is known to be XHTML, then you should just load it and validate it against the schema.
What programming language do you use? At least for Java there exist libraries to check the encoding of a string (or byte-array). I guess similar libraries would exist for other languages too.
Do I understand your question correctly: you want to check whether the data submitted by a form is valid, and properly encoded?
If so, why do several things at once? It would be a lot easier to separate those checks, and perform them step by step, IMHO.
You want to check that the submitted form data is correctly encoded (in UTF-8, I gather). As Archchancellor Ridcully says, that's easy to check in most languages.
Then, if the encoding is correct, you can check whether it's valid form data.
Then, if the form data is valid, you can check whether the data contains what you expect.