I have some XML text that I wish to render in an HTML page. This text contains an ampersand, which I want to render in its entity representation: &.
How do I escape this ampersand in the source XML? I tried &, but this is decoded as the actual ampersand character (&), which is invalid in HTML.
So I want to escape it in such a way that it will be rendered as & in the web page that uses the XML output.
When your XML contains &, this will result in the text &.
When you use that in HTML, that will be rendered as &.
As per §2.4 of the XML 1.0 spec, you should be able to use &.
I tried & but this isn't allowed.
Are you sure it isn't a different issue? XML explicitly defines this as the way to escape ampersands.
The & character is itself an escape character in XML so the solution is to concatenate it and a Unicode decimal equivalent for & thus ensuring that there are no XML parsing errors. That is, replace the character & with &.
Use CDATA tags:
<![CDATA[
This is some text with ampersands & other funny characters. >>
]]>
& should work just fine. Wikipedia has a list of predefined entities in XML.
In my case I had to change it to %26.
I needed to escape & in a URL. So & did not work out for me.
The urlencode function changes & to %26. This way neither XML nor the browser URL mechanism complained about the URL.
I have tried &, but it didn't work. Based on Wim ten Brink's answer I tried & and it worked.
One of my fellow developers suggested me to use & and that worked regardless of how many times it may be rendered.
& is the way to represent an ampersand in most sections of an XML document.
If you want to have XML displayed within HTML, you need to first create properly encoded XML (which involves changing & to &) and then use that to create properly encoded HTML (which involves again changing & to &). That results in:
&
For a more thorough explanation of XML encoding, see:
What characters do I need to escape in XML documents?
<xsl:text disable-output-escaping="yes">& </xsl:text> will do the trick.
Consider if your XML looks like below.
<Employees Id="1" Name="ABC">
<Query>
SELECT * FROM EMP WHERE ID=1 AND RES<>'GCF'
<Query>
</Employees>
You cannot use the <> directly as it throws an error. In that case, you can use <> in replacement of that.
<Employees Id="1" Name="ABC">
<Query>
SELECT * FROM EMP WHERE ID=1 AND RES <> 'GCF'
<Query>
</Employees>
14.1 How to use special characters in XML has all the codes.
Related
Why the src attribute of an image in html is:
https://www.ft.com/__origami/service/image/v2/images/raw/http%3A%2F%2Fprod-upp-image-read.ft.com%2Fafe24c11-a86d-4444-bd64-1c2f4e4e3a54?source=next&fit=scale-down&compression=best&width=210 210w, https://www.ft.com/__origami/service/image/v2/images/raw/http%3A%2F%2Fprod-upp-image-read.ft.com%2Fafe24c11-a86d-4444-bd64-1c2f4e4e3a54?source=next&fit=scale-down&compression=best&width=150 150w
& and = are not encoded but they should be $;amp or something. Why is that?
"encoding" in URL is that stuff with %20 (or shortened with +) for space, that escaping with & is XML (and all derivates like SGML, HTML, XHTML, XSLT, …). As & is used for escaping other things, it needs to be escaped there as well, with &. That is used e.g. in XML files when there is a URL included.
be aware there are different styles of url encoding; a simple playground for that is php which has urlencode and rawurlencode besides the same for decoding as functions.
The URL within HTML is encoded properly with & as a separator for parameters
after the first parameter. You can look this up in the RFCs (e.g. RFC 2396, section 3.3 following). If you wanted to say "Barnes and Noble" then it would be escaped withing the Text as Barnes & Noble. But in the URL it stands as such. Just in cases like XML processed by XSLT you'd again need to escape it.
So for attributes like href ad src imagine the content of that just is parsed differently and as such different rules apply.
Is the ampersand the only character that should be encoded in an HTML attribute?
It's well known that this won't pass validation:
Because the ampersand should be &. Here's a direct link to the validation fail.
This guy lists a bunch of characters that should be encoded, but he's wrong. If you encode the first "/" in http:// the href won't work.
In ASP.NET, is there a helper method already built to handle this? Stuff like Server.UrlEncode and HtmlEncode obviously don't work - those are for different purposes.
I can build my own simple extension method (like .ToAttributeView()) which does a simple string replace.
Other than standard URI encoding of the values, & is the only character related to HTML entities that you have to worry about simply because this is the character that begins every HTML entity. Take for example the following URL:
http://query.com/?q=foo<=bar>=baz
Even though there aren't trailing semi-colons, since < is the entity for < and > is the entity for >, some old browsers would translate this URL to:
http://query.com/?q=foo<=bar>=baz
So you need to specify & as & to prevent this from occurring for links within an HTML parsed document.
The purpose of escaping characters is so that they won't be processed as arguments. So you actually don't want to encode the entire url, just the values you are passing via the querystring. For example:
http://example.com/?parameter1=<ENCODED VALUE>¶meter2=<ENCODED VALUE>
The url you showed is actually a perfectly valid url that will pass validation. However, the browser will interpret the & symbols as a break between parameters in the querystring. So your querystring:
?q=whatever&lang=en
Will actually be translated by the recipient as two parameters:
q = "whatever"
lang = "en"
For your url to work you just need to ensure that your values are being encoded:
?q=<ENCODED VALUE>&lang=<ENCODED VALUE>
Edit: The common problems page from the W3C you linked to is talking about edge cases when urls are rendered in html and the & is followed by text that could be interpreted as an entity reference (© for example). Here is a test in jsfiddle showing the url:
http://jsfiddle.net/YjPHA/1/
In Chrome and FireFox the links works correctly, but IE renders © as ©, breaking the link. I have to admit I've never had a problem with this in the wild (it would only affect those entity references which don't require a semicolon, which is a pretty small subset).
To ensure you're safe from this bug you can HTML encode any of your URLS you render to the page and you should be fine. If you're using ASP.NET the HttpUtility.HtmlEncode method should work just fine.
You do not need HTML escapement here:
According to the HTML5 spec:
http://www.w3.org/TR/html5/tokenization.html#character-reference-in-attribute-value-state
&lang= should be parsed as non-recognized character reference and value of the attribute should be used as it is: http://domain.com/search?q=whatever&lang=en
For the reference: added question to HTML5 WG: http://lists.w3.org/Archives/Public/public-html/2011Sep/0163.html
In HTML attribute values, if you want ", '&' and a non-breaking space as a result, you should (as an author who is clear about intent) have ", & and in the markup.
For " though, you don't have to use " if you use single quotes to encase your attribute values.
For HTML text nodes, in addition to the above, if you want < and > as a result, you should use < and >. (I'd even use these in attribute values too.)
For hfnames and hfvalues (and directory names in the path) for URIs, I'd used Javascript's encodeURIComponent() (on a utf-8 page when encoding for use on a utf-8 page).
If I understand the question correctly, I believe this is what you want.
I would like to escape characters in JSP pages. Which is more suitable, escapeXml or escapeHtml?
They're designed for different purposes, HTML has lots of entities that XML doesn't. XML only has 5 escapes:
< represents "<"
> represents ">"
& represents "&"
' represents '
" represents "
While HTML has loads - think of © etc. These HTML codes aren't valid in XML unless you include a definition in the header. The numeric codes (like © for the copyright symbol) are valid in both.
There's no such thing as escapeHtml in JSP. You normally use <c:out escapeXml="true"> (it by the way already defaults to true, so you can omit it) or fn:escapeXml() to escape HTML in JSP.
E.g.
<c:out value="Welcome, ${user.name}" />
<input name="foo" value="${fn:escapeXml(param.foo)}" />
It will escape them as XML entities which works perfectly fine in plain HTML as well. They are only literally called XML entities because HTML entities are invalid in XML.
See also:
Java 5 HTML escaping To Prevent XSS
Escaping html in Java
Assuming you're referring to commons StringEscapeUtils, escapeXml only deals with <>"'& while escapeHtml covers a richer set of characters.
Since you are sending HTML back to the consumer I would go with escapeHtml.
escapeXml only supports escaping the five basic XML entities (gt, lt, quot, amp, apos) whereas escapeHtml supports escaping all known HTML 4.0 entities.
I am looking for a list of characters and symbols for use in HTML in PDF or image format. It could be some sort of cheat-sheet. Basically I want a reference list for use in HTML for replacing for example '&' with '&'. I have found the list in http://www.w3schools.com/tags/ref_entities.asp but if anyone can point me to pdf or image format of the list.
Regards
There is a complete list in the specification but, with the exception of <, &, and " or ', you should be able to use any character directly in UTF-8 (which results in much more readable documents).
Cheat-sheet
I have some XML text that I wish to render in an HTML page. This text contains an ampersand, which I want to render in its entity representation: &.
How do I escape this ampersand in the source XML? I tried &, but this is decoded as the actual ampersand character (&), which is invalid in HTML.
So I want to escape it in such a way that it will be rendered as & in the web page that uses the XML output.
When your XML contains &, this will result in the text &.
When you use that in HTML, that will be rendered as &.
As per §2.4 of the XML 1.0 spec, you should be able to use &.
I tried & but this isn't allowed.
Are you sure it isn't a different issue? XML explicitly defines this as the way to escape ampersands.
The & character is itself an escape character in XML so the solution is to concatenate it and a Unicode decimal equivalent for & thus ensuring that there are no XML parsing errors. That is, replace the character & with &.
Use CDATA tags:
<![CDATA[
This is some text with ampersands & other funny characters. >>
]]>
& should work just fine. Wikipedia has a list of predefined entities in XML.
In my case I had to change it to %26.
I needed to escape & in a URL. So & did not work out for me.
The urlencode function changes & to %26. This way neither XML nor the browser URL mechanism complained about the URL.
I have tried &, but it didn't work. Based on Wim ten Brink's answer I tried & and it worked.
One of my fellow developers suggested me to use & and that worked regardless of how many times it may be rendered.
& is the way to represent an ampersand in most sections of an XML document.
If you want to have XML displayed within HTML, you need to first create properly encoded XML (which involves changing & to &) and then use that to create properly encoded HTML (which involves again changing & to &). That results in:
&
For a more thorough explanation of XML encoding, see:
What characters do I need to escape in XML documents?
<xsl:text disable-output-escaping="yes">& </xsl:text> will do the trick.
Consider if your XML looks like below.
<Employees Id="1" Name="ABC">
<Query>
SELECT * FROM EMP WHERE ID=1 AND RES<>'GCF'
<Query>
</Employees>
You cannot use the <> directly as it throws an error. In that case, you can use <> in replacement of that.
<Employees Id="1" Name="ABC">
<Query>
SELECT * FROM EMP WHERE ID=1 AND RES <> 'GCF'
<Query>
</Employees>
14.1 How to use special characters in XML has all the codes.