How to add non-escaped ampersands to HTML with Nokogiri::XML::Builder - html

I would like to add things like bullet points "•" to HTML using the XML Builder in Nokogiri, but everything is being escaped. How do I prevent it from being escaped?
I would like the result to be:
<span>•</span>
rather than:
<span>&#8226;</span>
I'm just doing this:
xml.span {
xml.text "•\ "
}
What am I missing?

If you define
class Nokogiri::XML::Builder
def entity(code)
doc = Nokogiri::XML("<?xml version='1.0'?><root>&##{code};</root>")
insert(doc.root.children.first)
end
end
then this
builder = Nokogiri::XML::Builder.new do |xml|
xml.span {
xml.text "I can has "
xml.entity 8665
xml.text " entity?"
}
end
puts builder.to_xml
yields
<?xml version="1.0"?>
<span>I can has • entity?</span>
PS this a workaround only, for a clean solution please refer to the libxml2 documentation (Nokogiri is built on libxml2) for more help. However, even these folks admit that handling entities can be quite ..err, cumbersome sometimes.

When you're setting the text of an element, you really are setting text, not HTML source. < and & don't have any special meaning in plain text.
So just type a bullet: '•'. Of course your source code and your XML file will have to be using the same encoding for that to come out right. If your XML file is UTF-8 but your source code isn't, you'd probably have to say '\xe2\x80\xa2' which is the UTF-8 byte sequence for the bullet character as a string literal.
(In general non-ASCII characters in Ruby 1.8 are tricky. The byte-based interfaces don't mesh too well with XML's world of all-text-is-Unicode.)

Related

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

Saving NSXMLDocument escaping HTML for certain NSXMLElements

For my application I have to save an XML document containing a few elements with HTML-text.
Example as the result should be:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
But when I add this html element to my NSXMLDocument the '<' (to <) is correctly escaped automatically, but the '>' not (to >).
In code:
NSXMLElement *newWPT = [NSXMLElement elementWithName:#"wpt"];
NSXMLElement *htmlElement = [NSXMLElement elementWithName:#"elementInHTML"];
htmlElement.stringValue = #"<Sample text>";
[newWPT addChild:htmlElement];
But this results in an XML document like this:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
And this result is not valid for the device that has to process this xml file.
Anybody an idea how to enclose a correctly escaped html-string into a NSXMLDocument?
&The string is correctly scaped for XML, greater than is a valid character where it is: http://www.w3.org/TR/REC-xml/#syntax
It seems it's a device implementation specific problem.
Your easy option is to include your html markup in a CDATA.
...and hope the device client XML parser implementation understand it properly.
(If your html markup include also CDATA sections you'll have to find/replace ">" with ">", as stated in the link before.)
P.D.: NSXMLNode CDATA in any search engine will lead you to something closer to "copy-paste"
EDIT:
Knowing now more about the content of the string in the original question (see question comments) and depending on the nature of your string answers to this other question may also help: Objective-C and Swift URL encoding

Is there an easy way to strip HTML from a QString in Qt?

I have a QString with some HTML in it... is there an easy way to strip the HTML from it? I basically want just the actual text content.
<i>Test:</i><img src="blah.png" /><br> A test case
Would become:
Test: A test case
I'm curious to know if Qt has a string function or utility for this.
QString s = "<i>Test:</i><img src=\"blah.png\" /><br> A test case";
s.remove(QRegExp("<[^>]*>"));
// s == "Test: A test case"
If you don't care about performance that much then QTextDocument does a pretty good job of converting HTML to plain text.
QTextDocument doc;
doc.setHtml( htmlString );
return doc.toPlainText();
I know this question is old, but I was looking for a quick and dirty way to handle incorrect HTML. The XML parser wasn't giving good results.
You may try to iterate through the string using QXmlStreamReader class and extract all text (if you HTML string is guarantied to be well formed XML).
Something like this:
QXmlStreamReader xml(htmlString);
QString textString;
while (!xml.atEnd()) {
if ( xml.readNext() == QXmlStreamReader::Characters ) {
textString += xml.text();
}
}
but I'm unsure that its 100% valid ussage of QXmlStreamReader API since I've used it quite longe time ago and may forget something.
the situation that some html is not quite validate xml make it worse to work it out correctly.
If it's valid xml (or not too bad formated), I think QXmlStreamReader + QXmlStreamEntityResolver might not be bad idea.
Sample code in: https://github.com/ycheng/misccode/blob/master/qt_html_parse/utils.cpp
(this can be a comment, but I still don't have permission to do so)
this answer is for who read this post later and using Qt5 or later. simply escape the html characters using inbuilt functions as below.
QString str="<h1>some hedding </h1>"; // a string containing html tags.
QString esc=str.toHtmlEscaped(); //esc contains the html escaped srring.

HTML rendered incorrectly in .NET

I am trying to take the string "<BR>" in VB.NET and convert it to HTML through XSLT. When the HTML comes out, though, it looks like this:
<BR>
I can only assume it goes ahead and tries to render it. Is there any way I can convert those </> back into the brackets so I get the line break I'm trying for?
Check the XSLT has:
<xsl:output method="html"/>
edit: explanation from comments
By default XSLT outputs as XML(1) which means it will escape any significant characters. You can override this in specific instances with the attribute disable-output-escaping="yes" (intro here) but much more powerful is to change the output to the explicit value of HTML which confides same benefit globally, as the following:
For script and style elements, replace any escaped characters (such
as & and >) with their actual values
(& and >, respectively).
For attributes, replace any occurrences of > with >.
Write empty elements such as <br>, <img>, and <input> without
closing tags or slashes.
Write attributes that convey information by their presence as
opposed to their value, such as
checked and selected, in minimized
form.
from a solid IBM article covering the subject, more recent coverage from stylusstudio here
If HTML output is what you desire HTML output is what you should specify.
(1) There is actually corner case where output defaults to HTML, but I don't think it's universal and it's kind of obtuse to depend on it.
Try wraping it with <xsl:text disable-output-escaping="yes"><br></xsl:text>
Don't know about XSLT but..
One workaround might be using HttpUtility.HtmlDecode from System.Web namespace.
using System;
using System.Web;
class Program
{
static void Main()
{
Console.WriteLine(HttpUtility.HtmlDecode("<br>"));
Console.ReadKey();
}
}
...
Got it! On top of the selected answer, I also did something similar to this on my string:
htmlString = htmlString.Replace("<","<")
htmlString = htmlString.Replace(">",">")
I think, though, that in the end, I may just end up using <pre> tags to preserve everything.
The string "<br>" is already HTML so you can just Response.Write("<br>").
But you meantion XSLT so I imagine there some transform going on. In that case surely the transform should be inserting it at the correct place as a node. A better question will likely get a better answer

Actionscript3 E4X XML and CSS: Do I really have to use CDATA?

When working with CSS inside of XML such as
<span class="IwuvAS3"></span>
when parsed in flash, if I don't use CDATA like the following:
<![CDATA[<span class="IwuvAS3"></span>]]>
then the parsed data drops down a line for every "<" character it sees.
When parsing the data into a single-line text field, nothing was shown because it was actually down a line. Soon as I wrap it inside of CDATA it works great. I have played with prettyIndent, and as I understand ignoreWhite is true by default.
Is there a way to parse the data without the use of CDATA and keep the implied line breaks out?
EDIT 1 (10/10/08): Thank you, but I am actually looking for a Function or Method. Escaping each is much more cumbersome than using CDATA. The only reason I don't want to use CDATA is that I was taught to stay clear of it. If ActionScript has a method associated to E4X XML handling that will remove the requirement to wrap my XML in CDATA, I would love to know about it.
EDIT 1 (10/15/08): Thanks Philippe! I never would have thought that HTML formatting in Flash is treated as whitespace. The answer was
textField.condenseWhite = true;
<3AS3
Set the TextField's condenseWhite property to true - so only < br/> tags will generate linebreaks.
You could escape the "<" characters (and &, ", >, ', among others) as entities instead.