I'm using libxml2 to parse/read an HTML page. The following code is used to read the value of an attribute:
char *value = (char*)xmlGetProp(node, attr->name);
But xmlGetProp substitutes character entity references when it reads the attribute content. E.g.
<p onload="readId="blahString"; myFun();"> Event handler in P HTML TAG</p>
In the above case, it returns the following string as "onload" attribute value:
readId="blahString";myFun();
The character entity reference is substituted in the above reading process. Is there any way to read the attribute value keeping the original HTML content using libxml2?
What you call "HTML encoding" is actually called character entity reference. To answer your question: No, the HTML parser of libxml2 has no option to turn off substitution of character references.
The XML parser keeps character entity references by default, but it can't be used for typical HTML documents.
Is it possible to build an XSD that will treat any tag's contents just as text? I am trying to extract a tag's contents that sometimes contains HTML tags. There is no fixed pattern to the html and is not always present. I just want to extract all the text from within the tags. e.g. <content>this is a new piece of content by <b>Person A</b></content>. I want to extract just "this is a new piece of content by <b>Person A</b>" but the schema generated by SSIS naturally includes these tags. When I just add a simple entry
<xs:element minOccurs="0" name="content" type="xs:string"></xs:element>
I get the following error which is not unexpected.
[XML Source [5]] Error: The XML Source was unable to process the XML
data. The element "content" cannot contain a child element. Content
model is text only.
You're not distinguishing very clearly between the schema you are writing to describe and constrain your data (and, I assume, guide SSIS in various ways) and the executable code you will at some point want to write in order to extract the data you want at a particular moment. There are several things you seem to want or need:
To allow unconstrained XML within an element, you'll want a wildcard; read up on the xsd:any element.
To extract just the text within an element, you'll want the XPath string() function (but note that your example "this is a new piece of content by <b>Person A</b>" is not just the text of content but contains a child element).
To extract a serialized XML representation of the content of the content element (which is what you apparently want, in contrast to what you say you want), you'll want to serialize the contents; there are a variety of ways to do that.
Think of the XSD primarily as describing allowed markup in a valid XML document rather than as a method to define extraction. If you change the type of content to xs:string, you're declaring that markup is not permitted within content, only text, and the validation error you're getting reflects that.
What you want is to select the string value of the content element. If the context for an XPath doesn't automatically convert its results to a string value, you can do so explicitly via the string() XPath function:
string(/path/to/particular/content)
This will return the concatenation of the string values of all of the children of content, omitting the tags as requested.
Update: Re-reading your question, I see that you actually want to retrieve
"this is a new piece of content by <b>Person A</b>"
(including the b element, not its string value). Here, the wrapping content element clearly has to be described in the XSD as having mixed content (mixed="true"). Extracting this data from an XML document in this form would typically involve selecting a collection of text and elements nodes, and serializing these back to a single string. I am not familiar enough with SSIS to provide details, but perhaps the reference I mentioned in the comments could help.
For my application I have to save an XML document containing a few elements with HTML-text.
Example as the result should be:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
But when I add this html element to my NSXMLDocument the '<' (to <) is correctly escaped automatically, but the '>' not (to >).
In code:
NSXMLElement *newWPT = [NSXMLElement elementWithName:#"wpt"];
NSXMLElement *htmlElement = [NSXMLElement elementWithName:#"elementInHTML"];
htmlElement.stringValue = #"<Sample text>";
[newWPT addChild:htmlElement];
But this results in an XML document like this:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
And this result is not valid for the device that has to process this xml file.
Anybody an idea how to enclose a correctly escaped html-string into a NSXMLDocument?
&The string is correctly scaped for XML, greater than is a valid character where it is: http://www.w3.org/TR/REC-xml/#syntax
It seems it's a device implementation specific problem.
Your easy option is to include your html markup in a CDATA.
...and hope the device client XML parser implementation understand it properly.
(If your html markup include also CDATA sections you'll have to find/replace ">" with ">", as stated in the link before.)
P.D.: NSXMLNode CDATA in any search engine will lead you to something closer to "copy-paste"
EDIT:
Knowing now more about the content of the string in the original question (see question comments) and depending on the nature of your string answers to this other question may also help: Objective-C and Swift URL encoding
I have a strange problem:
In the database, I have a literal ampersand lt semicolon:
<div
whenever its printed into a html textarea tag, the source code of the page shows the > as >.
How do I stop this decoding?
You can't stop entities being decoded in a textarea since the content of a textarea is not (unlike a script or style element) intrinsic CDATA, even though error recovery may sometimes give the impression that it is.
The definition of the textarea element is:
<!ELEMENT TEXTAREA - - (#PCDATA) -- multi-line text field -->
i.e. it contains PCDATA which is described as:
Document text (indicated by the SGML construct "#PCDATA"). Text may contain character references. Recall that these begin with & and end with a semicolon (e.g., Hergé's adventures of Tintin contains the character entity reference for the e acute character).
This means that when you type (the invalid HTML of) "start of tag" (<) the browser corrects it to "less than sign" (<) but when you type "start of entity" (&), which is allowed, no error correction takes place.
You need to write what you mean. If you want to include some HTML as data then you must convert any character with special meaning to its respective character reference.
If the data is:
<div
Then the HTML must be:
<textarea><div</textarea>
You can use the standard functions for converting this (e.g. PHP's htmlspecialchars or Perl's HTML::Entities module).
NB 1: If you were using XHTML[2] (and really using it, it doesn't count if you serve it as text/html) then you could use an explicit CDATA block:
<textarea><![CDATA[<div]]></textarea>
NB 2: Or if browsers implemented HTML 4 correctly
Ok , but the question is . why it decodes them anyway ? assuming i've added & , save the textarea , ti will be saved < , but displayed as < , saving it again will convert it back to < (but it will remain < in the database) , saving again will save it a < in the database , why the textarea decodes it ?
The server sends (to the browser) data encoded as HTML.
The browser sends (to the server) data encoded as application/x-www-form-urlencoded (or multipart/form-data).
Since the browser is not sending the data as HTML, the characters are not represented as HTML entities.
If you take the data received from the client and then put it into an HTML document, then you must encode it as HTML first.
In PHP, this can be done using htmlentities(). Example below.
<?php
$content = "This string contains the TM symbol: ™";
print "<textarea>". htmlentities($content) ."</textarea>";
?>
Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "™".
http://php.net/manual/en/function.htmlentities.php
You have to be sure that this is rendered to the browser:
<textarea name="somename"><div</textarea>
Essentially, this means that the & in < has to be html encoded to &. How to do it will depend on the technologies you're using.
UPDATE: Think about it like this. If you want to display <div> inside a textarea, you'll have to encode <> because otherwise, <div> would be a normal HTML element to the browser:
<textarea name="somename"><div></textarea>
Having said this, if you want to display <div> inside a textarea, you'll have to encode & again, because the browser decodes HTML entities when rendering HTML. It has nothing to do with your database.
You can serve your DB-content from a separate page and then place it in the textarea using a Javascript (jQuery) Ajax-call:
request = $.ajax
({
type: "GET",
url: "url-with-the-troubled-content.php",
success: function(data)
{
document.getElementById('id-of-text-area').value = data;
}
});
Explained at
http://www.endtask.net/how-to-prevent-a-textarea-element-from-decoding-html-entities/
I had the same problem and I just made two replacements on the text to show from the database before letting it into the text area:
myString = Replace(myString, "&", "&")
myString = Replace(myString, "<", "<")
Replace n:o 1 to trick the textarea to show the codes.
replace n:o 2: Without this replacement you can not show the word "" inside the textarea (it would end the textarea tag).
(Asp / vbscript code above, translate to a replace method of your language choice)
I found an alternative solution for reading and working with in-browser, simply read the element's text() using jQuery, it returns the characters as display characters and allows me to write from a textarea to a div's innerHTML using the property via html()...
With only JS and HTML...
...to answer the actual question, with a bare-minimal example:
<textarea id=myta></textarea>
<script id=mytext type=text/plain>
™
</script>
<script> myta.value = mytext.innerText; </script>
Explanation:
Script tags do not render html nor entities. By storing text in a script tag, it will remain unadultered-- problem is it will try to execute as JavaScript. So we use an empty textarea and store the text in a script tag (here, the first one).
To prevent that, we change the mime-type to text/plain instead of it's default, which is text/javascript. This will prevent it from running.
Then to populate the textarea, we copy the script tag's content to it (here done in the second script tag).
The only caveats I have found with this are you have to use JavaScript and you cannot include script tags directly in it.
When working with CSS inside of XML such as
<span class="IwuvAS3"></span>
when parsed in flash, if I don't use CDATA like the following:
<![CDATA[<span class="IwuvAS3"></span>]]>
then the parsed data drops down a line for every "<" character it sees.
When parsing the data into a single-line text field, nothing was shown because it was actually down a line. Soon as I wrap it inside of CDATA it works great. I have played with prettyIndent, and as I understand ignoreWhite is true by default.
Is there a way to parse the data without the use of CDATA and keep the implied line breaks out?
EDIT 1 (10/10/08): Thank you, but I am actually looking for a Function or Method. Escaping each is much more cumbersome than using CDATA. The only reason I don't want to use CDATA is that I was taught to stay clear of it. If ActionScript has a method associated to E4X XML handling that will remove the requirement to wrap my XML in CDATA, I would love to know about it.
EDIT 1 (10/15/08): Thanks Philippe! I never would have thought that HTML formatting in Flash is treated as whitespace. The answer was
textField.condenseWhite = true;
<3AS3
Set the TextField's condenseWhite property to true - so only < br/> tags will generate linebreaks.
You could escape the "<" characters (and &, ", >, ', among others) as entities instead.