Saving NSXMLDocument escaping HTML for certain NSXMLElements - html

For my application I have to save an XML document containing a few elements with HTML-text.
Example as the result should be:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
But when I add this html element to my NSXMLDocument the '<' (to <) is correctly escaped automatically, but the '>' not (to >).
In code:
NSXMLElement *newWPT = [NSXMLElement elementWithName:#"wpt"];
NSXMLElement *htmlElement = [NSXMLElement elementWithName:#"elementInHTML"];
htmlElement.stringValue = #"<Sample text>";
[newWPT addChild:htmlElement];
But this results in an XML document like this:
<gpx>
<wpt>
<elementInHTML>
<p>Sample text.</p>
</elementInHTML>
etc...
And this result is not valid for the device that has to process this xml file.
Anybody an idea how to enclose a correctly escaped html-string into a NSXMLDocument?

&The string is correctly scaped for XML, greater than is a valid character where it is: http://www.w3.org/TR/REC-xml/#syntax
It seems it's a device implementation specific problem.
Your easy option is to include your html markup in a CDATA.
...and hope the device client XML parser implementation understand it properly.
(If your html markup include also CDATA sections you'll have to find/replace ">" with ">", as stated in the link before.)
P.D.: NSXMLNode CDATA in any search engine will lead you to something closer to "copy-paste"
EDIT:
Knowing now more about the content of the string in the original question (see question comments) and depending on the nature of your string answers to this other question may also help: Objective-C and Swift URL encoding

Related

Why do some strings contain " " and some " ", when my input is the same(" ")?

My problem occurs when I try to use some data/strings in a p-element.
I start of with data like this:
data: function() {
return {
reportText: {
text1: "This is some subject text",
text2: "This is the conclusion",
}
}
}
I use this data as follows in my (vue-)html:
<p> {{ reportText.text1 }} </p>
<p> {{ reportText.text2 }} </p>
In my browser, when I inspect my elements I get to see the following results:
<p>This is some subject text</p>
<p>This is the conclusion</p>
As you can see, there is suddenly a difference, one p element uses and the other , even though I started of with both strings only using . I know and technically represent the same thingm, but the problem with the string is that it gets treated as a string with 1 large word instead of multiple separate words. This screws up my layout and I can't solve this by using certain css properties (word-wrap etc.)
Other things I have tried:
Tried sanitizing the strings by using .replace( , ), but that doesn't do anything. I assume this is because it basically is the same, so there is nothing to really replace. Same reason why I have to use blockcode on stackoverflow to make the destinction between and .
Logged the data from vue to see if there is any noticeable difference, but I can't see any. If I log the data/reportText I again only see string with 's
So I have the following questions:
Why does this happen? I can't seem to find any logical explanation why it sometimes uses 's and sometimes uses 's, it seems random, but I am sure I am missing something.
Any other things I could try to follow the path my string takes, so I can see where the transformation from to happens?
Per the comments, the solution devised ended up being a simple unicode character replacement targeting the \u00A0 unicode code point (i.e. replacing unicode non-breaking spaces with ordinary spaces):
str.replace(/[\\u00A0]/g, ' ')
Explanation:
JavaScript typically allows the use of unicode characters in two ways: you can input the rendered character directly, or you can use a unicode code point (i.e. in the case of JavaScript, a hexadecimal code prefixed with \u like \u00A0). It has no concept of an HTML entity (i.e. a character sequence between a & and ; like ).
The inspector tool for some browsers, however, utilizes the HTML concept of the HTML entity and will often display unicode characters using their corresponding HTML entities where applicable. If you check the same source code in Chrome's inspector vs. Firefox's inspector (as of writing this answer, anyway), you will see that Chrome uses HTML entities while Firefox uses the rendered character result. While it's a handy feature to be able to see non-printable unicode characters in the inspector, Chrome's use of HTML entities is only a convenience feature, not a reflection of the actual contents of your source code.
With that in mind, we can infer that your source code contains unicode characters in their fully rendered form. Regardless of the form of your unicode character, the fix is identical: you need to target these unicode space characters explicitly and replace them with ordinary spaces.

Displaying tags in HTML5

In HTML5, if you include <pre> for example on a paragraph text, the result won't display '<pre>' on the paragraph, and it will run the command <pre> on the words after it.
What I have to do display texts including signs like " " or <> on a text, without running the command.
How can I accomplish this?
What you're looking for are known as HTML entities: characters that are reserved, and which automatically get parsed to the the relevant HTML. Using these tags allow you to write out the entities that would usually automatically get parsed as HTML.
For example, attempting to write out the <pre> tag within a parent <pre> tag will normally result in the inner tag being treated as HTML:
<pre><pre>The relevant tags surround this text</pre></pre>
Though using the HTML entities < and > for the left and right bracket respectively parses the entities as HTML, where they get displayed as text:
<pre><pre>The relevant tags surround this text</pre></pre>
A full list of HTML entities can be found here.
Hope this helps! :)
The 'best' way is to replace every < and > element with < and &gt:
But if you want to do it fast, you can use the xmp tag. It's deprecated but is still supported by all browsers
<xmp>
<div>Lorem ipsum</div>
<p>Hello</p>
</xmp>
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/xmp
You need to escape the provided content to display it as-is, without it being interpreted as HTML.
https://github.com/sindresorhus/escape-goat
This involves taking the reserved characters for the given language (e.g. HTML) and converting them to a representation that either uses an escape sequence or only uses unreserved characters.
In this case, HTML prescribes the use of entities to display characters that would otherwise be used by the syntax for tags and attributes within the source code itself.

How to display raw HTML code in PRE or something like it but without escaping it

I'd like to display raw HTML. We all know one has to escape each "<" and ">" like this:
<PRE> this is a test &ltDIV&gt </PRE>
However, I do not want to do this. I'd like a way to keep the HTML code as is (since it is easier to read, (inside the editor) and I might want to copy it and use it again myself as actual HTML code, and do not want to have to change it again or have two versions of the same code, one escaped and one not escaped.
Is there any other environment that is more "raw" than PRE that might allow this? So one does not have to keep editing HTML and changing everything each time they want to show some raw HTML code, maybe in HTML5?
Something like <REALLY_REALLY_VERBATIM> ...... </<REALLY_REALLY_VERBATIM>
The JavaScript solution does not work on Firefox 21, here is a screenshot:
The first solution still does not work on Firefox, here is a screenshot:
You can use the xmp element, see What was the <XMP> tag used for?. It has been in HTML since the beginning and is supported by all browsers. Specifications frown upon it, but HTML5 CR still describes it and requires browsers to support it (though it also tells authors not to use it, but it cannot really prevent you).
Everything inside xmp is taken as such, no markup (tags or character references) is recognized there, except, for apparent reason, the end tag of the element itself, </xmp>.
Otherwise xmp is rendered like pre.
When using “real XHTML”, i.e. XHTML served with an XML media type (which is rare), the special parsing rules do not apply, so xmp is treated like pre. But in “real XHTML”, you can use a CDATA section, which implies similar parsing rules. It has no special formatting, so you would probably want to wrap it inside a pre element:
<pre><![CDATA[
This is a demo, tags like <p> will
appear literally.
]]></pre>
I don’t see how you could combine xmp and CDATA section to achieve so-called polyglot markup
Essentially the original question can be broken down in 2 parts:
Main objective/challenge: embedding(/transporting) a raw formatted code-snippet
(any kind of code) in a web-page's markup (for simple copy/paste/edit due to no
encoding/escaping)
correctly displaying/rendering that code-snippet (possibly edit it) in the
browser
The short (but) ambiguous answer is: you can't, ...but you can (get very close).
(I know, that are 3 contradicting answers, so read on...)
(polyglot)(x)(ht)ml Markup-languages rely on wrapping (almost) everything between begin/opening and end/closing tags/character(sequences).
So, to embed any kind of raw code/snippet inside your markup-language, one will always have to escape/encode every instance (inside that snippet) that resembles the character(-sequence) that would close the wrapping 'container' element in the markup. (During this post I'll refer to this as rule no 1.)
Think of "some "data" here" or <i>..close italics with '</i>'-tag</i>, where it is obvious one should escape/encode (something in) </i and " (or change container's quote-character from " to ').
So, because of rule no 1, you can't 'just' embed 'any' unknown raw code-snippet inside markup.
Because, if one has to escape/encode even one character inside the raw snippet, then that snippet would no longer be the same original 'pure raw code' that anyone can copy/paste/edit in the document's markup without further thought. It would lead to malformed/illegal markup and Mojibake (mainly) because of entities.
Also, should that snippet contain such characters, you'd still need some javascript to 'translate' that character(sequence) from (and to) it's escaped/encoded representation to display the snippet correctly in the 'webpage' (for copy/paste/edit).
That brings us to (some of) the datatypes that markup-languages specify. These datatypes essentially define what are considered 'valid characters' and their meaning (per tag, property, etc.):
PCDATA (Parsed Character DATA): will expand entities and one must
escape <, & (and > depending on markup language/version).
Most tags like body, div, pre, etc, but also textarea (until
HTML5) fall under this type.
So not only do you need to encode all the container's closing character-sequences
inside the snippet, you also have to encode all <, & (,>) characters
(at minimum).
Needless to say, encoding/escaping this many characters falls outside this
objective's scope of embedding a raw snippet in the markup.
'..But a textarea seems to work...', yes, either because of the browsers
error-engine trying to make something out of it, or because HTML5:
RCDATA (Replaceable Character DATA): will not not treat tags inside the
text as markup (but are still governed by rule 1), so one doesn't need to
encode < (>). BUT entities are still expanded, so they and 'ambiguous
ampersands' (&) need special care.
The current HTML5 spec says the textarea is now a RCDATA field and (quote):
The text in raw text and RCDATA elements must not contain any
occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS)
followed by characters that case-insensitively match the tag name of
the element followed by one of U+0009 CHARACTER TABULATION (tab),
U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN
(CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).
Thus no matter what, textarea needs a hefty entity translation handler or
it will eventually Mojibake on entities!
CDATA (Character Data) will not treat tags inside the text as
markup and will not expand entities.
So as long as the raw snippet code does not violate rule 1 (that one can't
have the containers closing character(sequence) inside the snippet), this
requires no other escaping/encoding.
Clearly this boils down to: how can we minimize the number of characters/character-sequences that still need to be encoded in the snippet's raw source and the number of times that character(sequence) might appear in an average snippet; something that is also of importance for the javascript that handles the translation of these characters (if they occur).
So what 'containers' have this CDATA context?
Most value properties of tags are CDATA, so one could (ab)use a hidden input's value property (proof of concept jsfiddle here).
However (conform rule 1) this creates an encoding/escape problem with nested quotes (" and ') in the raw snippet and one needs some javascript to get/translate and set the snippet in another (visible) element (or simply setting it as a text-area's value). Somehow this gave me problems with entities in FF (just like in a textarea). But it doesn't really matter, since the 'price' of having to escape/encode nested quotes is higher then a (HTML5) textarea (quotes are quite common in source code..).
What about trying to (ab)use <![CDATA[<tag>bla & bla</tag>]]>?
As Jukka points out in his extended answer, this would only work in (rare) 'real xhtml'.
I thought of using a script-tag (with or without such a CDATA wrapper inside the script-tag) together with a multi-line comment /* */ that wraps the raw snippet (script-tags can have an id and you can access them by count). But since this obviously introduces a escaping problem with */, ]]> and </script in the raw snippet, this doesn't seem like a solution either.
Please post other viable 'containers' in the comments to this answer.
By the way, encoding or counting the number of - characters and balancing them out inside a comment tag <!-- --> is just insane for this purpose (apart from rule 1).
That leaves us with Jukka K. Korpela's excellent answer: the <xmp> tag seems the best option!
The 'forgotten' <xmp> holds CDATA, is intended for this purpose AND is indeed still in the current HTML 5 spec (and has been at least since HTML3.2); exactly what we need! It's also widely supported, even in IE6 (that is.. until it suffers from the same regression as the scrolling table-body).
Note: as Jukka pointed out, this will not work in true xhtml or polyglot (that will treat it as a pre) and the xmp tag must still adhere to rule no 1. But that's the 'only' rule.
Consider the following markup:
<!-- ATTENTION: replace any occurrence of </xmp with </xmp -->
<xmp id="snippet-container">
<div>
<div>this is an example div & holds an xmp tag:<br />
<xmp>
<html><head> <!-- indentation col 0!! -->
<title>My Title</title>
</head><body>
<p>hello world !!</p>
</body></html>
</xmp> <!-- note this encoded/escaped tag -->
</div>
This line is also part of the snippet
</div>
</xmp>
The above codeblok illustrates a raw piece of markup where <xmp id="snippet-container"> contains an (almost raw) code-snippet (containing div>div>xmp>html-document).
Notice the encoded closing tag in this markup? To comply with rule no 1, this was encoded/escaped).
So embedding/transporting the (sometimes almost) raw code is/seems solved.
What about displaying/rendering the snippet (and that encoded </xmp>)?
The browser will (or it should) render the snippet (the contents inside snippet-container) exactly the way you see it in the codeblock above (with some discrepancy amongst browsers whether or not the snippet starts with a blank line).
That includes the formatting/indentation, entities (like the string &), full tags, comments AND the encoded closing tag </xmp> (just like it was encoded in the markup). And depending on browser(version) one could even try use the property contenteditable="true" to edit this snippet (all that without javascript enabled). Doing something like textarea.value=xmp.innerHTML is also a breeze.
So you can... if the snippet doesn't contain the containers closing character-sequence.
However, should a raw snippet contain the closing character-sequence </xmp (because it is an example of xmp itself or it contains some regex, etc), you must accept that you have to encode/escape that sequence in the raw snippet AND need a javascript handler to translate that encoding to display/render the encoded </xmp> like </xmp> inside a textarea (for editing/posting) or (for example) a pre just to correctly render the snippet's code (or so it seems).
A very rudimentary jsfiddle example of this here. Note that getting/embedding/displaying/retrieving-to-textarea worked perfect even in IE6. But setting the xmp's innerHTML revealed some interesting 'would-be-intelligent' behavior on IE's part. There is a more extensive note and workaround on that in the fiddle.
But now comes the important kicker (another reason why you only get very close):
Just as an over-simplified example, imagine this rabbit-hole:
Intended raw code-snippet:
<!-- remember to translate between </xmp> and </xmp> -->
<xmp>
<p>a paragraph</p>
</xmp>
Well, to comply with rule 1, we 'only' need to encode those </xmp[> \n\r\t\f\/] sequences, right?
So that gives us the following markup (using just a possible encoding):
<xmp id="container">
<!-- remember to translate between </xmp> and </xmp> -->
<xmp>
<p>a paragraph</p>
</xmp>
</xmp>
Hmm.. shalt I get my crystal ball or flip a coin? No, let the computer look at its system-clock and state that a derived number is 'random'. Yes, that should do it..
Using a regex like: xmp.innerHTML.replace(/<(?=\/xmp[> \n\r\t\f\/])/gi, '<');, would translate 'back' to this:
<!-- remember to translate between </xmp> and </xmp> -->
<xmp>
<p>a paragraph</p>
</xmp>
Hmm.. seems this random generator is broken... Houston..?
Should you have missed the joke/problem, read again starting at the 'intended raw code-snippet'.
Wait, I know, we (also) need to encode .... to ....
Ok, rewind to 'intended raw code-snippet' and read again.
Somehow this all begins to smell like the famous hilarious-but-true rexgex-answer on SO, a good read for people fluent in mojibake.
Maybe someone knows a clever algorithm or solution to fix this problem, but I assume that the embedded raw code will get more and more obscure to the point where you'd be better of properly escaping/encoding just your <, & (and >), just like the rest of the world.
Conclusion: (using the xmp tag)
it can be done with known snippets that do not contain the container's closing character-sequence,
we can get very close to the original objective with known snippets that only use 'basic first-level' escaping/encoding so we don't fall in the rabbithole,
but ultimately it seems that one can't do this reliably in a 'production-environment' where people can/should copy/paste/edit 'any unknown' raw snippets while not knowing/understanding the implications/rules/rabbithole (depending on your implementation of handling/translating for rule 1 and the rabbit-hole).
Hope this helps!
PS:
Whilst I would appreciate an upvote if you find this explanation useful, I kind of think Jukka's answer should be the accepted answer (should no better option/answer come along), since he was the one who remembered the xmp tag (that I forgot about over the years and got 'distracted' by the commonly advocated PCDATA elements like pre, textarea, etc.).
This answer originated in explaining why you can't do it (with any unknown raw snippet) and explain some obvious pitfalls that some other (now deleted) answers overlooked when advising a textarea for embedding/transport. I've expanded my existing explanation to also support and further explain Jukka's answer (since all that entity and *CDATA stuff is almost harder than code-pages).
Cheap and cheerful answer:
<textarea>Some raw content</textarea>
The textarea will handle tabs, multiple spaces, newlines, line wrapping all verbatim.
It copies and pastes nicely and its valid HTML all the way. It also allows the user to resize the code box.
You don't need any CSS, JS, escaping, encoding.
You can alter the appearance and behaviour as well.
Here's a monospace font, editing disabled, smaller font, no border:
<textarea
style="width:100%; font-family: Monospace; font-size:10px; border:0;"
rows="30" disabled
>Some raw content</textarea>
This solution is probably not semantically correct. So if you need that, it might be best to choose a more sophisticated answer.
xmp is the way to go, i.e.:
<xmp>
# your code...
</xmp>
echo '<pre>' . htmlspecialchars("<div><b>raw HTML</b></div>") . '</pre>';
I think that's what you're looking for?
In other words, use htmlspecialchars() in PHP
#GitaarLAB and #Jukka elaborate that <xmp> tag is obsolete, but still the best. When I use it like this
<xmp>
<div>Lorem ipsum</div>
<p>Hello</p>
</xmp>
then the first EOL is inserted in the code, and it looks awful.
It can be solved by removing that EOL
<xmp><div>Lorem ipsum</div>
<p>Hello</p>
</xmp>
but then it looks bad in the source. I used to solve it with wrapping <div>, but recently I figured out a nice CSS3 rule, I hope it also helps somebody:
xmp { margin: 5px 0; padding: 0 5px 5px 5px; background: #CCC; }
xmp:before { content: ""; display: block; height: 1em; margin: 0 -5px -2em -5px; }
This looks better.
If you have jQuery enabled you can use an escapeXml function and not have to worry about escaping arrows or special characters.
<pre>
${fn:escapeXml('
<!-- all your code -->
')};
</pre>
<code> tag is the good way because <xmp> and <pre> tags not support line wrapping
echo '<code>' . htmlspecialchars("<div><b>hello world</b></div>") . '</code>';

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

Actionscript3 E4X XML and CSS: Do I really have to use CDATA?

When working with CSS inside of XML such as
<span class="IwuvAS3"></span>
when parsed in flash, if I don't use CDATA like the following:
<![CDATA[<span class="IwuvAS3"></span>]]>
then the parsed data drops down a line for every "<" character it sees.
When parsing the data into a single-line text field, nothing was shown because it was actually down a line. Soon as I wrap it inside of CDATA it works great. I have played with prettyIndent, and as I understand ignoreWhite is true by default.
Is there a way to parse the data without the use of CDATA and keep the implied line breaks out?
EDIT 1 (10/10/08): Thank you, but I am actually looking for a Function or Method. Escaping each is much more cumbersome than using CDATA. The only reason I don't want to use CDATA is that I was taught to stay clear of it. If ActionScript has a method associated to E4X XML handling that will remove the requirement to wrap my XML in CDATA, I would love to know about it.
EDIT 1 (10/15/08): Thanks Philippe! I never would have thought that HTML formatting in Flash is treated as whitespace. The answer was
textField.condenseWhite = true;
<3AS3
Set the TextField's condenseWhite property to true - so only < br/> tags will generate linebreaks.
You could escape the "<" characters (and &, ", >, ', among others) as entities instead.