Are there any ready solutions for string(html) parsing with actionscript? - actionscript-3

I need to parse string entered to textarea for 2 reasons:
Remove all html tags except small number that are allowed;
Check if html tags are positioned correctly: <i><b> wow </i></b> is incorrect.
Are there any 'build-in' solutions or libraries for that deal? Using regExp is not ok for this task - it's very complex and generally bad idea to parse html structure with them.

You can try the unescape function.
Usage : clean_String = unescape( input_String );
Or else...
You can try the decode URI component function.
Usage : clean_String = decodeURIComponent( input_String );
In each case clean_String is an empty string that will be updated by the result of the functions. input_String is the string with the HTML entities that is to be decoded or unescaped etc.

Related

string interpreted incorrectly in angularjs

I have a string in golang as follows.
discount = "("+discount+"% off)"
when passed to html via angularjs it is displayed as follows
(10 %o(MISSING)ff)
Any idea why it is happening?
Thanks in advance.
Something in your HTML rendering process is passing the string through go's fmt.Sprintf or similar. Try escaping the % by doubling it:
discount = "("+discount+"%% off)"
See http://play.golang.org/p/S_GEJXSfnD for a live example.
Looks like you need to escape the string. Try to use this module: http://golang.org/pkg/html/

Is there an easy way to strip HTML from a QString in Qt?

I have a QString with some HTML in it... is there an easy way to strip the HTML from it? I basically want just the actual text content.
<i>Test:</i><img src="blah.png" /><br> A test case
Would become:
Test: A test case
I'm curious to know if Qt has a string function or utility for this.
QString s = "<i>Test:</i><img src=\"blah.png\" /><br> A test case";
s.remove(QRegExp("<[^>]*>"));
// s == "Test: A test case"
If you don't care about performance that much then QTextDocument does a pretty good job of converting HTML to plain text.
QTextDocument doc;
doc.setHtml( htmlString );
return doc.toPlainText();
I know this question is old, but I was looking for a quick and dirty way to handle incorrect HTML. The XML parser wasn't giving good results.
You may try to iterate through the string using QXmlStreamReader class and extract all text (if you HTML string is guarantied to be well formed XML).
Something like this:
QXmlStreamReader xml(htmlString);
QString textString;
while (!xml.atEnd()) {
if ( xml.readNext() == QXmlStreamReader::Characters ) {
textString += xml.text();
}
}
but I'm unsure that its 100% valid ussage of QXmlStreamReader API since I've used it quite longe time ago and may forget something.
the situation that some html is not quite validate xml make it worse to work it out correctly.
If it's valid xml (or not too bad formated), I think QXmlStreamReader + QXmlStreamEntityResolver might not be bad idea.
Sample code in: https://github.com/ycheng/misccode/blob/master/qt_html_parse/utils.cpp
(this can be a comment, but I still don't have permission to do so)
this answer is for who read this post later and using Qt5 or later. simply escape the html characters using inbuilt functions as below.
QString str="<h1>some hedding </h1>"; // a string containing html tags.
QString esc=str.toHtmlEscaped(); //esc contains the html escaped srring.

RegEx to extract all HTML tag attributes including inline JavaScript

I found this useful regex code here while looking to parse HTML tag attributes:
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
It works great, but it's missing one key element that I need. Some attributes are event triggers that have inline Javascript code in them like this:
onclick="doSomething(this, 'foo', 'bar');return false;"
Or:
onclick='doSomething(this, "foo", "bar");return false;'
I can't figure out how to get the original expression to not count the quotes from the JS (single or double) while it's nested inside the set of quotes that contain the attribute's value.
I SHOULD add that this is not being used to parse an entire HTML document. It's used as an argument in an older "array to select menu" function that I've updated. One of the arguments is a tag that can append extra HTML attributes to the form element.
I've made an improved function and am deprecating the old... but in case somewhere in the code is a call to the old function, I need it to parse these into the new array format. Example:
// Old Function
function create_form_element($array, $type, $selected="", $append_att="") { ... }
// Old Call
create_form_element($array, SELECT, $selected_value, "onchange=\"something(this, '444');\"");
The new version takes an array of attr => value pairs to create extra tags.
create_select($array, $selected_value, array('style' => 'width:250px;', 'onchange' => "doSomething('foo', 'bar')"));
This is merely a backwards compatibility issue where all calls to the OLD function are routed to the new one, but the $append_att argument in the old function needs to be made into an array for the new one, hence my need to use regex to parse small HTML snippets. If there is a better, light-weight way to accomplish this, I'm open to suggestions.
The problem with your regular expression is that it tries to handle both single and double quotes at the same time. It doesn't support attribute values that contain the other quote. This regex will work better:
(\w+)=("[^<>"]*"|'[^<>']*'|\w+)
following regex will work as per HTML syntax specs available here
http://www.w3.org/TR/html-markup/syntax.html
regex patterns
// valid tag names
$tagname = '[0-9a-zA-Z]+';
// valid attribute names
$attr = "[^\s\\x00\"'>/=\pC]+";
// valid unquoted attribute values
$uqval = "[^\s\"'=><`]*";
// valid single-quoted attribute values
$sqval = "[^'\\x00\pC]*";
// valid double-quoted attribute values
$dqval = "[^\"\\x00\pC]*";
// valid attribute-value pairs
$attrval = "(?:\s+$attr\s*=\s*\"$dqval\")|(?:\s+$attr\s*=\s*'$sqval')|(?:\s+$attr\s*=\s*$uqval)|(?:\s+$attr)";
and the final regex query will be
// start tags + all attr formats
$patt[] = "<(?'starttags'$tagname)(?'tagattrs'($attrval)*)\s*(?'voidtags'[/]?)>";
// end tags
$patt[] = "</(?'endtags'$tagname)\s*>"; // end tag
// full regex pcre pattern
$patt = implode("|", $patt);
// search and match
preg_match_all("#$patt#imuUs",$data,$matches);
hope this helps.
Even better would be to use backreferences, in PHP the regular expression would be:
([a-zA-Z_:][-a-zA-Z0-9_:.]+)=(["'])(.*?)\\2
Where \\2 is a reference to (["'])
Also this regular expression will match attributes containing _, - and :, which are allowed according to W3C, however, this expression wont match attributes which values are not contained in quotes.

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Qt Regex matches HTML Tag InnerText

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0
HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.
DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one
i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)
I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.