Qt Regex matches HTML Tag InnerText - html

I have a html file with one <pre>...</pre> tag. What regex is necessary to match all content within the pre's?
QString pattern = "<pre>(.*)</pre>";
QRegExp rx(pattern);
rx.setCaseSensitivity(cs);
int pos = 0;
QStringList list;
while ((pos = rx.indexIn(clipBoardData, pos)) != -1) {
list << rx.cap(1);
pos += rx.matchedLength();
}
list.count() is always 0

HTML is not a regular language, you do not use regular expressions to parse it.
Instead, use QXmlSimpleReader to load the XML, then QXmlQuery to find the PRE node and then extract its contents.

DO NOT PARSE HTML USING Regular Expressions!
Instead, use a real HTML parser, such as this one

i did it using substrings:
int begin = clipBoardData.indexOf("<pre");
int end = clipBoardData.indexOf("</body>");
QString result = data.mid(begin, end-begin);
The result includes the <pre's> but i found out thats even better ;)

I have to agree with the others. Drupal 6.x and older are using regex to do a lot of work on the HTML data. It quickly breaks if you create pages of 64Kb or more. So using a DOM or just indexOf() as you've done is a better much faster solution.
Now, for those interested in knowing more about regex, Qt uses the perl implementation. This means you can use the lazy operator. Your regex would become:
(<pre>.*?</pre>)+
to get each one of the <pre> block in your code (although if you have only one, then the question mark and the plus are not required.) Note that no delimiters at the start and end of the regular expression are required here.
QRegExp re("(<pre>.*?</pre>)+", Qt::CaseInsensitive);
re.indexIn(html_input);
QStringList list = re.capturedTexts();
Now list should have one <pre> tag or more.

Related

Is there an easy way to strip HTML from a QString in Qt?

I have a QString with some HTML in it... is there an easy way to strip the HTML from it? I basically want just the actual text content.
<i>Test:</i><img src="blah.png" /><br> A test case
Would become:
Test: A test case
I'm curious to know if Qt has a string function or utility for this.
QString s = "<i>Test:</i><img src=\"blah.png\" /><br> A test case";
s.remove(QRegExp("<[^>]*>"));
// s == "Test: A test case"
If you don't care about performance that much then QTextDocument does a pretty good job of converting HTML to plain text.
QTextDocument doc;
doc.setHtml( htmlString );
return doc.toPlainText();
I know this question is old, but I was looking for a quick and dirty way to handle incorrect HTML. The XML parser wasn't giving good results.
You may try to iterate through the string using QXmlStreamReader class and extract all text (if you HTML string is guarantied to be well formed XML).
Something like this:
QXmlStreamReader xml(htmlString);
QString textString;
while (!xml.atEnd()) {
if ( xml.readNext() == QXmlStreamReader::Characters ) {
textString += xml.text();
}
}
but I'm unsure that its 100% valid ussage of QXmlStreamReader API since I've used it quite longe time ago and may forget something.
the situation that some html is not quite validate xml make it worse to work it out correctly.
If it's valid xml (or not too bad formated), I think QXmlStreamReader + QXmlStreamEntityResolver might not be bad idea.
Sample code in: https://github.com/ycheng/misccode/blob/master/qt_html_parse/utils.cpp
(this can be a comment, but I still don't have permission to do so)
this answer is for who read this post later and using Qt5 or later. simply escape the html characters using inbuilt functions as below.
QString str="<h1>some hedding </h1>"; // a string containing html tags.
QString esc=str.toHtmlEscaped(); //esc contains the html escaped srring.

Regex for unclosed HTML tags

Does someone have a regex to match unclosed HTML tags? For example, the regex would match the <b> and second <i>, but not the first <i> or the first's closing </i> tag:
<i><b>test<i>ing</i>
Is this too complex for regex? Might it require some recursive, programmatic processing?
I'm sure some regex guru can cobble something together that approximates a solution, but it's a bad idea: HTML isn't regular. Consider either a HTML parser that's capable of identifying such problems, or parsing it yourself.
Yes it requires recursive processing, and potentially quite deep (or a fancy loop of course), it is not going to be done with a regex. You could make a regex that handled a few levels deep, but not one that will work on just any html file. This is because the parser would have to remember what tags are open at any given point in the stream, and regex arent good at that.
Use a SAX parser with some counters, or use a stack with pop off/push on to keep your state. Think about how to code this game to see what I mean about html tag depth. http://en.wikipedia.org/wiki/Tower_of_Hanoi
As #Pesto said, HTML isn't regular, you would have to build html grammar rules, and apply them recursively.
If you are looking to fix HTML programatically, I have used a component called html tidy with considerable success. There are builds for it for most languages (COM+, Dotnet, PHP etc...).
If you just need to fix it manually, I'd recommend a good IDE. Visual Studio 2008 does a good job, so does the latest Dreamweaver.
No, that's to complex for a regular expression. Your problem is equivalent to test an arithmetic expression of proper usage of brackets which needs at least an pushdown automaton to success.
In your case you should split the HTML code in opening tags, closing tags and text nodes (e.g with an regular expression). Store the result in a list. Then you can iterate through node list and push every opening tag onto the stack. If you encounter a closing tag in your node list you must check that the topmost stack entry is a opening tag of the same type. Otherwise you found the html syntax error you looked for.
I've got a case where I am dealing with single, self-contained lines. The following regular expression worked for me: <[^/]+$ which matches a "<" and then anything that's not a "/".
You can use RegEx to identify all the html begin/end elements, and then enumerate with a Stack, Push new elements, and Pop the closing tags. Try this in C# -
public static bool ValidateHtmlTags(string html)
{
string expr = "(<([a-zA-Z]+)\\b[^>]*>)|(</([a-zA-Z]+) *>)";
Regex regex = new Regex(expr, RegexOptions.IgnoreCase);
var stack = new Stack<Tuple<string, string>>();
var result = new StringBuilder();
bool valid = true;
foreach (Match match in regex.Matches(html))
{
string element = match.Value;
string beginTag = match.Groups[2].Value;
string endTag = match.Groups[4].Value;
if (beginTag == "")
{
string previousTag = stack.Peek().Item1;
if (previousTag == endTag)
stack.Pop();
else
{
valid = false;
break;
}
}
else if (!element.EndsWith("/>"))
{
// Write more informative message here if desired
string message = string.Format("Char({0})", match.Index);
stack.Push(new Tuple<string, string>(beginTag, message));
}
}
if (stack.Count > 0)
valid = false;
// Alternative return stack.Peek().Item2 for more informative message
return valid;
}
I suggest using Nokogiri:
Nokogiri::HTML::DocumentFragment.parse(html).to_html

Regex for Encoded HTML

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:
<a href="doesntmatter.com">
It should match the above, but not match when other attributes are added:
<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">
Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:
<a href="doesntmatter.com" >
But not match this:
<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >
Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).
Thanks!
The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.
In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:
/<a (?>href=".*?")>/
Which would look like the following when replacing the characters with their HTML entities:
/<a (?>href=".*?")>/
Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:
private string getAnchor(string data)
{
MatchCollection matches;
string pattern = #"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
string anchor = "";
matches = myRegex.Matches(data);
foreach (Match match in matches)
{
anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
}
return anchor;
}
I hope that helps!
I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:
(?:(?!").)*
It essentially means:
Match the following group 0 or more times
Fail match if the following string is """
Match any character (except new line unless DOTALL is specified)
The complete regular expression would be:
/<a href="(?>(?:[^&]+|(?!").)*)">/s
This is more efficient than using a non-greedy expression.
Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)
I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.
Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):
/<a href="(?:[^&]+|(?!").)*+">/s
As you can see it's slightly shorter.

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.

Regex to Parse Hyperlinks and Descriptions

C#: What is a good Regex to parse hyperlinks and their description?
Please consider case insensitivity, white-space and use of single quotes (instead of double quotes) around the HREF tag.
Please also consider obtaining hyperlinks which have other tags within the <a> tags such as <b> and <i>.
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
As long as there are no nested tags (and no line breaks), the following variant works well:
<a\s+href=(?:"([^"]+)"|'([^']+)').*?>(.*?)</a>
As soon as nested tags come into play, regular expressions are unfit for parsing. However, you can still use them by applying more advanced features of modern interpreters (depending on your regex machine). E.g. .NET regular expressions use a stack; I found this:
(?:<a.*?href=[""'](?<url>.*?)[""'].*?>)(?<name>(?><a[^<]*>(?<DEPTH>)|</a>(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:</a>)
Source: http://weblogs.asp.net/scottcate/archive/2004/12/13/281955.aspx
See this example from StackOverflow: Regular expression for parsing links from a webpage?
Using The HTML Agility Pack you can parse the html, and extract details using the semantics of the HTML, instead of a broken regex.
I found this but apparently these guys had some problems with it.
Edit: (It works!)
I have now done my own testing and found that it works, I don't know C# so I can't give you a C# answer but I do know PHP and here's the matches array I got back from running it on this:
Text
array(3) { [0]=> string(52) "Text" [1]=> string(15) "pages/index.php" [2]=> string(4) "Text" }
I have a regex that handles most cases, though I believe it does match HTML within a multiline comment.
It's written using the .NET syntax, but should be easily translatable.
Just going to throw this snippet out there now that I have it working..this is a less greedy version of one suggested earlier. The original wouldnt work if the input had multiple hyperlinks. This code below will allow you to loop through all the hyperlinks:
static Regex rHref = new Regex(#"<a.*?href=[""'](?<url>[^""^']+[.]*?)[""'].*?>(?<keywords>[^<]+[.]*?)</a>", RegexOptions.IgnoreCase | RegexOptions.Compiled);
public void ParseHyperlinks(string html)
{
MatchCollection mcHref = rHref.Matches(html);
foreach (Match m in mcHref)
AddKeywordLink(m.Groups["keywords"].Value, m.Groups["url"].Value);
}
Here is a regular expression that will match the balanced tags.
(?:""'[""'].*?>)(?(?>(?)|(?<-DEPTH>)|.)+)(?(DEPTH)(?!))(?:)