RegEx matching for HTML and non-HTML URLs - html

I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.
Attempt
(\w*.)(\\\/){1,}(.*)(?![^"])
Input
<div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
<a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a> <\/div>\n
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\" width=\"307\" height=\"224\" \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n <\/div>\n <\/div>\n <div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>
Demo

As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().
RegEx 1 for HTML URLs
This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:
(src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.
RegEx 2 for both HTML and non-HTML URLs
For capturing other non-HTML URLs, you can update it similar to this RegEx:
(src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(")
where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:
The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.

Related

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

regex to ignore duplicate matches

I'm using an application to search this website that I don't have control of right this moment and was wondering if there is a way to ignore duplicate matches using only regex.
Right now I wrote this to get matches for the image source in the pages source code
uses this to retrieve srcs
<span> <img id="imgProduct.*? src="/(.*?)" alt="
from this
<span> <img id="imgProduct_1" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want1.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_2" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want2.jpg" alt="woohee"> </span>
<span> <img id="imgProduct_3" class="SmPrdImg selected"
onclick="(some javascript);" src="the_src_I_want3.jpg" alt="woohee"> </span>
the only problem is that the exact same code listed above is duplicated way lower in the source. Is there a way to ignore or delete the duplicates using only regex?
Your pattern's not very good; it's way too specific to your exact source code as it currently exists. As #Truth commented, if that changes, you'll break your pattern. I'd recommend something more like this:
<img[^>]*src=['"]([^'"]*)['"]
That will match the contents of any src attribute inside any <img> tag, no matter how much your source code changes.
To prevent duplicates with regex, you'll need lookahead, and this is likely to be very slow. I do not recommend using regex for this. This is just to show that you could, if you had to. The pattern you would need is something like this (I tested this using Notepad++'s regex search, which is based on PCRE and more robust than JavaScript's, but I'm reasonably sure that JavaScript's regex parser can handle this).
<img[^>]*src=['"]([^'"]*)['"](?!(?:.|\s)*<img[^>]*src=['"]\1['"])
You'll then get a match for the last instance of every src.
The Breakdown
For illustration, here's how the pattern works:
<img[^>]*src=['"]([^'"]*)['"]
This makes sure that we are inside a <img> tag when src comes up, and then makes sure we match only what is inside the quotes (which can be either single or double quotes; since neither is a legal character in a filename anyway we don't have to worry about mixing quote types or escaped quotes).
(?!
(?:
.
|
\s
)*
<img[^>]*src=['"]\1['"]
)
The (?! starts a negative lookahead: we are requiring that the following pattern cannot be matched after this point.
Then (?:.|\s)* matches any character or any whitespace. This is because JavaScript's . will not match a newline, while \s will. Mostly, I was lazy and didn't want to write out a pattern for any possible line ending, so I just used \s. The *, of course, means we can have any number of these. That means that the following (still part of the negative lookahead) cannot be found anywhere in the rest of the file. The (?: instead of ( means that this parenthetical isn't going to be remembered for backreferences.
That bit is <img[^>]*src=['"]\1['"]. This is very similar to the initial pattern, but instead of capturing the src with ([^'"]*), we're referencing the previously-captured src with \1.
Thus the pattern is saying "match any src in an img that does not have any img with the same src anywhere in the rest of the file," which means you only get the last instance of each src and no duplicates.
If you want to remove all instances of any img whose src appears more than once, I think you're out of luck, by the way. JavaScript does not support lookbehind, and the overwhelming majority of regex engines that do wouldn't allow such a complicated lookbehind anyway.
I wouldn't work too hard to make them unique, just do that in the PHP following the preg match with array_unique:
$pattern = '~<span> <img id="imgProduct.*? src="/(.*?)" alt="~is';
$match = preg_match_all($pattern, $html, $matches);
if ($match)
{
$matches = array_unique($matches[1]);
}
If you are using JavaScript, then you'd need to use another function instead of array_unique, check PHPJS:
http://phpjs.org/functions/array_unique:346

How to prevent link search from spilling across tags?

How to prevent link search from spilling across tags?
I have a local web site whose pages contain hyperlinks of various classes and would like to know how to prevent search results from spilling across several tags. (I need to do a batch modification of the address of a particular link type.)
E.g., my page may contain lists of links such as
Best solution:<br>
AAA<br> but see also
BBB<br> and
CCC<br>.
Now when I try to search the site for only the links of class "zzz" using the regex search term
<a href="+[].html" class="zzz">
my results include long strings such as
AAA<br> but see also BBB<br> and <a href="ccc.html" class="zzz>
What has happened is that the search engine (Funduc Search & Replace, if this helps) finds the <a href= of the first link (aaa.html), the matching class of the third link (ccc.html), and includes everything in between.
What expression must I use to ensure only the link of the file with the correct class, and nothing else, appears in the search result?
E.g.,
<a href="ccc.html" class="zzz>
Thanks for your help.
Use a DOM library (preferably one that supports XPath) instead of a regular expression. Regular expressions are poorly suited to dealing with HTML.
The + modifier for one or more occurrences, is eager to match in most regex engine. That means, [a-z]+ means "Match a or b or ... or z as many as possible".
Perl regex engine has a special modifier +? for lazy match, so [a-z]+? means "Match a..z as few as possible".
Simply, you can exclude ", > from "any char" to match:
[^">]+
The regex will be look like:
<a href="([^">]+.html)" class="zzz">
A more precised perl version:
<a\s+.*?\bhref\s*=\s*"(.+?\.html)"\s*class\s*=\s*"zzz".*?>
Here () for capture group.
I haven't tried with Funduc Search and Replace for Windows, hope it works.

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE

Variable order regex syntax

Is there a way to indicate that two or more regex phrases can occur in any order? For instance, XML attributes can be written in any order. Say that I have the following XML:
Home
Home
How would I write a match that checks the class and title and works for both cases? I'm mainly looking for the syntax that allows me to check in any order, not just matching the class and title as I can do that. Is there any way besides just including both combinations and connecting them with a '|'?
Edit: My preference would be to do it in a single regex as I'm building it programatically and also unit testing it.
No, I believe the best way to do it with a single RE is exactly as you describe. Unfortunately, it'll get very messy when your XML can have 5 different attributes, giving you a large number of different REs to check.
On the other hand, I wouldn't be doing this with an RE at all since they're not meant to be programming languages. What's wrong with the old fashioned approach of using an XML processing library?
If you're required to use an RE, this answer probably won't help much, but I believe in using the right tools for the job.
Have you considered xpath? (where attribute order doesn't matter)
//a[#class and #title]
Will select both <a> nodes as valid matches. The only caveat being that the input must be xhtml (well formed xml).
You can create a lookahead for each of the attributes and plug them into a regex for the whole tag. For example, the regex for the tag could be
<a\b[^<>]*>
If you're using this on XML you'll probably need something more elaborate. By itself, this base regex will match a tag with zero or more attributes. Then you add a lookhead for each of the attributes you want to match:
(?=[^<>]*\s+class="link")
(?=[^<>]*\s+title="Home")
The [^<>]* lets it scan ahead for the attribute, but won't let it look beyond the closing angle bracket. Matching the leading whitespace here in the lookahead serves two purposes: it's more flexible than matching it in the base regex, and it ensure that we're matching a whole attribute name. Combining them we get:
<a\b(?=[^<>]*\s+class="link")(?=[^<>]*\s+title="Home")[^<>]+>[^<>]+</a>
Of course, I've made some simplifying assumptions for the sake of clarity. I didn't allow for whitespace around the equals signs, for single-quotes or no quotes around the attribute values, or for angle brackets in the attribute values (which I hear is legal, but I've never seen it done). Plugging those leaks (if you need to) will make the regex uglier, but won't require changes to the basic structure.
You could use named groups to pull the attributes out of the tag. Run the regex and then loop over the groups doing whatever tests that you need.
Something like this (untested, using .net regex syntax with the \w for word characters and \s for whitespace):
<a ((?<key>\w+)\s?=\s?['"](?<value>\w+)['"])+ />
The easiest way would be to write a regex that picks up the <a .... > part, and then write two more regexes to pull out the class and the title. Although you could probably do it with a single regex, it would be very complicated, and probably a lot more error prone.
With a single regex you would need something like
<a[^>]*((class="([^"]*)")|(title="([^"]*)"))?((title="([^"]*)")|(class="([^"]*)"))?[^>]*>
Which is just a first hand guess without checking to see if it's even valid. Much easier to just divide and conquer the problem.
An first ad hoc solution might be to do the following.
((class|title)="[^"]*?" *)+
This is far from perfect because it allows every attribute to occur more than once. I could imagine that this might be solveable with assertions. But if you just want to extract the attributes this might already be sufficent.
If you want to match a permutation of a set of elements, you could use a combination of back references and zero-width
negative forward matching.
Say you want to match any one of these six lines:
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-def-789-abc-0AB
You can do this with the following regex:
/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/
The back references (\1, \2), let you refer to your previous matches, and the zero
width forward matching ((?!...) ) lets you negate a positional match, saying don't match if the
contained matches at this position. Combining the two makes sure that your match is a legit permutation
of the given elements, with each possibility only occuring once.
So, for example, in ruby:
input = <<LINES
123-abc-456-abc-789-abc-0AB
123-abc-456-abc-789-def-0AB
123-abc-456-abc-789-ghi-0AB
123-abc-456-def-789-abc-0AB
123-abc-456-def-789-def-0AB
123-abc-456-def-789-ghi-0AB
123-abc-456-ghi-789-abc-0AB
123-abc-456-ghi-789-def-0AB
123-abc-456-ghi-789-ghi-0AB
123-def-456-abc-789-abc-0AB
123-def-456-abc-789-def-0AB
123-def-456-abc-789-ghi-0AB
123-def-456-def-789-abc-0AB
123-def-456-def-789-def-0AB
123-def-456-def-789-ghi-0AB
123-def-456-ghi-789-abc-0AB
123-def-456-ghi-789-def-0AB
123-def-456-ghi-789-ghi-0AB
123-ghi-456-abc-789-abc-0AB
123-ghi-456-abc-789-def-0AB
123-ghi-456-abc-789-ghi-0AB
123-ghi-456-def-789-abc-0AB
123-ghi-456-def-789-def-0AB
123-ghi-456-def-789-ghi-0AB
123-ghi-456-ghi-789-abc-0AB
123-ghi-456-ghi-789-def-0AB
123-ghi-456-ghi-789-ghi-0AB
LINES
# outputs only the permutations
puts input.grep(/123-(abc|def|ghi)-456-(?!\1)(abc|def|ghi)-789-(?!\1|\2)(abc|def|ghi)-0AB/)
For a permutation of five elements, it would be:
/1-(abc|def|ghi|jkl|mno)-
2-(?!\1)(abc|def|ghi|jkl|mno)-
3-(?!\1|\2)(abc|def|ghi|jkl|mno)-
4-(?!\1|\2|\3)(abc|def|ghi|jkl|mno)-
5-(?!\1|\2|\3|\4)(abc|def|ghi|jkl|mno)-6/x
For your example, the regex would be
/<a href="home.php" (class="link"|title="Home") (?!\1)(class="link"|title="Home")>Home<\/a>/