Scraping HTML with Regex

Scraping HTML with Regex - html

I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...
I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:
But when I tried this regex, it returns nothing...
<h2\b[^>]*>.*?<\/h2>[( )\t\s]*(.*?)[( )\t\s]*<\/div>
I think it may have something to do with the empty spaces within the HTML source...
Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?
Many thanks
HTML segment
<div id="Highlights">
<h2>Highlights</h2>
<ul>
<li>1234</li>
<li>abc def asdasd asdasd</li>
<li>asdasda as asdasdasdas </li>
<li>asdasd asdasdas asdsad asdasd asa</li>
</ul>
</div>
<div class="FloatClear"></div>
<div id="SalesMarquee">
<div id="SalesMarqueeTemplate" style="display: none;">

In this case, because it's so simple, I think you might be able to pull it off with Regex. Although you could probably cater an example where it will fail, it should work in all normal cases. I suppose in this type of code that wouldn't exactly mean a security risk.
The reason it's not working is because of the dot you use in the middle of the expression. By default, the dot matches anything EXCEPT newline. To test, I used [\W\w] instead, which does work (stupid hack to really match anything).
The clean way is to switch your regex into single-line mode using the s switch. How to do that depends on your framework, but usually it's \<regex>\s.
See http://www.regular-expressions.info/dot.html for more info.

Don't use regex to scrape HTML.
See here for compelling reasons why.
Use an HTML parser instead - this SO answer suggests using DOMDocument->loadHTML().

Related

RegEx matching for HTML and non-HTML URLs

I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.
Attempt
(\w*.)(\\\/){1,}(.*)(?![^"])
Input
<div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
<a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a> <\/div>\n
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\" width=\"307\" height=\"224\" \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n <\/div>\n <\/div>\n <div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>
Demo

As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().
RegEx 1 for HTML URLs
This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:
(src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.
RegEx 2 for both HTML and non-HTML URLs
For capturing other non-HTML URLs, you can update it similar to this RegEx:
(src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(")
where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:
The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.

I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML

Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)

Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Extracting data from HTML files using regular expressions

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,
in this page
http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A
I have to keep only the data which is between,
<div class="row-fluid">
and
<br /> <br /><i class="icon-user"></i>
SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,
.*<div class="row-fluid">
to delete everything before <div class="row-fluid">
but it is not working at all.
Does anyone knows why ?
P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.

I would achieve this in several steps.
Step 1.
transform document into one line. find
\r\n
and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)
Step 2.
find
<div class="row-fluid">
and replace with
\r\n~<div class="row-fluid">
Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later
Step 3.
find
<br /> <br /><i class="icon-user"></i>
and replace with
<br /> <br /><i class="icon-user"></i>\r\n
Step 4.
Delete unnecessary lines. Check "Regular expression".
find
^[^~].+$\r\n
and replace with nothing
Step 5.
Now you have only lines that starts with
~<div class="row-fluid">
and ends with
<br /> <br /><i class="icon-user"></i>
everything you need it's just delete this tags
PS. You can try to record a macro, if you need to do the same task several times.

You should consider retrieving using Xpath. Most languages support it.
There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.
There's a hacked version that works for newer firefox versions here
http://jassage.com/xpather-1.4.5b.xpi
To use Xpath with python, consider using http://xmlsoft.org/python.html
Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.
http://tidy.sourceforge.net/

IMHO doing it with Notepad++ is difficult. According to this, you need to:
remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML
Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.
If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.
If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

parse footnote in html document

I need to parse a html document that has been generated by saving a word document as html.
I have been using the HTML agility pack quite successfully but in this instance I figured using regex for this one part might be easier (opinions?)
Word generates the following code when it translates one of its footnotes into html
<a href="#_ftn2" name="_ftnref2" title=""><span
class=MsoFootnoteReference><span class=MsoFootnoteReference><span
style='font-size:10.0pt'>[2]</span></span></span></a>
This output is consistent for every footnote with only the href= and name changing as well as the [2] text.
I need to extract the _ftn2 and [2] elements.
So far I have the following regex which will extract the _ftn2 part into the name group
<a href="#(?<name>_ftn\d).*>(<span class=MsoFootNoteReference>)
I'm having a bit of trouble parsing the second bit with all those span tags.
Is it going to be easier to use regex for this or should I continue to use the HAP for this part?
An an aside does anyone know why word generates nested identical span tags
<span class=MsoFootnoteReference>

If the input follows exactly that format then you can get away with a pretty loose regex. You just need to ignore everything except the parts you want to extract and then employ non-greedy expressions to eat up all the garbage between them:
<a href="#(?<name>_ftn\d).*?(?<number>\[\d+\]).*?<\/a>
You can use a non-greedy .*? to eat up all the extra markup because nothing in there will match your next \[\d+\] pattern. You don't really need the .*?<\/a> bit on the end, that's mostly for symmetry and a bit of extra paranoia.
Something like this is probably one of the few cases where using regular expressions to rip apart HTML makes sense. You could do this sort of thing with an HTML parser but then you'd be a nightmare of twisty XPath expressions (all of which look alike), DOM manipulations, or SAX events. And you might even get eaten by a grue.

How can I remove an entire HTML tag (and its contents) by its class using a regex?

I am not very good with Regex but I am learning.
I would like to remove some html tag by the class name. This is what I have so far :
<div class="footer".*?>(.*?)</div>
The first .*? is because it might contain other attribute and the second is it might contain other html stuff.
What am I doing wrong? I have try a lot of set without success.
Update
Inside the DIV it can contain multiple line and I am playing with Perl regex.

As other people said, HTML is notoriously tricky to deal with using regexes, and a DOM approach might be better. E.g.:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( 'yourdocument.html' );
for my $node ( $tree->findnodes( '//*[#class="footer"]' ) ) {
$node->replace_with_content; # delete element, but not the children
}
print $tree->as_HTML;

You will also want to allow for other things before class in the div tag
<div[^>]*class="footer"[^>]*>(.*?)</div>
Also, go case-insensitive. You may need to escape things like the quotes, or the slash in the closing tag. What context are you doing this in?
Also note that HTML parsing with regular expressions can be very nasty, depending on the input. A good point is brought up in an answer below - suppose you have a structure like:
<div>
<div class="footer">
<div>Hi!</div>
</div>
</div>
Trying to build a regex for that is a recipe for disaster. Your best bet is to load the document into a DOM, and perform manipulations on that.
Pseudocode that should map closely to XML::DOM:
document = //load document
divs = document.getElementsByTagName("div");
for(div in divs) {
if(div.getAttributes["class"] == "footer") {
parent = div.getParent();
for(child in div.getChildren()) {
// filter attribute types?
parent.insertBefore(div, child);
}
parent.removeChild(div);
}
}
Here is a perl library, HTML::DOM, and another, XML::DOM
.NET has built-in libraries to handle dom parsing.

In Perl you need the /s modifier, otherwise the dot won't match a newline.
That said, using a proper HTML or XML parser to remove unwanted parts of a HTML file is much more appropriate.

<div[^>]*class="footer"[^>]*>(.*?)</div>
Worked for me, but needed to use backslashes before special characters
<div[^>]*class=\"footer\"[^>]*>(.*?)<\/div>

Partly depends on the exact regex engine you are using - which language etc. But one possibility is that you need to escape the quotes and/or the forward slash. You might also want to make it case insensitive.
<div class=\"footer\".*?>(.*?)<\/div>
Otherwise please say what language/platform you are using - .NET, java, perl ...

Try this:
<([^\s]+).*?class="footer".*?>([.\n]*?)</([^\s]+)>
Your biggest problem is going to be nested tags. For example:
<div class="footer"><b></b></div>
The regexp given would match everything through the </b>, leaving the </div> dangling on the end. You will have to either assume that the tag you're looking for has no nested elements, or you will need to use some sort of parser from HTML to DOM and an XPath query to remove an entire sub-tree.

This will be tricky because of the greediness of regular expressions, (Note that my examples may be specific to perl, but I know that greediness is a general issue with REs.) The second .*? will match as much as possible before the </div>, so if you have the following:
<div class="SomethingElse"><div class="footer"> stuff </div></div>
The expression will match:
<div class="footer"> stuff </div></div>
which is not likely what you want.

why not <div class="footer".*?</div> I'm not a regex guru either, but I don't think you need to specify that last bracket for your open div tag

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scraping HTML with Regex - html

Don't use regex to scrape HTML. See here for compelling reasons why. Use an HTML parser instead - this SO answer suggests using DOMDocument->loadHTML().

Related

RegEx matching for HTML and non-HTML URLs

RegEx to substitute tag names, leaving the content and attributes intact

Extracting data from HTML files using regular expressions

parse footnote in html document

How can I remove an entire HTML tag (and its contents) by its class using a regex?

Categories

Resources