Extracting data from HTML files using regular expressions - html

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,
in this page
http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A
I have to keep only the data which is between,
<div class="row-fluid">
and
<br /> <br /><i class="icon-user"></i>
SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,
.*<div class="row-fluid">
to delete everything before <div class="row-fluid">
but it is not working at all.
Does anyone knows why ?
P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.

I would achieve this in several steps.
Step 1.
transform document into one line. find
\r\n
and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)
Step 2.
find
<div class="row-fluid">
and replace with
\r\n~<div class="row-fluid">
Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later
Step 3.
find
<br /> <br /><i class="icon-user"></i>
and replace with
<br /> <br /><i class="icon-user"></i>\r\n
Step 4.
Delete unnecessary lines. Check "Regular expression".
find
^[^~].+$\r\n
and replace with nothing
Step 5.
Now you have only lines that starts with
~<div class="row-fluid">
and ends with
<br /> <br /><i class="icon-user"></i>
everything you need it's just delete this tags
PS. You can try to record a macro, if you need to do the same task several times.

You should consider retrieving using Xpath. Most languages support it.
There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.
There's a hacked version that works for newer firefox versions here
http://jassage.com/xpather-1.4.5b.xpi
To use Xpath with python, consider using http://xmlsoft.org/python.html
Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.
http://tidy.sourceforge.net/

IMHO doing it with Notepad++ is difficult. According to this, you need to:
remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML
Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.
If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.
If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

Related

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Regex extract html source with multiple elements

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.
What I need is to take the following example html and extract each line
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to /<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.
can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)
This is all quite new to me, so hope ive explained what im trying to do.
Thanks in advance
I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.
What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.
in python add re.DOTALL option. In PHP there is a special slash tag to use.
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>
For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.
Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.
Thanks so much for responding and trying to help

selective search & replace of text string in many HTML-Docs

I've got many html-docs that need selective replacement of the <br /> tag in two specific areas in each document (400+).
I wonder how to achieve this goal and need assistance.
In each HTML-document the <br />-tag needs to be replaced only inside the html-tag:
<span property="dc:description" content="xyz1,<br /> xyz2,<br /> xyz3"/>
and also all occurences of <br />inside the alt="-tag, like in the html-tag
<img src="xyz.jpg" alt="uvw1,<br />uvw2" />)
In all other areas of the HTML-Docs the <br />-tag must remain unchanged.
...I gave this some more thought and think the problem described above may be resolved with the aid of a script or a function equipped with start- and stop-signals. This way the script knows at which positions to start looking for the <br />-tag and replace it with a given text-string AND also knows where to stop. Then move on to the next instance in documents that are open in an editor or residing in a given folder.
I am afraid that I am not capable to write such a script myself.
Hope someone can provide feedback on how to best accomplish this,
thanks.
OS: Win7-64, Editor: Notepad++
Providing that your HTML files aren't really big, I don't think you need a script for this.
You could just:
Join the files together.
Use a regex replace in Notepad++. For this you need to replace <span([^/]*)<br />(.*)"/> with <span\1NEWTAG\2"/> where NEWTAG is whatever you want to replace the <br /> with. Note that this will only replace the first <br /> it finds each time, so you will need to do this a few times until it finds no more. Therefore if you're replacing with text that contains <br /> itself (which I doubt by the sounds of it), you'll need to modify this a little.
Split the file back up into the originals.
Personally I'd just write a Python script, as it's pretty ace at string manipulation. But I don't know if this is within your scope.

Scraping HTML with Regex

I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...
I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:
But when I tried this regex, it returns nothing...
<h2\b[^>]*>.*?<\/h2>[( )\t\s]*(.*?)[( )\t\s]*<\/div>
I think it may have something to do with the empty spaces within the HTML source...
Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?
Many thanks
HTML segment
<div id="Highlights">
<h2>Highlights</h2>
<ul>
<li>1234</li>
<li>abc def asdasd asdasd</li>
<li>asdasda as asdasdasdas </li>
<li>asdasd asdasdas asdsad asdasd asa</li>
</ul>
</div>
<div class="FloatClear"></div>
<div id="SalesMarquee">
<div id="SalesMarqueeTemplate" style="display: none;">
In this case, because it's so simple, I think you might be able to pull it off with Regex. Although you could probably cater an example where it will fail, it should work in all normal cases. I suppose in this type of code that wouldn't exactly mean a security risk.
The reason it's not working is because of the dot you use in the middle of the expression. By default, the dot matches anything EXCEPT newline. To test, I used [\W\w] instead, which does work (stupid hack to really match anything).
The clean way is to switch your regex into single-line mode using the s switch. How to do that depends on your framework, but usually it's \<regex>\s.
See http://www.regular-expressions.info/dot.html for more info.
Don't use regex to scrape HTML.
See here for compelling reasons why.
Use an HTML parser instead - this SO answer suggests using DOMDocument->loadHTML().

Regex: Extracting readable (non-code) text and URLs from HTML documents

I am creating an application that will take a URL as input, retrieve the page's html content off the web and extract everything that isn't contained in a tag. In other words, the textual content of the page, as seen by the visitor to that page. That includes 'masking' out everything encapsuled in <script></script>, <style></style> and <!-- -->, since these portions contain text that is not enveloped within a tag (but is best left alone).
I have constructed this regex:
(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)
It correctly selects all the content that i want to ignore, and only leaves the page's text contents. However, that means that what I want to extract won't show up in the match collection (I am using VB.Net in Visual Studio 2010).
Is there a way to "invert" the matching of a whole document like this, so that I'd get matches on all the text strings that are left out by the matching in the above regex?
So far, what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group. This works, but I was wondering if it was possible to do it all through regex and just end up with matches on the plain text.
This is supposed to work generically, without knowing any specific tags in the html. It's supposed to extract all text. Additionally, I need to preserve the original html so the page retains all its links and scripts - i only need to be able to extract the text so that I can perform searches and replacements within it, without fear of "renaming" any tags, attributes or script variables etc (so I can't just do a "replace with nothing" on all the matches I get, because even though I am then left with what I need, it's a hassle to reinsert that back into the correct places of the fully functional document).
I want to know if this is at all possible using regex (and I know about HTML Agility Pack and XPath, but don't feel like).
Any suggestions?
Update:
Here is the (regex-based) solution I ended up with: http://www.martinwardener.com/regex/, implemented in a demo web application that will show both the active regex strings along with a test engine which lets you run the parsing on any online html page, giving you parse times and extracted results (for link, url and text portions individually - as well as views where all the regex matches are highlighted in place in the complete HTML document).
what I did was to add another alternative at the end, that selects "any sequence that doesn't contain < or >", which then means the leftover text. I named that last bit in a capture group, and when I iterate over the matches, I check for the presence of text in the "text" group.
That's what one would normally do. Or even simpler, replace every match of the markup pattern with and empty string and what you've got left is the stuff you're looking for.
It kind of works, but there seems to be a string here and there that gets picked up that shouldn't be.
Well yeah, that's because your expression—and regex in general—is inadequate to parse even valid HTML, let alone the horrors that are out there on the real web. First tip to look at, if you really want to chase this futile approach: attribute values (as well as text content in general) may contain an unescaped > character.
I would like to once again suggest the benefits of HTML Agility Pack.
ETA: since you seem to want it, here's some examples of markup that looks like it'll trip up your expression.
<a href=link></a> - unquoted
<a href= link></a> - unquoted, space at front matched but then required at back
- very common URL char missing in group
- more URL chars missing in group
<a href=lïnk></a> - IRI
<a href
="link"> - newline (or tab)
<div style="background-image: url(link);"> - unquoted
<div style="background-image: url( 'link' );"> - spaced
<div style="background-image: url('link');"> - html escape
<div style="background-image: ur\l('link');"> - css escape
<div style="background-image: url('link\')link');"> - css escape
<div style="background-image: url(\
'link')"> - CSS folding
<div style="background-image: url
('link')"> - newline (or tab)
and that's just completely valid markup that won't match the right link, not any of the possible invalid markup, markup that shouldn't but does match a link, or any of the many problems with your other technique of splitting markup from text. This is the tip of the iceberg.
Regex is not reliable for retrieving textual contents of HTML documents. Regex cannot handle nested tags. Supposing a document doesn't contain any nested tag, regex still requires every tags are properly closed.
If you are using PHP, for simplicity, I strongly recommend you to use DOM (Document Object Model) to parse/extract HTML documents. DOM library usually exists in every programming language.
If you're looking to extract parts of a string not matched by a regex, you could simply replace the parts that are matched with an empty string for the same effect.
Note that the only reason this might work is because the tags you're interested in removing, <script> and <style> tags, cannot be nested.
However, it's not uncommon for one <script> tag to contain code to programmatically append another <script> tag, in which case your regex will fail. It will also fail in the case where any tag isn't properly closed.
You cannot parse HTML with regular expressions.
Parsing HTML with regular expressions leads to sadness.
I know you're just doing it for fun, but there are so many packages out there than actually do the parsing the right way, AND do it reliably, AND have been tested.
Don't go reinventing the wheel, and doing it a way that is all but guaranteed to frustrate you down the road.
OK, so here's how I'm doing it:
Using my original regex (with the added search pattern for the plain text, which happens to be any text that's left over after the tag searches are done):
(?:(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?P<text>[^<>]*)
Then in VB.Net:
Dim regexText As New Regex("(?:(?:<(?<tag>script|style)[\s\S]*?</\k<tag>>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>))|(?<text>[^<>]*)", RegexOptions.IgnoreCase)
Dim source As String = File.ReadAllText("html.txt")
Dim evaluator As New MatchEvaluator(AddressOf MatchEvalFunction)
Dim newHtml As String = regexText.Replace(source, evaluator)
The actual replacing of text happens here:
Private Function MatchEvalFunction(ByVal match As Match) As String
Dim plainText As String = match.Groups("text").Value
If plainText IsNot Nothing AndAlso plainText <> "" Then
MatchEvalFunction = match.Value.Replace(plainText, plainText.Replace("Original word", "Replacement word"))
Else
MatchEvalFunction = match.Value
End If
End Function
Voila. newHtml now contains an exact copy of the original, except every occurrence of "Original word" in the page (as it's presented in a browser) is switched with "Replacement word", and all html and script code is preserved untouched. Of course, one could / would put in a more elaborate replacement routine, but this shows the basic principle. This is 12 lines of code, including function declaration and loading of html code etc. I'd be very interested in seeing a parallel solution, done in DOM etc for comparison (yes, I know this approach can be thrown off balance by certain occurrences of some nested tags quirks - in SCRIPT rewriting - but the damage from that will still be very limited, if any (see some of the comments above), and in general this will do the job pretty darn well).
For Your Information,
Instead of Regex, With JQuery , Its possible to extract text alone from a html markup. For that you can use the following pattern.
$("<div/>").html("#elementId").text()
You can refer this JSFIDDLE