Regex extract html source with multiple elements - html

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.
What I need is to take the following example html and extract each line
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to /<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.
can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)
This is all quite new to me, so hope ive explained what im trying to do.
Thanks in advance

I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.
What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.
in python add re.DOTALL option. In PHP there is a special slash tag to use.
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>

For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.
Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.
Thanks so much for responding and trying to help

Related

use regex to select words between html tags

thanks for visiting my questions here. I'm trying to match sentences between tags. for example:
<h1> Most flavors, except the ones discussed below, have only one
metacharacter that matches both before a word and after a word. <p>
This is because any position between characters can never be both at
the start and at the end of a word. Using only one operator makes
things easier for you.<p>Word boundaries, as described above, are
supported by most regular expression flavors.
I'm trying to get 10 words from each tag.
output:
Most flavors, except the ones discussed below, have only one
This is because any position between characters can never be
Word boundaries, as described above, are supported by most regular
I find it's so tricky. Thanks for your help here!!!
As has already been linked in the comment, one of the most well-known answers of all time on this site is about how you using regular expressions to parse HTML is probably not a good idea. For a more detailed and balanced overview of when it is and isn't a good idea to do so, check out this question as well.
But briefly, the answer depends on what you're trying to do. It's likely that you'll be better off finding an HTML/XML-parsing library for whatever language you're using, and extracting the text with that.
I'm a bit confused as to what your task actually is, as your code as shown isn't valid HTML, since <h1> at least requires a closing tag. But if you do need to use regex to do this, you will want to look at word boundaries and interval operators for limiting to 10, and perhaps lookbehind (or just capture groups) to match the tag without returning it.
But again: if you're trying to parse actual HTML, you'd be better of using an HTML parser to get the tag content, and then getting the first 10 words using string operators. An example in Javascript, which is a bit of a cheat because you get the HTML parsing for free, but it makes for an easy example:
for(const tag of document.querySelectorAll('body *')) {
console.log(`${tag.tagName}: ${tag.innerText.split(' ').slice(0,5).join(' ')}`)
}
<h1>This is an h1 tag with a bunch of text in it that is really long</h1>
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long

Is there a way to automatically add the same HTML tags, over and over, to different pieces of text? Maybe with Microsoft Word or Excel?

I would like to list items in an online shop, and I have certain HTML tags that are the same for every item description.
However, I have over 100 items and I would like to format them all in the same way.
I realize this could be done in CSS, but it's not really possible in my situation for complicated reasons.
Is there a way in Excel or maybe Word, that I can have these HTML tags copied and pasted on the start and end of my descriptions automatically?
Your help would be much appreciated, thank you!
An example of my code :
<p><strong> Designed in Paris.</strong></p>
<p><span style="line-height: 1.6em;"> Scarf from the NEXT collection.</span></p>
<p> This item is stylish yet elegant, and adds a unique tone to your outfit.</p>
<p><span style="line-height: 1.6em;"> Available in several colours.</span></p>
<div align="center">
<hr align="center" size="2" width="100%" /></div>
<p>Material: Silk</p>
<p>Length: ca. 38 cm</p>
<p>Width: ca. 2,5 cm</p>
I recommend using regular expressions.
Step one: Add line numbers to every item description. It is described in following answer:
Line's number using regex in Notepad++?
Step two: Prepare regex for every line:
Step three: Run regexes for every description (notepad++)
Edit: added samples.
Sample regexes for wrapping first line with paragraph tag:
Find: (^0001.*)
Replace: \1</p>
Find: (^0001)(.*)
Replace: \1<p>\3

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Extracting data from HTML files using regular expressions

I am trying to extract the specific data using regular expression but i couldn't be able to achieve what i desire, for example,
in this page
http://mnemonicdictionary.com/wordlist/GREwordlist/startingwith/A
I have to keep only the data which is between,
<div class="row-fluid">
and
<br /> <br /><i class="icon-user"></i>
SO i copied the HTML code in Notepad++ enabled Regular expression in replace, and tried replacing everything that matches,
.*<div class="row-fluid">
to delete everything before <div class="row-fluid">
but it is not working at all.
Does anyone knows why ?
P.S: I am not using any programming language i just need to perform this on an html code using Notepad++, not on an actual HTML file.
I would achieve this in several steps.
Step 1.
transform document into one line. find
\r\n
and replace with nothing. (make sure to select "Extended (\n, \r,..)" option in Replace dialog)
Step 2.
find
<div class="row-fluid">
and replace with
\r\n~<div class="row-fluid">
Make sure, that character "~" not used in the document. This character wil help us to delete unnecessary lines later
Step 3.
find
<br /> <br /><i class="icon-user"></i>
and replace with
<br /> <br /><i class="icon-user"></i>\r\n
Step 4.
Delete unnecessary lines. Check "Regular expression".
find
^[^~].+$\r\n
and replace with nothing
Step 5.
Now you have only lines that starts with
~<div class="row-fluid">
and ends with
<br /> <br /><i class="icon-user"></i>
everything you need it's just delete this tags
PS. You can try to record a macro, if you need to do the same task several times.
You should consider retrieving using Xpath. Most languages support it.
There's a great firefox plugin that infers the xpath expression when you select a page item called xpather.
There's a hacked version that works for newer firefox versions here
http://jassage.com/xpather-1.4.5b.xpi
To use Xpath with python, consider using http://xmlsoft.org/python.html
Notice that Xpath may have problem with malformed html, so you may also find tidy an interesting option to "clean up" the html and get a parseable XML.
http://tidy.sourceforge.net/
IMHO doing it with Notepad++ is difficult. According to this, you need to:
remove all lines (since regexps execute on each line of text)
perform the regexp on the whole (1-line) HTML
Either you want to learn regexps, or you want to parse the HTML. SDepending on which, solution differs.
If you want to learn regular expressions, this is (again IMHO) the wrong problem to solve.
If you want to resolve the problem (keep the data between <div> and <i>), then have a look at how to parse HTML/XML. In python you have some great libraries like BeautifulSoup (which can deal with broken html). You can do it with dom parsing or a more interesting solution (and arguably better for your problem) is to use SAX and per-event processing. Since you know that after every <div> you'll get an <i>, you could do a simple stack to push all the content between the two events...

parse footnote in html document

I need to parse a html document that has been generated by saving a word document as html.
I have been using the HTML agility pack quite successfully but in this instance I figured using regex for this one part might be easier (opinions?)
Word generates the following code when it translates one of its footnotes into html
<a href="#_ftn2" name="_ftnref2" title=""><span
class=MsoFootnoteReference><span class=MsoFootnoteReference><span
style='font-size:10.0pt'>[2]</span></span></span></a>
This output is consistent for every footnote with only the href= and name changing as well as the [2] text.
I need to extract the _ftn2 and [2] elements.
So far I have the following regex which will extract the _ftn2 part into the name group
<a href="#(?<name>_ftn\d).*>(<span class=MsoFootNoteReference>)
I'm having a bit of trouble parsing the second bit with all those span tags.
Is it going to be easier to use regex for this or should I continue to use the HAP for this part?
An an aside does anyone know why word generates nested identical span tags
<span class=MsoFootnoteReference>
If the input follows exactly that format then you can get away with a pretty loose regex. You just need to ignore everything except the parts you want to extract and then employ non-greedy expressions to eat up all the garbage between them:
<a href="#(?<name>_ftn\d).*?(?<number>\[\d+\]).*?<\/a>
You can use a non-greedy .*? to eat up all the extra markup because nothing in there will match your next \[\d+\] pattern. You don't really need the .*?<\/a> bit on the end, that's mostly for symmetry and a bit of extra paranoia.
Something like this is probably one of the few cases where using regular expressions to rip apart HTML makes sense. You could do this sort of thing with an HTML parser but then you'd be a nightmare of twisty XPath expressions (all of which look alike), DOM manipulations, or SAX events. And you might even get eaten by a grue.