What am I missing in my RegEx expression? - html

So, regex has been the bane of my existence for some time. I feel that I'm on the cusp of understanding it, but I'm just getting very frustrated. In short:
I'm attempting to scrape data from the following website via PHP:
http://magicseaweed.com/Asbury-Park-Surf-Report/857/
I want to extract the bold wave height at the top of the page (at the moment, it reads 3-5). I understand why this works:
preg_match('/<div class="msw-fct-ccd msw-sr-details span3"> <h3> <span>(.*)
<small>ft<\/small> <\/span> <div class="msw-fct-ccr msw-sr-rating">/', $pageMagic,
$height);
But I don't understand why this will not:
preg_match('/<div class="msw-fct-ccd msw-sr-details span3"> <h3> <span>(/d-/d)|(/d)
<small>ft<\/small> <\/span> <div class="msw-fct-ccr msw-sr-rating">/', $pageMagic,
$height);
In my mind, logically speaking, it should be looking for a digit, a dash, then another digit OR just one digit. I tested out regex in http://gskinner.com/RegExr/ and it picked up 3-5. Thank you in advance!

Your slashes are the wrong way around. It should be:
(\d-\d)|(\d)
Incidentally, you can simplify this to:
\d(-\d)?
...but note that this would change the capture groups. I leave the fix for that as an exercise for you :)

Related

Regex Between HTML Tags - VBA

I have a page full of html data that I am scraping from.
There is one occurrence of a "gross amount" field that I am trying to extract.
<h3 id="cart_trans_detail_ach_grossamount_lbl">Gross Amount</h3>
<p id="cart_trans_detail_ach_grossamount_txt">$76.99 USD</p>
All I want to get from this is $76.99 USD
I have tried using Regex Buddy and putting together but regex is not my strong suite. Even something simple like this: <p id="cart_trans_detail_ach_grossamount_txt">(.*)</p> matches the whole string and not just what is between the tags.
Any ideas?
First of all, using a regex to parse HTML is unrecommended, you should use a HTML/XML parsing library instead. But if you really feel the need to use a regular expression for that, what you are missing is the ungreedy char (?) after your (*) so that your regex stops at the first </p> it finds.
<p id="cart_trans_detail_ach_grossamount_txt">(.*?)</p>
Try this pattern:
(?<=grossamount_txt">\$)(\d*\.?\d*) USD
It works in python and php, it shall also work in Java.
The group(1) gives you back only the amount without other things.
The first parenthesis encloses a positive lookbehind which looks if before the USD amount there is a string related to "grossamount_txt">$".
then the second parenthesis try to match for a numeric amount possibily expressed in integer number and decimal numbers.
Finally there the last part of the pattern is " USD".
You can test it here
https://www.regex101.com/#python
where you can also find some more detailed explanation.
Here about how lookaround works
http://www.regular-expressions.info/lookaround.html
Hope it helps.

RegEx to substitute tag names, leaving the content and attributes intact

I would like to replace opening and closing tag, leaving the content of tags and its attribute intact.
Here is what I have:
<div class="QText">Text to be kept</div>
to be replaced with
<span class="QText">Text to be kept</span>
I tried this expression which finds all expressions I want but there seems to be no way to replace found expressions.
<div class="QText">(.*?)</div>
Thanks in advance.
I think #AmitJoki's answer will work well enough in certain circumstances, but if you only want to replace div elements when they have an attribute or a specific set of attributes, then you would want to use a regex replacement with backreferences - how you specify and refer to a backreference, unfortunately, depends upon your chosen editor. Visual Studio has the most unique and annoying "flavor" of regex I know of, while Dreamweaver has a fairly typical implementation (both as well as I imagine whatever editor you're using do regex replacement - you just have to know the menu item or keystroke to bring up the dialog).
If memory serves, Dreamweaver has replacement options when you hit Ctrl+F, while you have to hit Ctrl+H, so try those.
Once you get a "Find" and "Replace" box, you would put something like what you have in your last example above: <div class="QText">(.*?)</div> or perhaps <div class="(QText|RText|SText)">(.*?)</div> into your "Find" box, then put something like <span class="QText">\1</span> or <span class="\1">\2</span> in the "Replacement" box. A few utilities might use $1 to refer to a backreference rather than \1, but you'll have to lookup help or experiment to be sure.
If you are using a language to run this expression, you need to tell us which language.
If you are using a specific editor to run this expression, you need to tell us which editor.
...and never forget the prevailing wisdom on regex and HTML
Just replace div.
var s="<div class='QText'>Text to be kept</div>";
alert(s.replace(/div/g,"span"));
Demo: http://jsfiddle.net/9sgvP/
Mark it as answer if it helps ;)
Posted as requested
If its going to be literal like that, capture what's to be kept, then replace the rest,
Find: <div( class="QText">.*?</)div>
Replace: <span$1span>

Regex extract html source with multiple elements

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.
What I need is to take the following example html and extract each line
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to /<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.
can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)
This is all quite new to me, so hope ive explained what im trying to do.
Thanks in advance
I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.
What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.
in python add re.DOTALL option. In PHP there is a special slash tag to use.
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>
For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.
Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.
Thanks so much for responding and trying to help

Does a move html entity exist

I'm looking for an html entity code for a move symbol (with left right up down arrows). The same one that appears after cursor: move; is applied in css. Does anyone know if this is possible? I can't find it anywhere.
The closest I found was this ✥ (✥ or ✥).
✥ ✥
✥ ✥
➥➥
☇ ☇
↑↓ ↑↓
&rlarr; &rlarr;
As mentioned in the other answers, there is no exact match. Here are some that might be close enough for some.
↔ (↔ or ↔) and ↕ (&varr; or ↕) are available, however there is no up/down/left/right arrow symbol in the arrow subgroup.
No, there is no such entity, and there is even no character like that. In interesting way to check whether the symbol you are looking for exists as a character is to visit http://shapecatcher.com/ and draw it. It’s not exact science, of course.
It is generally pointless to look for HTML entity codes. Those codes add nothing to the expressive power of the language: you can use character references &#... instead, or enter the characters directly if you are using UTF-8, as you normally should. The real question, after identifying a character, is whether it is supported in fonts and what to do about this. Whether there is an HTML entity for it is really irrelevant.
To the spec! Check out table 8.5, "named character references". A quick search of the word "arrow" doesn't turn up exactly what you're looking for but with over 2200 named entities, maybe you can find something that looks "close enough".
I made a pen for this purpose, feel free to use it : Create a Move cursor with html + css
<div id="wrapper">
<div class="cp-drag">
</div>
</div>
// see codepen for css please
I came to the same conclusion, no easy character, but my fix was to create a div with both the up/down arrow and side/side arrow in the same space. A bit finicky, but if you had to have it...
<div style="font-size:200%;display:inline-block;position:relative;top:-11px;left:-15px;">
<div style="display:inline-block;">↔</div>
<div style="display:inline-block;left:-32px;position:relative;">↕</div>
</div>

Scraping HTML with Regex

I can't use any PHP code as the Regex is for a script I purchased (there is just a text box I have to enter the regex into)...
I'm trying to use Regex to scrape contents between the anchors
"<h2>Highlights</h2>" & "</div><div class="FloatClear"></div><div id="SalesMarquee">" within the HTML segment below:
But when I tried this regex, it returns nothing...
<h2\b[^>]*>.*?<\/h2>[( )\t\s]*(.*?)[( )\t\s]*<\/div>
I think it may have something to do with the empty spaces within the HTML source...
Can any Regex gurus give me the magic expression for grabbing everything between any given HTML archors, like the ones mentioned above (that can also cope with any empty spaces within the HTML source)?
Many thanks
HTML segment
<div id="Highlights">
<h2>Highlights</h2>
<ul>
<li>1234</li>
<li>abc def asdasd asdasd</li>
<li>asdasda as asdasdasdas </li>
<li>asdasd asdasdas asdsad asdasd asa</li>
</ul>
</div>
<div class="FloatClear"></div>
<div id="SalesMarquee">
<div id="SalesMarqueeTemplate" style="display: none;">
In this case, because it's so simple, I think you might be able to pull it off with Regex. Although you could probably cater an example where it will fail, it should work in all normal cases. I suppose in this type of code that wouldn't exactly mean a security risk.
The reason it's not working is because of the dot you use in the middle of the expression. By default, the dot matches anything EXCEPT newline. To test, I used [\W\w] instead, which does work (stupid hack to really match anything).
The clean way is to switch your regex into single-line mode using the s switch. How to do that depends on your framework, but usually it's \<regex>\s.
See http://www.regular-expressions.info/dot.html for more info.
Don't use regex to scrape HTML.
See here for compelling reasons why.
Use an HTML parser instead - this SO answer suggests using DOMDocument->loadHTML().