Regex for html style attribute - html

Trying to get a regex that I can get a style attribute value from the example below should explain my issue.
source: font-size:11pt;font-color:red;text-align:left;
want to say give me ..
font-size and returns 11pt
font-colour and returns red
text-align and returns left
Can someone point me in the right direction
Thanks
Lee

This question reminded me of a Jeff Atwood blog post, Parsing Html The Cthulhu Way. This isn't exactly the same question, but its the same sentiment. Don't parse CSS with regular expressions! There's tons of libraries out there to do this for you.

Logically you'd want:
[exact phrase] + 1 colon + 0 or more white space characters + 0 or more characters up to the first semicolon or closing quote.
I think this will get you headed in the right direction:
font-size[:][\s]*[^;'"]*
Gotchas:
the closing quote might be single or double and there may be a valid quote within (ie, quoting background image urls, for instance)
this is all dependent on the styles not being written in shorthand

var regex = new Regex(#"([\w-]+)\s*:\s*([^;]+)");
var match = regex.Match("font-size:11pt;font-color:red;text-align:left;");
while (match.Success)
{
var key = match.Groups[1].Value;
var value = match.Groups[2].Value;
Console.WriteLine("{0} : {1}", key, value);
match = match.NextMatch();
}
Edit: This is not supposed to be a 'complete' solution. It probably does the job for the 80% of the cases, and as ever the last 20% would be magnitudes more expensive ;-)

Related

HTML regex space

I'm trying to bind a regex to my HTML form for an EU bank account. So far, I was able to piece together that:
pattern="[A-Z]{2}[000000000-999999999]{9}
Will let something pass that's for example UK123456789
But I also want to let it pass for UK12 2345 6789
How do I go about accepting a space at exactly those placements?
pattern="[A-Z]{2}[000000000-999999999]{9}"
This only accidentally does what you want. ([000000000-999999999] says "this character should be a 0 or a 0 or a 0 or ... a character in the range of 0-9 or a 9 or a 9 or ... a 9.) The proper form is:
pattern="[A-Z]{2}[0-9]{9}"
or more accurately:
pattern="[A-Z]{2}\d{9}"
Now that we have something more rational, we can extend that to:
pattern="[A-Z]{2}\d{2}\s?\d{4}\s?\d{4}"
which allows optional whitespace at the specific locations.
If you want to allow just spaces rather than any whitespace character, you could do:
pattern="[A-Z]{2}\d{2} ?\d{4} ?\d{4}"
You can allow an optional whitespace using \s?, though it'l make your regex a little longer. Below regex will allow both with or without whitespace (DEMO)
\w{2}\d{2}\s?\d{4}\s?\d{4}
But be aware that an european IBAN is longer than what you have posted - though I'm not sure how it is in the UK.
If you don't care where the spaces are, as long as there are 9 digits, you can remove all the spaces before checking:
str = 'UK12 234 56789';
strToCheck = str.replace(/ /g, '');
validStr = strToCheck.match(/[a-zA-Z]{2}\d{9}/);
if (validStr) {
console.log('Valid');
}

Change CAPS lock to Capitalize in CSS

I have a sentence coming in that is all in CAPs lock (and can't be changed). That sentence is part of a paragraph, using CSS only (or a little Jquery if you have to). How can I always get the result reliably and across most devices!
Note: I have looked at questions such as this, but they do not cover the multiple sentences factor.
Without change:
THIS IS THE FIRST SENTENCE. And this is the second, as it is from the server.
Desired result:
This is the first sentence. And this is the second...
The CSS I tried was this, but it doesn't work with multiple sentences.
p { text-transform: lowercase;}
p:first-letter {text-transform:capitalize}
Seems like a problem for jQuery. Check this answer for the entire-element capitalization, then you can parse the first sentence by using something like:
var setval = $('#my_paragraph').html();
var firstSentence = setval.substring(0, setval.indexOf('.'));
firstSentence = toProperCase(firstSentence);
var theRest = setval.substring(setval.indexOf('.') + 1);
$('#my_paragraph').html(firstSentence + theRest);
This only a hotfix. If your output ever changes to something different, containing more then only a single dot or even other words starting with an uppercase character, this code will not provide the desired result.
Example:
http://jsfiddle.net/Em2bD/
// grab your text
var firstSentenceText = $('p').text();
// extract the first sentence and make it all lowercase
var firstSentence = firstSentenceText.substr(0, firstSentenceText.indexOf('.')).toLowerCase();
// convert first char to uppercase
var result = firstSentenceText.charAt(0).toUpperCase() + firstSentence.substring(1);
// append the text to what ever you like and append the missing dot.
$('.result').text(result + '.');
Something that comes in mind is by using a bit of jquery. You can find the first period(.) in the paragraph and then you can make the string before it lowercase(you can give it a span with a class/id and have the rules already on css file). You may have to do a bit of googling.

Regex to extract text from inside an HTML tag

I know this has been asked at least a thousand times but I can't find a proper regex that will match a name in this string here:
<td><div id="topbarUserName">Donald</div></td>
I want to get the name 'Donald' and the regex that's the closest is >[a-zA-Z0-9]+ but the result is >Donald.
I'm coding in PureBasic (It's syntax is similar to that of Basic) and it uses the PCRE library for regular expressions.
Can anyone help?
Josh's pattern will work if you only make use of the numbered group, not the whole match. If you have to use the whole match, use something like (?<=>)(\w+?)(?=<)
Either way, regex is widely known to not be good for parsing HTML.
Explanation:
(?<=) is used to check if something appears before the current item.
\w+? will match any "word"-character, one or more times, but stop whenever the rest of the pattern matches something, for this situation the ? could have been left out.
(?=) is used to check if something appears after the current item.
Try this
It should capture anything that is a letter / number
>([\w]+)<
Also I'm not exactly sure what your project limitations are, but it would be much easier to do something like this
$('#topbarUserName').text();
in jQuery instead of using a regex.
>([a-zA-Z]+) should do the Trick. Remember to get the grouping right.
Why not doing it with plain old basic string-functions?
a.w = FindString(HTMLstring.s, "topbarUserName") + 16 ; 2 for "> and topbar...
If a > 0
b.w = FindString(HTMLstring, "<", a)
If b > 0
c.w = b - a
Donald.s = Mid(HTMLstring,a, c)
EndIf
EndIf
Debug Donald

Surrounding text with tag and populating tag

I have several lines of text, in them there is a word or words that are capitalized like this:
Hello HOW ARE YOU good to see you
I am FINE
Is there a tool that can go through the text and surround all those capitalized with the HTML anchor text?
and
I guess more difficultly, also populate the href with uncapitalized, space(s) removed version of that capitalized text?
Any help on one or both questions is appreciated.
It took me a while, but here it is in javascript: http://jsfiddle.net/RdJ4E/4/
I'm sure you will find the way hot to tune the code. Good luck!
Is this a beginning? Matching all uppercased words is trivial with regex, and with providing the String.replace method with a callback function instead of a string you can do whatever you want with the matched string.
myString.replace(/(\b[A-Z\s]+\b)/g, function(result, match){
var stripped = encodeURI(result.trim().toLowerCase());
return ' '+result.trim()+' ';
});
http://jsfiddle.net/mwxnC/2/

Regex for Encoded HTML

I'd like to create a regex that will match an opening <a> tag containing an href attribute only:
<a href="doesntmatter.com">
It should match the above, but not match when other attributes are added:
<a href="doesntmatter.com" onmouseover="alert('Do something evil with Javascript')">
Normally that would be pretty easy, but the HTML is encoded. So encoding both of the above, I need the regex to match this:
<a href="doesntmatter.com" >
But not match this:
<a href="doesntmatter.com" onmouseover="alert('do something evil with javascript.')" >
Assume all encoded HTML is "valid" (no weird malformed XSS trickery) and assume that we don't need to follow any HTML sanitization best practices. I just need the simplest regex that will match A) above but not B).
Thanks!
The initial regular expression that comes to mind is /<a href=".*?">/; a lazy expression (.*?) can be used to match the string between the quotes. However, as pointed out in the comments, because the regular expression is anchored by a >, it'll match the invalid tag as well, because a match is still made.
In order to get around this problem, you can use atomic grouping. Atomic grouping tells the regular expression engine, "once you have found a match for this group, accept it" -- this will solve the problem of the regex going back and matching the second string after not finding a > a the end of the href. The regular expression with an atomic group would look like:
/<a (?>href=".*?")>/
Which would look like the following when replacing the characters with their HTML entities:
/<a (?>href=".*?")>/
Hey! I had to do a similar thing recently. I recommend decoding the html first then attempt to grab the info you want. Here's my solution in C#:
private string getAnchor(string data)
{
MatchCollection matches;
string pattern = #"<a.*?href=[""'](?<href>.*?)[""'].*?>(?<text>.*?)</a>";
Regex myRegex = new Regex(pattern, RegexOptions.Multiline);
string anchor = "";
matches = myRegex.Matches(data);
foreach (Match match in matches)
{
anchor += match.Groups["href"].Value.Trim() + "," + match.Groups["text"].Value.Trim();
}
return anchor;
}
I hope that helps!
I don't see how matching one is different from the other? You're just looking for exactly what you just wrote, making the portion that is doesntmatter.com the part you capture. I guess matching for anything until " (not "?) can present a problem, but you do it like this in regex:
(?:(?!").)*
It essentially means:
Match the following group 0 or more times
Fail match if the following string is """
Match any character (except new line unless DOTALL is specified)
The complete regular expression would be:
/<a href="(?>(?:[^&]+|(?!").)*)">/s
This is more efficient than using a non-greedy expression.
Credit to Daniel Vandersluis for reminding me of the atomic group! It fits nicely here for the sake of optimization (this pattern can never match if it has to backtrack.)
I also threw in an additional [^&]+ group to avoid repeating the negative look-ahead so many times.
Alternatively, one could use a possessive quantifier, which essentially does the same thing (your regex engine might not support it):
/<a href="(?:[^&]+|(?!").)*+">/s
As you can see it's slightly shorter.