Regex for HTML attribute replacement/addition - html

I'm looking for a single line regex which does the following:
Given a HTML tag with the "name" attribute, I want to replace it with my own attribute. If that tag lacks the name attribute, I want to implant my own attribute. The result should look like this:
<IMG name="img1" ...> => <IMG name="myImg1" ...>
<IMG ...> => <IMG name="myImg1" ...>
Can this be done with a single line regex?

The trick is to match every complete "attribute=value" pair, but capture only the ones whose attribute name isn't "name". Then plug in your own "name" attribute along with all the captured ones.
s/<IMG
((?:\s+(?!name\b)\w+="[^"]+")*)
(?:\s+name="[^"]+")?
((?:\s+(?!name\b)\w+="[^"]+")*)
>
/<IMG name="myName"$1$2>
/xg;

This isn't a perfect solution, the spacing and position within the tag may not be exactly what you want, but it does accomplish the goals. This is with a perl regex, but there's nothing particular perl-specific about it.
s/(<IMG)((\s+[^>]*)name="[^"]*")?(.*)/$1$3 name="myID"$4/g

If, like in your example, the name attribute is always the first one inside the IMG tag, then it's very easy. Search for
<(?!/)(/w+)\s+(name="[^"]+")?
and replace with
<\1 name="myImg1"
but I doubt that this is what you really want.
If the name attribute can occur in other positions, it gets more difficult.

Related

XPath based on id attribute value that starts with something?

For example if I have multiple anchor elements on a site and the easiest way to get them is via their ID, but the IDs look like this:
lots of html...
hop1
...lots of html...
hop2
...lots of html...
hop3
...lots of html
Is it possible to select the href attributes of all anchor elements whose id has the "foo_" part of the id? In other words, can I add a wildcard in an attribute's value in XPath?
This XPath expression, which works with all versions of XPath,
//a[starts-with(#id,"foo_")]/#href
will select all a/#href attributes whose a has an id attribute value that starts with "foo_".
Yes you can use matches function in terms of XSL:
Starting with foo_ //a/#id[matches(.,'^foo_\d+')]
Containing foo_ //a/#id[matches(.,'foo_\d+')]
Please specify for which language you are asking for

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

In objective-c how can i get the "url" and "content" between <a> tags using regular expression?

The <a> tag like this:
<a data-hash="9aab8aa3af7dc519c643fdcfd973b040" href="http://www.zhihu.com/people/9aab8aa3af7dc519c643fdcfd973b040" class="member_mention" data-editable="true" data-title="#somebody" data-tip="p$b$9aab8aa3af7dc519c643fdcfd973b040">#somebody</a>
and I want get the url
href="http://www.zhihu.com/people/9aab8aa3af7dc519c643fdcfd973b040"
and the #somebody at the same time
I have tried like this:
href=\"(.*?)\">(.*?)</a>
The result is:
href="http://www.zhihu.com/people/9aab8aa3af7dc519c643fdcfd973b040" class="member_mention" data-editable="true" data-title="#somebody" data-tip="p$b$9aab8aa3af7dc519c643fdcfd973b040">#somebody</a>
Is anyone can give me some suggestion?
You also need to match the extra parameters present inside the anchor tag.
"<a\\b[^>]*\\bhref=\"(.*?)\"[^>]*>(.*?)</a>"
or
"<a\\b[^>]*\\bhref=\"([^"]*)\"[^>]*>(.*?)</a>"
Then get the strings you want from group index 1 and 2. Your regex matches all the following chars (ie, chars next to href attribute) because it looks for an > symbol just after to the double quotes. Hence it finds a match in data-tip="p$b$9aab8aa3af7dc519c643fdcfd973b040">, so this part also get captured by the first capturing group.
DEMO

Select attribute content XPath

I have an XPath
//*[#class]
I would like to make an XPath to select the content inside this attribute.
<li class="tab-off" id="navList0">
So in this case I would like to extract the text "tab-off", is this possible with XPath?
Your original //*[#class] XPath query returns all elements which have a class attribute. What you want is //*[#class]/#class to retrieve the attribute itself.
In case you just want the value and not the attribute name try string(//*[#class]/#class) instead.
If you are specifically grabbing the data from an tag, you can do this:
//li[#class]
and loop through the result set to find a class with attribute "tab-off". Or
//li[#class='tab-off']
If you're in a position to hard code.
I assume you have already put your file through an XML parser like a DOMParser. This will make it much easier to extract any other values you may need on a specific tag.

RegEx to grab an entire div with a specific ID?

I have some markup which contains the firebug hidden div.
(long story short, the YUI RTE posts content back that includes the hidden firebug div is that is active)
So, in my posted content I have the extra div which I will remove server side in PHP:
<div firebugversion="1.5.4" style="display: none;" id="_firebugConsole"></div>
I cant seem to get a handle on the regex I would need to write to match this string, bearing in mind that it won't always be that exact string (version may change).
All help welcome!
Regex is not the best tool for the job, but you can try:
<div firebugversion=[^>]*></div>
The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.
The * is the zero-or-more repetition. Thus, [^>]* matches a sequence of anything but >.
If you want to target the id specifically, you can try:
<div [^>]*\bid="_firebugConsole"[^>]*></div>
The \b is the word boundary anchor.
Match this regex -
<div.*id="_firebugConsole".*?/div>
I would suggest this:
\<div firebugversion="(.+)" style="(.+)" id="(.+)"\>
Then you have three groups:
firebugversion
style
id
This one is a little more complicated, and probably not perfect, but it will:
Match any div containing the attribute firebugversion
Match the firebugversion attribute no matter which order attributes appear in the tag
Match the div, even if it contains content or spacing between it and its closing tag (i've seen the firebug tag with a &nbsp ; tag inside it before) Note: it does lazy matching so it will only match the next tag, rather than the last it finds in the document
<(div)\b([^>]*?)(firebugversion)([^>]*?)>(.*?)</div>