What does a variable written immediately after "href=" do in html? - html

I can't find a good explanation for this. What is the code after href= doing?
The variable here$ is a URL. The variable result is a directory name.
echo "<h1><a href=$here/$result>$result</a></h1>";
I understand this is html embedded in php. The echo gives it away.
The question is, what is this:
href=$here/$result
I do not recognize this code in html.

Usually href is referring to the url of a source, in this case its url + directory, href is necessary to the <a> tag which represents a link in HTML

This is not HTML, but PHP that write an HTML string. You can place a variable inside double quoted strings. PHP will parse the string and replace the variable by its value.
For example:
$example = "world";
echo "Hello, $example!";
// Outputs: "Hello, world!"
Notice that single quoted strings have not this behaviour.
In the case of your question, $here and $result will be replaced by the variable value, as follow:
$here = "LOCATION"; // just for example purposes
$result = "RESULT"; // just for example purposes
echo "<h1><a href=$here/$result>$result</a></h1>";
// Outputs: "<h1><a href=LOCATION/RESULT>RESULT</a></h1>"
It will output a link (<a>) where the href attribute is the address the link points to. If you don't have spaces or > symbols in the link, it will work without quotation marks ("), but you can as well write it between quotes:
echo "<h1>$result</h1>";
// Outputs: "<h1>RESULT</h1>"
The best way I know to understand the <a> and its href attribute behaviour is to try it in the browser and see it in action by yourself.

Related

Freemarker Template return String instead of html element

I am working on Freemarker Template to create one form.
I declared two variables using <#local> which will hold anchor tag and button
<#local rightElement = "<a class='set-right' href='${data.url}'>Forgot your password?</a>">
<#local rightButton = "<input type='button' class='js-show-pass btn-toggle-password btn-link' value='Show'>">
I have used this variable to pass to macro which create my form structure. But when I load my page the variable which I pass prints as String on my Form page. I am accessing them as ${rightElement} and ${rightButton} in my macro. But it is printing my anchor tag with double quotes ,making it as string.
I tried multiple ways to eliminate that double quote with no success. Could you please suggest ways to declare anchor tag in variable and use it as html element and not String.
I'm not sure what you mean by "printing my anchor tag with double quotes, making it as string", but my guess is that the HTML that you print gets auto-escaped, and so you literally see the HTML source (with the <-s and all) in the browser, instead of a link or button.
So, first, assign the values with <#local someName>content</#local> syntax, instead of <#local someName="content">:
<#local rightElement><a class='set-right' href='${data.url}'>Forgot your password?</a></#local>
<#local rightButton><input type='button' class='js-show-pass btn-toggle-password btn-link' value='Show'></#local>
This ensures that ${data.url} and such are properly escaped (assuming you do use auto-escaping, and you should), also then you won't have to avoid/escape " or ' inside the string literal, you can use #if and such in constructing the value, etc.
Then, where you print the HTML value, if you are using the modern version of auto-escaping (try ${.output_format} to see - it should print HTML then, not undefined), you can now just write ${rightElement}, because FreeMarker knows that it's captured HTML, and so needs no more escaping. If you are using the legacy version of auto-escaping (which utilized #escape directive), then you have to write <#noescape>${rightElement}</#noescape>.

Removing single and double quote from html attributes with no white spaces on all attributes except href and src

I'm trying to remove single and double quotes from html attributes that are single words with no white spaces. I wrote this regex which does work:
/((type|title|data-toggle|colspan|scope|role|media|name|rel|id|class|rel)\s*(=)\s*)(\"|\')(\S+)(\"|\')/ims
How ever instead of specifying all the html tags that I want to remove the quotes on, I rather just list the couple attributes to ignore like src and href and remove the quotes on all other attribute names. So I wrote the one below but for the life of me it doesn't work. It some how has to detect any atribute name except the href and src. I tried all kinds of combinations.
/((?!href|src)(\S)+\s*(=)\s*)(\"|\')(\S+)(\"|\')/i
I've tried this but it doesn't work. it just removes the h and s off the attribues for href and src. I know I'm close but missing something. I spent a good 5 hours on this.
working example
$html_code = 'your html code here.';
preg_replace('/((type|title|data-toggle|colspan|scope|role|media|name|rel|id|class|rel)\s*(=)\s*)(\"|\')(\S+)(\"|\')/i', '$1$5', "$html_code");
I modified the smaller RegEx you wrote, resulting in this:
((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\')
When your version is parsed, the lookahead will arrive at some 'h' preceding a 'href' in your document and fail, then proceed to the next character. Since 'ref' doesn't match 'href' or 'src', the rest of your pattern will match.
With my modifications, any 'href' or 'src' will be initially accepted by the regex. When the lookbehind is reached, it will check for 'href' in the already parsed text and will fail if it is found.
Also, it would be preferable instead of filtering for href or src attribute, to filter out for = instead. Here would be a good Regex to do this (this Regex also presume that all attributes use double quotes):
// Remove all double quote with attribute that have no space and no `=` character.
$html = preg_replace('/((\S)+\s*(=)\s*)(\")(\S+(?<!=.))(\")/', '$1$5', $html);

Regular expression to find URLs not inside a hyperlink

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:
something
http://www.example2.com
<b>something</b>http://www.example.com/<span>test</span>
Any URL outside of <a></a> should be matched.
One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.
I was looking for this answer as well and because nothing out there really worked like I wanted it too this is the regex that I created. Obviously since its a regex be aware that this is not a perfect solution.
/(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi
And the whole function to update html is:
function linkifyWithRegex(input) {
let html = input;
let regx = /(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi;
html = html.replace(
regx,
function (match) {
return '' + match + "";
}
);
return html;
}
You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}
You can do that using a single regular expression that matches both anchor tags and hyperlinks:
# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'
Then loop over the results and only process matches where the second sub-pattern was found.
Peter has a great answer: first, remove anchors so that
Some text TeXt and some more text with link http://a.net
is replaced by
Some text and some more text with link http://a.net
THEN run a regexp that finds urls:
http://a.net
Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.
^.*<(a|A){1,1} ->scan until >a or >A is found
.*(href|HREF){1,1}\= -> scan until href= or HREF=
\x22{1,1}.*\x22 -> accept all characters between two quotes
> -> look for >
.+(|){1,1} -> accept description and end anchor tag
$ -> End of string search
pattern= "^.*<(a|A){1,1}.*(href|HREF){1,1}.*\=.*\x22{0,1}.*\x22{0,1}.*>.+(|){1,1}$"

Regular expression for extracting tag attributes

I'm trying to extract the attributes of a anchor tag (<a>). So far I have this expression:
(?<name>\b\w+\b)\s*=\s*("(?<value>[^"]*)"|'(?<value>[^']*)'|(?<value>[^"'<> \s]+)\s*)+
which works for strings like
<a href="test.html" class="xyz">
and (single quotes)
<a href='test.html' class="xyz">
but not for a string without quotes:
<a href=test.html class=xyz>
How can I modify my regex making it work with attributes without quotes? Or is there a better way to do that?
Update: Thanks for all the good comments and advice so far. There is one thing I didn't mention: I sadly have to patch/modify code not written by me. And there is no time/money to rewrite this stuff from the bottom up.
Update 2021: Radon8472 proposes in the comments the regex https://regex101.com/r/tOF6eA/1 (note regex101.com did not exist when I wrote originally this answer)
<a[^>]*?href=(["\'])?((?:.(?!\1|>))*.?)\1?
Update 2021 bis: Dave proposes in the comments, to take into account an attribute value containing an equal sign, like <img src="test.png?test=val" />, as in this regex101:
(\w+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Update (2020), Gyum Fox proposes https://regex101.com/r/U9Yqqg/2 (again, note regex101.com did not exist when I wrote originally this answer)
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|\s*\/?[>"']))+.)["']?
Applied to:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
<img src="test.png">
<img src="a test.png">
<img src=test.png />
<img src=a test.png />
<img src=test.png >
<img src=a test.png >
<img src=test.png alt=crap >
<img src=a test.png alt=crap >
Original answer (2008):
If you have an element like
<name attribute=value attribute="value" attribute='value'>
this regex could be used to find successively each attribute name and value
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Applied on:
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href='test.html' class="xyz">
it would yield:
'href' => 'test.html'
'class' => 'xyz'
Note: This does not work with numeric attribute values e.g. <div id="1"> won't work.
Edited: Improved regex for getting attributes with no value and values with " ' " inside.
([^\r\n\t\f\v= '"]+)(?:=(["'])?((?:.(?!\2?\s+(?:\S+)=|\2))+.)\2?)?
Applied on:
<script type="text/javascript" defer async id="something" onload="alert('hello');"></script>
it would yield:
'type' => 'text/javascript'
'defer' => ''
'async' => ''
'id' => 'something'
'onload' => 'alert(\'hello\');'
Although the advice not to parse HTML via regexp is valid, here's a expression that does pretty much what you asked:
/
\G # start where the last match left off
(?> # begin non-backtracking expression
.*? # *anything* until...
<[Aa]\b # an anchor tag
)?? # but look ahead to see that the rest of the expression
# does not match.
\s+ # at least one space
( \p{Alpha} # Our first capture, starting with one alpha
\p{Alnum}* # followed by any number of alphanumeric characters
) # end capture #1
(?: \s* = \s* # a group starting with a '=', possibly surrounded by spaces.
(?: (['"]) # capture a single quote character
(.*?) # anything else
\2 # which ever quote character we captured before
| ( [^>\s'"]+ ) # any number of non-( '>', space, quote ) chars
) # end group
)? # attribute value was optional
/msx;
"But wait," you might say. "What about *comments?!?!" Okay, then you can replace the . in the non-backtracking section with: (It also handles CDATA sections.)
(?:[^<]|<[^!]|<![^-\[]|<!\[(?!CDATA)|<!\[CDATA\[.*?\]\]>|<!--(?:[^-]|-[^-])*-->)
Also if you wanted to run a substitution under Perl 5.10 (and I think PCRE), you can put \K right before the attribute name and not have to worry about capturing all the stuff you want to skip over.
Token Mantra response: you should not tweak/modify/harvest/or otherwise produce html/xml using regular expression.
there are too may corner case conditionals such as \' and \" which must be accounted for. You are much better off using a proper DOM Parser, XML Parser, or one of the many other dozens of tried and tested tools for this job instead of inventing your own.
I don't really care which one you use, as long as its recognized, tested, and you use one.
my $foo = Someclass->parse( $xmlstring );
my #links = $foo->getChildrenByTagName("a");
my #srcs = map { $_->getAttribute("src") } #links;
# #srcs now contains an array of src attributes extracted from the page.
You cannot use the same name for multiple captures. Thus you cannot use a quantifier on expressions with named captures.
So either don’t use named captures:
(?:(\b\w+\b)\s*=\s*("[^"]*"|'[^']*'|[^"'<>\s]+)\s+)+
Or don’t use the quantifier on this expression:
(?<name>\b\w+\b)\s*=\s*(?<value>"[^"]*"|'[^']*'|[^"'<>\s]+)
This does also allow attribute values like bar=' baz='quux:
foo="bar=' baz='quux"
Well the drawback will be that you have to strip the leading and trailing quotes afterwards.
Just to agree with everyone else: don't parse HTML using regexp.
It isn't possible to create an expression that will pick out attributes for even a correct piece of HTML, never mind all the possible malformed variants. Your regexp is already pretty much unreadable even without trying to cope with the invalid lack of quotes; chase further into the horror of real-world HTML and you will drive yourself crazy with an unmaintainable blob of unreliable expressions.
There are existing libraries to either read broken HTML, or correct it into valid XHTML which you can then easily devour with an XML parser. Use them.
PHP (PCRE) and Python
Simple attribute extraction (See it working):
((?:(?!\s|=).)*)\s*?=\s*?["']?((?:(?<=")(?:(?<=\\)"|[^"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!"|')(?:(?!\/>|>|\s).)+))
Or with tag opening / closure verification, tag name retrieval and comment escaping. This expression foresees unquoted / quoted, single / double quotes, escaped quotes inside attributes, spaces around equals signs, different number of attributes, check only for attributes inside tags, and manage different quotes within an attribute value. (See it working):
(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)
(Works better with the "gisx" flags.)
Javascript
As Javascript regular expressions don't support look-behinds, it won't support most features of the previous expressions I propose. But in case it might fit someone's needs, you could try this version. (See it working).
(\S+)=[\'"]?((?:(?!\/>|>|"|\'|\s).)+)
This is my best RegEx to extract properties in HTML Tag:
# Trim the match inside of the quotes (single or double)
(\S+)\s*=\s*([']|["])\s*([\W\w]*?)\s*\2
# Without trim
(\S+)\s*=\s*([']|["])([\W\w]*?)\2
Pros:
You are able to trim the content inside of quotes.
Match all the special ASCII characters inside of the quotes.
If you have title="You're mine" the RegEx does not broken
Cons:
It returns 3 groups; first the property then the quote ("|') and at the end the property inside of the quotes i.e.: <div title="You're"> the result is Group 1: title, Group 2: ", Group 3: You're.
This is the online RegEx example:
https://regex101.com/r/aVz4uG/13
I normally use this RegEx to extract the HTML Tags:
I recommend this if you don't use a tag type like <div, <span, etc.
<[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
For example:
<div title="a>b=c<d" data-type='a>b=c<d'>Hello</div>
<span style="color: >=<red">Nothing</span>
# Returns
# <div title="a>b=c<d" data-type='a>b=c<d'>
# <span style="color: >=<red">
This is the online RegEx example:
https://regex101.com/r/aVz4uG/15
The bug in this RegEx is:
<div[^/]+?(?:\".*?\"|'.*?'|.*?)*?>
In this tag:
<article title="a>b=c<d" data-type='a>b=c<div '>Hello</article>
Returns <div '> but it should not return any match:
Match: <div '>
To "solve" this remove the [^/]+? pattern:
<div(?:\".*?\"|'.*?'|.*?)*?>
The answer #317081 is good but it not match properly with these cases:
<div id="a"> # It returns "a instead of a
<div style=""> # It doesn't match instead of return only an empty property
<div title = "c"> # It not recognize the space between the equal (=)
This is the improvement:
(\S+)\s*=\s*["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))?[^"']*)["']?
vs
(\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?
Avoid the spaces between equal signal:
(\S+)\s*=\s*((?:...
Change the last + and . for:
|[>"']))?[^"']*)["']?
This is the online RegEx example:
https://regex101.com/r/aVz4uG/8
Tags and attributes in HTML have the form
<tag
attrnovalue
attrnoquote=bli
attrdoublequote="blah 'blah'"
attrsinglequote='bloob "bloob"' >
To match attributes, you need a regex attr that finds one of the four forms. Then you need to make sure that only matches are reported within HTML tags. Assuming you have the correct regex, the total regex would be:
attr(?=(attr)*\s*/?\s*>)
The lookahead ensures that only other attributes and the closing tag follow the attribute. I use the following regular expression for attr:
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?
Unimportant groups are made non capturing. The first matching group $1 gives you the name of the attribute, the value is one of $2or $3 or $4. I use $2$3$4 to extract the value.
The final regex is
\s+(\w+)(?:\s*=\s*(?:"([^"]*)"|'([^']*)'|([^><"'\s]+)))?(?=(?:\s+\w+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^><"'\s]+))?)*\s*/?\s*>)
Note: I removed all unnecessary groups in the lookahead and made all remaining groups non capturing.
splattne,
#VonC solution partly works but there is some issue if the tag had a mixed of unquoted and quoted
This one works with mixed attributes
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
to test it out
<?php
$pat_attributes = "(\S+)=(\"|'| |)(.*)(\"|'| |>)"
$code = ' <IMG title=09.jpg alt=09.jpg src="http://example.com.jpg?v=185579" border=0 mce_src="example.com.jpg?v=185579"
';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$code = '
<a href=test.html class=xyz>
<a href="test.html" class="xyz">
<a href=\'test.html\' class="xyz">
<img src="http://"/> ';
preg_match_all( "#$pat_attributes#isU", $code, $ms);
var_dump( $ms );
$ms would then contain keys and values on the 2nd and 3rd element.
$keys = $ms[1];
$values = $ms[2];
I suggest that you use HTML Tidy to convert the HTML to XHTML, and then use a suitable XPath expression to extract the attributes.
something like this might be helpful
'(\S+)\s*?=\s*([\'"])(.*?|)\2
If you want to be general, you have to look at the precise specification of the a tag, like here. But even with that, if you do your perfect regexp, what if you have malformed html?
I would suggest to go for a library to parse html, depending on the language you work with: e.g. like python's Beautiful Soup.
If youre in .NET I recommend the HTML agility pack, very robust even with malformed HTML.
Then you can use XPath.
I'd reconsider the strategy to use only a single regular expression. Sure it's a nice game to come up with one single regular expression that does it all. But in terms of maintainabilty you are about to shoot yourself in both feet.
My adaptation to also get the boolean attributes and empty attributes like:
<input autofocus='' disabled />
/(\w+)=["']((?:.(?!["']\s+(?:\S+)=|\s*\/[>"']))+.)["']|(\w+)=["']["']|(\w+)/g
I also needed this and wrote a function for parsing attributes, you can get it from here:
https://gist.github.com/4153580
(Note: It doesn't use regex)
I have created a PHP function that could extract attributes of any HTML tags. It also can handle attributes like disabled that has no value, and also can determine whether the tag is a stand-alone tag (has no closing tag) or not (has a closing tag) by checking the content result:
/*! Based on <https://github.com/mecha-cms/cms/blob/master/system/kernel/converter.php> */
function extract_html_attributes($input) {
if( ! preg_match('#^(<)([a-z0-9\-._:]+)((\s)+(.*?))?((>)([\s\S]*?)((<)\/\2(>))|(\s)*\/?(>))$#im', $input, $matches)) return false;
$matches[5] = preg_replace('#(^|(\s)+)([a-z0-9\-]+)(=)(")(")#i', '$1$2$3$4$5<attr:value>$6', $matches[5]);
$results = array(
'element' => $matches[2],
'attributes' => null,
'content' => isset($matches[8]) && $matches[9] == '</' . $matches[2] . '>' ? $matches[8] : null
);
if(preg_match_all('#([a-z0-9\-]+)((=)(")(.*?)("))?(?:(\s)|$)#i', $matches[5], $attrs)) {
$results['attributes'] = array();
foreach($attrs[1] as $i => $attr) {
$results['attributes'][$attr] = isset($attrs[5][$i]) && ! empty($attrs[5][$i]) ? ($attrs[5][$i] != '<attr:value>' ? $attrs[5][$i] : "") : $attr;
}
}
return $results;
}
Test Code
$test = array(
'<div class="foo" id="bar" data-test="1000">',
'<div>',
'<div class="foo" id="bar" data-test="1000">test content</div>',
'<div>test content</div>',
'<div>test content</span>',
'<div>test content',
'<div></div>',
'<div class="foo" id="bar" data-test="1000"/>',
'<div class="foo" id="bar" data-test="1000" />',
'< div class="foo" id="bar" data-test="1000" />',
'<div class id data-test>',
'<id="foo" data-test="1000">',
'<id data-test>',
'<select name="foo" id="bar" empty-value-test="" selected disabled><option value="1">Option 1</option></select>'
);
foreach($test as $t) {
var_dump($t, extract_html_attributes($t));
echo '<hr>';
}
This works for me. It also take into consideration some end cases I have encountered.
I am using this Regex for XML parser
(?<=\s)[^><:\s]*=*(?=[>,\s])
Extract the element:
var buttonMatcherRegExp=/<a[\s\S]*?>[\s\S]*?<\/a>/;
htmlStr=string.match( buttonMatcherRegExp )[0]
Then use jQuery to parse and extract the bit you want:
$(htmlStr).attr('style')
have a look at this
Regex & PHP - isolate src attribute from img tag
perhaps you can walk through the DOM and get the desired attributes. It works fine for me, getting attributes from the body-tag

extract title tag from html

I want to extract contents of title tag from html string. I have done some search but so far i am not able to find such code in VB/C# or PHP. Also this should work with both upper and lower case tags e.g. should work with both <title></title> and <TITLE></TITLE>. Thank you.
You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):
function get_title($html) {
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.
Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
RegexOptions.IgnoreCase );
Match match = regex.Match( html );
string title = match.Groups[0].Value;
I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.
If there is any attribute in the title tag (which is unlikely but can happen) you need to update the expression as follows:
$title = preg_match('!<title.*>(.*?)</title>!i', $url_content, $matches) ? $matches[1] : '';