Regular expression to find URLs not inside a hyperlink - html

There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:
something
http://www.example2.com
<b>something</b>http://www.example.com/<span>test</span>
Any URL outside of <a></a> should be matched.
One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.

I was looking for this answer as well and because nothing out there really worked like I wanted it too this is the regex that I created. Obviously since its a regex be aware that this is not a perfect solution.
/(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi
And the whole function to update html is:
function linkifyWithRegex(input) {
let html = input;
let regx = /(?!<a[^>]*>[^<])(((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?))(?![^<]*<\/a>)/gi;
html = html.replace(
regx,
function (match) {
return '' + match + "";
}
);
return html;
}

You can do it in two steps instead of trying to come up with a single regular expression:
Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).
Match the URL
In Perl it could be:
my $curLine = $_; #Do not change $_ if it is needed for something else.
$curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
if ( $curLine =~ /http:\/\//)
{
print "Matched an URL outside a HTML anchor !: $_\n";
}

You can do that using a single regular expression that matches both anchor tags and hyperlinks:
# Note that this is a dummy, you'll need a more sophisticated URL regex
regex = '(<a[^>]+>)|(http://.*)'
Then loop over the results and only process matches where the second sub-pattern was found.

Peter has a great answer: first, remove anchors so that
Some text TeXt and some more text with link http://a.net
is replaced by
Some text and some more text with link http://a.net
THEN run a regexp that finds urls:
http://a.net

Use the DOM to filter out the anchor elements, then do a simple URL regex on the rest.

^.*<(a|A){1,1} ->scan until >a or >A is found
.*(href|HREF){1,1}\= -> scan until href= or HREF=
\x22{1,1}.*\x22 -> accept all characters between two quotes
> -> look for >
.+(|){1,1} -> accept description and end anchor tag
$ -> End of string search
pattern= "^.*<(a|A){1,1}.*(href|HREF){1,1}.*\=.*\x22{0,1}.*\x22{0,1}.*>.+(|){1,1}$"

Related

How to wrap links in <a> with Notepad++ Find/Replace function?

I have a text document with raw links (not wrapped) and I would like to wrap them in HTML anchor tags.
Link example:
http://example.com/images/my-image.jpg
Desired output:
http://example.com/images/my-image.jpg
I can FIND the links in Notepad++ using the following RegEx:
[-a-zA-Z0-9#:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&//=]*)?(\?([-a-zA-Z0-9#:%_\+.~#?&//=]+)|)
However, the REPLACE string I'm trying is not working for some reason:
\1
How can I do this with notepad++?
You need to replace with the backreference to the whole match:
$&
Or
$0
Here, the $0 and $& "insert" the text that is matched by the whole regular expression, not just by some capturing groups.

Removing single and double quote from html attributes with no white spaces on all attributes except href and src

I'm trying to remove single and double quotes from html attributes that are single words with no white spaces. I wrote this regex which does work:
/((type|title|data-toggle|colspan|scope|role|media|name|rel|id|class|rel)\s*(=)\s*)(\"|\')(\S+)(\"|\')/ims
How ever instead of specifying all the html tags that I want to remove the quotes on, I rather just list the couple attributes to ignore like src and href and remove the quotes on all other attribute names. So I wrote the one below but for the life of me it doesn't work. It some how has to detect any atribute name except the href and src. I tried all kinds of combinations.
/((?!href|src)(\S)+\s*(=)\s*)(\"|\')(\S+)(\"|\')/i
I've tried this but it doesn't work. it just removes the h and s off the attribues for href and src. I know I'm close but missing something. I spent a good 5 hours on this.
working example
$html_code = 'your html code here.';
preg_replace('/((type|title|data-toggle|colspan|scope|role|media|name|rel|id|class|rel)\s*(=)\s*)(\"|\')(\S+)(\"|\')/i', '$1$5', "$html_code");
I modified the smaller RegEx you wrote, resulting in this:
((\S)+\s*(?<!href)(?<!src)(=)\s*)(\"|\')(\S+)(\"|\')
When your version is parsed, the lookahead will arrive at some 'h' preceding a 'href' in your document and fail, then proceed to the next character. Since 'ref' doesn't match 'href' or 'src', the rest of your pattern will match.
With my modifications, any 'href' or 'src' will be initially accepted by the regex. When the lookbehind is reached, it will check for 'href' in the already parsed text and will fail if it is found.
Also, it would be preferable instead of filtering for href or src attribute, to filter out for = instead. Here would be a good Regex to do this (this Regex also presume that all attributes use double quotes):
// Remove all double quote with attribute that have no space and no `=` character.
$html = preg_replace('/((\S)+\s*(=)\s*)(\")(\S+(?<!=.))(\")/', '$1$5', $html);

Regex to find content, then backtrack to initial HTML tag

I'm trying to use regex to match a string that starts with a <p> tag and has some specific content. Then, I want to replace everything from that specific paragraph tag to the end of the page.
I've tried using the expression <p.*?some content.*</html>, but it grabs the first <p> tag it sees, then follows through all the way to the end. I want it to only recognize the paragraph tag immediately preceding the content, allowing for other content and tags between the paragraph tag and the content.
How can I get to some specific content with the regex, then backtrack to the first paragraph tag it sees before the content, and then select everything from there to the end?
If it helps, I'm using EditPad Pro's "Search & Replace" function (although this could apply to anything that uses regex).
For simple input use regex
<p[^<]*some content.*<\/html>
but safer would be to use regex
<p(?:[^<]*|<(?!p\b))*some content.*<\/html>
To start, this is Java code, but it can be easily adapted to other regex engines / programming languages, I suppose.
So from what I understand, you want a situation where a given input has a part that starts with <p> and immediately followed by some target content/phrase. You then want to replace everything following the initial <p> tag with some other content?
If that is correct, you could do something like this:
String input; // holds your input text/html
String targetPhrase = "some specific content"; // some target content/phrase
String replacement; // holds the replacement value
Pattern p = Pattern.compile("<p[^>]*>(" + targetPhrase + ".*)$", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(input);
m.replaceFirst(replacement);
Of course, as mentioned in above comments, you really don't want to be using regex for HTML.
Alternatively, if you know that if the <p> tag is just that, with no properties or anything, you could try a substring instead.
So for example, if you're looking for "<p>some specific content", you could try something like:
String input; // your input text/html
String replacement; // the replacement value(s)
int index = input.indexOf("<p>some specific content");
if (index > -1) {
String output = input.substring(0, index);
output += "<p>" + replacement;
// now output holds your modified text/html
}

Regex select all text between tags

What is the best way to select all the text between 2 tags - ex: the text between all the '<pre>' tags on the page.
You can use "<pre>(.*?)</pre>", (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
Tag can be completed in another line. This is why \n needs to be added.
<PRE>(.|\n)*?<\/PRE>
To exclude the delimiting tags:
(?<=<pre>)(.*?)(?=</pre>)
(?<=<pre>) looks for text after <pre>
(?=</pre>) looks for text before </pre>
Results will text inside pre tag
This is what I would use.
(?<=(<pre>))(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>)) Selection have to be prepend with <pre> tag
(\w|\d|\n|[().,\-:;##$%^&*\[\]"'+–/\/®°⁰!?{}|~]| ) This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character | simply means "OR".
+? Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>)) Selection have to be appended by the </pre> tag
Depending on your use case you might need to add some modifiers like (i or m)
i - case-insensitive
m - multi-line search
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
Javascript does not support lookbehind
The above example should work fine with languages such as PHP, Perl, Java ...
Javascript however does not support lookbehind so we have to forget about using `(?))` and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
https://stackoverflow.com/questions/11592033/regex-match-text-between-tags
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses
use the below pattern to get content between element. Replace [tag] with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor tag having href, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>
This answer supposes support for look around! This allowed me to identify all the text between pairs of opening and closing tags. That is all the text between the '>' and the '<'. It works because look around doesn't consume the characters it matches.
(?<=>)([\w\s]+)(?=<\/)
I tested it in https://regex101.com/ using this HTML fragment.
<table>
<tr><td>Cell 1</td><td>Cell 2</td><td>Cell 3</td></tr>
<tr><td>Cell 4</td><td>Cell 5</td><td>Cell 6</td></tr>
</table>
It's a game of three parts: the look behind, the content, and the look ahead.
(?<=>) # look behind (but don't consume/capture) for a '>'
([\w\s]+) # capture/consume any combination of alpha/numeric/whitespace
(?=<\/) # look ahead (but don't consume/capture) for a '</'
I hope that serves as a started for 10. Luck.
This seems to be the simplest regular expression of all that I found
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
Exclude opening tag (?:<TAG>) from the matches
Include any whitespace or non-whitespace characters ([\s\S]*) in the matches
Exclude closing tag (?:<\/TAG>) from the matches
You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
Try this....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:
preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches) this regex will select everyting between tag. no matter is it in new line(work with multiline.
In Python, setting the DOTALL flag will capture everything, including newlines.
If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org
#example.py using Python 3.7.4
import re
str="""Everything is awesome! <pre>Hello,
World!
</pre>
"""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)
print(matches.group(1))
python example.py
Hello,
World!
Capturing text between all opening and closing tags in a document
To capture text between all opening and closing tags in a document, finditer is useful. In the example below, three opening and closing <pre> tags are present in the string.
#example2.py using Python 3.7.4
import re
# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)
matches = pattern.finditer(str)
for i,match in enumerate(matches):
print(f"tag {i}: ",match.group(1))
python example2.py
tag 0: Desert Survival Problem
tag 1: humor conditions
tag 2: endless opportunity for self-reflection
(?<=>)[^<]+
for Notepad++
>([^<]+)
for AutoIt (option Return array of global matches).
or
(?=>([^<]+))
https://regex101.com/r/VtmEmY/1
To select all text between pre tag I prefer
preg_match('#<pre>([\w\W\s]*)</pre>#',$str,$matches);
$matches[0] will have results including <pre> tag
$matches[1] will have all the content inside <pre>.
DomDocument cannot work in situations where the requirement is to get text with tag details within the searched tag as it strips all tags, nodeValue & textContent will only return text without tags & attributes.
test.match(/<pre>(.*?)<\/pre>/g)?.map((a) => a.replace(/<pre>|<\/pre>/g, ""))
this should be a preferred solution.especially if you have multiple pre tags in the context
You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );
const content = '<p class="title responsive">ABC</p>';
const blog = {content};
const re = /<([^> ]+)([^>]*)>([^<]+)(<\/\1>)/;
const matches = content.match(re);
console.log(matches[3]);
matches[3] is the content text and this is adapted to any tag name with classes. (not support nested structures)
How about:
<PRE>(\X*?)<\/PRE>
More complex than PyKing's answer but matches any type of tag (except self-closing) and considers cases where the tag has HTML-like string attributes.
/<TAG_NAME(?:STRING|NOT_CLOSING_TAG_NOT_QUOTE)+>INNER_HTML<\/\1 *>/g
Raw: /<([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>(.*?)<\/\1 *>/g
Regex Railroad diagram:
group #1 = tag name
group #2 = string attr
group #3 = inner html
JavaScript code testing it:
let TAG_NAME = '([^\s</>]+)';
let NOT_CLOSING_TAG_NOT_QUOTE = '[^>"]';
let STRING = '("(?:[^"\\\\]|\\\\.)*")';
let NON_SELF_CLOSING_HTML_TAG =
// \1 is a back reference to TAG_NAME
`<${TAG_NAME}(?:${STRING}|${NOT_CLOSING_TAG_NOT_QUOTE})+>(.*?)</\\1 *>`;
let tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG, 'g');
let myStr = `Aenean <abc href="/life<><>\\"<?/abc></abc>"><a>life</a></abc> sed consectetur.
Work Inner HTML quis risus eget about inner html leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>`;
let matches = myStr.match(tagRegex);
// Removing 'g' flag to match each tag part in the for loop
tagRegex = new RegExp(NON_SELF_CLOSING_HTML_TAG);
for (let i = 0; i < matches.length; i++) {
let tagParts = matches[i].match(tagRegex);
console.log(`Tag #${i} = [${tagParts[0]}]`);
console.log(`Tag #${i} name: [${tagParts[1]}]`);
console.log(`Tag #${i} string attr: [${tagParts[2]}]`);
console.log(`Tag #${i} inner html: [${tagParts[3]}]`);
console.log('');
}
Output:
Tag #0 = [<abc href="/life<><>\"<?/abc></abc>"><a>life</a></abc>]
Tag #0 name: [abc]
Tag #0 string attr: ["/life<><>\"<?/abc></abc>"]
Tag #0 inner html: [<a>life</a>]
Tag #1 = [Work Inner HTML]
Tag #1 name: [a]
Tag #1 string attr: ["/work"]
Tag #1 inner html: [Work Inner HTML]
Tag #2 = [about inner html]
Tag #2 name: [a]
Tag #2 string attr: ["/about"]
Tag #2 inner html: [about inner html]
Tag #3 = [<ve text="<></ve>>">abc</ve>]
Tag #3 name: [ve]
Tag #3 string attr: ["<></ve>>"]
Tag #3 inner html: [abc]
This doesn't work if:
The tag has any descendant tag of the same type
The tag start in one line and ends in another. (In my case I
remove line breaks from HTML)
If you change (.*?)<\/\1 *> to ([\s\S]*?)<\/\1 *> it should match the tag's inner html even if everything is not in the same line. For some reason it didn't work for me on Chrome and Node but worked here with the JavaScript's Regex Engine:
https://www.regextester.com
Regex: <([^\s</>]+)(?:("(?:[^"\\]|\\.)*")|[^>"])+>([\s\S]*?)<\/\1 *>
Test String:
Aenean lacinia <abc href="/life<><><?/a></a>">
<a>life</a></abc> sed consectetur.
Work quis risus eget urna mollis ornare about leo.
interacted with any of the <<<ve text="<></ve>>">abc</ve>
For multiple lines:
<htmltag>(.+)((\s)+(.+))+</htmltag>
I use this solution:
preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new);
var_dump($new);
In Javascript (among others), this is simple. It covers attributes and multiple lines:
/<pre[^>]*>([\s\S]*?)<\/pre>/
<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>

extract title tag from html

I want to extract contents of title tag from html string. I have done some search but so far i am not able to find such code in VB/C# or PHP. Also this should work with both upper and lower case tags e.g. should work with both <title></title> and <TITLE></TITLE>. Thank you.
You can use regular expressions for this but it's not completely error-proof. It'll do if you just want something simple though (in PHP):
function get_title($html) {
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
Sounds like a job for a regular expression. This will depend on the HTML being well-formed, i.e., only finds the title element inside a head element.
Regex regex = new Regex( ".*<head>.*<title>(.*)</title>.*</head>.*",
RegexOptions.IgnoreCase );
Match match = regex.Match( html );
string title = match.Groups[0].Value;
I don't have my regex cheat sheet in front of me so it may need a little tweaking. Note that there is also no error checking in the case where no title element exists.
If there is any attribute in the title tag (which is unlikely but can happen) you need to update the expression as follows:
$title = preg_match('!<title.*>(.*?)</title>!i', $url_content, $matches) ? $matches[1] : '';