RegEx valid relative urls in href and src links of html

RegEx valid relative urls in href and src links of html - html

I have this RegEx and have tested it against the below dataset:
RegEx: /(href|src)\=\"(?!(ht|f)tp|www|:|\/\/)(\/)?/g
Dataset:
href="/hello
href="hello/bob
href="new/hello/bob
href="hello/test.com/hello
href="abc.hello.com/hello <-- I want to exclude this type of url
href="www.google.com/hello
href="https://www.google.com
href="http://google.com
href="ftp://www.google.com
href="://google.com
href="//google.com
Here is a demo link with the above inputs:
https://regex101.com/r/1mCFWL/4
The issue I am having is that the 4th test item abc.hello.com/hello also matches the RegEx and I would like to exclude all URLs which contain a .com before a /.
I am trying to do a lookup ahead but have been unable to get this working.
Can anyone help improve the above RegEx to add support to exclude URLs which contain a .com before a /?
EDIT:
A successful match criterion is matching only the first 4 items in the dataset.

You may add [^"\/]*\.com or [^"\/]*\.com(?![^\/]) alternative to the negative lookahead:
(?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com)
See the regex demo and the Regulex graph:
The (?![^\/]) will require / or end of string if you add that pattern after com.

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}

You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:

Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

RegEx matching for HTML and non-HTML URLs

I'm trying to get all urls from this text. The absolute and relative URLs, but I'm not getting the right regular expression. The expression is combining with more things than I would like. You are getting HTML tags and other information that I do not want.
Attempt
(\w*.)(\\\/){1,}(.*)(?![^"])
Input
<div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>\n
<a title=\"Avengers\" href=\"\/pt\/movie\/Avengers\/57689\" >Avengers<\/a> <\/div>\n
<img title=\"\" alt=\"\" id=\"145793\" src=\"https:\/\/images04-cdn.google.com\/movies\/74932\/74932_02\/previews\/2\/128\/top_1_307x224\/74932_02_01.jpg\" class=\"tlcImageItem img\" width=\"307\" height=\"224\" \/>
pageLink":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","previousPage":"\/pt\/videos\/\/updates\/1\/0\/Category\/0","nextUrl":"\/pt\/videos\/\/updates\/2\/0\/Category\/0","method":"updates","type":"scenes","callbackJs"
<span class=\"value\">4<\/span>\n <\/div>\n <\/div>\n <div class=\"loader\">\n <div class=\"loaderImage\"><img src=\"\/c\/Community\/Rating\/img\/loader.gif\" \/><\/div>\n <\/div>\n<\/div>\n<\/div><\/span><\/span>
Demo

As it has been commented, it may not really be the best idea that you solve this problem with RegEx. However, if you wish to practice or you really have to, you may do an exact match in between "" where you URLs are present. You can bound them from left using scr, href, or any other fixed components that you may have. You can simply use an | and list them in the first group ().
RegEx 1 for HTML URLs
This RegEx may not be the right solution, but it might give you a perspective that how you might approach solving this problem using RegEx:
(src=|href=)(\\")([a-zA-Z\\\/0-9\.\:_-]+)(")
It creates four groups, so that to simplify updating it, and the $3 group might be your desired URLs. You can add any chars that your URLs might have in the third group.
RegEx 2 for both HTML and non-HTML URLs
For capturing other non-HTML URLs, you can update it similar to this RegEx:
(src=\\|href=\\|pageLink\x22:|previousPage\x22:|nextUrl\x22:)(")([a-zA-Z\\\/0-9\.\:_-]+)(")
where \x22 stands for ", which you can simply replace it. I have just added \x22 such that you could see those ", where your target URLs are located in between:
The second RegEx also has four groups, where the target group is $3. You can also simplify or DRY it, if you wish.

Select only URLs separated by commas with REGEX

My objective is to put all the URLs between "" so I'm trying to select them without the comma , then I will use the regular expression to do a large search/replace.
My current REGEX: "BigImage":\s(\[(.*)\])
I tried this but it doesn't work: "BigImage":\s(\[([^,]+)\])
"BigImage": [http://example.com/1.jpg,http://example.com/2.jpg,http://example.com/3.jpg]
Example: https://regex101.com/r/nE5eV3/30

You can make a regex for your urls, i don know, if it allways looks the same. For your links the regex would lookls like this:
(https?://(www)?[a-zA-Z0-9]*\.[a-zA-Z]{2,4}/[^\.]*\.(jpg|jpeg|png|gif))
This regex will match all of your urls (you posted in your question).
Full Blocks:
("BigImage": \[([^,\]]*,?)*\])
If you want to filter the string you posted above, you can use the regex above.
Tested with this site!
If you post a more complete example of your data, we can help you more.

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Disclaimer: I know that html and regex should not stand together, but this is an exceptional case.
I need to parse Google Search results and extract cache urls. I have this in the page:
<a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:
gsNKb7ku3ewJ:somedata&ei=MyIIUtrZAcPX7AaVzIHwDg&ved=0CB8QIDAC&usg
=AFQjCNGcnWfdzQiTKwyAMmI-M-xzxII5Ag">Cached</a>
I tried simple stuff like: href=[\'"]?([^\'" >]+) but it is not what I need. I want to extract a single parameter (q) from the href. I need to get:
http://webcache.googleusercontent.com/search%3Fq%3Dcache:gsNKb7ku3ewJ:somedata
So everything between "url?q=" and first "&", when the contents contain word "webcache" in it.

If your language supports positive look-behinds:
(?<=q=).*?(?=[&"])
Otherwise match group \1 with this expression:
(?:q=)(.*?)(?=[&"])
Explanation:
.*? is the body of our expression. Just match everything, but don't be greedy!
(?<=q=) is a positive look-behind, which says "q=" should come before the match
(?=[&"]) is a positive look ahead, which says "either & or a quote should come after the match"
Because we make it not greedy with the ?, it'll stop at the first quote or ampersand. Otherwise it'd match all of the way to the closing quote.

Use a look behind before, and a look ahead at the end to assert the surrounding text, and include the keyword in the regex:
(?<=url\?q=)[^&]*webcache[^&]*(?=&)
Using [^&]* ensures that the keyword occurs before an & - within the target string.

Using regex to extract information from a file and need help

I have an html file that has a table of information and I'm trying to extract specific columns. The pattern is like this with alternating "TableDarkRow" and "TableLightRow":
'>817338284254611</A></td><td Class='TableDarkRow' NOWRAP> 01/14/2011</td>
And I'm trying to extract an array of number and date pairs :
817338284254611
01/14/2011
I tried and came up with this:
>([0-9])+</A>(.*)NOWRAP> ?([0-9]{2}\/[0-9]{2}\/[0-9]{4})
But the (.*) is allowing the entire document to be selected between the first and last occurrences.

Replace the .* with .*? for non-greedy matching.
Reference: Watch Out for The Greediness!

Try this one(haven't tested):
/[0-9\/ ]+/

You can replace .* with `[A-Za-z'<> \t]+'.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

RegEx valid relative urls in href and src links of html - html

You may add [^"\/]\.com or [^"\/]\.com(?![^\/]) alternative to the negative lookahead: (?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com) See the regex demo and the Regulex graph: The (?![^\/]) will require / or end of string if you add that pattern after com.

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

RegEx matching for HTML and non-HTML URLs

Select only URLs separated by commas with REGEX

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Using regex to extract information from a file and need help

Categories

Resources

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

RegEx valid relative urls in href and src links of html - html

You may add [^"\/]*\.com or [^"\/]*\.com(?![^\/]) alternative to the negative lookahead: (?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com) See the regex demo and the Regulex graph: The (?![^\/]) will require / or end of string if you add that pattern after com.

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

RegEx matching for HTML and non-HTML URLs

Select only URLs separated by commas with REGEX

Regex matching Google Cache url (matching entire href parameter when it contains a word)

Using regex to extract information from a file and need help

Categories

Resources

You may add [^"\/]\.com or [^"\/]\.com(?![^\/]) alternative to the negative lookahead: (?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com) See the regex demo and the Regulex graph: The (?![^\/]) will require / or end of string if you add that pattern after com.