Select only URLs separated by commas with REGEX - json

My objective is to put all the URLs between "" so I'm trying to select them without the comma , then I will use the regular expression to do a large search/replace.
My current REGEX: "BigImage":\s(\[(.*)\])
I tried this but it doesn't work: "BigImage":\s(\[([^,]+)\])
"BigImage": [http://example.com/1.jpg,http://example.com/2.jpg,http://example.com/3.jpg]
Example: https://regex101.com/r/nE5eV3/30

You can make a regex for your urls, i don know, if it allways looks the same. For your links the regex would lookls like this:
(https?://(www)?[a-zA-Z0-9]*\.[a-zA-Z]{2,4}/[^\.]*\.(jpg|jpeg|png|gif))
This regex will match all of your urls (you posted in your question).
Full Blocks:
("BigImage": \[([^,\]]*,?)*\])
If you want to filter the string you posted above, you can use the regex above.
Tested with this site!
If you post a more complete example of your data, we can help you more.

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

RegEx valid relative urls in href and src links of html

I have this RegEx and have tested it against the below dataset:
RegEx: /(href|src)\=\"(?!(ht|f)tp|www|:|\/\/)(\/)?/g
Dataset:
href="/hello
href="hello/bob
href="new/hello/bob
href="hello/test.com/hello
href="abc.hello.com/hello <-- I want to exclude this type of url
href="www.google.com/hello
href="https://www.google.com
href="http://google.com
href="ftp://www.google.com
href="://google.com
href="//google.com
Here is a demo link with the above inputs:
https://regex101.com/r/1mCFWL/4
The issue I am having is that the 4th test item abc.hello.com/hello also matches the RegEx and I would like to exclude all URLs which contain a .com before a /.
I am trying to do a lookup ahead but have been unable to get this working.
Can anyone help improve the above RegEx to add support to exclude URLs which contain a .com before a /?
EDIT:
A successful match criterion is matching only the first 4 items in the dataset.
You may add [^"\/]*\.com or [^"\/]*\.com(?![^\/]) alternative to the negative lookahead:
(?:href|src)="(?!(?:ht|f)tp|www|:|\/\/|[^"\/]*\.com)
See the regex demo and the Regulex graph:
The (?![^\/]) will require / or end of string if you add that pattern after com.

My input pattern doesn't work

I've created a regex for checking a date format ( 01-01-0000 to 31-12-9999).
I tried an example regex, and it works, so there is something wrong with my regex, but when I try it in a debugger (regexr) it works just fine.
What am I missing?
([0]{1}[1-9]{1}|[1-2]{1}[0-9]{1}|[3]{1}[0-1]{1})(\-)([0]{1}[1-9]{1}|[1]{1}[0-2]{1})(\-)\d{4}
New regex after edit:
(0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[0-2])-\d{4}
I use an html input type text, and put the regex in pattern ="my pattern".
Thanks in advance (:
Edit: Fixed the regex according to Casimir et Hippolyte's comment, and now it works.
Your regex looks OK, at least it captures your both sample dates (tested on regex101.com).
You can simplify it a little:
No need for [...] around a single char (e.g. change [0] to 0).
No need for capturing groups around a dash (e.g. change (-) to -).
It is strange that you used capturing groups for day and month, but you
didn't for year field (I added it in the example below).
So try the following regex:
(0[1-9]|[12][0-9]|3[01])-(0[1-9]|1[0-2])-(\d{4})
It is however not clear, whether you realy need capturing groups.

Regex find two characters in order, between others, ignoring punctuation

I'm trying to filter using regex in mySQL.
The field is a text field and I want to find all that match 'MD' or similar ('M.D.', 'M. D.', 'DDS, M.D.' etc.).
I do not want to accept those that contain M and D as a part of another acronym (e.g., 'DMD'). However 'DMD, M.D.' I would want to find.
Apologies if this is a simple task - I read through some regex tutorials and couldn't figure this out! Thanks.
Update:
With help from the suggestions I arrived at the following solution:
(\s|^)M\.?\s*D\.?
which works for all of my cases. The quotes in my questions were to indicate it was a string, they are not a part of the string.
You can use a regex like this:
\b(M\.?\s*D\.?|D\.?\s*D\.?\s*S\.?)
Working demo
If I have understood your requirement:
'([^'.]*[ ,]*M[. ]*D[. ]*)'
this looks for MD preceded by space comma or ' separated by 0 or more dots & spaces, followed by '
it matches all the contents between the '' marks
test: https://regex101.com/r/oV2kV8/2
In the end I found this solution works:
(\s|^)M\.?\s*D\.?(\s|$)
This allows for the 'MD' to be at the start or after another credential and to have spaces or periods or nothing between the letters.

How do I check if values between html tags are blank or empty using regular expressions in notepad plus plus

I'm conducting a mass search of files in notepad++ and I need to determine if there are no values between a set of tags (i.e. ).
".*?" will search for 0 or more characters (well, most), which is fine. But I'm looking for a set of tags with at least one character between them.
".+?" is similar to the above and does work in notepad++.
I tried the following, which was unsuccessful:
<author>.{0}?</author>
Thank you for any help.
Since you look for something that doesn't exist you don't have to make it that complicated. Simply searching for <author></author> would do the trick, wouldn't it? If you want to include space-characters as "nothing" you could modify it to the following:
<author>\s*?</author>
Output:
<author></author> Match
<author> </author> Match
<author>something</author> No match
I don't understand why you are using the "?" operator; ".+" should yield the result you need.