How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string? - html

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}

You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:

Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

Related

regex interprete markdown but ignore HTML

In a string like
Hallo, this is <code>`code`</code> and this `is code again`.
To analyse it, parse it with regex?
In this example the user just typed the far right ` at the very last. The first "code" has obviously already been surrounded by HTML.
I need a regex to get the next code indicated part.
There always be one series, that is valid markdown AND not already surrounded by the corresponding HTML tags.
How to get this specific series (regardless if it's *, **, ___, ` or whatever)?
So what you want is a regex that only matches the markdown that isn't surrounded by HTML tags right ?
You can use something like this :
/(?:[^<>]|^)(`[^<>].*?`)/
This will only match the text placed inside `` that aren't directly placed next to a < or > character. This way, no matter what the HTML tag is inside the <...>, the `code` won't match.
See this Regex101.com
If you want to match every emphasized string that is not tagged with "code" you can use
(?<!<code>)`[\w ]+`
You can test it on regex101.com

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Select only URLs separated by commas with REGEX

My objective is to put all the URLs between "" so I'm trying to select them without the comma , then I will use the regular expression to do a large search/replace.
My current REGEX: "BigImage":\s(\[(.*)\])
I tried this but it doesn't work: "BigImage":\s(\[([^,]+)\])
"BigImage": [http://example.com/1.jpg,http://example.com/2.jpg,http://example.com/3.jpg]
Example: https://regex101.com/r/nE5eV3/30
You can make a regex for your urls, i don know, if it allways looks the same. For your links the regex would lookls like this:
(https?://(www)?[a-zA-Z0-9]*\.[a-zA-Z]{2,4}/[^\.]*\.(jpg|jpeg|png|gif))
This regex will match all of your urls (you posted in your question).
Full Blocks:
("BigImage": \[([^,\]]*,?)*\])
If you want to filter the string you posted above, you can use the regex above.
Tested with this site!
If you post a more complete example of your data, we can help you more.

Compare two HTML documents ignoring multiple and trailing whitespaces

Is there a tool that compares an HTML document like:
<p b="1" a="0 "> a b
c </p>
(as a C string: "<p> a b\nc </p>") equal to:
<p a="0 " b="1">a b c</p>
Note how:
text multiple whitespaces were converted to a single whitespace
newlines were converted to whitespaces
text trailing and heading whitespaces were stripped
attributes were put on a standard order
attribute values were unchanged, including trailing whitespaces
Why I want that
I am working on the Markdown Test Suite that aims to measure markdown engine compliance and portability.
We have markdown input, expected HTML output, and want to determine if the generated HTML output is equal to the expected one.
The problem is that Markdown is underspecified, so we cannot compare directly the two HTML strings.
The actual test code is here, just modify run-tests.py#dom_normalize if you want to try out your solution.
Things I tried
beautifulsoup. Orders the attributes, but does not deal well with whitespaces?
A function formatter regex modification might work, but I don't see a way to differentiate between the inside of nodes and attributes.
A Python only solution like this would be ideal.
looking for a Javascript function similar to isEqualNode() (does not work because ignores nodeVaue) + some headless JS engine. Couldn't find one.
If there is nothing better, I'll just have to write my own output formatter front-end to some HTML parser.
I ended up cooking up a custom HTML renderer that normalizes things based on Python's stdlib HTMLParser.
You can see it at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L20
Usage and docstrig tests at: https://github.com/karlcow/markdown-testsuite/blob/749ed0b812ffcb8b6cc56f93ff94c6fdfb6bd4a2/run-tests.py#L74

Yahoo Pipes and Regex with an html formatting issue

I am struggling to see how to use the regex to add a non-printable carriage return character into an html string.
Its a WordPress thing in that to auto-embed a video I need to put the URL on its own line in the html.
First I use a regex:
In item.vid_src replace ($) with \\r$1
s is checked.
After which I am using a loop with a string builder in it - I am prefixing vid_src to the start of description thus:
item.vid_src
<br><br>
item.description
assign results to item.description
Before I include the Regex module in the pipe I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ<br><br><p><h1 class="MsoNormal">Cheetahs on
the edge</h1>
But I need this:
http://www.youtube.com/watch?v=THA_5cqAfCQ
<br><br><p><h1 class="MsoNormal">Cheetahs on the edge</h1>
Adding the regex module I get this:
http://www.youtube.com/watch?v=THA_5cqAfCQ\r<br><br><p><h1 class="MsoNormal">
Cheetahs on the edge</h1>
Clearly its inserting exactly what I have asked for, but It is not what I was expecting, I need to get the html formatted with the newline. Does anybody have an insight as to how to tackle the problem?