regex interprete markdown but ignore HTML - html

In a string like
Hallo, this is <code>`code`</code> and this `is code again`.
To analyse it, parse it with regex?
In this example the user just typed the far right ` at the very last. The first "code" has obviously already been surrounded by HTML.
I need a regex to get the next code indicated part.
There always be one series, that is valid markdown AND not already surrounded by the corresponding HTML tags.
How to get this specific series (regardless if it's *, **, ___, ` or whatever)?

So what you want is a regex that only matches the markdown that isn't surrounded by HTML tags right ?
You can use something like this :
/(?:[^<>]|^)(`[^<>].*?`)/
This will only match the text placed inside `` that aren't directly placed next to a < or > character. This way, no matter what the HTML tag is inside the <...>, the `code` won't match.
See this Regex101.com

If you want to match every emphasized string that is not tagged with "code" you can use
(?<!<code>)`[\w ]+`
You can test it on regex101.com

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

Extracting content of HTML tag with specific attribute

Using regular expressions, I need to extract a multiline content of a tag, which has specific id value. How can I do this?
This is what I currently have:
<div(.|\n)*?id="${value}"(.|\n)*?>(.|\n)*?<\/div>
The problem with this is this sample:
<div id="1">test</div><div id="2">test</div>
If I want to replace id="2" using this regexp (with ${value} = 2), the whole string would get matched. This is because from the tag opening to closing I match everything until id is found, which is wrong.
How can I do this?
A fairly simple way is to use
Raw: <div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)</div>
Delimited: /<div(?=\s)[^>]*?\sid="2"[^>]*?>([\S\s]*?)<\/div>/
Use the variable in place of 2.
The content will be in group 1.
Change (.|\n) to [^>] so it won't match the > that ends the tag. Then it can't match across different divs.
<div\b[^>]*\bid="${value}"[^>]*>.*?<\/div>
Also, instead of using (.|\n)* to match across multiple lines, use the s modifier to the regexp. This makes . match any character, including newlines.
However, using regular expressions to parse HTML is not very robust. You should use a DOM parser.

Regex that matches any HTML tag with the content inside

I'd like to use Regex to match HTML tag "head" and text inside them so I can delete them easily. I'm using a find and replace tool that is utilizing regex syntax and it really works great in replacing multiple files at once.
I tried doing a lot of syntax but I always fail.
http://regex101.com/r/aZ6pN5/2
Anyone can help please?
Replace .* in your regex with [\S\s]*?, so that it would match line breaks also. You can't use s DOTALL modifier in JavaScript.
<head.*?>([\s\S]*?)<\/head>
[\s\S]*? This would do an non-greedy match of zero or more space or non-space characters.
DEMO
OR
To replace the contents of head tag.
(<head\b[^<>]*>)[\s\S]*?(<\/head>)
Replacement string:
$1stringyouwant$2
DEMO

Regular expression to match html tags

Just wanted to know if this the right way to write a regular-expression for an opening Html-tag <strong> : /<strong[^>]*/i?
What I am trying to do is have a pattern in place for html tags and then use is to match any html document.
Thanks in advance!
Close.
It would be like this for the opening tag:
/<strong[^>]*?>/i
Keep in mind that using Regex on HTML which involves tags nested within themselves can get very messy.
Ok. What I understood is that You want to match any string between "<" and ">" symbols. for an example <codekaro>
To do so you can use :
^[\<][A-Za-z]*[\>]$
Here, ^ indicates start of an expression,
[\<] will check for one occurrence of < symbol, \ is used as escape character for < symbol
[A-Za-z]* will check for any string,
[>] will check for one occurrence of > symbol, \ is used as escape character for > symbol
$ indicates end of an expression.
I encourage you to use this link for regex tutorial and this link to check results of regular expression.
Hope this will help you..!!
Happy learning..!!

Regex Operator in Validating HTML Tags

I am following Regular Expression.info and see on their samples page an expression to match agains HTML tags, as follows:
([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>
What is the semantic effect of the part \b[^]? I get its a word boundary but given what follows it what is the purpose?
It matches anything extra (if it exists) up until the next occurrence of a ">" (closing HTML tag). This would capture stuff like class="classname" id="idname". However, it would also capture any character you could think of, such as •·°ÁÓ, which may or may not be what you want. As always, a proper HTML parser is the way to go for parsing HTML.