Extract string from HTML tag [VB.Net] [duplicate] - html

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 4 years ago.
How do I match and replace text using regular expressions in multiline mode?
I know the RegexOptions.Multiline option, but what is the best way to specify match all with the new line characters in C#?
Input:
<tag name="abc">this
is
a
text</tag>
Output:
[tag name="abc"]this
is
a
test
[/tag]
Aahh, I found the actual problem. '&' and ';' in Regex are matching text in a single line, while the same need to be escaped in the Regex to work in cases where there are new lines also.

If you mean there has to be a newline character for the expression to match, then \n will do that for you.
Otherwise, I think you might have misunderstood the Multiline/Singleline flags. If you want your expression to match across several lines, you actually want to use RegexOptions.Singleline. What it means is that it treats the entire input string as a single line, thus ignoring newlines. Is this what you're after...?
Example
Regex rx = new Regex("<tag name=\"(.*?)\">(.*?)</tag>", RegexOptions.Singleline);
String output = rx.Replace("Text <tag name=\"abc\">test\nwith\nnewline</tag> more text...", "[tag name=\"$1\"]$2[/tag]");

Here's a regex to match. It requires the RegexOptions.Singleline option, which makes the . match newlines.
<(\w+) name="([^"]*)">(.*?)</\1>
After this regex, the first group contains the tag, the second the tag name, and the third the content between the tags. So replacement string could look like this:
[$1 name="$2"]$3[/$1]
In C#, this looks like:
newString = Regex.Replace(oldString,
#"<(\w+) name=""([^""]*)"">(.*?)</\1>",
"[$1 name=\"$2\"]$3[/$1]",
RegexOptions.Singleline);

Related

How to use regex (regular expressions) in Notepad++ to remove all HTML and JSON code that does not contain a specific string?

Using regular expressions (in Notepad++), I want to find all JSON sections that contain the string foo. Note that the JSON just happens to be embedded within a limited set of HTML source code which is loaded into Notepad++.
I've written the following regex to accomplish this task:
({[^}]*foo[^}]*})
This works as expected in all the input that is possible.
I want to improve my workflow, so instead of just finding all such JSON sections, I want to write a regex to remove all the HTML & JSON that does not match this expression. The result will be only JSON sections that contain foo.
I tried using the Notepad++ regex Replace functionality with this find expression:
(?:({[^}]*?foo[^}]*?})|.)+
and this replace expression:
$1\n\n$2\n\n$3\n\n$4\n\n$5\n\n$6\n\n$7\n\n$8\n\n$9\n\n
This successfully works for the last occurrence of foo within the JSON, but does not find the rest of the occurrences.
How can I improve my code to find all the occurrences?
Here is a simplified minimal example of input and desired output. I hope I haven't simplified it too much for it to be useful:
Simplified input:
<!DOCTYPE html>
<html>
<div dat="{example foo1}"> </div>
<div dat="{example bar}"> </div>
<div dat="{example foo2}"> </div>
</html>
Desired output:
{example foo1}
{example foo2}
You can use
{[^}]*foo[^}]*}|((?s:.))
Replace with (?1:$0\n). Details:
{[^}]*foo[^}]*} - {, zero or more chars other than }, foo, zero or more chars other than } and then a }
| - or
((?s:.)) - Capturing group 1: any one char ((?s:...) is an inline modifier group where . matches all chars including line break chars, same as if you enabled . matches newline option).
The (?1:$0\n) replacement pattern replaces with an empty string if Group 1 was matched, else the replacement is the match text + a newline.
See the demo and search and replace dialog settings:
Updates
The comment section was full tried to suggest a code here,
Let me know if this is a bit close to your intended result,
Find: ({.+?[\n]*foo[ \d]*})|.*?
Replace all: $1
Also added Toto's example

RegEx to extract hardcoded strings in HTML [duplicate]

This question already has answers here:
What to do Regular expression pattern doesn't match anywhere in string?
(8 answers)
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 7 years ago.
I need some help for extracting hard coded strings out of HTML.
This is a example markup from the template engine that I use
[[if:"x";"y"]]
<p>true part</p>
[[:else]]
<p>false part</p>
[[:endif]]
[[each:ARRAY;KEY;VALUE]]
Index :[[KEY]] is :[[VALUE]]
or if VALUE is an array
Index :[[KEY]], FOO is :[[VALUE:FOO]]
[[:endeach]]
{$_TEMPLATE['VARS']}
<p><b>I want this</b> and this, {%'AND **THIS NOT**, THIS IS ALREADY TRANSLATED
SINGLE QUOTE MARK IS ESCAPED BY A BACKSLASH \' '}
LINES</p>
Currently I use that pattern />([^\<\>\n\{\}]+\S*?)+</is but it work not reliable.
:[[VAR]], {$_TEMPLATE['VAR']} and control-blocks([[if:"x";"y"]] etc.) should not be extracted. In case of mixed text (Foo :[[has]] bar) should Foo and bar extracted separately
For the attributes, I using the pattern /(placeholder|title|alt|value)\=\"([^\"\'=\{\}\[\]]*?)\"/ which is no Problem
I hope you can help me.
EDIT: Required output from this example:
true part
false part
Index
is
or if VALUE is an array
Index
, FOO is
I want this
and this

Sublime Text regex to find and replace whitespace between two xml or html tags?

I'm using Sublime Text and I need to come up with a regex that will find the whitespaces between a certain opening and closing tag and replace them with commas.
Example: Replace white space in
<tags>This is an example</tags>
so it becomes
<tags>This,is,an,example</tags>
Thanks!
You have just to use a simple regex like:
\s+
And replace it with with a comma.
Working demo
This will find instances of
<tags>...</tags>
with whitespace between the tags
(<tags>\S+)\W(.+</tags>)
This will replace the first whitespace with a comma
\1,\2
Open Find and Replace [OS X Cmd+Opt+F :: Windows Ctrl+H]
Use the two values above to find and replace and use the 'Replace All' option. Repeat until all the whitespaces are converted to commas.
The best answer is probably a quick script but this will get you there fairly fast without needing to do any coding.
You can replace any one or more whitespace chunks in between two tags using a single regular expression:
(?s)(?:\G(?!\A)|<tags>(?=.*?</tags>))(?:(?!</?tags>).)*?\K\s+
See the regex demo. Details
(?s) - a DOTALL inline modifier, makes . match line breaks
(?:\G(?!\A)|<tags>(?=.*?</tags>)) - either the end of the previous successful match (\G(?!\A)) or (|) <tags> substring that is immediately followed with any zero or more chars, as few as possible and then </tags> (see (?=.*?</tags>))
(?:(?!</?tags>).)*? - any char that does not start a <tags> or </tags> substrings, zero or more occurrences but as few as possible
\K - match reset operator
\s+ - one or more whitespaces (NOTE: use \s if each whitespace must be replaced).
SublimeText settings:

regex newline character error

I am trying to make my regex work across multiple lines and "m" didn't seem to work either. So, my regex is working for 1st line and noT for the following lines.
You can skip the match part and just do it all in one step:
> "the *text* is to be replaced \n by *text*".replace(/\*([\s\S]*?)\*/g, '<i>$1</i>');
"the <i>text</i> is to be replaced \n by <i>text</i>"
. matches any character, but it excludes newlines. [\s\S] matches any character including newlines.
I changed your search regex to \*([\s\S]*?)\*, which non-greedily matches the stuff between the asterisks.
The replacement string is <i>$1</i>. $1 is replaced with the contents of the first capturing group, which is your text.
Also, because it looks like you're trying to convert Markdown to HTML, try using a pre-made JS converter: http://www.showdown.im/
You can use it like this:
var str = "the *text* is to be *replaced \n by* *text*";
alert(str.replace(/\*([\s\S]*?)\*/g, '<i>$1</i>'));

Regular expression to remove HTML tags from a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regular expression to remove HTML tags
Is there an expression which will get the value between two HTML tags?
Given this:
<td class="played">0</td>
I am looking for an expression which will return 0, stripping the <td> tags.
You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.
The following examples are Java, but the regex will be similar -- if not identical -- for other languages.
String target = someString.replaceAll("<[^>]*>", "");
Assuming your non-html does not contain any < or > and that your input string is correctly structured.
If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:
String target = someString.replaceAll("(?i)<td[^>]*>", "");
Edit:
Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.
For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.
In a situation where multiple tags are expected, we could do something like:
String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();
This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.
A trivial approach would be to replace
<[^>]*>
with nothing. But depending on how ill-structured your input is that may well fail.
You could do it with jsoup http://jsoup.org/
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);