Regular expression for remove html links [duplicate] - html

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for parsing links from a webpage?
RegEx match open tags except XHTML self-contained tags
i need a regular expression to strip html <a> tags , here is sample:
<a href="xxxx" class="yyy" title="zzz" ...> link </a>
should be converted to
link

I think you're looking for: </?a(|\s+[^>]+)>

Answers given above would match valid html tags such as <abbr> or <address> or <applet> and strip them out erroneously. A better regex to match only anchor tags would be
</?a(?:(?= )[^>]*)?>

Here's what I would use:
</?a\b[^>]*>

You're going to have to use this hackish solution iteratively, and it won't probably even work perfectly for complicated HTML:
<a(\s[^>]*)?>.*?(</a>)?
Alternatively, you can try one of the existing HTML sanitizers/parsers out there.
HTML is not a regular language; any regex we give you will not be 'correct'. It's impossible. Even Jon Skeet and Chuck Norris can't do it. Before I lapse into a fit of rage, like #bobince [in]famously once did, I'll just say this:
Use a HTML Parser.
(Whatever they're called.)
EDIT:
If you want to 'incorrectly' strip out </a>s that don't have any <a>s as well, do this:
</?[a\s]*[^>]*>

</?a.*?> would work. Replace it with ''

Related

use regex to select words between html tags

thanks for visiting my questions here. I'm trying to match sentences between tags. for example:
<h1> Most flavors, except the ones discussed below, have only one
metacharacter that matches both before a word and after a word. <p>
This is because any position between characters can never be both at
the start and at the end of a word. Using only one operator makes
things easier for you.<p>Word boundaries, as described above, are
supported by most regular expression flavors.
I'm trying to get 10 words from each tag.
output:
Most flavors, except the ones discussed below, have only one
This is because any position between characters can never be
Word boundaries, as described above, are supported by most regular
I find it's so tricky. Thanks for your help here!!!
As has already been linked in the comment, one of the most well-known answers of all time on this site is about how you using regular expressions to parse HTML is probably not a good idea. For a more detailed and balanced overview of when it is and isn't a good idea to do so, check out this question as well.
But briefly, the answer depends on what you're trying to do. It's likely that you'll be better off finding an HTML/XML-parsing library for whatever language you're using, and extracting the text with that.
I'm a bit confused as to what your task actually is, as your code as shown isn't valid HTML, since <h1> at least requires a closing tag. But if you do need to use regex to do this, you will want to look at word boundaries and interval operators for limiting to 10, and perhaps lookbehind (or just capture groups) to match the tag without returning it.
But again: if you're trying to parse actual HTML, you'd be better of using an HTML parser to get the tag content, and then getting the first 10 words using string operators. An example in Javascript, which is a bit of a cheat because you get the HTML parsing for free, but it makes for an easy example:
for(const tag of document.querySelectorAll('body *')) {
console.log(`${tag.tagName}: ${tag.innerText.split(' ').slice(0,5).join(' ')}`)
}
<h1>This is an h1 tag with a bunch of text in it that is really long</h1>
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long
<p>Here's a p tag with some more text that's really long

How to replace all html tags of one kind with another [duplicate]

This question already has answers here:
Find and replace HTML tags
(3 answers)
Closed 4 years ago.
I need to replace all HTML tags of one kind in a string with another, e.g., replace all <i> tags with <em> tag.
What's the best way to effectively change:
"<p><i>Random stuff here...</i></p>"
to the following?
"<p><em>Random stuff here...</em></p>"
There are millions of such strings, so a solution taking complexity into account would be nice.
You can make use gsub with block
string = "<p><i>Random stuff here...</i></p>"
string.gsub(/(<\/?)i(>)/) { "#{$1}em#{$2}" }
#=> "<p><em>Random stuff here...</em></p>"
Explanation:
Match an i html opening or closing tag and replace it with em

Regex: Negate capture group with logical or

I'm trying to use regex to filter forbidden HTML tags out of a given string. Yes I know, I'm supposed to use a parser instead but for this specific problem it's faster this way.
The idea is to whitelist every tag which is okay (e.g. <span>, <b>, </br>) and match forbidden ones. So far I came up with the following expression: <\/?(?!(span|b|br)).\>
It works well for single char tags like <a> but stuff like <label> does not work. I'd really appreciate some help, thanks in advance.
This regex will get tags while ignoring the span, br, b opening and closing tags.
It should even ignore those from the white list if they contain attributes.
<\/?(?!(?:span|br|b)(?: [^>]*)?>)[^>\/]*>
/<(?!(\/?span|\/?b|\/?br)).*?>/g

Regex that matches any HTML tag with the content inside

I'd like to use Regex to match HTML tag "head" and text inside them so I can delete them easily. I'm using a find and replace tool that is utilizing regex syntax and it really works great in replacing multiple files at once.
I tried doing a lot of syntax but I always fail.
http://regex101.com/r/aZ6pN5/2
Anyone can help please?
Replace .* in your regex with [\S\s]*?, so that it would match line breaks also. You can't use s DOTALL modifier in JavaScript.
<head.*?>([\s\S]*?)<\/head>
[\s\S]*? This would do an non-greedy match of zero or more space or non-space characters.
DEMO
OR
To replace the contents of head tag.
(<head\b[^<>]*>)[\s\S]*?(<\/head>)
Replacement string:
$1stringyouwant$2
DEMO

Regular expression to remove HTML tags from a string [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Regular expression to remove HTML tags
Is there an expression which will get the value between two HTML tags?
Given this:
<td class="played">0</td>
I am looking for an expression which will return 0, stripping the <td> tags.
You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.
The following examples are Java, but the regex will be similar -- if not identical -- for other languages.
String target = someString.replaceAll("<[^>]*>", "");
Assuming your non-html does not contain any < or > and that your input string is correctly structured.
If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:
String target = someString.replaceAll("(?i)<td[^>]*>", "");
Edit:
Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.
For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.
In a situation where multiple tags are expected, we could do something like:
String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();
This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.
A trivial approach would be to replace
<[^>]*>
with nothing. But depending on how ill-structured your input is that may well fail.
You could do it with jsoup http://jsoup.org/
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(yourText, whitelist);