Remove HTML tags in specific tags in MySQL - mysql

I'd like to make a SQL script to remove for exemple all <strong> and </strong> tags which are inside a title <hX></hX> tag.
I want to replace all occurences like <h4><strong>Some text</strong></h4> with <h4>Some text</h4>,
but only if in a H tag and without losing content of course.
I tried many things like the REGEXP_REPLACE and REGEXP_SUBSTR but I'm stuck with something like REGEXP_REPLACE(myfield, "<h\\d>.*<strong>.*<\/strong>.*<\/h\\d>", "") which replaces all match.

I use php to strip info out: preg_replace('#[^A-Za-z0-9]#i', '', $_POST['username']); // filter everything but letters and numbers. It can be modified for specific phrases and characters. I know it isn't SQL but it is something. Also in Javascript, you can use an innerHTML command that pulls the text only out from within tags >Text<

Related

How to match text and skip HTML tags using a regular expression?

I have a bunch of records in a QuickBase table that contain a rich text field. In other words, they each contain some paragraphs of text intermingled with HTML tags like <p>, <strong>, etc.
I need to migrate the records to a new table where the corresponding field is a plain text field. For this, I would like to strip out all HTML tags and leave only the text in the field values.
For example, from the below input, I would expect to extract just a small example link to a webpage:
<p>just a small <a href="#">
example</a> link</p><p>to a webpage</p>
As I am trying to get this done quickly and without coding or using an external tool, I am constrained to using Quickbase Pipelines' Text channel tool. The way it works is that I define a regex pattern and it outputs only the bits that match the pattern.
So far I've been able to come up with this regular expression (Python-flavored as QB's backend is written in Python) that correctly does the exact opposite of what I need. I.e. it matches only the HTML tags:
/(<[^>]*>)/
In a sense, I need the negative image of this expression but have not be able to build it myself.
Your help in "negating" the above expression is most appreciated.
Assuming there are no < or > elsewhere or entity-encoded, an idea using a lookbehind.
(?:(?<=>)|^)[^<]+
See this demo at regex101
(?:(?<=>)|^) is an alternation between either ^ start of the string or looking behind for any >. From there [^<]+ matches one or more characters that are not < (negated character class).

regex interprete markdown but ignore HTML

In a string like
Hallo, this is <code>`code`</code> and this `is code again`.
To analyse it, parse it with regex?
In this example the user just typed the far right ` at the very last. The first "code" has obviously already been surrounded by HTML.
I need a regex to get the next code indicated part.
There always be one series, that is valid markdown AND not already surrounded by the corresponding HTML tags.
How to get this specific series (regardless if it's *, **, ___, ` or whatever)?
So what you want is a regex that only matches the markdown that isn't surrounded by HTML tags right ?
You can use something like this :
/(?:[^<>]|^)(`[^<>].*?`)/
This will only match the text placed inside `` that aren't directly placed next to a < or > character. This way, no matter what the HTML tag is inside the <...>, the `code` won't match.
See this Regex101.com
If you want to match every emphasized string that is not tagged with "code" you can use
(?<!<code>)`[\w ]+`
You can test it on regex101.com

Search and replace outer tag in Atom using REGEX

Using Atom, I'm trying to replace the outer tag structure for multiple different texts within a document. Also using REGEX, which I'm not versed enough to come up with my own solution
HTML to be searched <span class="klass">Any text string</span>
Replace it with <code>Any text string</code>
My REGEX (<?span class="klass">)+[\w]+(<?/span>)
Is there a wildcard to "keep" the [\w] part into the replaced result?
You can use a capture group to capture the text in between the <span> tags during the match, and then use it to build the <code> output you want. Try the following find and replace:
Find:
<span class="klass">(.*?)</span>
Replace:
<code>$1</code>
Here $1 represents the quantity (.*?) which we captured in the search. One other point, we use .*? when capturing between tags as opposed to just .*. The former .*? is a "lazy" or tempered dot. This tells the engine to stop matching upon hitting the first closing </span> tag. Without this, the match would be greedy and would consume as much as possible, ending only with the final </span> tag in your text.

Search to exclude html tags in MySQL

I need a search query which will exclude text within HTML tags. For example, I need to search for a word called "spa" in my database. There are HTML tags in the database, so the result will contain <span> tags.
I need the search query to check only the words starting with the word "spa" but not within any HTML tag.
Please help.
Regex are always really hard to use for HTML, cause there are many rules that apply to it.
You should consider using a HTML-Parser instead.

regex: selecting everything but img tag

I'm trying to select some text using regular expressions leaving all img tags intact.
I've found the following code that selects all img tags:
/<img[^>]+>/g
but actually having a text like:
This is an untagged text.
<p>this is my paragraph text</p>
<img src="http://example.com/image.png" alt=""/>
this is a link
using the code above will select the img tag only
/<img[^>]+>/g #--> using this code will result in:
<img src="http://example.com/image.png" alt=""/>
but I would like to use some regex that select everything but the image like:
/magical regex/g # --> results in:
This is an untagged text.
<p>this is my paragraph text</p>
this is a link
I've also found this code:
/<(?!img)[^>]+>/g
which selects all tags except the img one. but in some cases I will have untagged text or text between tags so this won't work for my case. :(
is there any way to do it?
Sorry but I'm really new to regular expressions so I'm really struggling for few days trying to make it work but I can't.
Thanks in advance
UPDATE:
Ok so for the ones thinking I would like to parse it, sorry I don't want it, I just want to select text.
Another thing, I'm not using any language in specific, I'm using Yahoo Pipes which only provide regex and some string tools to accomplish the job. but it doesn't evolves any programming code.
for better understanding here is the way regex module works in yahoo pipes:
http://pipes.yahoo.com/pipes/docs?doc=operators#Regex
UPDATE 2
Fortuntately I'm being able to strip the text near the img tag but on a step-by-step basis as #Blixt recommended, like:
<(?!img)[^>]+> , replace with "" #-> strips out every tag that is not img
(?s)^[^<]*(.*), replace with $1 #-> removes all the text before the img tag
(?s)^([^>]+>).*, replace with $1 #-> removed all the text after the img tag
the problem with this is that it will only catch the first img tag and then I would have to do it manually and catch the others hard-coding it, so I still not sure if this is the best solution.
The regexp you have to find the image tags can be used with a replace to get what you want.
Assuming you are using PHP:
$htmlWithoutIMG = preg_replace('/<img[^>]+>/g', '', $html);
If you are using Javascript:
var htmlWithoutIMG = html.replace(/<img[^>]+>/g, '');
This takes your text, finds the <img> tags and replaces them with nothing, ie. it deletes them from the text, leaving what you want. Can not recall if the <,> need escaping.
Regular expression matches have a single start and length. This means the result you want is impossible in a single match (since you want the result to end at one point, then continue later).
The closest you can get is to use a regular expression that matches everything from start of string up to start of <img> tag, everything between <img> tags and everything from end of <img> tag to end of string. Then you could get all matches from that regular expression (in your example, there would be two matches).
The above answer is assuming you can't modify the result. If you can modify the result, simply replace the <img> tags with the empty string to get your result.