I have this string that should be removed from the content of my wordpress website. I want it to be removed from database too.Either via Phpmyadmin or through a plugin.
Plugins don't accept wildcards or regex.
The string starts with <li class="dZip"> and ends with Download ZIP</a></li> , and contains alphanumeric and special characters between them. I like to remove all of them.
I have tried this <li class="dZip">\b.*Download ZIP</a></li>\b using plugins.No use.
If you have a new enough MySQL or MariaDB, you can use the function REGEXP_REPLACE().
The regexp would be
<li class="dZip">.*?Download ZIP</a></li>
two changes from what you had...
\b is a "word boundary". By definition either side of > is a word boundary.
So, I removed them.
.* would gobble up all the way to the last </li>. If you are expecting multiple li's then use .*? so that it gobbles only the one. The function (either MySQL's REGEXP_REPLACE or PHP's preg_replace) will repeat until finished.
Related
While there's a told of "old" examples on the internet using the now unsupported '[[:<:]]word[[:>:]]' technique, I'm trying to find out how, in MySQL 8.0.30, to do exact word matching from our table with words that have special characters in them.
For example, we have a paragraph of text like:
"Senior software engineer and C++ developer with Unit Test and JavaScript experience. I also have .NET experience!"
We have a table of keywords to match against this and have been using the basic system of:
SELECT
sk.ID
FROM
sit_keyword sk
WHERE
var_text REGEXP CONCAT('\\b',sk.keyword,'\\b')
It works fine 90% of the time, but it completely fails on:
C#, C++, .NET, A+ or "A +" etc. So it's failing to match keywords with special characters in them.
I can't seem to find any recent documentation on how to address this since, as mentioned, nearly all of the examples I can find use the old unsupported techniques. Note I need to match these words (with special characters) anywhere in the source text, so it can be the first or last word, or somewhere in the middle.
Any advice on the best way to do this using REGEXP would be appreciated.
You need to escape special chars in the search phrase and use the construct that I call "adaptive dynamic word boundaries" instead of word boundaries:
var_text REGEXP CONCAT('(?!\\B\\w)',REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1'),'(?<!\\w\\B)')
The REGEXP_REPLACE(sk.keyword, '([-.^$*+?()\\[\\]{}\\\\|])', '\\$1') matches . ^ $ * + - ? ( ) [ ] { } \ | chars (adds a \ before them) and (?!\\B\\w) / (?<!\\w\\B) require word boundaries only when the search phrase start/ends with a word char.
More details on adaptive dynamic word boundaries and demo in my YT video.
Regular expressions treat several characters as metacharacters. These are documented in the manual on regular expression syntax: https://dev.mysql.com/doc/refman/8.0/en/regexp.html#regexp-syntax
If you need a metacharacter to be treated as the literal character, you need to escape it with a backslash.
This gets very complex. If you just want to search for substrings, perhaps you should just use LOCATE():
WHERE LOCATE(sk.keyword, var_text) > 0
This avoids all the trickery with metacharacters. It treats the string of sk.keyword as containing only literal characters.
I'm trying to use a regex to detect URLs in all the rows of my table, here's the regex
\b(([\w-]+:\/\/?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|\/)))
However, I invariably get the "repetition-operator operand invalid" error, which, after hours of search on the internet, still remains obscure.
Where have I gone wrong? What can I do to fix this? And alternaltively, is there a better way to detect URLs in messages in SQL other than a Regex?
Thank you.
You cannot use ? quantifier in MySQL regex as the syntax is POSIX-based. Still, you can use * to match 0 or more characters. Also, \b in MySQL regex should be replaced with [[:<:]] (since this matches at the beginning of a word).
Thus, I suggest using
[[:<:]](([a-zA-Z0-9-]+:\/\/*|www[.])[^ ()<>]+(\([a-zA-Z0-9_]+\)|([^ [:punct:]]|\/)))
I am expanding \w to [a-zA-Z0-9_] as it is exactly what \w is. Instead of \s, I am using a literal space. Instead of \d, I am using [0-9]. This is done for readability and better compatibility. If \w, \d and \s work for you, you can use them, but I do not see them among the supported entities in POSIX specs.
Also, instead of literal space, you could use [:space:], it matches space, tab, newline, and carriage return. Instead of [a-zA-Z] you can use [:alpha:], and instead of [0-9], you can use [:digit:]. Please also check this:
[[:<:]](([[:alpha:][:digit:]-]+:\/\/*|www[.])[^[:space:]()<>]+(\([[:alpha:][:digit:]_]+\)|([^[:space:][:punct:]]|\/)))
I'm trying to recognize quoting (citing) somebody's else sentence in a markdown text, which I have in my local copy of MySQL GHTorrent dataset. So I wrote this query:
select * from github_discussions where body rlike '(.)*(\s){1,}(>)(\s){1,}(.)+';
it matches some unwanted data, which according to https://regex101.com/, it should not with this particular regular expression.
Test string:
`Params` is plural -> contain<s>s</s>
Matched on MySQL database, not matched at regex101 dot com.
Obvious example of quoting, but not matched at db:
Yes, I believe so.\r\n\r\n\r\n\r\nK\r\n\r\n> On 19-Jul-2014, at 17:33, Stefan Karpinski <notifications#github.com> wrote:\r\n> \r\n> This is the standard 3-clause BSD license, right?\r\n> \r\n> —\r\n> Reply to this email directly or view it on GitHub.
Moreover, MySQL workbench didn't show those return carriage and new line symbols unless copy-pasted here.
Can I normalize (remove \r and \n) with some update query ?
Is MySQL regex implementation different from POSIX standard regex ?
Do you have by any chances maximally clean solution for recognizing quoting in a markdown text ?
Thanks!
You've got an awful lot of parens in there. Try this as functionally what you have above:
select * from github_discussions where body rlike '.*[:blank:]+>[:blank:]+.+'
However, I'm not sure that's really what you want. This would happily match this line:
this is before > and after
which by my understanding is not a quoted string in markdown. Instead I would anchor it at the beginning like this:
select * from github_discussions where body rlike '^[:blank:]*>[:blank:]+'
That will match a greater-than sign at the beginning of the line, optionally preceded by whitespace. Is that what you are looking for?
I'm not sure if your data has newlines embedded. If so, you may need to look into ways of having your regex identify newlines using the ^ anchoring symbol. As is the well accepted conclusion in regex literature, that is left as an exercise for the student. :-)
I have multiple <li> in my code, well over 3,000 of them (don't ask!).
They are all either in the format:
<li>Name, Job, Company</li>
or
<li>Job, Company</li>
I need to find the ones that contain a Name (i.e. the ones with two commas ,, as opposed to just one), and remove the names. I was hoping to use Sublime Text's Regex find+replace feature.
Now, I can select all the lines that contain two commas using the following regex:
<li>.*,.*,.*</li>
But how do I now replace those with just the second and third .*s, discarding the first?
find this :
<li>.*,(.*),(.*)</li>
replace with :
<li>\1,\2</li>
or
<li>$1,$2</li>
whatever your editor supports
sed -r 's/[^,]*,([^,]*,[^,]*)/\1/g'
not .* because it would match the comma.
I have a document that was converted from PDF to HTML for use on a company website to be referenced and indexed for search. I'm attempting to format the converted document to meet my needs and in doing so I am attempting to clean up some of the junk that was pulled over from when it was a PDF such as page numbers, headers, and footers. luckily all of these lines that need to be removed are in blocks of 4 lines unfortunately they are not exactly the same therefore cannot be removed with a simple literal replace. The lines contain numbers which are incremental as they correlate with the pages. How can I remove the following example from my html file.
Title<br>
10<br>
<hr>
<A name=11></a>Footer<br>
I've tried many different regular expression attempts but as my skill in that area is limited I can't find the proper syntax. I'm sure i'm missing something fairly easy as it would seem all I need is a wildcard replace for the two numbers in the code and the rest is literal.
any help is apprciated
The search & replace of npp is quite odd. I can't find newline charactes with regular expression, although the documentation says:
As of v4.9 the Simple find/replace (control+h) has changed, allowing the use of \r \n and \t in regex mode and the extended mode.
I updated to the last version, but it just doesn't work. Using the extended mode allows me to find newlines, but I can't specify wildcards.
However, you can use the macros to overcome this problems.
prepare a search that will find a unique passage (like Title<br>\r\n, here you can use the extended mode)
start recording a macro
press F3 to use your search
mark the four lines and delete them
stop recording the macro ... done!
Just replay it and it deletes what you wanted to delete.
If I have understood your request correctly this pattern matches your string:
Title<br>( ?)\n([0-9]+)<br>( ?)\n<hr>( ?)\n<A name=([0-9]+)></a>Footer<br>
I use the Regex Coach to try out complicated regex patterns. Other utilities are available.
edit
As I do not use Notepad++ I cannot be sure that this pattern will work for you. Apologies if that transpires to be the case. (I'm a TextPad man myself, and it does work with that tool).