MySQL Pattern matching to find specific text - mysql

My working database is from a webforum. In it, there's a table containing all the data of posts (I.e. the text a user submitted within a thread). These posts contain a column called message which is the actual content of the post. A post can contain any character, as well as smilies. Smilies are indicated by a colon, immediately followed by a short description of variable length and again a colon. I.e. :clap:. A single post can contain multiple smilies.
I am trying to come up with a way to pull out a list of all smilies within the posts table.
What I have working so far is a query that pulls a list of posts containing at least two colons:
SELECT
thread_id
, post_id
, SUBSTRING_INDEX(SUBSTRING_INDEX(message, ':', 2), ':', -1)
FROM
xf_post
WHERE
ROUND((CHAR_LENGTH(message) - CHAR_LENGTH(REPLACE(message, ':', ""))) / CHAR_LENGTH(':')) > 1
LIMIT 50
This works, but will also return messages where a user for whatever reason included multiple colons, like for instance random : text followed : by more text, or a timestamp: 00:00:12345.
What I'm hoping to achieve is to return all occurrences of alphanumeric characters enclosed between colons, without any spaces. (Yes, this will remove all smilies that are purely numeric, but ¯\_(ツ)_/¯).
I fiddled with REGEXP, and came up with the following: [:][a-zA-Z]+(?=:)[:] which according to regex101 yields exactly what I want.
How can I use this to capture the output, and only see the values between the semi-colons, and preferably in such a way it would show all occurrences of a smilie within a single post?
Thank you.

#SimonlucaLandi helped me at least figure out the way to display the results. My final query:
SELECT
thread_id
, post_id
, REGEXP_SUBSTR(message, '':[a-zA-Z]+:'')
FROM
xf_post
WHERE
message REGEXP '':[a-zA-Z]+:''
LIMIT 50

Related

How do I create a SELECT conditional in MySQL where the conditional is the character length of the LIKE match?

I am working on a search function, where the matches are weighted based on certain conditions. One of the conditions I want to add weight to is matches where the character length of the query string in a LIKE match is longer than 4.
This is what I want to the query to look like, roughly. %s is meant to represent the actual match found by LIKE, but I don't think it does. I'm wondering if there is a special variable in MySQL that does represent the precise character match found by LIKE.
SELECT help.*,
IF(CHAR_LENGTH(%s) > 4, 2, 0) w
FROM help
WHERE (
(title LIKE '%this%' OR title LIKE '%testy%' OR title LIKE '%test%') OR
(content LIKE '%this%' OR content LIKE '%testy%' OR content LIKE '%test%')
) LIMIT 1000
edit: I could in the PHP split the search string array into two arrays based on the character length of the elements, with two separate queries that return different values for 'w', then combine the results, but I'd rather not do that, as it seems to me that would be awkward, messy, and slow.
Check out FULLTEXT as another way to discover rows. It will be faster, but won't address your question.
This probably has the effect you want.
SELECT ....
IF ( (title LIKE '%testy%' OR
content LIKE '%testy%'), 2, 0)
....
Note that the "match" in your LIKEs includes the %, so it is the entire length of the string. I don't think that is what you wanted.
REGEXP "(this|testy|that)" will match either 4 or 5 characters (in this example). It may be possible to do something with REGEXP_REPLACE to replace that with the empty string, then see how much it shrank.
I think the answer to my question is that what I wanted to do isn't possible. There is no special variable in MySQL representing the core character match in a WHERE condtional where LIKE is the operator. The match is the contents of the returned data row.
What I did to reach my objective was took the original dynamic list of search tokens, iterated through that list, and performed a search on each token, with the SQL tailored to the conditions that matched each token.
As I did this I built an array of the search results, using the id for the database row as the index for the array. This allowed me to perform calculations with the array elements, while avoiding duplicates.
I'm not posting the PHP code because the original question was about the SQL.

Matching content in two different columns-MySQL

Let's say I have a list of person names and a list of social media URL's (that might or might not contain a portion of the person names).
I'm trying to see if the full name is not contained in the list of URL's I have. I don't think a "not like" would work here (because the URL has plenty of other characters to throw back a result), but I can't think of any other way to address this. Any tips? The closest I could find was from this:
Matching partial words in two different columns
But I'm unsure if that applies here.
Just use SELECT * FROM yourtable WHERE url LIKE '%name%' % means any characters even whitespace. Then just check if it returned any rows.
From mysql doc:
% matches any number of characters, even zero characters.
mysql> SELECT 'David!' LIKE 'David_';
-> 1
mysql> SELECT 'David!' LIKE '%D%v%';
-> 1
So let's say these are your url's in your list:
website.com/peterjohnson
website.com/jackmiller
website.com/robertjenkins
Then if you would do:
SELECT * FROM urls WHERE url LIKE '%peter%'
It would return 1 row.
You can also use NOT LIKE so you will get all the rows not containing the name.

Isolate an email address from a string using MySQL

I am trying to isolate an email address from a block of free field text (column name is TEXT).
There are many different variations of preceding and succeeding characters in the free text field, i.e.:
email me! john#smith.com
e:john#smith.com m:555-555-5555
john#smith.com--personal email
I've tried variations of INSTR() and SUBSTRING_INDEX() to first isolate the "#" (probably the one reliable constant in finding an email...) and extracting the characters to the left (up until a space or non-qualifying character like "-" or ":") and doing the same thing with the text following the #.
However - everything I've tried so far hasn't filtered out the noise to the level I need.
Obviously 100% accuracy isn't possible but would someone mind taking a crack at how I can structure my select statement?
There is no easy solution to do this within MySQL. However you can do this easily after you have retrieved it using regular expressions.
Here would be a an example of how to use it in your case: Regex example
If you want it to select all e-mail addresses from one string: Regex Example
You can use regex to extract the ones where it does contain an e-mail in MySQL but it still doesn't extract the group from the string. This has to be done outside MySQL
SELECT * FROM table
WHERE column RLIKE '\w*#\w*.\w*'
RLIKE is only for matching it, you can use REGEXP in the SELECT but it only returns 1 or 0 on whether it has found a match or not :s
If you do want to extract it in MySQL maybe this other stackoverflow post helps you out. But it seems like a lot of work instead of doing it outside MySQL
Now in MySQL 5 and 8 you can use REGEXP_SUBSTR to isolate just the email from a block of free text.
SELECT *, REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable`;
If you want to get just the records with emails and remove duplicates ...
SELECT DISTINCT REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable` WHERE `TEXT` REGEXP '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})';

How to Bulk Update post titles wordpress through mySQL

All my post titles have a static word in front of them. I have over 9000 published posts with two different static words in my post titles. I am trying to remove this word from all my posts.
Essentially, I am looking for a way to remove that constant word. I tried using export through wordpress and editing the post titles that way, but the file was 100mb + and simply won't open.
I looked towards SQL/phpmyadmin but I am not very well versed with SQL queries and don't want to mess up my database.
The "Constant" is always the first word of the title. What could I do
to instead have mysql detect the first space and remove everything
before it as well as including the space. Essentially, removing the
first word. Basically I have more than one constant, so it would be
better if I could just find the first space in the string and then
remove it and everything before it. I'm assuming we'd use sub_string
or something.
This is the title structure
Constant other stuff here
so if there is a search query where it finds "Constant " and a space and replaces it with "" nothing, that way I could have it completely removed. It'd be great if it could be a search and replace query, so I could utilize it later.
Information on the Database / Table / Column
the table is called: wp_posts and it needs to be restricted to the value of the table post_type when it's value is post and the title is in the post_title column
Any help would be greatly appreciated.
MySQL doesn't have exactly what you're looking for -- yes, there is a REPLACE() string function, but you can't limit it to a single substitution. As such, you might inadvertently replace other occurrences of this constant that could conceivably appear in your title string.
IMHO, the easiest way is to find all titles starting with your constant, and just replace the first one (i.e. at the start of the string):
UPDATE wp_posts
SET post_title =
MID( post_title, LENGTH('Constant ')+1 )
WHERE post_title LIKE 'Constant %'
AND post_type = 'post';
You need the +1 because MySQL string offsets start at 1, not zero.
Personally, I always prefer to run the equivalent SELECT first, just to be certain (too many years of MyISAM without BEGIN WORK):
SELECT post_title,
MID( post_title, LENGTH('Constant ')+1 ) AS replacedTitle
FROM wp_posts
WHERE post_title LIKE 'Constant %'
AND post_type = 'post';
Alternatively, if you're certain that you always want to remove the first word (i.e. up to and including the first space), then the following statement should work:
UPDATE wp_posts
SET post_title =
MID( post_title, POSITION( ' ' IN post_title )+1 )
WHERE post_type = 'post';
Since POSITION() will return zero if no space is found, this statement will be a no-op (i.e. non-destructive) in the general case.
Search RegEx is a great plugin to be able to search and replace - with grep or plain text - through all post and page content, post titles, post meta, etc. Does not search custom post types.
Back up your DB before making any changes, either with this plugin or direct query in the database.
Docs: http://urbangiraffe.com/plugins/search-regex/

MySQL Regex - Find where there are 10 numbers in a row in a field

I need to find records where there are 10 numbers in a row in the field. e.g. 1234567890, 8884265555 etc. The field will contain text as well so I need to see if any 10-digit strings exist anywhere within the field.
I have got this far...
SELECT * FROM `comments` WHERE detail REGEXP '[0-9]{10}'
My that returns where there 10 numbers anywhere in the field instead of all in a row. I am trying to detect phone numbers. Thanks!
The regular expression [0-9]{10} does imply that ten digits in a row (only) should be matched. So, your issue must be elsewhere.