Isolate an email address from a string using MySQL - mysql

I am trying to isolate an email address from a block of free field text (column name is TEXT).
There are many different variations of preceding and succeeding characters in the free text field, i.e.:
email me! john#smith.com
e:john#smith.com m:555-555-5555
john#smith.com--personal email
I've tried variations of INSTR() and SUBSTRING_INDEX() to first isolate the "#" (probably the one reliable constant in finding an email...) and extracting the characters to the left (up until a space or non-qualifying character like "-" or ":") and doing the same thing with the text following the #.
However - everything I've tried so far hasn't filtered out the noise to the level I need.
Obviously 100% accuracy isn't possible but would someone mind taking a crack at how I can structure my select statement?

There is no easy solution to do this within MySQL. However you can do this easily after you have retrieved it using regular expressions.
Here would be a an example of how to use it in your case: Regex example
If you want it to select all e-mail addresses from one string: Regex Example
You can use regex to extract the ones where it does contain an e-mail in MySQL but it still doesn't extract the group from the string. This has to be done outside MySQL
SELECT * FROM table
WHERE column RLIKE '\w*#\w*.\w*'
RLIKE is only for matching it, you can use REGEXP in the SELECT but it only returns 1 or 0 on whether it has found a match or not :s
If you do want to extract it in MySQL maybe this other stackoverflow post helps you out. But it seems like a lot of work instead of doing it outside MySQL

Now in MySQL 5 and 8 you can use REGEXP_SUBSTR to isolate just the email from a block of free text.
SELECT *, REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable`;
If you want to get just the records with emails and remove duplicates ...
SELECT DISTINCT REGEXP_SUBSTR(`TEXT`, '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})') AS Emails FROM `mytable` WHERE `TEXT` REGEXP '([a-zA-Z0-9._%+\-]+)#([a-zA-Z0-9.-]+)\.([a-zA-Z]{2,4})';

Related

Why isn't MySQL REGEXP filtering out these values?

So I'm trying to find what "special characters" have been used in my customer names. I'm going through updating this query to find them all one-by-one, but it's still showing all customers with a - despite me trying to exlude that in the query.
Here's the query I'm using:
SELECT * FROM customer WHERE name REGEXP "[^\da-zA-Z\ \.\&\-\(\)\,]+";
This customer (and many others with a dash) are still showing in the query results:
Test-able Software Ltd
What am I missing? Based on that regexp, shouldn't that one be excluded from the query results?
Testing it on https://regex101.com/r/AMOwaj/1 shows there is no match.
Edit - So I want to FIND any which have characters other than the ones in the regex character set. Not exclude any which do have these characters.
Your code checks if the string contains any character that does not belong to the character class, while you want to ensure that none does belong to it.
You can use ^ and $ to check the while string at once:
SELECT * FROM customer WHERE name REGEXP '^[^\da-zA-Z .&\-(),]+$';
This would probably be simpler expressed with NOT, and without negating the character class:
SELECT * FROM customer WHERE name NOT REGEXP '[\da-zA-Z .&\-(),]';
Note that you don't need to escape all the characters within the character class, except probably for -.
Use [0-9] or [[:digit:]] to match digits irrespective of MySQL version.
Use the hyphen where it can't make part of a range construction.
Fix the expression as
SELECT * FROM customer WHERE name REGEXP "[^0-9a-zA-Z .&(),-]+";
If the entire text should match this pattern, enclose with ^ / $:
SELECT * FROM customer WHERE name REGEXP "^[^0-9a-zA-Z .&(),-]+$";
- implies a range except if it is first. (Well, after the "not" (^).)
So use
"[^-0-9a-zA-Z .&(),]"
I removed the + at the end because you don't really care how many; this way it will stop after finding one.

sphinx search return match and surrounding characters

In Sphinx (using mysql connection), how can I match for a term and also get, let's say 5 characters before and after the match?
For example: a row contains this is a word in a sentence.
When I query: SELECT term FROM table WHERE MATCH('"word*"')
I would want to see is a word in a s returned. Is this possible?
Edit Using #barryhunter's helpful answer below, now trying to figure out how to fit this into my query:
SELECT field1,field2,SPHINX_SNIPPETS(field3,indexName, "word") as snippet FROM tableName
Thats what CALL SNIPPETS is designed for. Although its counted in words, not charactors.
http://sphinxsearch.com/docs/current/sphinxql-call-snippets.html
CALL SNIPPETS('this is a word in a sentance','index1','word', 2 AS around, 5 as limit_words);
would get back
... is a word in a ...
add '' as chunk_separator is dont want the ellipsis
Edit To add: then if want to build the snippet during the search query (not as a seperate CALL query), can use SNIPPET() function in the intial select
http://sphinxsearch.com/docs/current.html#sphinxql-select

MYSQL Find entries that contain more than 7 numbers

I need to find entries that contain more than 7 numbers in one of my mysql tables BUT the numbers are separated by letters or anything else.
What I have is this little piece of code I use to find entries like dsc123456789:
select * from crawl where title regexp '[0-9]{7}'
How can I find entries like dsc-123-456_78B9? I tried different things but without success so far.
Thanks
You can use the following solution:
SELECT *
FROM crawl
WHERE title REGEXP '(([^[:digit:]])?[[:digit:]]){8,}';
Why the original query of the answer doesn't work?
-- this query doesn't work!
SELECT *
FROM crawl
WHERE title REGEXP '\d([^\d]?\d){7,}'
MySQL can't use character groups like \d (digits). So the query fails every time. On PHP and other languages the regular expression would look like this:
\d([^\d]?\d){7,}
but on MySQL this isn't valid. So you have to use the character classes of MySQL to solve this:
(([^[:digit:]])?[[:digit:]]){8,}
Hint: Make sure you use {8} or {8,} instead of {7} since you want to find all entries with more than 7 numbers / digits.

MySQL UNION query correct handling for 3 or more words

I've to ask your help to solve this problem.
My website has a search field, let's say user writes in "Korg X 50"
In my database in table "products" i have a filed "name" that holds "X50" and a field "brand" that hold "Korg". Is there a way to use the UNION option to get the correct record ?
And if the user enters "Korg X-50" ?
Thank you very much !
Matteo
May be you should use full-text search
SELECT brand, name, MATCH (brand,name) AGAINST ('Korg X 50') AS score
FROM products WHERE MATCH (brand,name) AGAINST ('Korg X 50')
As far as I understand you don't need UNION but something like
SELECT * FROM table1
WHERE CONCAT(field1, field2) LIKE '%your_string%'
On client side you get rid of all characters (like space, hyphen, etc) in your_string that appears in user input and cannot be in field1 or field2.
So, user input Korg X 50 as well as Korg X-50 becomes KorgX50.
you will need to get some form of searchable text.
either parse out the input for multiple key words and match each separately, or perhaps try to append them all together and match to the columns appended in the same way.
you will also need either a regex, or maybe a simpler search and replace to get rid of spaces and dashes after the append before the comparison.
in general, allowing users to search for open ended text strings is more complicated than 'what union do i use'... you will ideally also be worried about slight misspellings and capitalization, and keyword order.
you may consider pulling all keywords out from your normal record into a separate keyword list associated with each product, then use that list to perform your searches.
If you do not want to parse user input and use as it is, then you will need to use a query like this
select * from products where concat_ws(' ',brand,name) = user_input -- or
select * from products where concat_ws(' ',brand,name) like %user_input%
However, this query won't return result if user enters name "Korg X-50" and your table contains "Korg" and "X50", then you need to do some other thing to achive this. You may look at http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex however it won't be a complete solution. Look for text indexing libraries for that ex: lucene

MySQL Find similar strings

I have an InnoDB database table of 12 character codes which often need to be entered by a user.
Occasionally, a user will enter the code incorrectly (for example typing a lower case L instead of a 1 etc).
I'm trying to write a query that will find similar codes to the one they have entered but using LIKE '%code%' gives me way too many results, many of which contain only one matching character.
Is there a way to perform a more detailed check?
Edit - Case sensitive not required.
Any advice appreciated.
Thanks.
Have a look at soundex. Commonly misspelled strings have the same soundex code, so you can query for:
where soundex(Code) like soundex(UserInput)
use without wildcard % for that
SELECT `code` FROM table where code LIKE 'user_input'
thi wil also check the space
SELECT 'a' = 'a ', return 1 whereas SELCET 'a' LIKE 'a ' return 0
reference