Matching strings without space and punctuation in MySQL - mysql

I'm working on a query which I thought should be quite intuitive, but somehow I'm facing a bit of issues when implementing it. I guess what I'm trying to achieve is to match a string stored in MySQL DB without space and punctuation (other creative approaches are more than welcome). At the same time I would like the query to handle Unicode characters in diacritics insensitive fashion (so options like REGEXP are kinda out of luck). And the last condition is I'm on MySQL 5.5 with InnoDB engine, so full-text indexing is not supported (but I'm open to upgrade to 5.6/5.7 if it helps sorting this out).
Consider the scenario which the string Hello-World from John Doe is stored in DB. I would like to find it when given the search string HelloWorld or JohnDoe. To be more general, the string in DB can contain brackets, understores and any other punctuation (not limited to ASCII but can compromise for now), while the search string can be a combination of words with or without any separators in between. The closest I've gotten so far is to daisy chain the REPLACE function for a list of known punctuation, like below:
SELECT text FROM table WHERE REPLACE(REPLACE(text, '-', ''), ' ', '') LIKE '%JohnDoe%'
My questions are:
Is there a better way instead of using the daisy chain above?
If that's the only solution, how will the performance be impacted when I chain up hundred or more REPLACE functions?
Thanks in advance for your help.

I don't know how restrictive your searches must be, but you could try to strip out all non-alphanumeric characters from it, so that you end up with a string like "HelloWorldfromJohnDoe" that you match with instead.
Have a look at this answer: How to remove all non-alpha numeric characters from a string?
You might have to change it around a bit though to make it fir your purposes. I changed it from CHAR(32) to CHAR(255) to make sure I could get the column, but you might want to look into changing the function altogether to fit your data more precisely.
Then you something like this:
SELECT *
FROM testing
WHERE alphanum(test) LIKE CONCAT('%', alphanum('John Doe'), '%')
which should give you a hit.

Method 1
I would have another column on the schema containing an "hashed" version of the name, for example, let's say you have the user:
John Doe The Great
This name hashes to
johndoethegreat
The hash function is coded in such a way that all of the following strings:
John_Doe_THE_great
John Doe The GREAT
John.Doe.The.Great
johnDOE___theGreat
john Doe the great
___john____DOE____THE____great
hash to the same value
johndoethegreat
It's trivial to write such a function. This way you can get the user input, hash it and then compare it against the hash column in your database
Names like:
Jon Doe
John Doo
will not be found of course
Method 2
Use the FULLTEXT search feature built-in in MySQL, sort the results by score and pick the first non zero entry
http://blog.oneiroi.co.uk/mysql/php/mysql-full-text-search-with-percentage-scoring/

I am totally missing the point of your question. You appear to have the string:
Hello-World from John Doe
If you want to find this when the search string is JohnDoe or John Doe, then you only need to substitute spaces:
where replace(text, ' ') like concat('%', 'JohnDoe', '%')
If you want a string that contains both "John" and "Doe" in that order, then:
where replace(text, ' ') like concat('%', 'John%Doe', '%')
I fail to see why 100 nested replace()s would be needed.

Related

remove character on sql

I have a table
| John
| Robert
| Mary
| James
| Bond
i want to remove characters with prefix 'J' with substr
| ohn
| Robert
| Mary
| ames
| Bond
this my sql code, but still not working
SELECT SUBSTR(name,2) FROM table_name WHERE name LIKE 'J%'
The query you wrote says "show me all the names that needed to be modified", but what I think you want is, "Show me ALL the names, but names that follow a particular pattern I want you to modify first."
The job of modifying the data in the results is the responsibility of the SELECT statement. You've touched on it yourself with your use of SUBSTR. So you want to pull all the names, but only change some of them. Further study into the catalog of MySQL string operations reveals a dizzying array of options. The goal is "SELECT name but if name starts with 'J' then chop that off."
For educational purposes, I encourage you to try to implement this with IF logic, but ultimately that's not necessary.
And while the ultimately powerful regex functions are tempting, there's a simpler option, TRIM. TRIM allows you to say, "chop this string from the front and/or back of this other string."
Since you want all results, there is no WHERE clause anymore, and your query is simply
SELECT TRIM(LEADING 'J' FROM name) FROM table_name
Look at that. FROM meaning two different things. No one ever said SQL was pretty.
If your actual use-case is trickier than simple TRIM can handle, there's a whole bunch more functions to peruse, and ultimately there's regex.
You can able to check others forms to use TRIM function in a following link:
TRIM Functions

MySQL: Search keyword in a comment string

I've looked through several similar topics on stackoverflow that is similar my question but I did not find anything that can help me yet. I have this SQL query:
SELECT * FROM twitter_result
WHERE LOWER(TweetComment) LIKE LOWER('%lebron james%')
AND LOWER(TweetComment) LIKE LOWER('%NBA%')
I want to search a TweetComment that contains the word "LeBron James" and "NBA" at a same time. But these two words need to stand alone by themselves. Like it should not return a tweet that contains #LeBron James and #NBA (or NBATalk)
For instance, it should return a tweet like this
LeBron James Donates $41 Million To Send 1,100 Kids To College, Becomes 6th Most Charitable Athlete NBA In World
where Lebron James and NBA stand alone (no # characters). I have the LOWER there to ignore the case sensitive. Any help is greatly appreciated. Thanks
Sorry I forgot to add, I am just using SQL in PHPMyAdmin
Although there are solutions using regular expressions, it is hard to propose one without knowing the database you are using.
Instead, you can remove the tags you don't want before doing the like:
WHERE REPLACE(LOWER(TweetComment), '#lebron james', '') LIKE LOWER('%lebron james%') AND
REAPLCE(LOWER(TweetComment), '#nba', '') LIKE LOWER('%NBA%')
If you plan to use a regexp use,
select * from twitter_result
where --ignore tweets that contain #lebron james and #nba
TweetComment not regexp '.*#lebron james.*|.*#nba.*'
--select only those tweets that contain lebron james AND nba
and TweetComment regexp '[[:<:]]lebron james[[:>:]]'
and TweetComment regexp '[[:<:]]nba[[:>:]]'
All the patterns being searched for, have to be stated explicitly as MySQL by default doesn't support lookarounds.
The above match is case insensitive by default. Use regexp binary if the search needs to be case sensitive. Add more search words as needed.
Sample fiddle

How do I assign a variable to each letter of a string in MySQL?

I am trying to figure out a way of doing an "anagram" function as a stored procedure on MySQL. Lets say I have a database containing all the words in the dictionary - I want to enter a parameter of some letters as a VARCHAR and get back a list of words which make up an anagram of those letters.
I guess what I'm sort of saying is, how do I run an SQL command to say "Select all words which are the same length as the parameter AND contain each of the letters in the parameter".
I have explored the string functions available (http://www.hscripts.com/tutorials/mysql/string-function.php). I'm sure these can be used in conjunction in some way but can't quite get the syntax right when it gets complicated.
I am new to SQL, and it just seems like the String functions available are very limited. Any help would be greatly appreciated :)
You don't; it's not a sensible thing to ask a relational database to do.
However, if someone was forcing me at gunpoint to implement anagram finding using a relational database, I would denormalize it like this:
word | sorted
-----|-------
bar | abr
bra | abr
keel | eekl
leek | eekl
Where "sorted" consists of all of the letters in "word", sorted using any rule you like as long as it's a total order. You would use something other than SQL to compute that part.
Then you could find anagrams with something like this:
SELECT w2.word AS anagram
FROM words w1
JOIN words w2 ON w1.sorted=w2.sorted
WHERE w1.word = 'leek'
AND w2.word <> w1.word
SQL is probably not the right place to do this, you should do it on the front end.
First of all consider the properties of an anagram, it will be the same length as the words in your dictionary. You can start by retrieving those words.
Instead of creating a variable per letter consider using an array
Each letter maps to an index (a=0, b=3, etc...). Each time you run into that letter increase the value for that bucket so for the word "dad" you'll end up with a structure that looks like this:
arr[0]=1, arr[1]=0, arr[2]=0, arr[3]=2, arr[4]=0 and so on...
Now you can just see if your words match each item in the array.
While not impossible in SQL, you can represent that kind of logic in the database, for example another table that will have a reference to the dictionary word and each tuple would be the array, then you can just retrieve all the items with the same values.

mysql query to match sentence against keywords in a field

I have a mysql table with a list of keywords such as:
id | keywords
---+--------------------------------
1 | apple, oranges, pears
2 | peaches, pineapples, tangerines
I'm trying to figure out how to query this table using an input string of:
John liked to eat apples
Is there a mysql query type that can query a field with a sentence and return results (in my example, record #1)?
One way to do it could be to convert apple, oranges, pears to apple|oranges|pears and use RLIKE (ie regular expression) to match against it.
For example, 'John liked to eat apples' matches the regex 'apple|orange|pears'.
First, to convert 'apple, oranges, pears' to the regex form, replace all ', ' by '|' using REPLACE. Then use RLIKE to select the keyword entries that match:
SELECT *
FROM keywords_table
WHERE 'John liked to eat apples' RLIKE REPLACE(keywords,', ','|');
However this does depend on your comma-separation being consistent (i.e. if there is one row that looks like apples,oranges this won't work as the REPLACE replaces a comma followed by a space (as per your example rows).
I also don't think it'll scale up very well.
And, if you have a sentence like 'John liked to eat pineapples', it would match both of the rows above (as it does have 'apple' in it). You could then try to add word boundaries to the regex (i.e. WHERE $sentence RLIKE '[[:<:]](apple|oranges|pears)[[:>:]]'), but this would screw up matching when you have plurals ('apples' wouldn't match '[wordboundary]apple[wordboundary]').
Hopefully this isn't more abstract than what you need but maybe good way of doing it.
I haven't tested this but I think it would work. If you can use PHP you can use str_replace to turn the spaces into keyword LIKE '%apple%'
$sentence = "John liked to eat apples";
$sqlversion = str_replace(" ","%' OR Keyword like '%",$sentence );
$finalsql = "%".$sqlversion."%";
the above will echo:
%John%' OR Keyword like '%liked%' OR Keyword like '%to%' OR Keyword like '%eat%' OR Keyword like '%apples%
Then just combine with your SQl statement
SQL ="SELECT *
FROM keywords_table
WHERE Keyword like" . $finalsql;
Storing comma delimited data is... less than ideal.
If you broke up the string "John liked to eat apples" into individual words, you could use the FIND_IN_SET operator:
WHERE FIND_IN_SET('apple', t.keywords) > 0
The performance wouldn't be great - this operation is better suited to Full Text Search.
I'm not aware of any direct solution to that type of query. But Full Text Search is a possibility. If you have a full-text index on the field of interest then a search with OR between each word in the sentence (although I think the OR operator is implied) would find that record ... but it might also find more than you want too.
I really don't think what you are looking for is completely possible but you can look into Full Text Search or SOUNDEX. SOUNDEX, for example, can do something like:
WHERE SOUNDEX(sentence) = SOUNDEX('%'+keywords+'%');
I have never tried it in this context but you should and let me know how it works out.

Mysql RegExp question selecting from a list of codes

I am trying to match a list of motorcycle models to a series of ebay codes for listing motorcycles in ebay.
So we get a motorcycle model name that will be something like:
XL883C Sportster where the manufacturer is Harley Davidson
I have a list of ebay codes that look like this
MB-100-0 Other
MB-100-1 883
MB-100-2 1000
MB-100-3 1130
MB-100-4 1200
MB-100-5 1340
MB-100-6 1450
MB-100-7 Dyna
MB-100-8 Electra
MB-100-9 FLHR
MB-100-10 FLHT
MB-100-11 FLSTC
MB-100-12 FLSTR
MB-100-13 FXCW
MB-100-14 FXSTB
MB-100-15 Softail
MB-100-16 Sportster
MB-100-17 Touring
MB-100-18 VRSCAW
MB-100-19 VRSCD
MB-100-20 VRSCR
So I want to match the model name against the list above using a regExp pattern.
I have tried the following code:
SELECT modelID FROM tblEbayModelCodes WHERE
LOWER(makeName) = 'harley-davidson' AND fnmodel REGEXP '[883|1000|1130|1200|1340|1450|Dyna|Electra|FLHR|FLHT|FLSTC|FLSTR|FXCW|FXSTB|Softail|Sportster|Touring|VRSCAW|VRSCD|VRSCR].*' LIMIT 1
however when I run the query I would expect the code to match on either MB-100-1 for 883 or MB-100-16 for Sportster but when I run it the query returns MB-100-0 for Other.
I am guessing that I have the pattern incorrect, so can anybody suggest what I might need to do to correct this?
Many thanks
Graham
[chars] matches any of the characters 'c','h','a','r','s'
So by giving it such a long list, it will inevitably match just the first item (single character)
Try this instead
LOWER(makeName) = 'harley-davidson' AND fnmodel REGEXP '(883|1000|1130|1200|1340|1450|Dyna|Electra|FLHR|FLHT|FLSTC|FLSTR|FXCW|FXSTB|Softail|Sportster|Touring|VRSCAW|VRSCD|VRSCR).*' LIMIT 1
You might also consider not using REGEX and using FIND_IN_SET instead.
Not really fully tested, but it should be something like this:
REGEXP '^MB-[0-9]+-[0-9]+[[:space:]]+(883|1000|1130|1200|1340|1450|Dyna|Electra|FLHR|FLHT|FLSTC|FLSTR|FXCW|FXSTB|Softail|Sportster|Touring|VRSCAW|VRSCD|VRSCR)$'
In detail:
^MB- Starts with MB-
[0-9]+ One or more digits
- Dash
[0-9]+ One or more digits
[[:space:]]+ One or more white space
(883|1000|...)$ Ends with one of these
Here's the reference for the regexp dialect spoken by MySQL:
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
Answer to comment:
If you want to match the Sportster row them remove all other conditions. And you may not even need regular expressions:
WHERE fnmodel LIKE '% Sportster'