MySQL dictionary query optimization - mysql

I have a dictionary query which I would like to optimize. Apparently the query is too long as the result page takes quite long to load. The query is as follows:
$var = #$_GET['q'] ;
$varup1 = strtoupper($var);
$varup = addslashes ($varup1);
$query1 = "select distinct $lang from $dict WHERE
UPPER ($lang) LIKE trim('$varup')
or UPPER($lang) LIKE replace('$varup',' ','')
or replace($lang,'ß','ss') LIKE trim('$varup')
or replace($lang,'ss','ß') LIKE trim('$varup')
or replace($lang,'ence','ance') LIKE trim('$varup')
or replace($lang,'ance','ence') LIKE trim('$varup')
or UPPER ($lang) like trim(trailing 'LY' from '$varup')
or UPPER ($lang) like trim(trailing 'Y' from '$varup')
or UPPER ($lang) like trim(trailing 'MENTE' from '$varup')
or UPPER ($lang) like trim(trailing 'EMENT' from '$varup')
or UPPER ($lang) like trim(trailing 'IN' from '$varup')
The purpose is that a search string shall also find different writings of the same word or the adverb of an adjective.
The table looks like
or
For instance "flawlessly" shall also display "flawless". "Fully" shall also find "full" and vice-versa.
"Feliz" should also find the entries for "Felizmente".
There are around twenty substitutes like the above which I eliminated as they do not make the question easier to understand.
The whole code is quite long and I wonder if I can make it smaller without losing functionality. Any ideas?

Where is the FROM clause in the query?
The REPLACE calls could be chained: REPLACE(REPLACE(..., 'a', 'b'), 'c', 'd'). Ditto for theTRIM` calls.
As already mentioned, a suitable COLLATION eliminates all need for UPPER() and LOWER(). Avoid the ...general... collations, and you will be provided with this: ss=ß. Many, but not all, treat ij=ij and/or oe=œ and/or Aa=Å (etc); do you need them, too? Here is a rundown of most situations: http://mysql.rjweb.org/utf8_collations.html
Using a FULLTEXT index will take care of most of the endings you are testing for, there obviating most of your code.
You show multiple words in the second column. Is this simply for display? If you need to pick apart the words, then you have other nasty challenges.
This, alone, will speed up the query something like 10-fold:
WHERE english LIKE 'ha%'
AND ... (whatever else you have)
That is, filter on the first 2 letters with something that can use INDEX(english), specifically LIKE 'ha%' for the word hate. Since you seem to be using PHP, there should be no difficulty building this into the query.
Here's another thought on my substring($word, 0, 2)... Instead of specifially using "2", see if floor(strlen($word)/2) will work well enough. So, 'flawlessly' would be tested LIKE 'flawl%' and run a lot faster than even 10-fold.
But, another issue. Are you chopping both the word in the table and the word given? Try to avoid chopping the word in the table. To discuss this further, please provide the table entries for 'flaw', 'flaws', 'flawless', flawlessly', etc. I can't quite tell if you need to get all the way down to 'flaw', but have various rows for the various forms.
Beware of some very short words with odd forms. Perhaps you need to add extra entries to avoid making the SQL query too messy. These change the second letter: "LIE" and "LYING". Seems like there is even a common word that changes the first letter.

Related

How do I create a SELECT conditional in MySQL where the conditional is the character length of the LIKE match?

I am working on a search function, where the matches are weighted based on certain conditions. One of the conditions I want to add weight to is matches where the character length of the query string in a LIKE match is longer than 4.
This is what I want to the query to look like, roughly. %s is meant to represent the actual match found by LIKE, but I don't think it does. I'm wondering if there is a special variable in MySQL that does represent the precise character match found by LIKE.
SELECT help.*,
IF(CHAR_LENGTH(%s) > 4, 2, 0) w
FROM help
WHERE (
(title LIKE '%this%' OR title LIKE '%testy%' OR title LIKE '%test%') OR
(content LIKE '%this%' OR content LIKE '%testy%' OR content LIKE '%test%')
) LIMIT 1000
edit: I could in the PHP split the search string array into two arrays based on the character length of the elements, with two separate queries that return different values for 'w', then combine the results, but I'd rather not do that, as it seems to me that would be awkward, messy, and slow.
Check out FULLTEXT as another way to discover rows. It will be faster, but won't address your question.
This probably has the effect you want.
SELECT ....
IF ( (title LIKE '%testy%' OR
content LIKE '%testy%'), 2, 0)
....
Note that the "match" in your LIKEs includes the %, so it is the entire length of the string. I don't think that is what you wanted.
REGEXP "(this|testy|that)" will match either 4 or 5 characters (in this example). It may be possible to do something with REGEXP_REPLACE to replace that with the empty string, then see how much it shrank.
I think the answer to my question is that what I wanted to do isn't possible. There is no special variable in MySQL representing the core character match in a WHERE condtional where LIKE is the operator. The match is the contents of the returned data row.
What I did to reach my objective was took the original dynamic list of search tokens, iterated through that list, and performed a search on each token, with the SQL tailored to the conditions that matched each token.
As I did this I built an array of the search results, using the id for the database row as the index for the array. This allowed me to perform calculations with the array elements, while avoiding duplicates.
I'm not posting the PHP code because the original question was about the SQL.

Inefficient LIKE query

I have an online search box that needs to look across many MySQL columns for a match. And it needs to handle a multi-keyword search.
Use cases:
I search for DP/101/R/23 (rego no)
I search for Johnty Winebottom (owner)
I search for Le Mans 1969 (mixed, history related keywords)
I get a lot of special chars so fulltext doesn't always work. So I'm splitting the keyword input apart on spaces and then looping thorugh and doing LIKE queries.
Simplified query that gets the point across (I've removed many columns):
SELECT   `cars`.`id`,
         `cars`.`car_id`,
         `cars`.`date_of_build`,
…..
FROM     (`cars`)
WHERE    (
                  `chassis_no` LIKE "DP/101/R/23"
         OR       `chassis_no` LIKE "DP/101/R/23 %"
         OR       `chassis_no` LIKE "% DP/101/R/23"
         OR       `chassis_no` LIKE "% DP/101/R/23 %"
         OR       `history` LIKE "DP/101/R/23"
         OR       `history` LIKE "DP/101/R/23 %"
         OR       `history` LIKE "% DP/101/R/23"
         OR       `history` LIKE "% DP/101/R/23 %"
….
In this case (rego no) it's exact so matches the LIKE without spaces on either side.
This works.. but is slow and feels wrong. Is there another way to do this that's more efficient?
EDIT:: Using REGEXP appears to work and actually is a little faster:
chassis_no` REGEXP "([ ]*)DP/101/R/23([ ]*)"
I'm not sure of a better way since fulltext fails on many of the special characters in my data.

Performance of LIKE 'xyz%' v/s LIKE '%xyz'

I was wondering how the LIKE operator actually work.
Does it simply start from first character of the string and try matching pattern, one character moving to the right? Or does it look at the placement of the %, i.e. if it finds the % to be the first character of the pattern, does it start from the right most character and starts matching, moving one character to the left on each successful match?
Not that I have any use case in my mind right now, just curious.
edit: made question narrow
If there is an index on the column, putting constant characters in the front will lead your dbms to use a more efficient searching/seeking algorithm. But even at the simplest form, the dbms has to test characters. If it is able to find it doesn't match early on, it can discard it and move onto the next test.
The LIKE search condition uses wildcards to search for patterns within a string. For example:
WHERE name LIKE 'Mickey%'
will locate all values that begin with 'Mickey' optionally followed by any number of characters. The % is not case sensitive and not accent sensitive and you can use multiple %, for example
WHERE name LIKE '%mouse%'
will return all values with 'mouse' (or 'Mouse' or 'mousé') in it.
The % is inclusive, meaning that
WHERE name like '%A%'
will return all that starts with an 'A', contain 'A' or end with 'A'.
You can use _ (underscore) for any character on a single position:
WHERE name LIKE '_at%'
will give you all values with 'a' as the second letter and 't' as the third. The first letter can be anything. For example: 'Batman'
In T-SQL, if you use [] you can find values in a range.
WHERE name LIKE '[c-f]%'
it will find any value beginning with letter between c and f, inclusive. Meaning it will return any value that start with c, d, e or f. This [] is T-SQL only. Use [^ ] to find values not in a range.
Finding all values that contain a number:
WHERE name LIKE '%[0-9]%'
returns everything that has a number in it. Example: 'Godfather2'
If you are looking for all values with the 3rd position to be a '-' (dash) use two underscores:
WHERE NAME '__-%'
It will return for example: 'Lo-Res'
Finding the values with names ends in 'xyz' use:
WHERE name LIKE '%xyz'
returns anything that ends with 'xyz'
Finding a % sign in a name use brackets:
WHERE name LIKE '%[%]%'
will return for example: 'Top%Movies'
Searching for [ use brackets around it:
WHERE name LIKE '%[[]%'
gives results as: 'New York [NY]'
The database collation's sort order determines both case sensitivety and the sort order for the range of characters. You can optionally use COLLATE to specify collation sort order used by the LIKE operator.
Usually the main performance bottleneck is IO. The efficiency of the LIKE operator can be only important if your whole table fits in the memory otherwise IO will take most of the time.
AFAIK oracle can use indexes for prefix matching. (like 'abc%'), but these index cannot be used for more complex expressions.
Anyway if you have only this kind of queries you should consider using a simple index on the related column. (Probably this is true for other RDBMS's as well.)
Otherwise LIKE operator is generally slow, but most of the RDBMS have some kind of full text searching solution. I think the main reason of the slowness is that LIKE is too general. Usually full text indexes has lots of different options which can tell the database what you really want to search for, and with these additional information the DB can do its task in a more efficient way.
As a rule of thumb I think if you want to search in a text field and you think performance can be an issue, you should consider your RDBMS's full text searching solution, or the real goal is not text searching, but this is some kind of "design side effect", for example xml/json/statuses stored in a field as text, then probably you should consider choosing a more efficient data storing option. (if there is any...)

Complex mysql search query, 'boolean and' search of keyword on mysql string

Well I needed to implement some search parameters on my applications, and could not come up with any better solution so i hope you guys could help me, my prob goes something like this->
i have a table with following columns,
id,question
now i am supposed to search keywords, with various criterias, such as->
If i search keyword "heart disease" the returned questions should contain both words "heart" and "disease"
Sentence like " We have a heartly disease" are returned because "heartly" contains "heart", but sentence like "We have a fooheart disease" won't be returned cause "foo" is before "heart" and that isn't acceptable according to the criteria given. But anything following "heart" or "disease" is acceptable.
Well these were the criterias given, I know my english isn't that impressive and haven't been able to explain my problem properly. But i do hope for a solution!! Thanks!!
You probably would be better off with a full text search engine like Lucene, but you can do it in mysql. You would just have to build up the search criteria based on the number of words. For many cases, this would be an incredibly inefficient query.
something like
select * from table
where text like '% heart%' and text like '% disease%'
Link to SQLFiddle
should work.
Note that this isn't necessarily the full solution. The following value wouldn't be returned, because there would be no space before diseases or heart.
Diseases are bad.Hearts are very susceptible.
The problem, of course, is that you are going to have to start building up a lot of special cases. To address the comments, and the example I showed, you would have to add in rules like:
select * from terms
where (terms like '% heart%' or terms like 'heart%' or terms like '%.Heart%')
and (terms like '% disease%' or terms like 'disease%' or terms like '%.disease%')
Link to more advanced case
You could also do this with some sort of regular expression. This would handle the cases that you've brought up.
select * from terms
where (terms like 'heart%' or terms REGEXP '[ |\.]heart')
and (terms like 'disease%' or terms REGEXP '[ |\.]disease')
Example with regular expressions

Regexp MySql- Only strings containing two words

I have table with rows of strings.
I'd like to search for those strings that consists of only
two words.
I tried few ways with [[:space:]] etc but mysql was returning
three, four word strings also
try this:
select * from yourTable WHERE field REGEXP('^[[:alnum:]]+[[:blank:]]+[[:alnum:]]+$');
more details in link :
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
^\w+\s\w+$ should do well.
Note; what I experience more often in the last days is that close to nobody uses the ^$-operators.
They are absolutely needed if you want to tell if a string starts or ends with something or want to match the string exactly, word for word, as you. "Normal" strings, like you used (I assume you used something like \w[:space]\w match in the string, what means that they also match if the condition is true anywhere within the string!
Keep that in mind and Regex will serve you well :)
REGEXP ('^[a-z0-9]*[[:space:]][a-z0-9]*$')