Unexpected behaviour in MySQL with Boolean-Mode-Query with quoted hyphenated string - mysql

I have a problem or rather an understanding problem with a hyphenated searchstring which is quoted.
In my Table there is a table with a column 'company'.
One of the entries in that column is: A-Z Electro
The following examples are simplified a lot (though the real query is much more complex) - but the effect is still the same.
When I do the following search, I don't get the row with the above mentioned company:
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('+\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
If I do the following search, get the row with the above mentioned company (notice only changed the - to a + before "A-Z":
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('-\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
I also get the row, if I remove the operator completely:
SELECT i.*
FROM my_table i
WHERE MATCH (i.company) AGAINST ('\"A-Z\" +Electro*' IN BOOLEAN MODE)
GROUP BY i.uid ORDER BY i.company ASC LIMIT 0, 40;
Can anyone explain to me this behaviour? Because I would expect, when searching with a +, I should get the result too...
I just checked the table index with myisam_ftdump.
Two-Character-Words are indexed properly as there are entries like
14f2e8 0.7908264 ab
3a164 0.8613265 dv
There is also an entry:
de340 0.6801047 az
I suppose this should be the entry for A-Z - so the search should find this entry, shouldn't it?

The default value of ft_min_word_len is 4. See this link for information on that. In short, your system isn't indexing words of less than 4 characters.
Why is this important? Well:
A-Z is less than 4 characters long
...therefore it's not in the index
...but your first query +"A-Z" states it must be in the index in order for the match to succeed
The other two (match if it's not in the index, match if either this or that is in the index) work because it's not in the index.
The hyphen is a red herring - the reason is because "A-Z" is three characters long and your FT index ignores it.

Related

Mysql fulltext search didn't find exact phrase

I have a table with column name with fulltext index -
In this table there is many rows but some of them contain the phrase "aberlour 18" exactly at start like here -
But when I search using the fulltext search with exact name from the table -
SELECT * FROM `whiskybase_bottles` WHERE MATCH(`name`) AGAINST('Aberlour 18 year old')
It finds it only on 6th place -
How can I improve it to find it first?
I don't want to use "LIKE" search because fulltext works better for my needs on other cases.
I already decreased the "innodb ft min token size" param to be 2 instead of 3 to include the age statement and recreated the index after it -
If you are trying to find rows that have "Aberlour 18 year old" included in the name but not necessarily the entire name, you can use the + operator that is used for boolean full text searches. Without the + operator, your query is searching for phrases that match the search exactly and does not select phrases that have other words and characters like you want.
Your code would look something like this:
SELECT *
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST('+"Aberlour 18 year old"' IN BOOLEAN MODE)
You can learn more about BOOLEAN MODE and the related operators here:
https://www.w3resource.com/mysql/mysql-full-text-search-functions.php
For a general search with a variable you can do something like this:
SELECT *
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST(concat('+\"', search_phrase, '\"') IN BOOLEAN MODE)
If you want to rank the exact phrase on the top.
Use this code instead:
SELECT *,MATCH (name) AGAINST( 'Aberlour 18 year old' IN BOOLEAN MODE) as score
FROM `whiskybase_bottles`
WHERE MATCH(`name`) AGAINST('Aberlour 18 year old')
ORDER BY name = 'Aberlour 18 year old' DESC, score DESC

mysql ngrams indexing example

As I have read in many parts, ngram indexing can improve word searches.
In this old post it says that it could be adapted for mysql but it does not say how: levenshtein alternative
Can anyone put some example as you can use this technique in mysql?
Can use this technique to improve the performance of the levenshtein function for mysql?.
My need is to find approximate text (like levenshtein)
In mi tests I used levenshtein() and levenshtein_ratio() functions from:
http://www.artfulsoftware.com/infotree/qrytip.php?id=552
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
ORDER BY ratio DESC
That improve performance (Supposing not misspelled first letter)
SELECT *, levenshtein_ratio('stacoverflou',words_column) AS ratio
FROM my_table
WHERE words_column LIKE 's%'
ORDER BY ratio DESC
Also I found this php library for building ngrams:
https://gist.github.com/Xeoncross/5366393
But I have no idea how to use these ngrams in mysql
I have finally made an algorithm myself:
Generate ngrams algorithm:
I build a words table with 3 columns: ngrams (fulltext), word (UNIQUE), lang
I used Bigrams() function to make ngrams for each word
I added a char padding to each ngram to skip the full text index minimum word length ex: 'abcd' should be 'ab bc cd' but with padding it look like 'abxx bcxx cdxx'
Search algorithm:
I take the client written words as correct and use them to search in the real table
with mysql fulltext search query using AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
if the score (ranking) > 0 Mission accomplished and ngrams not used (show result to client)
If the score (ranking) = 0 (maybe misspelled words) then use ngrams words table to retrieve correct words
Retrieve correct words from ngrams algorithm:
For each word generate ngrams and preform a fulltext search query using AGAINST('abxx bcxx cdxx' IN BOOLEAN MODE) FROM words table (where we have the ngrams column) and retrieve the correct word. Note that here no (+) prefix to ngrams
Rebuild the search like as step 4
If score > 0 Mission accomplished -> show results -> END
If score still 0 then make another query but this time without (+) +word prefix and IN NATURAL LANGUAGE MODE -> show results -> END
Step 2 code:
// original from : https://gist.github.com/Xeoncross/5366393
// modified for working also with unicode characters
function Bigrams($word){
$ngrams = array();
$len = mb_strlen($word);
for($i=0;$i+1<$len;$i++){
$ngram = mb_substr($word, $i, 2);
while(mb_strlen($ngram) < 4){
$ngram .= "x";
}
$ngrams[$i]=$ngram;
}
return implode(" ",$ngrams);
}
Step 4 code:
SELECT my_column,
( MATCH(my_full_text_column )
AGAINST('+word_1 +word_2 +word_n' IN BOOLEAN MODE)
) AS score LIMIT 10
Step 7 code:
$word = "stacoverflou"; // Intentionally misspelled
$actual_word_ngrams = Bigrams();
//that return
//stxx taxx acxx coxx ovxx vexx erxx rfxx flxx loxx ouxx
SELECT word,
( MATCH( ngrams )
AGAINST('$actual_word_ngrams' IN BOOLEAN MODE)
) AS score LIMIT 1
That return stackoverflow an can be used with ohers words (if many) to much more accurate search like as step 4
END

ORDER BY in MySql based on LIKE condition

I am facing difficulty in sorting the result based on field in mysql. Say for example I am searching the word "Terms" then I should get the results which starts with 'Terms' first and then 'Terms and' as next and then 'Terms and conditions' and so on.
Any one please help out who to fetch the search result based on my requirements in efficient manner using mysql query.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY your_column;
Based on the storage engine and mysql version you probably can use the full text search capabilities of MySQL. For example:
SELECT *, MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE) AS relevance
FROM your_table
WHERE MATCH (your_column) AGAINST ('Terms' IN BOOLEAN MODE)
ORDER BY relevance
You can find more info here: http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
Or if you don't want FTS another possible solution where ordering is strictly based on the length (difference) of the strings.
SELECT * FROM your_table WHERE your_column LIKE "Terms%" ORDER BY ABS(LENGTH(your_column) - LENGTH('Terms'));
You are looking for fulltext search. Below a very simple example
SELECT id,name MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE) AS score
FROM tablename WHERE MATCH (name) AGAINST ('string' > 'string*' IN BOOLEAN MODE)
ORDER BY score DESC
The advantage of this is that you can control the value of words. This is very basic, you can 'up' some matches or words (or 'down' them)
In my example an exact match ('string') would get a higher score than the string with something attached ('string*'). The following line is even one step broader:
'string' > 'string*' > '*string*'
This documentation about fulltextsearch explains allot. It's a long read, but worth it and complete.
Don't use fulltext index if you search for prefix string!
Using LIKE "Term%" the optimizer will be able to use a potential index on your_column:
SELECT * FROM your_table
WHERE your_column LIKE "Terms%"
ORDER BY CHAR_LENGTH(your_column),your_column
Note the ORDER BY clause: it first sorts by string length, and only use alphabetcal order to sort strings of equal length.
And please, use CHAR_LENGTH and not LENGTH as the first count the number of characters, whereas the later count number of bytes. Using a variable length encoding such as utf8, this would made a difference.

Best way to return "champion" that exists in a table in case the given query "championship" is not found

I have a very big table with strings.
Field "words":
- dog
- champion
- cat
- this is a cat
- pool
- champ
- boots
...
In my example, if a select query is looking for the given string "championship", it won't find it because this string is not in the table.
In that case, I want the query to return "champion" from the table, i.e. the longest string in the table that begins the given word "championship".
The possible match (if found) is the longest one in table between championship, or championshi, or championsh, or champions, ..., or cham, or cha, or ch, or C.
Question: I want to return longest string in table that starts a given string.
I need high speed. Is there a way to create index and query in order to have fast execution of queries?
Here's one query that will return the specified result:
SELECT t.mycol
FROM mytable t
WHERE 'championship' LIKE CONCAT(t.mycol,'%')
ORDER
BY LENGTH(t.mycol) DESC
LIMIT 1
This query can't do a index range scan, it's going to have to be full scan, but it may be able to use an index to satisfy the query.
If you can restrict the search to a finite number of leading letters that need to match to be considered a "hit", you could include another predicate. For example, to match at least 4 characters:
SELECT t.mycol
FROM mytable t
WHERE 'championship' LIKE CONCAT(t.mycol,'%')
AND t.mycol LIKE 'cham%'
ORDER
BY LENGTH(t.mycol) DESC
LIMIT 1
--or--
AND t.mycol >= 'cham'
AND t.mycol < 'chan'
You are a little vague with 'the longest string in the table that begins the given word "championship".' Would "championing" count as a match?
Perhaps the following will help. If you have an index on words, then the following will return the last word before the given word. It should maximize the initial sequence of matches:
select word
from t
where words <= 'championship'
order by words desc
limit 1;
This isn't exactly what you are asking for, but it might work in practice.
EDIT:
If you are looking for an exact match, then the following should use an index on words effectively and return what you want:
select word
from t
where word in ('championship', 'championshi', 'championsh', 'champions', 'champion',
'champio', 'champi', 'champ', 'cham', 'cha', 'ch', 'c')
order by word desc
limit 1;
It is a bit brute force, but it should have the property of using the index to speed up the query.
Have a look at this article:
http://blog.fatalmind.com/2010/09/29/finding-the-best-match-with-a-top-n-query/
It explains the solution from this SO question:
How to use index efficienty in mysql query
The solution pattern looks like this:
select words
from (
select words
from yourtable
where words <= 'championship'
order by words desc
limit 1
) tmp
where 'championship' like concat (words, '%')

Matching first char in string to digit or non-standard character

I need to allow users to browse a table, with >1 million entries, by the first letter in the title.
I want them to be able to browse by every letter from A-Z, 0-9 in a list together and all other characters together.
Since it's a big database and it is to be displayed on a website, I need it to be efficient. Regex does not use index, so that would be too slow.
Is this possible or will I have to rethink the design?
Thanks in advance
As long as there's an index on the "Title", you should be able to use a SQL like
select *
from myTable
where Title like 'A%'
(or 'B%', 'C%'...)
Create links representing every letter and number. Clicking these links will provide the users with the results from the database that begin with the selected character.
SELECT title FROM table
WHERE LEFT(title,1) = ?Char
ORDER BY title ASC;
Consider paginating these result pages into appropriate chunks. MySQL will let you do this with LIMIT
This command will select the first 100 records from the desired character group:
SELECT title FROM table
WHERE LEFT(title,1) = ?Char
ORDER BY title ASC
LIMIT 0, 100;
This command will select the second 100 records from the desired character group:
SELECT title FROM table
WHERE LEFT(title,1) = ?Char
ORDER BY title ASC
LIMIT 100, 100;
Per your comments, if you want to combine characters 0-9 without using regex, you will need to combine several OR statements:
SELECT title FROM table
WHERE (
LEFT(title,1) = '0'
OR LEFT(title,1) = '1'
...
)
ORDER BY title ASC;