Sphinx match by first letter - mysql

I need simple explanation of why my queries fail to bring the results i need.
Sphinx 2.0.8-id64-release (r3831)
Here is what i have in sphinx.conf:
SELECT
trackid,
title,
artistname,
SUBSTRING(REPLACE(TRIM(`artist_name`), 'the ', ''),1,3) AS artistname_init
....
sql_field_string = title
sql_field_string = artistname
sql_field_string = artistname_init
Additional settings:
docinfo = extern
charset_type = utf-8
min_prefix_len = 1
enable_star = 1
expand_keywords= 0
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z, A..Z->a..z, a..z
Query works. I index my data without problems. However i am failing to make sphinx bring any sensible results. I am using SphinxQL to query.
Example:
select
artistname, artistname_init from myindex
WHERE MATCH('#artistname_init ^t*')
GROUP BY artistname ORDER BY artistname_init ASC limit 0,10;
brings nothing related to the query.
I've tried everything i could think of like:
MATCH('#artistname_init ^t*')
MATCH('#artistname_init[1] t')
MATCH('#artistname_init ^t$')
Can anyone please point where is my mistake and perhaps give me query that will work for my case?
My target is to get results that follow this sorting order:
B (Single letter)
B-T (Single letter + non-alphabet sign after)
B as Blue (Single letter + space after)
Baccara (First letter of single word)
Bad Religion (First letter of several words)
The B (not counting "The ")
The B.Y.Z (Single letter + non-alphabet sign after not counting "The ")
The B 2 B (Single letter + space after not counting "The ")
The Boyzz (First letter of single word not counting "The ")
The Blue Boy (First letter of several words not counting "The ")
Or close to it.

There are a lot of moving parts in what you're trying to do, but I can at least answer the title portion of it. Sphinx offers field-level ranking factors to let you customize the WEIGHT() function – it should be much easier to order the matches the way you want, rather than trying to actually filter out entries that matched the query later than the 1st or 2nd word.
Here's an example, which will return all results with a word starting with "b", sorted by how early that word appears:
SELECT id, artistname, WEIGHT()
FROM myindex
WHERE MATCH('(#artistname (b*))')
ORDER BY WEIGHT() DESC
LIMIT 10
OPTION ranker=expr('sum(100 - min_hit_pos)');
If you want to filter out other cases like "Several other words then B", I think I'd suggest doing that in your application. For example, if the fourth result has the keyword in the 3rd word, only return the first 3 results. That, or actually create a new field in Sphinx without the leading "The", and then add a numeric attribute to the index to show that a word was removed (you can use numeric attributes in your ranker expressions).
As for ranking "B-t" more highly than "Bat", I'm not sure if that's possible without somehow changing Sphinx's concept of alphabetical order.. You could try diving into the source code? ;)
One last note. For this particular kind of query, MySQL (I say MySQL because it's the common way of sourcing a Sphinx index) may actually work just as well. If you strip the leading "The", a B-tree index (which MySQL uses) is a perfectly good way of searching if you're sure you only want results where the query matches the beginning of the field. Sphinx's inverted indexes are kind of overkill for that sort of thing.

Related

SQL Server 2008 - fulltext search not stopping on stop words

I've created a stoplist based on the system's list and I set up my fulltext indexes to use it.
If I run the code select unique_index_id, stoplist_id from sys.fulltext_indexes I can see that all my indexes are using the stoplist with ID 5 which is the one I have created.
When I run the text using the FTS_PARTIAL the result comes correct.
example:
SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "Rua José do Patrocinio nº125, Vila América, Santo André - SP" ', 1046, 5, 0)
The words that I added to the stoplist are shown as noise words. But for some reason when I run my query it brings me the register containing the stopwords too.
For example:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"rua*" or "jose*"')
Brings me the register above as I would expect. Since the word 'rua' should be ignored but 'Jose' would be a match.
But if I searched:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"rua*"')
I would expect no register to be found. Since 'rua' is set to be a stopword.
I'm using Brazilian (Portuguese) as the stoplist language.
So the word "Rua" (that means "Street") should be ignored (as I added it to the stop list). It is recognized as noise by the parser but when I run my query it brings me registers that contain "Rua".
My search is an address search, so it should ignore the words such as "Street", "Avenue", etc.. (in Portuguese of course and which I added them all as well).
This is the query that I'm using to look up the tables.
select DISTINCT(PES.idPessoa)
, PES.Nome
, EN.idEndereco
, EN.idUF
, CID.Nome as Cidade
, EN.Bairro
, EN.Logradouro
, EN.Numero
, EN.Complemento
, EN.CEP
, EN.Lat
, EN.Lng
from tbPessoa PES
INNER JOIN tbAdvogado ADV ON PES.idPessoa = ADV.idPessoa
INNER JOIN tbEndereco EN ON PES.idEmpresa = EN.idEmpresa
LEFT JOIN tbCidade CID ON CID.idCidade = EN.idCidade
where adv.Ativo = 1
and CONTAINS (en.*, '"rua*"')
OR EN.idCidade IN (SELECT idCidade
FROM tbCidade
WHERE CONTAINS (*, '"rua*"'))
OR PES.idPessoa IN (SELECT DISTINCT (ADVC.idPessoa)
FROM tbComarca C
INNER JOIN tbAdvogadoComarca ADVC
ON ADVC.idComarca = C.idComarca
WHERE CONTAINS (Nome, '"rua*"'))
OR PES.idPessoa IN (SELECT OAB.idPessoa
FROM tbAdvogadoOAB OAB
WHERE CONTAINS (NROAB, '"rua*"'))
I tried both FREETEXT and CONTAINS. Using something simpler like WHERE CONTAINS (NROAB, 'rua')) but it also brought me the registers containing "Rua".
I thought my query could have some problem then I tried a simpler query and it also brought me the stop-word "Rua".
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, 'rua')
One thing I noticed is that the words that were native from the system stoplist work just fine. For example, if I try the word "do" (which means "of") it does not bring me any registers.
Example:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"do*"')
I tried to run the command "Start full population" through SSMS in all tables to check whether that was the problem and got nothing.
What am I missing here. This is the first time I work with Fulltext indexes and I may be missing some point setting it up.
Thank you in advance for your support.
Regards,
Cesar.
You have changed your question so I will change my answer and try to explain it a little better.
According to Stopwords and Stoplists:
A stopword can be a word with meaning in a specific language, or it
can be a token that does not have linguistic meaning. For example, in
the English language, words such as "a," "and," "is," and "the" are
left out of the full-text index since they are known to be useless to
a search.
Although it ignores the inclusion of stopwords, the full-text index
does take into account their position. For example, consider the
phrase, "Instructions are applicable to these Adventure Works Cycles
models". The following table depicts the position of the words in the
phrase:
I am not sure why, but I think it only applies when using a phrasal search like:
If you have a line like this:
Teste anything casa
And you query the fulltext as:
SELECT *
FROM Address
WHERE CONTAINS (*, '"teste rua casa"')
The line:
Teste anything casa
Will be returned. In that case, the fulltext will translate your query as something like this:
"Search for 'teste' near any word near 'casa'"
When you query the fulltext using the "or" operator or only search for one word the rule does not apply. I have tested it several times for about 3 months and I never understood why.
EDIT
if you have the line
"Rua José do Patrocinio nº125"
and you query the fulltext
"WHERE CONTAINS (, '"RUA" or "Jose*" or "do*"')"
it will bring the line because it DOES contains at least one of the words you are searching for and not because the word "rua" and "do" are being ignored.

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).

MySQL search within the last 5 characters in a column?

My user table has a column "name" which contains information like this:
Joe Lee
Angela White
I want to search for either first name or last name efficiently. First name is easy, I can do
SELECT * FROM user WHERE name LIKE "ABC%"
But for last name, if I do
SELECT * FROM user WHERE name LIKE "%ABC"
That would be extremely slow.
So I am thinking about counting the characters of the input, for example, "ABC" has 3 characters, and if I can search only the last three characters in name column, that would be great. So I want something like
SELECT * FROM user WHERE substring(name, end-3, end) LIKE "ABC%"
Is there anything in MySQL that can do this?
Thanks so much!
PS. I cannot do fulltext because our search engine doesn't support that.
The reason that
WHERE name LIKE '%ith'
is a slow way to look for 'John Smith' by last name is the same reason that
WHERE Right(name, InStr(name, ' ' )) LIKE 'smi%'
or any other expression on the column is slow. It defeats the use of the index for quick lookup and leaves the MySQL server doing a full table scan or full index scan.
If you were using Oracle (that is, if you worked for a formerly wealthy employer) you could use function indexes. As it is you have to add some extra columns or some other helping data to accelerate your search.
Your smartest move is to split your first and last names into separate columns. Several other people have pointed out good reasons for doing that.
If you can't do that you could try creating an extra column which contains the name string reversed, and create an index on that column. That column will have, for example, 'John Smith' stored as 'htimS nhoJ'. Then you can search as follows.
WHERE nameReversed LIKE CONCAT(REVERSE('ith'),'%')
This search will use the index and be decently fast. I've had good success with it.
You're close. In MySQL you should be able to use InStr(str, substr) and Right(str, index) to do the following:
SELECT * FROM user WHERE Right(name, InStr(name, " ")) LIKE "ABC%"
InStr(name, " ") returns the index of the Space character (you may have to play with the " " syntax). This index is then used in the Right() function to search for only the last name (basically; problems arise when you have multiple names, multiple spaces etc). LIKE "ABC%" would then search for a last name starting with ABC.
You cannot use a fixed index as names that are more than 3 or less than 3 characters long would not return properly as you suggest.
However, as Zane said, it's a much better practise to use seperate fields.
If it is a MyIsam table, you may use Free text search to do the same.
You can use the REGEXP operator:
SELECT * FROM user WHERE name REGEXP "ABC$"
http://dev.mysql.com/doc/refman/5.1/en/regexp.html

MySQL Sort Alphabetically but Ignore "The"

I have MySQL database that has a table with book data in it. One of the columns in the table is called "title". Some of the titles begin the word "the" and some do not.
Example:
"The Book Title One"
"Book Title Two"
"Book Title Three"
I need to pull these out of the database in alphabetical order, but I need to ignore the "the" in the beginning of the titles that start with it.
Does SQL (specifically MySQL) provide a way to do this in the query?
do a case when to check if the column value starts with the and if it does, return the title without the 'The'. This will be a new column that you will be using later on for the sort order
select title, case when title like 'The %' then trim(substr(title from 4)) else title end as title2 from tablename order by title2;
You can use a CASE statement in the ORDER BY and the use REGEXP or LIKE to match strings that start with words you would like to remove.
In the example below I find all words that begin with a, an, or the followed by a space, and then remove everything up to the space, and trim away additional white space (you might have two or spaces following an instance of the).
SELECT *
FROM books
ORDER BY
CASE
WHEN title REGEXP '^(A|An|The)[[:space:]]' = 1 THEN
TRIM(SUBSTR(title , INSTR(title ,' ')))
ELSE title
END ;
if you are sure that you will NEVER EVER have a typo (and use lowercase instead of uppercase)
select *
from books b
order by UPPER(LTRIM(Replace(b.Title, 'The', '')))
Otherwise your sorting will do all Upper and then all lower.
for example, this is ascending order:
Have a Great Day
Wild west
Zorro
aZtec fries are hotter
alfred goes shopping
bart is small
will i am not
adapted from AJP's answer
I've seen some convoluted answers here which I tried but were just wrong (didn't work) or unsafe (replaced every occurrence of 'the'). The solution I believe to be easy, or maybe I'm getting it wrong or not considering edge cases (sincerely, no sarcasm intended).
... ORDER BY SUBSTRING(UPPER(fieldname), IF(fieldname LIKE 'The %', 5, 1))
As stated elsewhere, UPPER just prevents ASCII sorting which orders B before a (note the case difference).
There's no need for a switch-case statement when there is only one condition, IF() will do
I'm using MySQL 5.6 and it seems like the string functions are 1-indexed, in contrast to say PHP where strings are 0-indexed (this caught me out). I've tested the above on my dataset and it works
select *
from books b
order by LTRIM(Replace(b.Title, 'The', ''))
PLease note this will replace The from the title.. no matter where in the title. so use substring to get first 3 characters.
Simply:
SELECT Title
FROM book
ORDER BY IF(Title LIKE "The %", substr(Title, 5), Title);
Explanation:
We use the IF function to strip the "The" (if present) from the beginning of the string before returning the string to the ORDER BY clause. For more complex alphabetization rules we could create a user-defined function and place that in the ORDER BY clause instead. Then you would have ...ORDER BY MyFunction(Title).

MySQL query - select postcode matches

I need to make a selection based on the first 2 characters of a field, so for example
SELECT * from table WHERE postcode LIKE 'rh%'
But this would select any record that contains those 2 characters at any point in the "postcode" field right? I am in need of a query that just selects the first 2 characters. Any pointerS?
Thanks
Your query is correct. It searches for postcodes starting with "rh".
In contrast, if you wanted to search for postcodes containing the string "rh" anywhere in the field, you would write:
SELECT * from table WHERE postcode LIKE '%rh%'
Edit:
To answer your comment, you can use either or both % and _ for relatively simple searches. As you have noticed already, % matches any number of characters whereas _ matches a single character.
So, in order to match postcodes starting with "RHx " (where x is any character) your query would be:
SELECT * from table WHERE postcode LIKE 'RH_ %'
(mind the space after _). For more complex search patterns, you need to read about regular expressions.
Further reading:
http://dev.mysql.com/doc/refman/5.1/en/pattern-matching.html
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
LIKE '%rh%' will return all rows with 'rh' anywhere
LIKE 'rh%' will return all rows with 'rh' at the beginning
LIKE '%rh' will return all rows with 'rh' at the end.
If you want to get only first two characters 'rh', use MySQL SUBSTR() function
http://dev.mysql.com/doc/refman/5.1/en/string-functions.html#function_substr
Dave, your way seems correct to me (and works on my test data). Using a leading % as well will match anywhere in the string which obviously isn't desirable when dealing with postcodes.