SQL Server 2008 - fulltext search not stopping on stop words - sql-server-2008

I've created a stoplist based on the system's list and I set up my fulltext indexes to use it.
If I run the code select unique_index_id, stoplist_id from sys.fulltext_indexes I can see that all my indexes are using the stoplist with ID 5 which is the one I have created.
When I run the text using the FTS_PARTIAL the result comes correct.
example:
SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "Rua José do Patrocinio nº125, Vila América, Santo André - SP" ', 1046, 5, 0)
The words that I added to the stoplist are shown as noise words. But for some reason when I run my query it brings me the register containing the stopwords too.
For example:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"rua*" or "jose*"')
Brings me the register above as I would expect. Since the word 'rua' should be ignored but 'Jose' would be a match.
But if I searched:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"rua*"')
I would expect no register to be found. Since 'rua' is set to be a stopword.
I'm using Brazilian (Portuguese) as the stoplist language.
So the word "Rua" (that means "Street") should be ignored (as I added it to the stop list). It is recognized as noise by the parser but when I run my query it brings me registers that contain "Rua".
My search is an address search, so it should ignore the words such as "Street", "Avenue", etc.. (in Portuguese of course and which I added them all as well).
This is the query that I'm using to look up the tables.
select DISTINCT(PES.idPessoa)
, PES.Nome
, EN.idEndereco
, EN.idUF
, CID.Nome as Cidade
, EN.Bairro
, EN.Logradouro
, EN.Numero
, EN.Complemento
, EN.CEP
, EN.Lat
, EN.Lng
from tbPessoa PES
INNER JOIN tbAdvogado ADV ON PES.idPessoa = ADV.idPessoa
INNER JOIN tbEndereco EN ON PES.idEmpresa = EN.idEmpresa
LEFT JOIN tbCidade CID ON CID.idCidade = EN.idCidade
where adv.Ativo = 1
and CONTAINS (en.*, '"rua*"')
OR EN.idCidade IN (SELECT idCidade
FROM tbCidade
WHERE CONTAINS (*, '"rua*"'))
OR PES.idPessoa IN (SELECT DISTINCT (ADVC.idPessoa)
FROM tbComarca C
INNER JOIN tbAdvogadoComarca ADVC
ON ADVC.idComarca = C.idComarca
WHERE CONTAINS (Nome, '"rua*"'))
OR PES.idPessoa IN (SELECT OAB.idPessoa
FROM tbAdvogadoOAB OAB
WHERE CONTAINS (NROAB, '"rua*"'))
I tried both FREETEXT and CONTAINS. Using something simpler like WHERE CONTAINS (NROAB, 'rua')) but it also brought me the registers containing "Rua".
I thought my query could have some problem then I tried a simpler query and it also brought me the stop-word "Rua".
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, 'rua')
One thing I noticed is that the words that were native from the system stoplist work just fine. For example, if I try the word "do" (which means "of") it does not bring me any registers.
Example:
SELECT *
FROM tbEndereco
WHERE CONTAINS (*, '"do*"')
I tried to run the command "Start full population" through SSMS in all tables to check whether that was the problem and got nothing.
What am I missing here. This is the first time I work with Fulltext indexes and I may be missing some point setting it up.
Thank you in advance for your support.
Regards,
Cesar.

You have changed your question so I will change my answer and try to explain it a little better.
According to Stopwords and Stoplists:
A stopword can be a word with meaning in a specific language, or it
can be a token that does not have linguistic meaning. For example, in
the English language, words such as "a," "and," "is," and "the" are
left out of the full-text index since they are known to be useless to
a search.
Although it ignores the inclusion of stopwords, the full-text index
does take into account their position. For example, consider the
phrase, "Instructions are applicable to these Adventure Works Cycles
models". The following table depicts the position of the words in the
phrase:
I am not sure why, but I think it only applies when using a phrasal search like:
If you have a line like this:
Teste anything casa
And you query the fulltext as:
SELECT *
FROM Address
WHERE CONTAINS (*, '"teste rua casa"')
The line:
Teste anything casa
Will be returned. In that case, the fulltext will translate your query as something like this:
"Search for 'teste' near any word near 'casa'"
When you query the fulltext using the "or" operator or only search for one word the rule does not apply. I have tested it several times for about 3 months and I never understood why.
EDIT
if you have the line
"Rua José do Patrocinio nº125"
and you query the fulltext
"WHERE CONTAINS (, '"RUA" or "Jose*" or "do*"')"
it will bring the line because it DOES contains at least one of the words you are searching for and not because the word "rua" and "do" are being ignored.

Related

How to do fulltext search in multiple columns in MySQL, quickly?

I know this question has been asked several times.. but , let me explain.
I have a table with 450k records of users (id, first name, last name, address, phone number, etc ..).
I want to search users by thei first name and/or their last name.
I used these queries :
SELECT * FROM correspondants WHERE nom LIKE 'Renault%' AND prénom LIKE 'r%';
and
SELECT * FROM correspondants WHERE CONCAT(nom, CHAR(32), prénom= LIKE 'Renault r%';
It works well, but with a too high duration (1,5 s). This is my problem.
To fix it, I tried with MATCH and AGAINST with a full text index on both colums 'nom' and 'prénom' :
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault r');
It's very quick (0,000 s ..) but result is bad, I don't obtain what I should have.
For example, with LIKE function, results are :
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(3 results).
And with MATCH/AGAINST :
327380 RENAULT Luc
1559 RENAULT Marina
17280 RENAULT Anne
(...)
88623 RENAULT Rémy
91736 RENAULT Robin
202269 RENAULT Régine
(...)
436696 SEZNEC-RENAULT Helene
(...)
(115 results !)
What is the best way to do a quick and efficient text search on both columns with a "AND" search ? (and what about indexes)
Fulltext search doesn't do pattern-matching as LIKE string comparisons do. Fulltext search only searches for full words, not fragments like r%.
Also there's a minimum size of word, controlled by the ft_min_word_len configuration variable. To avoid making the fulltext index too large, it doesn't index words smaller than that variable. And therefore short words are ignored when you search, so r is ignored.
There's also no choice in fulltext indexing to search for words in a specific position like at the beginning of a string. So your search for renault may be found in the middle of the string.
To solve these issues, you could do the following:
SELECT * FROM correspondants WHERE MATCH(nom, prénom) AGAINST('Renault')
AND CONCAT(nom, CHAR(32), prénom) LIKE 'Renault r%';
This would use the fulltext index to find a small subset of your 450,000 rows that have the word renault somewhere in the string. Then the second term in the search would be done without help from an index, but only against the subset of rows that match the first term.
That particular query is best done this way:
INDEX(nom, prénom)
WHERE non = 'Relault' AND prénom LIKE 'R%'
I recommend that you add that index and add code to your application to handle different requests in different ways.
Do not hide an indexed column inside a function call, such as CONCAT(nom, ...), it will not be able to use the index; instead it will check every row, performing the CONCAT for every row and then doing the LIKE. Very slow.
Except for cases of initials (as above), you should mostly avoid very short names. However, here is another case where you can make it work with extra code:
WHERE nom = 'Lu'
(with the same index). Note that using any flavor of MATCH is likely to be much less efficient.
So, if you are given a full last name, use WHERE nom =. If you are given a prefix, then it might work to use WHERE nom LIKE 'Prefix%' Etc.
FULLTEXT is best used for cases where you have full words scattered in longer text, which is not your case since you have nom and prénom split out.
Perhaps you should not use MATCH for anything in this schema.

Sphinx match by first letter

I need simple explanation of why my queries fail to bring the results i need.
Sphinx 2.0.8-id64-release (r3831)
Here is what i have in sphinx.conf:
SELECT
trackid,
title,
artistname,
SUBSTRING(REPLACE(TRIM(`artist_name`), 'the ', ''),1,3) AS artistname_init
....
sql_field_string = title
sql_field_string = artistname
sql_field_string = artistname_init
Additional settings:
docinfo = extern
charset_type = utf-8
min_prefix_len = 1
enable_star = 1
expand_keywords= 0
charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z, A..Z->a..z, a..z
Query works. I index my data without problems. However i am failing to make sphinx bring any sensible results. I am using SphinxQL to query.
Example:
select
artistname, artistname_init from myindex
WHERE MATCH('#artistname_init ^t*')
GROUP BY artistname ORDER BY artistname_init ASC limit 0,10;
brings nothing related to the query.
I've tried everything i could think of like:
MATCH('#artistname_init ^t*')
MATCH('#artistname_init[1] t')
MATCH('#artistname_init ^t$')
Can anyone please point where is my mistake and perhaps give me query that will work for my case?
My target is to get results that follow this sorting order:
B (Single letter)
B-T (Single letter + non-alphabet sign after)
B as Blue (Single letter + space after)
Baccara (First letter of single word)
Bad Religion (First letter of several words)
The B (not counting "The ")
The B.Y.Z (Single letter + non-alphabet sign after not counting "The ")
The B 2 B (Single letter + space after not counting "The ")
The Boyzz (First letter of single word not counting "The ")
The Blue Boy (First letter of several words not counting "The ")
Or close to it.
There are a lot of moving parts in what you're trying to do, but I can at least answer the title portion of it. Sphinx offers field-level ranking factors to let you customize the WEIGHT() function – it should be much easier to order the matches the way you want, rather than trying to actually filter out entries that matched the query later than the 1st or 2nd word.
Here's an example, which will return all results with a word starting with "b", sorted by how early that word appears:
SELECT id, artistname, WEIGHT()
FROM myindex
WHERE MATCH('(#artistname (b*))')
ORDER BY WEIGHT() DESC
LIMIT 10
OPTION ranker=expr('sum(100 - min_hit_pos)');
If you want to filter out other cases like "Several other words then B", I think I'd suggest doing that in your application. For example, if the fourth result has the keyword in the 3rd word, only return the first 3 results. That, or actually create a new field in Sphinx without the leading "The", and then add a numeric attribute to the index to show that a word was removed (you can use numeric attributes in your ranker expressions).
As for ranking "B-t" more highly than "Bat", I'm not sure if that's possible without somehow changing Sphinx's concept of alphabetical order.. You could try diving into the source code? ;)
One last note. For this particular kind of query, MySQL (I say MySQL because it's the common way of sourcing a Sphinx index) may actually work just as well. If you strip the leading "The", a B-tree index (which MySQL uses) is a perfectly good way of searching if you're sure you only want results where the query matches the beginning of the field. Sphinx's inverted indexes are kind of overkill for that sort of thing.

How to get records in mysql where first letter equals 'X', when dealing with spanish characters?

As an example:
select *
from recording as r
where left(r.title, 1) = 'p'
This picks all the recordings where the recording title starts with the letter p. I can use other letters...
But, the problem is that i'm dealing with a spanish table that contains recordings like ¿-Por qué? or «Por un amor» o even ¡..Pon la mesa!, etc... And i do want those recordings to be part of my result set.
How do i get the recordings that start with a specific letter only?
Do i need to create a separate column in the table, let's call it 'sort name', and insert the recording title with all of those special characters stripped out? Are they other solutions, special functions that i'm not aware of in mysql?
Thanks
You can use mysql regexp:
SELECT * FROM recording WHERE title REGEXP '^[^a-zA-Z]*[pP]'
You can use the SQL LIKE wildcard '_' this way :
SELECT * FROM recording AS r WHERE r.title LIKE '_P%'
It will accept any character before the P letter. (And only one character)
If you don't wan't to match words such as 'Ap...', you could use the SQL regex function (Google it, it's quite well documented), or check it afterward using PhP if the case is rare enough.
Edit : Ok, so you want to match titles with several special characters behind the first letter.
In that case, using the regexp function (SQL side or PhP side, depends on your needs) should be the best way

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).

fuzzy matching an address using mysql's match against (if possible using weights for better results ranking)

I have a myISAM table with FULLTEXT index , trying to do
SELECT
lk.id,
lk.address
FROM
lk
WHERE MATCH
lk.address
AGAINST('235 regent street, london w1b 2et');
I get results but only the ones who got the word "london" inside, or ones who got the word "street" inside. I know that 3 ft_min_word_len character words aren't indexed so "235","w1b","2et" are ignored, but what about "regent" ?
What is the STANDARD way of doing this? fuzzy matching an address.
thanks
The answer is to use MATCH AGAINST('...' IN BOOLEAN MODE) , and to add + in front of every word.
Or use other characters like explained in:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-boolean.html
It needs fine tuning depending on your searched text, and how you got it.