Common words not showing up in FULLTEXT search results - mysql

I am using Full Text searching for a website I am making to order a users search query by relevance. This is working great with one problem. When a search term is populated in my table more than 50% of the time, that query is ignored. I am looking for a solution to NOT ignore words that are in more than 50% of the rows of a table.
For example, in my "items" table, it may look something like this:
item_name
---------
poster 1
poster 2
poster 3
poster 4
another item
If a user searches for "poster", the query returns 0 results because it appears too many times in the table. How can I stop this from happening?
I've tried using IN BOOLEAN MODE, but that takes away the functionality I need (which is ordering by relevance).
Here's an example of my SQL:
SELECT item_title
FROM items
WHERE MATCH(item_title, tags, item_category) AGAINST('poster')

You have to recompile MySQL to change this. From the documentation on Fine-Tuning MySQL Full-Text Search
The 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in storage/myisam/ftdefs.h:
#define GWS_IN_USE GWS_PROB
Change that line to this:
#define GWS_IN_USE GWS_FREQ
Then recompile MySQL. There is no need to rebuild the indexes in this case.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

MySQL Fulltext search query matching ALL words still returns partial matches

I'm having the identical issue that this poster had, however the accepted answer didn't resolve my issue. Basically I'm trying to match my "title" column with ALL of the words in a fulltext search query, yet it's still returning partial matches. I recently transferred my MySQL database tables to a new web host, and my fulltext search isn't behaving as it was on my old server. I'm assuming there might be a setting difference, but I can't seem to locate it. Fulltext is enabled, my ft_min_word_len is set at 3, and yet the following MySQL query is still garnering partial matches:
SELECT title, MATCH (title) AGAINST ("more pink") AS relevance
FROM discography
WHERE MATCH (title) AGAINST ("+more +pink" IN BOOLEAN MODE)
ORDER BY relevance DESC
The above code returns the below set, the first 7 titles are:
Under The Pink & More Pink
Under The Pink Tour All Pass
Under The Pink Tour Guest Pass
Under The Pink Tour Aftershow Pass
Under The Pink Tour After Show Pass
Under The Pink
Under The Pink
How can I omit the partial matches? Is there something I'm missing? The results are even worse if I put the SELECT statement in Boolean mode, since that sets the relevance into a binary 1 or 0:
SELECT title, MATCH (title) AGAINST ("+more +pink" IN BOOLEAN MODE) AS relevance
FROM discography
WHERE MATCH (title) AGAINST ("+more +pink" IN BOOLEAN MODE)
ORDER BY relevance DESC
First 7 titles are:
Under The Pink
Under The Pink
Under The Pink
Under The Pink
Under The Pink
Under The Pink
Under The Pink & More Pink
Despite using the + operator, it doesn't seem to be narrowing my results. Any help would be welcome, many thanks in advance.
Well, I feel silly now. My table uses MyISAM, and according to the documentation, "more" is on the stopwords list. So that's why that search is picking up on partial matches. Thanks everyone for the help.
EDIT
If anyone is curious how to "go around" a stopwords list on shared hosting when programming your own search engine on your website, I recommend a similar technique that I used to get around my "ft_min_word_len" setting. Create a separate search column that saves a duplicate all of the values in the column or columns you wish to be searched via Fulltext. Create an include file that stores all the stopwords listed for your database type into an array. Before saving the values into your dedicated search column, loop through each individual word in your column values and check if any exist in the stopwords array using the include file. If any values include stopwords, add a character onto the stopword at the end (I chose "z"). Then when a search is triggered, loop the search terms through the same stopwords array and check to see if any include stopwords. If any search words are in the stopwords array, once again add the same character you chose to add to the end of the stopwords in your search column ("z" in this case). After looping through the array and making the necessary alterations to the search terms, you may search your dedicated search column without fear of your stopwords being ignored. Of course, I don't use my search column for any display purposes, only searching.

Improving a SQL LIKE query performance

We have a large table with product information. Almost all the time we need to find product names that contain specific words, but unfortunately these queries take forever to run.
Example: Find all the products where the name contains the words "steel" and "102" (not necessarily next to each other, so a product like "Ninja steel iron 102 x" is a match, just like "Dragon steel 102 b" is it).
Currently we are doing it like this:
SELECT columns FROM products WHERE name LIKE '%WORD1%' AND name LIKE '%WORD2%' (the number of like words are normally 2-4, but it can in theory be 7-8 or more).
Is there a faster way of doing this?
We are only matching words, so I wonder if that can help somehow (i.e. the products in the example above are matches, but "Samurai swordsteel 102 v" is not a match since "steel" doesn't stand alone).
My own thought is to make a helper table with the words from productnames in and then use that table to get the ids of the matching products.
i.e. a table like: [id, word, productid] so we get for example:
1, samurai, 3
2, swordsteel, 3
3, 102, 3
4, v, 3
Just wonder if there is a built in way to do this in MySQL, so I don't have to implement my own stuff + maintain two tables.
Thanks!
Unfortunately, you have wild cards at the beginning of the pattern name. Hence, MySQL cannot use a standard index for this.
You have two options. First, if the words are really keywords/attributes, then you should have another table, with one row per word.
If that is not the case, you can try a full text index. Note that MySQL has attributes for the minimum words length and uses a stop words list. You should take these into account before building the index.

MySql Ordered Keyword Search

I have two tables currently:
search_matches:
match_id (int) <-- primary key
parent_id (int) <-- foreign-key
word_id (int) <-- foreign-key (to a table filled with words that are unique and have an id)
pos (int) <-- the position of the word in the block of text it comes from
search_words: (update)
word_id (int) <-- primary key
word (varchar ...) <-- the word
(I'm using innodb, and my host won't upgrade mysql, so fulltext is out)
I'd like to be able for my users to search using ". So that they can search for "foo bar".
I've thought of a few ways of doing this, but the least intensive seems to be adding another column:
next_pos (int)
I could then do
(SELECT * FROM table WHERE word_id='foo') as foo
INNER JOIN (SELECT * FROM table WHERE word_id='bar') AS bar
ON (
foo.parent_id=bar.parent_id AND
foo.next_pos=bar.next_pos
)
It comes at the cost of storing an extra column and an inner join for each word beyond the first, but its the best option I've come up with so far. (The idea previous to this was one less column, but needing to do an addition operation within the ON block, something I thought might be too expensive as my site grows.
Is this my best option, or is there another out there? I'm still just playing in staging, so now's the time to make changes.
Update 1:
I'm now considering using the keyword table to narrow down my search and then using like on that instead of multiple joins as this may be faster yet and greatly eliminates the need for joins. It just would not be productive to do a like on my entire database.
I really can't understand why do you want to do all this manual work. There are tools out there that can simply it. From what I read what you want to do is related to a full text search. You don't need to build the index yourself.
Have you considered using something like SolR? It works well with any sort of DB as long as you create an index.
I don't see how you are going to make that search with your current set-up. If as you say you have a table that contains only UNIQUE words from a block of text, how would you expect to correlate this listing of unique words to the actual word placement in the full content? For example say the original content looked like this:
some text with foo and also with foo bar
Would you unique word table look like this?
word_id word
--------------
1 some
2 text
3 with
4 foo
5 and
6 also
7 bar
If so, how are you ever going to find foo and bar as adjacent records?
I assume your database also has the full content somewhere, so why not just search in the content using LIKE?

Ranking search keywords

Question is: How to rank keywords that have been used in search queries in my web application based on time and number of search?
A user types his search query in the text box. Via AJAX I need to return some suggestions to the user. These suggestions are based on number of search done for that keyword, and should be sorted by most recently searched.
For example if a user enters the search term as "hang" the suggestions should be in this order: "hangover part 2", "hangover".
How should I design the database to store the search queries?How should I write the sql query to get the suggestions?
For query suggestion a good way is to count the number of occurrences of each search query (it is probably better to not count repeated queries made by the same user). You'll have a file/table/something (query, count) like this:
"britney spears" 12
"kelly clarkson" 5
"billy joel" 27
"query abcdef" 2
"lady gaga" 39
...
Then you can sort by descending order of occurrence:
"lady gaga" 39
"billy joel" 27
"britney spears" 12
"lady xyz" 5
"query abcdef" 2
...
Then when someone is searching "lady", for example, do a prefix search on all strings from the top of the file/table/something to the bottom. If you only want K suggestions you'll go only until you find the Top-K suggestions.
You could implement this using a simple file, or you can also have a counting query table and do a query similar to:
SELECT q.query from (SELECT * from search_queries order by query_count DESC) as q where q.query LIKE "prefix%" LIMIT 0,K
Two notes:
There are better (and more difficult) ways of doing this. Amazon, for example, has a pretty nice query suggestion.
The provided solution will only suggest queries that starts with the user query. Like:
"lady" => ["lady gaga", "lady xyz"]
Query "lady" won't match "gaga lady". For them to match you will need query indexing, through the Full-Text Search support of your database or an external library such as Lucene.
Ideally, you'd sort on something like the following:
order by sum(# of searches / (how long ago that search was performed + 1))
This would have to be modified so that how long ago would be base on an appropriate base time. For example, if you want searches to count as half after a week, you'd make a week = 1.
This will clearly be inefficient, because calculating how long ago each search was performed for all search results will be time consuming. Thus, you might want to keep a running total for each search and multiply the totals by a certain value each time period. For example, if you want searches to count as half after a week, you would add one to that column for every search. Then, you would have a process that multiplies the search column by .5 every week. Then you just sort on that column.
Do you need something like autosuggestion? There is an JQuery plugin called autocomplete which only looks for similar words as soon as the user types in the letters. However, if you want to get the suggestions based on the number of times that keyword is searched by user, then you need to store the keywords in a separate table and then fetch it later for other user?