Improving a SQL LIKE query performance - mysql

We have a large table with product information. Almost all the time we need to find product names that contain specific words, but unfortunately these queries take forever to run.
Example: Find all the products where the name contains the words "steel" and "102" (not necessarily next to each other, so a product like "Ninja steel iron 102 x" is a match, just like "Dragon steel 102 b" is it).
Currently we are doing it like this:
SELECT columns FROM products WHERE name LIKE '%WORD1%' AND name LIKE '%WORD2%' (the number of like words are normally 2-4, but it can in theory be 7-8 or more).
Is there a faster way of doing this?
We are only matching words, so I wonder if that can help somehow (i.e. the products in the example above are matches, but "Samurai swordsteel 102 v" is not a match since "steel" doesn't stand alone).
My own thought is to make a helper table with the words from productnames in and then use that table to get the ids of the matching products.
i.e. a table like: [id, word, productid] so we get for example:
1, samurai, 3
2, swordsteel, 3
3, 102, 3
4, v, 3
Just wonder if there is a built in way to do this in MySQL, so I don't have to implement my own stuff + maintain two tables.
Thanks!

Unfortunately, you have wild cards at the beginning of the pattern name. Hence, MySQL cannot use a standard index for this.
You have two options. First, if the words are really keywords/attributes, then you should have another table, with one row per word.
If that is not the case, you can try a full text index. Note that MySQL has attributes for the minimum words length and uses a stop words list. You should take these into account before building the index.

Related

SQL - Finding rows with unknown, but slightly similar, values?

I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.

Search for specific keyword in MYSQL

I'm almost new to mysql.
I wanted to write a query to search for specific keywords in a column where keywords are separated by the comma. but as I use the following code, it only returns the rows where I only have that specific keyword, not in combination with any other keywords.
In Table q16, I'm looking for a way to select rows that have my keyword in the "Area_of_concern" column, no matter if it's combined with other keywords or not:
SELECT *
FROM `q16`
WHERE area_of_concern like '%more education is needed%'
Here's an input example:
q16_id area of concern
1 more education is needed
2 more enforcement, change in strategy
3 change in strategy
4 more education is needed, change in strategy
5 transportation issue, more enforcement, more education is needed
Where I'm looking to get the rows with the keyword "more education is needed". So I should see row 1, 4,5 in the output
I think you should create a table where you have one column for keywords and one column for where those keywords are used: a foreign key for the q16 table in your case.
It will work much faster that way.
As for your question it is a duplicate of this one here, I believe.
How to search for rows containing a substring?
A quick try: try using double quotes instead of single ones, as in some systems, single quotes don't allow for escapes (special characters) inside them.

MySQL table design, one row or more pr user?

Using MySQL I have table of users, a table of matches (Updated with the actual result) and a table called users_picks (at first it's always going to be 10 football matches pr. gameweek pr. league because there's only one league as of now, but more leagues will come along eventually, and some of them only have 8 matches pr. gameweek).
In the users_picks table should i store each 'pick' (by pick I mean both 'hometeam score' and 'awayteam score') in a different row, or have all 10 picks in one single row? Both with a FK for user and gameweek. All picks in one row would mean I had columns with appended numbers like this:
Option 1: [pick_id, user_id, league_id, gameweek_id, match1_hometeam_score, match1_awayteam_score, match2_hometeam_score, match2_awayteam_score ... etc]
and that option doesn't quite fill me with joy, and looks a bit stupid. Especially since there's going to be lots of potential NULLs in the db. The second option would mean eventually millions of rows. But would look like this:
Option 2: [pick_id, user_id, league_id, gameweek_id, match_id, hometeam_score, awayteam_score]
What's the best practice? And would it be a PITA to do all sorts of statistics using the second option? eg. Calculating how many matches a user has hit correctly in a specific round, how many alltime correct hits etc.
If I'm not making much sense, I'll try to elaborate anything. I just wan't my table design to be good from the start, so I won't have a huge headache in a couple of months.
Thanks in advance.
The second choice is much better than the first. This is called database normalisation and makes querying easier, not harder. I would suggest reading the linked article, and the related descriptions of the various "normal forms", and aiming for a 3rd Normal Form data structure as a minimum.
To see the flaw in your first option, imagine if there were to be included later a new league with 11 matches. Or 400.
You should read up about database normalization.
When you have a 1:n relation, like in your case one team having many matches, you would create two tables. One table "teams" and a second table "matches" where each row includes the ID of the team which played the match.
In the same manner you should also have separate tables for users, picks and leagues.
Option two is better, provided you INDEX your table properly, since (as you indicate) it will grow quite large. The pick_id is the primary key, but also create an INDEX on the user_id field, as likely the most common query will be
SELECT * FROM `users_pics` WHERE `user_id`=?;
to get all the picks for a given user.

MySql - Get all occurring

I have an MySql DB.
My main table is all the sentences of a series of 5 books, with indexes for the book, chapter in book, sentence in chapter.
Example - For Harry Potter book five, chapter 1 , sentence 3 I'll have a row like that.
BookID ChapterID SentenceID Text
4 1 3 Deprived of their usual car-washing and lawn-mowing pursuits, the inhabitants of Privet Drive had retreated into the shade of their cool houses, windows thrown wide in the hope of tempting in a nonexistent breeze.
I need to retrieve all the occurrence of a letter or a word.
So if I search for 'e' I'll get the same row 17 times. 'e' occur 17 time in this row.
I've simplified the scenario, I have more information to retrieve for each letter.
So far I've been unable to get something useful.
Thank you
I'm 90% certain an SQL query won't return more than one row for a given database row, unless you use a JOIN. But that would be very inefficient for your purposes.
The way this would typically be implemented is using a query like SELECT * FROM books WHERE Text LIKE '%e%', which would return all of the rows that have at least one "e" in the text; then your application would iterate over the rows and count the occurrences.

mysql - extract specific words from text field using full text search

My question is a little simillar to Extract specific words from text field in mysql, but now the same.
I have a text field with words inside. In my language word can have many different endings. I need to find this endings.
I use fulltext search of mysql, but I would need to have access to the index database where all the field is "cut" to words and words are counted. I could then search for "test*" and I could quickly find "test", "tested", "testing". I need the list of all endigns that exist in my database, that is my primary goal.
As it is I can get the records with specific "test*" words in it, but I need not only to locate the occurence in the field, but to group somehow so I get the list of all the words that for example start with "test". I don't need location in which record they are, just a list, grouped so that "testing" is not written 10 times but only once (maybe a counter of how many times it is found but not necessary).
Is there a way to extract this info from fulltextsearch field or should I explode all this fields to words and make a index table full of words and just do a "like "word%" and group by the different results? I am not sure how to do that either in practice, but just to point me to the right direction please.
So to summarize: I have a text fied and I need to find out which words are inside that start with "test", like "tested", "test", "testing" etc... It doesn't make sense in English but in my language it does as we have same word on different endigns and there are so many of them, somethimes 20, I need to find out which ones are there so I can make a synonims table ;-)
UPDATE:
Database has columns ID (int), ingredients (text) and recipe (text).
Data in ingredients are cooking ingredients with different endings like:
1 egg
2 eggs
etc.
You can dump all words that are present in an index. And that would also show frequency of each word. E.g. test is used 200 times and testing is used 300 times.
Manual for that: http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html