MySQL Self Join Query Optimization - mysql

I have a database of substrings generated from a list of words. I'm performing a comparison to retrieve all words that share substrings with some input word.
'word_substrings' Database format and example ( for the word 'aback' ):
id (primary key), word_id (Foreign Key), word_substring (char(3))
30 4 " a"
31 4 " ab"
32 4 "aba"
33 4 "bac"
34 4 "ack"
35 4 "ck "
36 4 "k "
Where the 'word_id' is the key of the word in a table of words.
I've tried an equivalence:
select distinct t1.word_id
from word_substrings t1, word_substrings t2
where t1.word_substring = t2.word_substring
and t2.word_id = [some word_id]
As well as a table join:
select distinct t1.word_id
from word_substrings as t1
join word_substrings as t2
on t1.word_substring = t2.word_substring
where and t2.word_id = [some word_id]
However, both queries take about 10 seconds to return results.
Given that the table of words and table of word_substrings are both liable to change, but the data will be retrieved very regularly, I tried making a view to help improve query times. However, I saw no nominal change in return times.
My list of words is currently 40k rows and my list of substrings is approximately 400k rows.
Does anyone have any ideas on how to either optimize the query, or to reformat the database to improve return times?
I've contemplated generating a table that has columns that represent every possible substring, and registering each word in the appropriate columns, however I don't quite know how that would work.
I thank you for all your help! If there is any information that I neglected to include, I will be happy to retrieve that data for you.
NOTE: If it is pertinent information, this is for a Django web application.

You need an index on word_id and word_substring. (As well, set the columns as not null if you can)
This way, queries using only word_id will work, and others using word_id and word_substring will also work.
Cheers.

Related

MySQL GROUP BY slow across three tables with spatial search

I am adding some further First World War records to my astreetnearyou.org site
I have three tables:
people - contains full details of over 1 million people who died
addresses - contains about 700,000 different addresses for about 600,000 of these people
cemeteries - a new table which has records of about 15,000 cemeteries;
In terms of relationships, every address has the ID of the person it relates to; every person in the people table has the name of the cemetery they are buried in (as an aside, these can be long varchar values, would it be better to give them unique integer IDs for the join? Answer: I tried it and it shaved about 0.5 secs off the query time)
I want to run a query that essentially says "give me a unique list of all the people who lived or are buried in this map area (bounding box)"
An example query is:
SELECT people.id, people.rank, people.forename, `people`.surname, people.regiment, people.date_of_death, people.cemeteryname, cemeteries.country, cemeteries.link
FROM people
JOIN cemeteries ON people.cemeteryId=cemeteries.id
LEFT JOIN addresses ON addresses.personId=people.id
WHERE MBRContains( GeomFromText( 'LINESTRING(-0.35 51.50,-0.32 51.51)' ), cemeteries.point) OR MBRContains( GeomFromText( 'LINESTRING(-0.35 51.50,-0.32 51.51)' ), addresses.point)
GROUP BY people.id
This returns 276 results but takes about 6 seconds. Without the GROUP BY it's 296 results including the duplicate IDs but takes well under a second. If I remove the LEFT JOIN table and associated WHERE clause (so I only get matches by cemetery, not address) it is also very quick.
I have spatial indexes on both point fields and all the fields that are in the JOIN conditions, plus based on another post on here I've added indexes across the id and point fields in the addresses table, and the cemetery and point fields in the cemeteries table.
I'm no sql expert so any advice on making this more efficient and thereby quicker would be much appreciated. Also I guess some more table info would probably be of use, but can you tell me what would be helpful and how to produce it?!
ALTER TABLE people ADD INDEX IdCemIdIdx (id, cemeteryId);
if possible, use:
https://www.percona.com/doc/percona-toolkit/LATEST/pt-online-schema-change.html

Selecting values where the first 6 characters match

I'm trying to pull a list of IDs from a table Company where the first 6 characters of the ID are the same. The way our application creates a company ID is it takes the first 3 characters of the company name and the first 3 characters of the City. Beceause of that, overtime we have company IDs with the same first 6 characters, followed by a sequential number...
I was thinking using something using LIKE
Select companyID, companyName from Company Where
substring(companyID,1,6)+'%' like substring(companyID,1,6)+'%'
Basically i'm trying to get all company IDs where the first 6 characters match; The result set should show the just the top company ID ( The first 1 created) and the company name. I'm not expecting a tone of results, so i can then use the IDs returned to find the IDs below it.
I'm thinking it could maybe also be done using HAVING, where the count of IDs with the same first 6 characters are the same HAVING Count(*)>1??
Not really sure what the syntax would be...
SELECT distinct c1.CompanyID, c1.CompanyName, c2.CompanyID, c2.CompanyName
FROM dbo.Company c1
JOIN dbo.Company c2
ON SUBSTRING(c1.CompanyName,1,6) = SUBSTRING(c2.CompanyName,1,6)
AND c1.CompanyID < c2.CompanyID
order by c1.CompanyName, c2.CompanyName
SELECT c1.CompanyID, c1.CompanyName, c2.CompanyID, c2.CompanyName
FROM dbo.Company c1
INNER JOIN dbo.Company c2
ON SUBSTRING(c1.CompanyName,1,6) + '%' LIKE SUBSTRING(c2.CompanyName,1,6) + '%'
AND c1.CompanyID <> c2.CompanyID
If this is something that you envision doing frequently, I'd add a computed column to the table that has a definition of substring(CompanyName, 1, 6). You can then index it and make this efficient. As it is, it will have to scan all the entries and calculate the substring on the fly. With the computed column, you amortize the substring calculation up front and at least have a chance at an efficient query.
After trying to use Blam's script, i made a few slight changes and got some better results. His script was returning more results than rows in the table and it was pretty slow; think it's because of the company_name column. I got rid of it and wrote it like this:
select distinct c1.cmp_id, count(substring(c2.cmp_id,1,6)) as TotalCount
from company c1
join company c2 on substring(c1.cmp_id,1,6)=substring(c2.cmp_id,1,6)
group by c1.cmp_id
order by c1.cmp_id asc
This still returns all the table records, but atleast i can see the total count when the first 6 characters are listed more than once. Also, it ran in only 1 second so that's also a plus. Thank again for you input guys, always appreciated!

How to convert list field into many-to-many table in MySQL

I have inherited a database in which a person table has a field called authorised_areas. The front end allows the user to choose multiple entries from a pick list (populated with values from the description field of the area table) and then sets the value of authorised_areas to a comma-delimited list. I am migrating this to a MySQL database and while I'm at it, I would like to improve the database integrity by removing the authorised_areas field from the person table and create a many-to-many table person_area which would just hold pairs of person-area keys. There are several hundred person records, so I would like to find a way to do this efficiently using a few MySQL statements, rather than individual insert or update statements.
Just to clarify, the current structure is something like:
person
id name authorised_areas
1 Joe room12, room153, 2nd floor office
2 Anna room12, room17
area
id description
1 room12
2 room17
3 room153
4 2nd floor office
...but what I would like is:
person
id name
1 Joe
2 Anna
area
id description
1 room12
2 room17
3 room153
4 2nd floor office
person_area
person_id area_id
1 1
1 3
1 4
2 1
2 2
There is no reference to the area id in the person table (and some text values in the lists are not exactly the same as the description in the area table), so this would need to be done by text or pattern matching. Would I be better off just writing some php code to split the strings, find the matches and insert the appropriate values into the many-to-many table?
I'd be surprised if I were the first person to have to do this, but google search didn't turn up anything useful (perhaps I didn't use the appropriate search terms?) If anyone could offer some suggestions of a way to do this efficiently, I would very much appreciate it.
While it is possible to do this I would suggest that as a one off job it would probably be quicker to knock up a php (or your favorite scripting language) script to do it with multiple inserts.
If you must do it in a single statement then have a table of integers (0 to 9, cross join against itself to get as big a range as you need) and join this against your original table, using string functions to get the Xth comma and from that each of the values for each row.
Possible, and I have done it but mainly to show that having a delimited field is not a good idea. It would likely be FAR quicker to knock up a script with multiple inserts.
You could base an insert on something like this SELECT (although this also comes up with a blank line for each person as well as the relevant ones, and will only cope with up to 1000 authorised areas per person)
SELECT z.id, z.name, x.an_authorised_area
FROM person z
LEFT OUTER JOIN (
SELECT DISTINCT a.id, SUBSTRING_INDEX( SUBSTRING_INDEX( authorised_areas, ",", b.ournumbers ) , ",", -1 ) AS an_authorised_area
FROM person a, (
SELECT hundreds.i *100 + tens.i *10 + units.i AS ournumbers
FROM integers AS hundreds
CROSS JOIN integers AS tens
CROSS JOIN integers AS units
)b
)x ON z.id = x.id

MySQL select users on multiple criteria

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.

MySQL join count from one table with ids from another

I have two tables that make up a full text index of article content for search purposes. One of the tables is just a primary key associated with a word, whereas the other records the article it occurred in and its location in the document. A single word can conceivably appear many times in the same document with different locations, so the same word id can occur several times in the word_locations table.
Here are the structures:
words:
id bigint
word tinytext
word_location:
id bigint(20)
wordid bigint(20)
location int(11)
article_id int(11)
What i need to write is a query that will find the count of occurrences for each word for any one profile. I need to preserve a zero value for wordids that don't appear at all, so I assume this needs to be a left join. However, whenever I try to add a where query to limit off article, any wordids that don't appear at all are not included in the result set.
I have tried:
select words.wordid, COUNT(word_location.wordid) as appears from words left join word_location on word.id = word_location.wordid where article_id = %s GROUP BY wordid
But this query does not return zeros for words that don't appear at all.
How can I modify this left join?
Thanks in advance!
EDIT:
Here is an example data set and the result sets for the different queries.
Example article content:
Bob's Restaurant is one of the finest restaurants in greater
County where you can enjoy the finest Turkish Cuisine.
So the vocabulary table, after being adjusted by the application to exclude stop words, will have in its vocabulary rows for Bob, Restaurant, finest, greater, county, enjoy, Turkish, and cusine. (I'm using this actual article since it's the first in the set, so the ids actually appear starting from integer 1.
The query provided by #Mark Bannister produces this result set:
wordid - word - occurances:
128 clifton 0
1 bob's 2
2 restaurant 2
3 one 1
4 finest 3
5 restaurants 2
6 greater 1
9 county 1
12 enjoy 3
13 turkish 6
14 cuisine 1
The result set is correct per se - but id 128 doesn't appear in the document at all and is the only thing in the result set with occurance 0. The goal is to have the entire vocabulary returned with number of occurrences from the document (this is roughly 2500 different words)
My original problematic query from before the edit above actually returned the same result set, but without ANY 0 occurance rows at all.
You need to include your article selection in your join condition:
select words.wordid, COUNT(word_location.wordid) as appears
from words
left join word_location on word.id = word_location.wordid and article_id = ?
GROUP BY wordid
Including the restriction on article_id in the WHERE clause effectively turns your left join back into an inner join.
I would use a subselect instead of a join.
SELECT words.id, (SELECT count(*) FROM word_location WHERE word_location.wordid = words.id) as appears
Bit of a guess this one, but I think COUNT() is just disregarding your nulls, not COUNTing them and arriving at 0. (NULL + NULL != 0)
Look at the IFNULL() function, you might be able to do something like:
COUNT(IFNULL(word_location.wordid, 0))
(Disclaimer - I am more used to Oracle's NVL(, ) function hence this is a little speculative!)