MySQL is not my specific field of expertise, and since I never needed to handle big tables, my basic knowledge has been enough - until now :)
When I have a master/detail MySQL tables, I always do queries like this:
SELECT SUM(detail.qty)
FROM master, detail
WHERE master.id = detail.masterId
AND (interestingField = interestingValue)
But right now, I have too may details rows and so the query is slow. I tried to understand the "INNER" syntax, and I came up with
SELECT SUM(detail.qty)
FROM master
INNER JOIN detail ON master.id = detail.masterId
WHERE (interestingField = interestingValue)
but the timing is still the same... So:
1) Does it mean that doing INNER JOIN is THE SAME than checking the master/detail field by hand?
2) Apart from making indexes using the required fields (which I already did), is there a way to make the query faster?
Thank you and sorry if the question seems dumb :(
----- As requested, I am adding the EXPLAIN output to my query
+----+-------------+--------+--------+---------------+-------------+---------+------+---------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+--------+--------+---------------+-------------+---------+------+---------------------------------+-------+-------------+
| | 1 | SIMPLE | bill_t | ALL | PRIMARY,id | NULL | NULL | NULL | 4441 | Using where |
| | 1 | SIMPLE | bill_d | ref | billId,type | billId | 5 | localtha_bg3667772802.bill_t.id | 2 | Using where |
+----+-------------+--------+--------+---------------+-------------+---------+------+---------------------------------+-------+-------------+
INNER JOIN won't be any faster than WHERE since the query optimizer should produce an identical execution plan for both queries provided. However, different DB engines might produce different execution plans but typically such behaviour is observed for more complex queries (which this is not the case here). I would suggest to test both queries and choose the one which is faster.
As a general advice, I would recommend INNER JOIN which IMO is much more readable and it reduces the risk of omitting the condition in the WHERE clause that might lead to a CROSS JOIN.
for your query with master and detail performance assuming that interestingField in detail table be sure you have a proper composite index on
create index idx1 on detail (interestingField, masterId )
For the INNER JOIN and Where the implicit join based on where is an OLD way for manage join .. the recente version is based on Explictic JOIN condition eg:
SELECT SUM(detail.qty)
FROM master
INNER detail ON master.id = detail.masterId
AND interestingField = interestingValue
each the way are the same for performance
Related
I have some tables:
object
person
project
[...] (some more tables)
type
The object table has foreign keys to all other tables.
Now I do a query like:
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY object.type_id ASC
LIMIT 25
This works perfectly well and fast, even for big resultsets. For example I have 90000 objects and the query takes about 3 seconds. The result ist quite big because the tables have a lot of columns and all of them are fetched. For info: I'm using Symfony with Propel, InnoDB and the "doSelectJoinAll"-function.
But if do a query like (sort by type.name):
SELECT * FROM object
LEFT JOIN person ON (object.person_id = person.id)
LEFT JOIN project ON (object.project_id = project.id)
LEFT JOIN [...] (all other joins)
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
The query takes about 200 seconds!
EXPLAIN:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | object | ref | object_FI_2 | object_FI_2 | 4 | const | 164966 | Using where; Using temporary; Using filesort
1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | db.object.person_id | 1
1 | SIMPLE | ... | eq_ref | PRIMARY | PRIMARY | 4 | db.object...._id | 1
1 | SIMPLE | type | eq_ref | PRIMARY | PRIMARY | 4 | db.object.type_id | 1
I saw in the processlist, that MySQL is creating a temporary table for such a sorting on a joined table.
Adding an index to type.name didn't improve the performance. There are only about 800 type rows.
I found out that the many joins and the big result is the problem, because if I do a query with just one join like:
SELECT * FROM object
LEFT JOIN type ON (object.type_id = type.id)
WHERE object.customer_id = XXX
ORDER BY type.name ASC
LIMIT 25
it works as fast as expected.
Is there a way to improve such sorting queries on a big resultset with many joined tables? Or is it just a bad habit to sort on a joined table column and this shouldn't be done anyway?
Thank you
LEFT gets in the way of rearranging the order of the tables. How fast is it without any LEFT? Do you get the same answer?
LEFT may be a red herring... Here's what the optimizer is likely to be doing:
Decide what order to do the tables in. Take into consideration any WHERE filtering and any LEFTs. Because of WHERE object.customer_id = XXX, object is likely to be the best table to start with.
Get the rows from object that satisfy the WHERE.
Get the columns needed from the other tables (do the JOINs).
Sort according to the ORDER BY ** see below
Deliver the first 25 rows.
** Let's dig deeper into these two:
WHERE object.customer_id = XXX ORDER BY object.id
WHERE object.customer_id = XXX ORDER BY virtually-anything-else
You have INDEX(customer_id), correct? And the table is InnoDB, correct? Well, each secondary index implicitly includes the PRIMARY KEY, as if you had said INDEX(customer_id, id). The optimal index for the first WHERE + ORDER BY is precisely that. It will locate XXX and scan 25 rows, then stop. You might say that steps 2,4,5 are blended together.
The second WHERE just gather all the stuff through step 4. This could be thousands of rows. Hence it is likely to be a lot slower.
See also article on building optimal indexes.
TL;DR:
I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?
The background:
I have 2 tables
person, a table containing informations about people (id, birthdate)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.
Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.
I'm trying to get the age of the youngest person in each department, and here is the query I've came up with
SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
FROM person
INNER JOIN works_in
ON works_in.person_id = person.id
WHERE person.birthdate IS NOT NULL
GROUP BY works_in.department
The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.
EXPLAIN for this query gives
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------|
| 1 | SIMPLE | works_in| ALL | NULL | NULL | NULL | NULL | 22496409 | Using temporary; Using filesort |
| 1 | SIMPLE | person | eq_ref | PRIMARY | PRIMARY | 4 | dbtest.works_in.person_id| 1 | Using where |
I built a bunch of indexes for the 2 tables,
/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);
EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------|
| 1 | SIMPLE | person | range | PRIMARY,birthdate | birthdate | 4 | NULL | 267818 | Using where; Using index; Using temporary; Using f... |
| 1 | SIMPLE | works_in| ref | person,department_id_person,person_department_id | person_department_id | 4 | dbtest.person.id | 3 | Using index |
However, the execution time of the query has doubled (from ~17s to ~35s).
Why does this makes sense, and what is the correct way to optimize this?
EDIT
Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?
More information concerning the dataset:
There are about 5'000'000 records in the person table.
Of which only 130'000 have a valid (not NULL) birthdate
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)
For this query:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;
The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.
By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:
SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
works_in wi
ON wi.person_id = p.id
GROUP BY wi.department;
For this, the best indexes are person(id, birthdate) and works_in(person_id, department).
EDIT:
I cannot think of an easy way to solve the problem. One solution is more powerful hardware.
If you really need this information quickly, then additional work is needed.
One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.
A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.
Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.
Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.
With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.
When to add an index on a SQL table field (MySQL)?
When to use MyISAM and InnoDB?
https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam
I was trying to optimize NOT IN clause in mysql: Some how I ended up in the following query:
SELECT #i:=(SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
SELECT * FROM word WHERE #i IS NULL OR word_id NOT IN (#i);
There is no relationship between sent_question table and word table. And also I cannot place index on correct_option_word_id.
Can somebody please explain, will this method even optimize the query or not?
UPDATE: As mentioned here that both the methods: NOT IN and LEFT JOIN/IS NULL are almost equally efficient. That's why I don't want to use LEFT JOIN/IS NULL method.
UPDATE 2:
Explain results for original query:
EXPLAIN SELECT * FROM word WHERE word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 2 | DEPENDENT SUBQUERY | sent_question | ref | fk_question_subscriber1 | fk_question_subscriber1 | 48 | const | 1 | Using where |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
You're right in that both the NOT IN and LEFT JOIN/IS NULL method are equally efficient, however, unfortunately, there is no faster option, only slower ones (NOT EXISTS).
Here's your query, simplified:
SELECT *
FROM word
WHERE
word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc')
As you know, MySQL will do the subquery first and use the returned result set for the NOT IN clause. Then, it will scan through all of the rows in word to see if word_id is in the list for each row.
Unfortunately for this case, indexes are inclusive, not exclusive. They don't help with NOT queries. A covering index on word could potentially still be used to avoid accessing the actual table, and provide some IO benefits, but it won't be used in the traditional "lookup" sense. However, since you are returning all columns on the word table, it may not be viable to have such a large index.
The most important index that will be used here is an index on sent_question.msisdn for the subquery. Ensure that you have that index defined. A multi-column "covering" index on (msisdn, correct_option_word_id) would be best.
If you share your design, we can probably offer some design solutions for optimization.
I doubt it'll work at all.
Try
SELECT *
FROM word AS w
LEFT JOIN sent_question AS sq
ON w.word_id = sq.correct_option_word_id AND sq.msisdn='abc'
WHERE sq.correct_option_word_id IS NULL
Give this simple query a try
SELECT
sent_question.*,
word.word_id AS foundWord
FROM sent_question
LEFT JOIN word
ON word.word_id = sent_question.correct_option_word_id
WHERE sent_question.msisdn='abc'
// GROUP BY sent_question.correct_option_word_id // This shouldn't be needed but included for completion
HAVING foundWord IS NULL
This is the first time I'm approaching an extremely high-volume situation. This is an ad server based on MySQL. However, the query that is used incorporates a lot of JOINs and is generally just slow. (This is Rails ActiveRecord, btw)
sel = Ads.find(:all, :select => '*', :joins => "JOIN campaigns ON ads.campaign_id = campaigns.id JOIN users ON campaigns.user_id = users.id LEFT JOIN countries ON countries.campaign_id = campaigns.id LEFT JOIN keywords ON keywords.campaign_id = campaigns.id", :conditions => [flashstr + "keywords.word = ? AND ads.format = ? AND campaigns.cenabled = 1 AND (countries.country IS NULL OR countries.country = ?) AND ads.enabled = 1 AND campaigns.dailyenabled = 1 AND users.uenabled = 1", kw, format, viewer['country'][0]], :order => order, :limit => limit)
My questions:
Is there an alternative database like MySQL that has JOIN support, but is much faster? (I know there's Postgre, still evaluating it.)
Otherwise, would firing up a MySQL instance, loading a local database into memory and re-loading that every 5 minutes help?
Otherwise, is there any way I could switch this entire operation to Redis or Cassandra, and somehow change the JOIN behavior to match the (non-JOIN-able) nature of NoSQL?
Thank you!
EDIT: here are more details:
Full executed SQL with flattened select (truncated above):
SELECT campaigns.id, campaigns.guid, campaigns.user_id, campaigns.dailylimit, campaigns.impressions, campaigns.cenabled, campaigns.dayspent, campaigns.dailyenabled, campaigns.fr, ads.id, ads.guid, ads.user_id, ads.campaign_id, ads.format, ads.enabled, ads.datafile, ads.data1, ads.data2, ads.originalfilename, ads.aid, ads.impressions, countries.id, countries.guid, countries.campaign_id, countries.country, keywords.id, keywords.campaign_id, keywords.word, keywords.bid FROM ads JOIN campaigns ON ads.campaign_id = campaigns.id JOIN users ON campaigns.user_id = users.id LEFT JOIN countries ON countries.campaign_id = campaigns.id LEFT JOIN keywords ON keywords.campaign_id = campaigns.id WHERE (keywords.word = 'design' AND ads.format = 10 AND campaigns.cenabled = 1 AND (countries.country IS NULL OR countries.country = 82) AND ads.enabled = 1 AND campaigns.dailyenabled = 1 AND users.uenabled = 1 AND ads.datafile != '') ORDER BY keywords.bid DESC LIMIT 1,1
EXPLAIN/execution plan:
+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | keywords | ref | campaign_id,word | word | 257 | const | 9 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | ads | ref | campaign_id | campaign_id | 4 | e_development.keywords.campaign_id | 8 | Using where |
| 1 | SIMPLE | campaigns | eq_ref | PRIMARY | PRIMARY | 4 | e_development.keywords.campaign_id | 1 | Using where |
| 1 | SIMPLE | users | eq_ref | PRIMARY | PRIMARY | 4 | e_development.campaigns.user_id | 1 | Using where |
| 1 | SIMPLE | countries | ALL | campaign_id | NULL | NULL | NULL | 4 | Using where |
+----+-------------+-----------+--------+------------------+-------------+---------+------------------------------------+------+----------------------------------------------+
(this is on a development database, which doesn't have nearly as many rows as the production version.)
DEFINED INDICES:
ads -> id (primary, autoinc) + aid (unique) + campaign_id (index) + user_id (index)
campaigns -> id (primary, autoinc) + user_id (index)
countries -> id (primary, autoinc) + campaign_id (index) + country (index) + user_id (index)
keywords -> id (primary, autoinc) + campaign_id (index) + word (index) + user_id (index)
user -> id (primary, autoinc)
Database theory and the nominal practice exist to provide a framework for a majority of cases. Not every database usage pattern fits neatly into 3rd normal form. Hence the emergence of NoSQL. These database don't work well in a majority of cases but do work great in specific cases. One reason they work well is because they DON'T work like a normal RDBMS. Cassandra does have some facility for 'joining' but I don't remember the exact details. If you want a quick understanding I'd recommend the Digg developers blog. There's a nice simple description.
The problem is that I'll bet you a pickle that joining 4 tables would be slower than mySQL. And the only way to know for sure would be learning a new DBMS, installing it, tuning the install as well as you can tune MySQL and setting up all your data and .... you'll like find out MySQL does pretty damn good.
Trying to solve the EXACT SAME problem the EXACT SAME way with a different engine won't cut it... you have to THINK like a NoSQL developer, not a RDBMS developer using NoSQL.
But you can think about the problem as Frustrated suggests.
Why do we have Third Normal Form? Ease of Update mainly. I update one row instead of dozens. It also helps constrain data, if I carefully control addition of countries in the country table, I'll never get a bad one in the campaign table. After that, 3NF doesn't make querying faster, which is why we invented reporting databases, OLAP, Cubes, Star Schemas.
the Key is that it's a different structure for reporting vs editing/capturing.
As Frustrated said, determine the speed of change in your underlying data. If you're really adding countries every 5 minutes, I'll be stunned. Campaigns? probably occasional? Ads? a couple times a day. How long would it take to build a fully flattened table and index it? How many rows does that produce? if that cycle time is much shorter than your update frequency... build that and see. Test the query speed. That's a cheaper experiment than going for a whole new DB.
Have you analysed your execution plan? Have you analysed your indices?
My first guess would be that you need an index on campaigns for user_id, index on countries for campaign_id, on keywords on campaign_id... maybe others. You need to get an execution plan to see what your query is doing.
The other option: How often does the data in this result set change? By the minute? Hour? Day? If it's daily or hourly (well, several hours), it might be better to have a secondary table that contains ALL the columns (or just the columns that aren't likely to change frequently) of this result set and is populated by this query every n hours. Then your app would just query the secondary table (or maybe join with one table that has frequently changing data), it could be faster that way.
I have a SQL query that takes a very long time to run on MySQL (it takes several minutes). The query is run against a table that has over 100 million rows, so I'm not surprised it's slow. In theory, though, it should be possible to speed it up as I really only want to get back the rows from the large table (let's call it A) that have a reference in another table, B.
So my query is:
SELECT id FROM A, B where A.ref = B.ref;
(A has over 100 million rows; B has just a few thousand).
I've added INDEXes:
alter table A add index(ref);
alter table B add index(ref);
But it's still very slow (several minutes -- I'd be happy with one minute).
Unfortunately, I'm stuck with MySQL 4.1.22, so I can't use views.
I'd rather not copy all of the relevant rows from A into a separate, smaller table, as the rows that I need will change from time to time. On the other hand, at the moment that's the only solution I can think of.
Any suggestions welcome!
EDIT: Here's the output of running EXPLAIN on my query:
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
| 1 | SIMPLE | B | ALL | B_ref,ref | NULL | NULL | NULL | 16718 | Using where |
| 1 | SIMPLE | A | ref | A_REF,ref | A_ref | 4 | DATABASE.B.ref | 5655 | |
+----+-------------+------------------------+------+------------------------------------------+-------------------------+---------+------------------------------------------------+-------+-------------+
(In redacting my original query example, I chose to use "ref" as my column name, which happens to be the same as one of the types, but hopefully that's not too confusing...)
The query optimizer is probably already doing the best that it can, but in the unlikely event that it's reading the giant table (A) first, you can explicitly tell it to read B first using the STRAIGHT_JOIN syntax:
SELECT STRAIGHT_JOIN id FROM B, A where B.ref = A.ref;
From the answers, it seems like you're doing the most efficient thing you can with the SQL. The A table seems to be the big problem, how about splitting it into three individual tables, kind of like a local version of sharding? Alternatively, is it worth denormalising the B table into the A table, assuming B doesn't have too many columns?
Finally, you could just have to buy a faster box to run it on - there's no substitute for horsepower!
Good luck.
SELECT id FROM A JOIN B ON A.ref = B.ref
You may be able to optimize further by using an appropriate type of join e.g. LEFT JOIN
http://en.wikipedia.org/wiki/Join_(SQL)