Recently, I dealt with retrieving a large amount of data which consists of thousands of records from a MySQL database. Since it was my first time to handle such large data set, I didn't think about the efficiency of the SQL statement. And the problem comes.
Here are the tables of the database
(It is just a simple database model of a curriculum system):
course:
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| course_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(20) | NO | | NULL | |
| lecturer | varchar(20) | NO | | NULL | |
| credit | float | NO | | NULL | |
| week_from | tinyint(3) unsigned | NO | | NULL | |
| week_to | tinyint(3) unsigned | NO | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
select:
+-----------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------------+------+-----+---------+----------------+
| select_id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| card_no | int(10) unsigned | NO | | NULL | |
| course_id | int(10) unsigned | NO | | NULL | |
| term | varchar(7) | NO | | NULL | |
+-----------+------------------+------+-----+---------+----------------+
When I want to retrieve all the courses that a student has selected (with his card number),
the SQL statement is
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `course` WHERE course_id IN (
SELECT course_id FROM `select` WHERE card_no=<student's card number>
);
But, it was extremely slow and it didn't return anything for a long time.
So I changed WHERE IN clauses into NATURAL JOIN. Here is the SQL,
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `select` NATURAL JOIN `course`
WHERE card_no=<student's card number>;
It returns immediately and works fine!
So my question is:
What's the difference between NATURAL JOIN and WHERE IN Clauses?
What makes them perform differently?
(Is that maybe because I doesn't set up any INDEX?)
When shall we use NATURAL JOIN or WHERE IN?
Theoretically the two queries are equivalent. I think it's just poor implementation of the MySQL query optimizer that causes JOIN to be more efficient than WHERE IN. So I always use JOIN.
Have you looked at the output of EXPLAIN for the two queries? Here's what I got for a WHERE IN:
+----+--------------------+-------------------+----------------+-------------------+---------+---------+------------+---------+--------------------------+
| 1 | PRIMARY | t_users | ALL | NULL | NULL | NULL | NULL | 2458304 | Using where |
| 2 | DEPENDENT SUBQUERY | t_user_attributes | index_subquery | PRIMARY,attribute | PRIMARY | 13 | func,const | 7 | Using index; Using where |
+----+--------------------+-------------------+----------------+-------------------+---------+---------+------------+---------+--------------------------+
It's apparently performing the subquery, then going through every row in the main table testing whether it's in -- it doesn't use the index. For the JOIN I get:
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
| 1 | SIMPLE | t_user_attributes | ref | PRIMARY,attribute | attribute | 1 | const | 15 | Using where |
| 1 | SIMPLE | t_users | eq_ref | username,username_2 | username | 12 | bbodb_test.t_user_attributes.username | 1 | |
+----+-------------+-------------------+--------+---------------------+-----------+---------+---------------------------------------+------+-------------+
Now it uses the index.
Try this:
SELECT course_id, name, lecturer, credit, week_from, week_to
FROM `course` c
WHERE c.course_id IN (
SELECT s.course_id
FROM `select` s
WHERE card_no=<student's card number>
AND c.course_id = s.course_id
);
Notice the addition of the AND clause in the sub-query. This is called a co-related sub-query because it relates the two course_ids, just as the NATURAL JOIN does.
I think Barmar's index explanation is on the mark.
Related
I have a nested query that deletes a row in table terms only if exactly one row in definitions.term_id is found. It works but it takes like 9 seconds on my system. Im looking to optimize the query.
DELETE FROM terms
WHERE id
IN(
SELECT term_id
FROM definitions
WHERE term_id = 1234
GROUP BY term_id
HAVING COUNT(term_id) = 1
)
The database is only about 4000 rows. If I separate the query into 2 independent queries, it takes about 0.1 each
terms
+-------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| term | varchar(50) | YES | | NULL | |
+-------+------------------+------+-----+---------+----------------+
definitions
+----------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------------+------+-----+---------+----------------+
| id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| term_id | int(11) | YES | | NULL | |
| definition | varchar(500) | YES | | NULL | |
| example | varchar(500) | YES | | NULL | |
| submitter_name | varchar(50) | YES | | NULL | |
| approved | int(1) | YES | MUL | 0 | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
| votos | int(3) | NO | | NULL | |
+----------------+------------------+------+-----+---------+----------------+
To speed up the process, please consider creating an index on the relevant field:
CREATE INDEX term_id ON terms (term_id)
How about using correlated sub query using exists and try,
DELETE FROM terms t
WHERE id = 1234
AND EXISTS (SELECT 1
FROM definitions d
WHERE d.term_id = t.term_id
GROUP BY term_id
HAVING COUNT(term_id) = 1)
It's often quicker to create a new table retaining only the rows you wish to keep. That said, I'd probably write this as follows, and provide indexes as appropriate.
DELETE
FROM terms t
JOIN
( SELECT term_id
FROM definitions
WHERE term_id = 1234
GROUP
BY term_id
HAVING COUNT(*) = 1
) x
ON x.term_id = t.id
Hehe; this may be a kludgy way to do it:
DELETE ... WHERE id = ( SELECT ... )
but without any LIMIT or other constraints.
I'm depending on getting an error something like "subquery returned more than one row" in order to prevent the DELETE being performed if multiple rows match.
I have three tables that look like this:
People:
+------------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| fname | varchar(32) | NO | | NULL | |
| lname | varchar(32) | NO | | NULL | |
| dob | date | NO | | 0000-00-00 | |
| license_no | varchar(24) | NO | | NULL | |
| date_added | timestamp | NO | | CURRENT_TIMESTAMP | |
| status | varchar(8) | NO | | Allow | |
+------------+-------------+------+-----+-------------------+----------------+
Units:
+----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| number | varchar(3) | NO | | NULL | |
| resident | int(11) | NO | MUL | NULL | |
| type | varchar(16) | NO | | NULL | |
+----------+-------------+------+-----+---------+----------------+
Visits:
+----------+-----------+------+-----+---------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-----------+------+-----+---------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| vis_id | int(11) | NO | MUL | NULL | |
| unit | int(11) | NO | MUL | NULL | |
| time_in | timestamp | NO | | CURRENT_TIMESTAMP | |
| time_out | timestamp | NO | | 0000-00-00 00:00:00 | |
+----------+-----------+------+-----+---------------------+----------------+
There are multiple foreign keys linking these tables:
units.resident -> people.id
visits.unit -> units.id
visits.vis_id -> people.id
I am able to run this query to find all residents ie - everyone from people that are referenced by the units.resident foreign key:
SELECT concat(p.lname, ', ', p.fname) as 'Resident', p.dob as 'Birthday',
u.number as 'Unit #'
from people p, units u
where p.id = u.resident
order by u.number
It returns the results I want... However, it'd be useful to do the opposite of this to find all the people who are not residents ie- everyone from people who aren't referenced by the units.resident foreign key.
I've tried many different queries, most notably some inner and left joins, but I'm getting waaaaay too many duplicate entries (from what I've read here, this is normal). The only thing I've found that works is using a group by license_no, because as of now the "residents" don't have this information, like this:
SELECT p.id, concat(p.lname, ', ', p.fname) as 'Visitor',
p.license_no as 'License', u.number from people p
left join units u on u.number <> p.id
group by p.license_no order by p.id;
This works for all but one resident, who's u.number is displayed on ALL results. The residents will soon have license_no entries, and I can't have that one odd entry in the returned results all the time, so this query won't work as a long-term solution.
How can I structure a query without a group by that will return the results I want?
This should work
SELECT
p.id
, P.fname
, P.lname
FROM
people AS p
LEFT JOIN
units AS u
ON
p.id = u.resident
WHERE
u.resident IS NULL
Extra hint.
Table people should be called person.
By u.resident you mean a person. so it should be a person_id there in the unit table...
Better logic helps to write SQL better, if your name convention is clear to use.
Use a NOT EXISTS clause to exclude those people who are residents.
SELECT P.id
,P.fname
,P.lname
,etc...
FROM People P
WHERE NOT EXISTS (SELECT 1 FROM Units U WHERE U.resident = P.id)
I've got 3 tables: model, model_views, and model_views2. In an effort to have one column per row to hold aggregated views, I've done a migration to make the model look something like this, with a new column for the views:
+---------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | | NULL | |
| [...] | | | | | |
| views | int(20) | YES | | 0 | |
+---------------+---------------+------+-----+---------+----------------+
This is what the columns for model_views and model_views2 look like:
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | smallint(5) | NO | MUL | NULL | |
| model_id | smallint(5) | NO | MUL | NULL | |
| time | int(10) unsigned | NO | | NULL | |
| ip_address | varchar(16) | NO | MUL | NULL | |
+------------+------------------+------+-----+---------+----------------+
model_views and model_views2 are gargantuan, both totalling in the tens of millions of rows each. Each row is representative of one view, and this is a terrible mess for performance. So far, I've got this MySQL command to fetch a count of all the rows representing single views in both of these tables, sorted by model_id added up:
SELECT model_id, SUM(c) FROM (
SELECT model_views.model_id, COUNT(*) AS c FROM model_views
GROUP BY model_views.model_id
UNION ALL
SELECT model_views2.model_id, COUNT(*) AS c FROM model_views2
GROUP BY model_views2.model_id)
AS foo GROUP BY model_id
So that I get a nice big table with the following:
+----------+--------+
| model_id | SUM(c) |
+----------+--------+
| 1 | 1451 |
| [...] | |
+----------+--------+
What would be the safest route for pulling off commands from here on in to merge the values of SUM(c) into the column model.views, matched by the model.id to model_ids that I get out of the above SQL query? I want to only fill the rows for models that still exist - There is probably model_views referring to rows in the model table which have been deleted.
You can just use UPDATE with a JOIN on your subquery:
UPDATE model
JOIN (
SELECT model_views.model_id, COUNT(*) AS c
FROM model_views
GROUP BY model_views.model_id
UNION ALL
SELECT model_views2.model_id, COUNT(*) AS c
FROM model_views2
GROUP BY model_views2.model_id) toupdate ON model.id = toupdate.model_id
SET model.views = toupdate.c
I've got (what I thought) was a very simple query on a MySQL database, but using explain shows that the query is using a temporary table. I've tried changing the order of the select and the join but to no avail. I've reduced the tables to their simplest (to see if it was an issue with the complexity of my tables, but I still have the problem). I've tried it with two basic tables, one with a "name" field, and the other with a foreign key reference back to that table:
a:
+-------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| name | varchar(128) | NO | MUL | NULL | |
+-------+--------------+------+-----+---------+----------------+
b:
+-------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| a_id | int(11) | NO | MUL | NULL | |
+-------+---------+------+-----+---------+----------------+
And this is my query:
SELECT a.id, a.name FROM a JOIN b ON a.id = b.a_id ORDER BY a.name;
Which I thought was very simple... just a list of all records in a that have records in b, ordered by name. Alas explain says this:
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
| 1 | SIMPLE | b | index | a_id | a_id | 4 | NULL | 2 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | a | eq_ref | PRIMARY | PRIMARY | 4 | b.a_id | 1 | |
+----+-------------+-------+--------+---------------+---------+---------+--------+------+----------------------------------------------+
It looks like it should be using the key on the b table but for some reason it isn't. I get the feeling I'm missing something basic (or my RDBMS knowledge needs brushing up somewhat). Anyone any ideas why it's using a temporary table for such a simple query?
I have four tables like this:
mysql> describe courses;
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| course_id | int(11) | NO | PRI | NULL | auto_increment |
| course_name | varchar(75) | YES | | NULL | |
| course_price_id | int(11) | YES | MUL | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
mysql> describe pricegroups;
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| price_id | int(11) | NO | PRI | NULL | auto_increment |
| price_name | varchar(255) | YES | | NULL | |
| price_value | int(11) | YES | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
mysql> describe courseplans;
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| plan_id | int(11) | NO | PRI | NULL | auto_increment |
| plan_name | varchar(255) | YES | | NULL | |
| plan_time | int(11) | YES | | NULL | |
+------------+--------------+------+-----+---------+----------------+
mysql> describe course_to_plan;
+-----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| course_id | int(11) | NO | PRI | NULL | |
| plan_id | int(11) | NO | PRI | NULL | |
+-----------+---------+------+-----+---------+-------+
Let me try to explain what I have and what I would like to do...
All my courses (course_id) has different steps (plan_id) wich has a value of 1 or more days (plan_time). A course has one or more steps (course_to_plan)A course is connected to a pricegroup (price_id).
I would like to query my MySQL database and get an output off:
The course_name, the plan_id's it has, and based on the value of price_id together with the value in the plan_time get a result who looks something like this:
+------------+--------------+------------+---------+
| course_name| pricegroup | plan_time | RESULT |
+------------+--------------+------------+---------+
| Math | Expensive | 7 | 3500 |
+------------+--------------+------------+---------+
I hope you understand me...
Is it even possible with the structure I have or should I "rebuild-and-redo-correct" something?
SELECT c.course_name, p.price_name, SUM(cp.plan_time), SUM(cp.plan_time * p.price_value)
FROM courses c
INNER JOIN pricegroups p ON p.price_id = c.course_price_id
INNER JOIN course_to_plan cpl ON cpl.course_id = c.course_id
INNER JOIN courseplans cp ON cp.plan_id = cpl.plan_id
GROUP BY c.course_name, p.price_name
Please note that it seems to me that your implementation might be erroneous. The way you want the data makes me think that you could be happier with a plan having a price, so you don't apply the same price for a plan which is "expensive" AND another plan which is "cheap", which is what you are doing at the moment. But I don't really know, this is intuitive :-)
Thanks for accepting the answer, regards.
Let me see if I understand what you need:
SELECT c.course_name, pg.price_name,
COUNT(cp.plan_time), SUM(pg.price_value * cp.plan_time) AS result
FROM courses c
INNER JOIN pricegroups pg ON c.course_price_id = pg.price_id
INNER JOIN course_to_plan ctp ON c.course_id = ctp.course_id
INNER JOIN courseplans cp ON ctp.plan_id = cp.plan_id
GROUP BY c.couse_name, pg.price_name