I have 15M rows in 3 tables (one table is the original CSV import, the other two are normalized versions of that CSV + some other data).
I need to simply update one field from the original CSV table. The update query joining these tables has now run for 30 hours on my quad-core-8GB-ssd box.
Is this normal? Is there a better way to perform this simple update?
Tables: ti (the CSV dump, denormalized, ~13M rows)
i (the primary, normalized table, ~17M rows)
icm (a map of ti.raw_id to i.item_id, ~17M rows)
mysql> explain select * from item AS i, item_catalog_map AS icm, temp_input AS ti WHERE i.id=icm.item_id AND icm.catalog_unique_item_id=ti.productID;
+----+-------------+-------+--------+----------------------+----------------------+---------+------------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------+----------------------+---------+------------+----------+-------------+
| 1 | SIMPLE | i | ALL | PRIMARY | NULL | NULL | NULL | 13652592 | |
| 1 | SIMPLE | icm | ref | IDX_ACD6184F126F525E | IDX_ACD6184F126F525E | 5 | frugg.i.id | 1 | Using where |
| 1 | SIMPLE | ti | eq_ref | PRIMARY | PRIMARY | 767 | func | 1 | Using where |
+----+-------------+-------+--------+----------------------+----------------------+---------+------------+----------+-------------+
3 rows in set (0.06 sec)
mysql> UPDATE item AS i, item_catalog_map AS icm, temp_input AS ti
-> SET i.name=ti.productName,
-> icm.price=ti.retailPrice,
-> icm.conversion_url=productURL
-> WHERE i.id=icm.item_id AND icm.catalog_unique_item_id=ti.productID;
First of all, if your denormalized data has 13M records, but both of your "normalized" tables have 17M records, then you are not getting much compression out of your normalization.
Second, you are trying to update both normalized tables in one SQL statement. I would think that you should update the mapping table first, then in a second SQL statement update the data table.
Third, doing an inner join could speed things up because your query is doing a three-way cartesian product. Well, not exactly, because you are just doing the join old school, and the optimizer should pick it up, but none-the-less, use the JOIN syntax.
UPDATE item_catalog_map AS icm
INNER JOiN temp_input AS ti
ON icm.catalog_unique_item_id = ti.productID
SET icm.price = ti.retailPrice,
icm.conversion_url = productURL;
UPDATE item AS i
INNER JOIN temp_input AS ti
ON i.id = icm.item_id
SET i.name = ti.productName;
Finally, indexes to make sure you have are:
CREATE INDEX IDX_CATALOG ON item_catalog_map (catalog_unique_item_id);
CREATE INDEX IDX_RAW_PRODUCT_ID ON temp_input (productID);
CREATE INDEX IDX_RAW_ITEM_ID ON temp_input (item_id);
Related
I'm trying to figure out why that query so slow (take about 6 second to get result)
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
WHERE
c.id NOT IN (... big list of ids which should be excluded)
This is execution plan
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
| 1 | SIMPLE | z1 | index | PRIMARY | PRIMARY | 4 | NULL | 318563 | 99.85 | Using where; Using index; Using temporary |
| 1 | SIMPLE | c | eq_ref | PRIMARY,member_id | PRIMARY | 4 | z1.id | 1 | 100.00 | |
| 1 | SIMPLE | i | eq_ref | PRIMARY | PRIMARY | 4 | c.member_id | 1 | 100.00 | Using index |
+----+-------------+-------+--------+-------------------+---------+---------+--------------------+--------+----------+--------------------------+
is it because mysql has to take out almost whole 1st table ? Can it be adjusted ?
You can try to replace c with a subquery.
SELECT DISTINCT
c.id
FROM
z1
INNER JOIN
(select c.id
from c
WHERE
c.id NOT IN (... big list of ids which should be excluded)) c ON (z1.id = c.id)
INNER JOIN
i ON (c.member_id = i.member_id)
to leave only necessary id's
It is imposible to say from the information you've provided whether there is a faster solution to obtaining the same data (we would need to know abou data distributions and what foreign keys are obligatory). However assuming that this is a hierarchical data set, then the plan is probably not optimal: the only predicate to reduce the number of rows is c.id NOT IN.....
The first question to ask yourself when optimizing any query is Do I need all the rows? How many rows is this returning?
I'm struggling to see any utlity in a query which returns a list of 'id' values (implying a set of autoincrement integers).
You can't use an index for a NOT IN (or <>) hence the most eficient solution is probably to start with a full table scan on 'c' - which should be the outcome of StanislavL's query.
Since you don't use the values from i and z, the joins could be replaced with 'exists' which may help performance.
I would consider creating a compound index for c(id, member_id). This way the query should work at index level only without scanning any rows in tables.
I want to do a count like this (as an example, not really counting dogs):
SELECT COUNT(*)
FROM dogs AS d INNER JOIN races AS r ON d.race_id = r.race_id
LEFT INNER colors AS c ON c.color_id = r.color_id
WHERE d.deceased = 'N'
I have 130,000 dogs in a MyISAM table. Races has 1,500 records and is an InnoDB table with 9 columns, colors has 83 records and is also InnoDB and has two columns (id, name).
The *_id columns are all primary keys, I have indices on the 'foreign' keys dogs.race_id and races.color_id and I have an index on dogs.deceased. None of the mentioned columns can be NULL.
# mysql --version
mysql Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (i486) using readline 5.2
Now the thing is: In my PhpMyAdmin this query takes 1.8 secs (with SQL_NO_CACHE) with a count result of 64,315. Changing COUNT(*) to COUNT(d.dog_id) or COUNT(d.deceased) also takes the query to run for 1.8 secs with the same result.
But when I remove the COUNT() and just do SELECT * or SELECT dog_id, it takes about 0.004 secs to run (and then counting the result with something like mysql_num_rows()).
How can this be? And how can I make the COUNT() work faster?
Edit: Added an EXPLAIN below
EXPLAIN SELECT COUNT(*)
FROM dogs AS d INNER JOIN races AS r ON d.race_id = r.race_id
INNER JOIN colors AS c ON c.color_id = r.color_id
WHERE d.deceased = 'N'
Gives me:
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
| 1 | SIMPLE | c | index | color_id | color_id | 4 | NULL | 83 | Using index |
| 1 | SIMPLE | r | ref | PRIMARY,color_id | color_id | 4 | database.c.color_id | 14 | Using index |
| 1 | SIMPLE | d | ref | race_id,deceased | race_id | 4 | database.r.race_id | 123 | Using where |
+----+-------------+-------+-------+------------------+----------+---------+----------------------+------+-------------+
The MySQL Optimizer does a full table scan only if it is needed because a column can be NULL which means if the column is not defined as NOT NULL there can be some NULL values in it and so MySQL have to perform table scan to find out. If your column d.dog_id nullable? try to run the count on another column which is not nullable, this should provide you a better performance than count(*).
Try to set an index on dogs.deceased and use SELECT COUNT(*) ... USE INDEX my_index_name.
Create indexes to make your counting faster:
CREATE INDEX ix_temp ON dogs (d.race_id)
INCLUDE (columns needed for the query)
I have four tables that I am trying to join and output the result to a new table. My code looks like this:
create table tbl
select a.dte, a.permno, (ret - rf) f0_xs_ret, (xs_ret - (betav*xs_mkt)) f0_resid, mkt_cap last_year_mkt_cap, betav beta_value
from a inner join b using (dte)
inner join c on (year(a.dte) = c.yr and a.permno = c.permno)
inner join d on (a.permno = d.permno and year(a.dte)-1 = year(d.dte));
All of the tables have multiple indices and for table a, (dte, permno) identify a unique record, for table b, dte id's a unique record, for table c, (yr, permno) id a unique record and for table d, (dte, permno) id a unique record. the explain from the select part of the query is:
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------+---------+---------+---------- ------------------------+--------+-------------------+
| 1 | SIMPLE | d | ALL | idx1 | NULL | NULL | NULL | 264129 | |
| 1 | SIMPLE | c | ref | idx2 | idx2 | 4 | achernya.d.permno | 16 | |
| 1 | SIMPLE | b | ALL | PRIMARY,idx2 | NULL | NULL | NULL | 12336 | Using join buffer |
| 1 | SIMPLE | a | eq_ref | PRIMARY,idx1,idx2 | PRIMARY | 7 | achernya.b.dte,achernya.d.permno | 1 | Using where |
+----+-------------+-------+--------+-------------------+---------+---------+----------------------------------+--------+-------------------+
Why does mysql have to read so many rows to process this thing? and if i am reading this correctly, it has to read (264129*16*12336) rows which should take a good month.
Could someone please explain what's going on here?
MySQL has to read the rows because you're using functions as your join conditions. An index on dte will not help resolve YEAR(dte) in a query. If you want to make this fast, then put the year in its own column to use in joins and move the index to that column, even if that means some denormalization.
As for the other columns in your index that you don't apply functions to, they may not be used if the index won't provide much benefit, or they aren't the leftmost column in the index and you don't use the leftmost prefix of that index in your join condition.
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
I have 2 database tables Companies (uses InnoDB engine) and Company_financial_figures (uses MyISAM engine). Table Companies has about 300 000 records, Company_financial_figures has 600 000 recodrs. Also Table Company_financial_figures has flag that is used in table LEFT JOIN.
Query idea is to select all actual balances for companies (there is a situation when tere is no balance data for that company, but anyway it must be selected, so I have to use LEFT JOIN). It seems to me, that it must select about 300k records from Company_financial_figures table, but not to make Full table scan, like 600k records. And the performance for this query is very slow.
Query is something like this:
SELECT DISTINCT comp.id, comp.name, comp.surname, cff.balance FROM companies comp LEFT JOIN company_financial_figures cff ON (cff.company_id = comp.id AND cff.actual = 1)
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | companies | index | NULL | comp_i_i | 2 | NULL | 346908 | Using index |
| 1 | SIMPLE | company_finan.. | ALL | NULL | NULL | NULL | NULL | 610364 | |
+----+-------------+-------------------+-------+---------------+----------+---------+------+--------+-------------+
I have index on company_id column, but it doesn't help.
Any suggestions?
Why do you need the DISTINCT keyword? DISTINCT always slows down your querying because every record has to be considered explicitly for DISTINCT comparison. I don't know how MySQL handles this in detail, but in Oracle, you can get horrible execution plans if one of your DISTINCT fields is nullable.
But you shouldn't need it, your records are probably distinct anyway, because you select comp.id (I'm guessing?) Wouldn't it be a lot faster when you remove DISTINCT ?
So, I've never understood the explain of MySQL. I understand the gross concepts that you should have at least one entry in the possible_keys column for it to use an index, and that simple queries are better. But what is the difference between ref and eq_ref? What is the best way to be optimizing queries.
For example, this is my latest query that I'm trying to figure out why it takes forever (generated from django models) :
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| 1 | SIMPLE | T6 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | const | 244 | Using temporary; Using filesort |
| 1 | SIMPLE | T5 | eq_ref | PRIMARY | PRIMARY | 4 | paul.T6.achievement_id | 1 | Using index |
| 1 | SIMPLE | T4 | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_achievement_id | 4 | paul.T6.achievement_id | 298 | |
| 1 | SIMPLE | yourock_alias | eq_ref | PRIMARY | PRIMARY | 4 | paul.T4.alias_id | 1 | Using index |
| 1 | SIMPLE | yourock_achiever | ref | yourock_achiever_achievement_id,yourock_achiever_alias_id | yourock_achiever_alias_id | 4 | paul.T4.alias_id | 152 | |
| 1 | SIMPLE | yourock_achievement | eq_ref | PRIMARY | PRIMARY | 4 | paul.yourock_achiever.achievement_id | 1 | |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
6 rows in set (0.00 sec)
I had hoped to learn enough about mysql explain that the query wouldn't be needed. Alas, it seems that you can't get enough information from the explain statement and you need the raw SQL. Query :
SELECT `yourock_achievement`.`id`,
`yourock_achievement`.`modified`,
`yourock_achievement`.`created`,
`yourock_achievement`.`string_id`,
`yourock_achievement`.`owner_id`,
`yourock_achievement`.`name`,
`yourock_achievement`.`description`,
`yourock_achievement`.`owner_points`,
`yourock_achievement`.`url`,
`yourock_achievement`.`remote_image`,
`yourock_achievement`.`image`,
`yourock_achievement`.`parent_achievement_id`,
`yourock_achievement`.`slug`,
`yourock_achievement`.`true_points`
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6
ORDER BY
`yourock_achievement`.`modified` DESC
Paul:
eq_ref
One row is read from this table for each combination of rows from the previous tables. Other than the system and const types, this is the best possible join type. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE index.
eq_ref can be used for indexed columns that are compared using the = operator. The comparison value can be a constant or an expression that uses columns from tables that are read before this table. In the following examples, MySQL can use an eq_ref join to process ref_table:
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
ref
All rows with matching index values are read from this table for each combination of rows from the previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or UNIQUE index (in other words, if the join cannot select a single row based on the key value). If the key that is used matches only a few rows, this is a good join type.
ref can be used for indexed columns that are compared using the = or <=> operator. In the following examples, MySQL can use a ref join to process ref_table:
SELECT * FROM ref_table WHERE key_column=expr;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column=other_table.column;
SELECT * FROM ref_table,other_table
WHERE ref_table.key_column_part1=other_table.column
AND ref_table.key_column_part2=1;
These are copied verbatim from the MySQL manual: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
If you could post your query that is taking forever, I could help pinpoint what is slowing it down. Also, please specify what your definition of forever is. Also, if you could provide your "SHOW CREATE TABLE xxx;" statements for these tables, I could help in optimizing your query as much as possible.
What jumps out at me immediately as a possible point of improvement is the "Using temporary; Using filesort;". This means that a temporary table was created to satisfy the query (not necessarily a bad thing), and that the GROUP BY/ORDER BY you designated could not be retrieved from an index, thus resulting in a filesort.
You query seems to process (244 * 298 * 152) = 11,052,224 records, which according to Using temporary; Using filesort need to be sorted.
This can take long.
If you post your query here, we probably will be able to optimize it somehow.
Update:
You query indeed does a number of nested loops and seems to yield lots of values which need to be sorted then.
Could you please run the following query:
SELECT COUNT(*)
FROM `yourock_achievement`
INNER JOIN
`yourock_achiever`
ON `yourock_achievement`.`id` = `yourock_achiever`.`achievement_id`
INNER JOIN
`yourock_alias`
ON `yourock_achiever`.`alias_id` = `yourock_alias`.`id`
INNER JOIN
`yourock_achiever` T4
ON `yourock_alias`.`id` = T4.`alias_id`
INNER JOIN
`yourock_achievement` T5
ON T4.`achievement_id` = T5.`id`
INNER JOIN
`yourock_achiever` T6
ON T5.`id` = T6.`achievement_id`
WHERE
T6.`alias_id` = 6