When to add index on joined tables - mysql

I have a mysql table with 9 million records that doesn't have any indices set. I need to join this to another table based on a common ID. I'm going to add an index to this ID, but I also have other fields in the select and where clause.
Should I add an index to all of the fields in the where clause?
What about the fields in the select clause? Should I create one index for all fields, or an index per field?
Update - Added tables and query
Here is the query - I need to get the number of sales, item name, and item ID by item based on the store name and store ID (the store name and ID by themselves are not unique)
SELECT COUNT(*) as salescount, items.itemName, CONCAT(items.ID, items.productcode) as itemId
FROM items JOIN sales ON items.itemId = sales.itemId WHERE items.StoreName = ?
AND sales.storeID = ? GROUP BY items.ItemId ORDER BY salescount DESC LIMIT 10;
Here is the sales table:
+----------------+------------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------------------------+------+-----+---------+-------+
| StoreId | bigint(20) unsigned | NO | | NULL | |
| ItemId | bigint(20) unsigned | NO | | NULL | |
+----------------+------------------------------+------+-----+---------+-------+
and the items table:
+--------------------+------------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------------------+------+-----+---------+-------+
| ItemId | bigint(20) unsigned | NO | PRI | NULL | |
| ProductCode | bigint(20) unsigned | NO | | NULL | |
| ItemName | varchar(100) | NO | | NULL | |
| StoreName | varchar(100) | NO | PRI | NULL | |
+--------------------+------------------------------+------+-----+---------+-------+

You should index all fields that will be searched for in the leading table in the WHERE clause and in the driven table in the WHERE and JOIN clauses.
Making the indexes to cover all fields used in the query (including SELECT and ORDER BY clauses) will also help, since no table lookups will be needed.
Just post your query here and I'll probably be able to tell you how to index the tables.
Update:
Your query will return at most 1 row with 1 as a COUNT(*)
This will select the sale with the given StoreID (which is the PRIMARY KEY), and join the items on the sale's itemId and given StoreName (this combination is a PRIMARY KEY too).
This join either succeeds (returning 1 row) or fails (returning no rows).
If it succeeds, the COUNT(*) will be 1.
If it's really what you want, then your table is indexed fine.
However, it seems to me that your table design is a little more complex and you just missed some fields when copying the field definitions.
Update 2:
Create a composite index on sales (storeId, itemId)
Make sure that you PRIMARY KEY on items is defined as (StoreName, ItemId) (in that order).
If the PK is defined as (ItemID, StoreName), the create an index on items (StoreName, ItemID).

Yes, you really should have indexes, but they should be appropriate for all your queries. Without having a good rummage about in your database its difficult to recommend exactly what indexes to configured.
9 milion rows is enough that indexes will make a big difference - but not so big that you can't afford to tinker a bit.
A crude solution would be to create indexes on items(storeid),items(itemid,storename), items(storename,itemid), sales(itemid),sales(storeid),sales(itemid,storeid) and sales(storeid,itemid) then drop the indexes that aren't getting used.
C.

Indexing is great -- when used in the correct form. Remember, indexes must be indexed.
Concentrate your indexes on your primary, shared keys, as well as fields which require heavy and common data comparisons, such as literal fields and date ranges.
Indexes are great when used correctly, but indexes arn't a cure-all problem. Even properly indexed tables can be brought to their knees with a bad query and a flick of the wrist.

Related

Query performance on primary index vs index

I have a table on mysql and two queries whose performances are quite different. I have extracted plans of the queries, but I couldn't fully understand the reason behind the performance difference.
The table:
+-------------+----------------------------------------------+------------------------------------+
| TableA | | |
+-------------+----------------------------------------------+------------------------------------+
| id | int(10) unsigned NOT NULL AUTO_INCREMENT | |
| userId | int(10) | unsigned DEFAULT NULL |
| created | timestamp | NOT NULL DEFAULT CURRENT_TIMESTAMP |
| PRIMARY KEY | id | |
| KEY userId | userId | |
| KEY created | created | |
+-------------+----------------------------------------------+------------------------------------+
Keys/Indices: The primary key on id field, a key on userId field ASC
, another key on created field ASC.
tableA is a very big table, it contains millions of rows.
The query I run on this table is:
The user with id 1234 has 1.5M records in this table. I want to fetch its latest 100 rows. In order to achieve this, I have 2 different queries:
Query 1:
SELECT * FROM tableA USE INDEX (userId)
WHERE userId=1234 ORDER BY created DESC LIMIT 100;
Query 2:
SELECT * FROM tableA
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Since id field of tableA is auto increment, the condition of being latest is preserved. These 2 queries return the same result. However, there is a huge performance difference.
Query plans are:
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query No | Operation | Params | Raws | Raw desc |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
| Query 1 | Sort(using file sort) Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using index condition; Using filesort |
| Query 2 | Unique index scan (ref) | table: tableA; index: userId; | 2.5M | Using where |
+----------+-----------------------------------------------+-------------------------------+------+---------------------------------------+
+--------+-------------+
| | Performance |
+--------+-------------+
| Query1 | 7,5 s |
+--------+-------------+
| Query2 | 741 ms |
+--------+-------------+
I understand that there is a sorting operation on Query 1. In each query, the index used is userId. But why is sorting not used in Query 2? How does the primary index affect?
Mysql 5.7
Edit: There are more columns on the table, I have extracted them from the table definition above.
Since id field of tableA is auto increment, the condition of being latest is preserved.
That is usually a valid statement.
WHERE userId=1234 ORDER BY created DESC LIMIT 100
needs this 'composite' index: (userId, created). With that, it will hit only 100 rows, regardless of the table size or the number of rows for that user.
The same goes for
WHERE userId=1234 ORDER BY id DESC LIMIT 100;
Namely that it needs (userId, id). However, in InnoDB, when you say INDEX(x) it silently tacks on the PRIMARY KEY columns. So you effectively get INDEX(x,id). This is why your plain INDEX(userId) worked well.
EXPLAIN rarely (if ever) takes into account the LIMIT. This is why 'Rows' is "2.5M" for both queries.
The first query might (or might not) have used INDEX(userId) if you took out the USE INDEX hint. The choice depends on what percentage of the table has userId = 1234. If it is less than about 20%, the index would be used. But it would bounce back and forth between the secondary index and the data -- all 1.5 million times. If more than 20%, it would avoid the bouncing by simply reading all the "millions" of rows, ignoring those that don't apply.
Note: What you had for Q1 will still read at least 1.5M rows, sort them ("Using filesort"), then peel off the desired 100. But with INDEX(userId, created), it can skip the sort and look at only 100 rows.
I cannot explain "Unique index scan" without seeing SHOW CREATE TABLE and the un-annotated EXPLAIN. (EXPLAIN FORMAT=JSON SELECT... might provide more insight.)

Need index for simple query please

Can anyone suggest a good index to make this query run quicker?
SELECT
s.*,
sl.session_id AS session_id,
sl.lesson_id AS lesson_id
FROM
cdu_sessions s
INNER JOIN cdu_sessions_lessons sl ON sl.session_id = s.id
WHERE
(s.sort = '1') AND
(s.enabled = '1') AND
(s.teacher_id IN ('193', '1', '168', '1797', '7622', '19951'))
Explain:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | s | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2993 | 0.50 | Using where
1 | SIMPLE | sl | NULL | ref | session_id,ix2 | ix2 | 4 | ealteach_main.s.id | 5 | 100.00 | Using index
cdu_sessions looks like this:
------------------------------------------------
id | int(11)
name | varchar(255)
map_location | enum('classroom', 'school'...)
sort | tinyint(1)
sort_titles | tinyint(1)
friend_gender | enum('boy', 'girl'...)
friend_name | varchar(255)
friend_description | varchar(2048)
friend_description_format | varchar(128)
friend_description_audio | varchar(255)
friend_description_audio_fid | int(11)
enabled | tinyint(1)
created | int(11)
teacher_id | int(11)
custom | int(1)
------------------------------------------------
cdu_sessions_lessons contains 3 fields - id, session_id and lesson_id
Thanks!
Without looking at the query plan, row count and distribution on each table, is hard to predict a good index to make it run faster.
But, I would say that this might help:
> create index sessions_teacher_idx on cdu_sessions(teacher_id);
looking at where condition you could use a composite index for table cdu_sessions
create index idx1 on cdu_sessions(teacher_id, sort, enabled);
and looking to join and select for table cdu_sessions_lessons
create index idx2 on cdu_sessions_lessons(session_id, lesson_id);
First, write the query so no type conversions are necessary. All the comparisons in the where clause are to numbers, so use numeric constants:
SELECT s.*,
sl.session_id, -- unnecessary because s.id is in the result set
sl.lesson_id
FROM cdu_sessions s INNER JOIN
cdu_sessions_lessons sl
ON sl.session_id = s.id
WHERE s.sort = 1 AND
s.enabled = 1 AND
s.teacher_id IN (193, 1, 168, 1797, 7622, 19951);
Although it might not be happening in this specific case, mixing types can impede the use of indexes.
I removed the column as aliases (as session_id for instance). These were redundant because the column name is the alias and the query wasn't changing the name.
For this query, first look at the WHERE clause. All the column references are from one table. These should go in the index, with the equality comparisons first:
create index idx_cdu_sessions_4 on cdu_sessions(sort, enabled, teacher_id, id)
I added id because it is also used in the JOIN.
Formally, id is not needed in the index if it is the primary key. However, I like to be explicit if I want it there.
Next you want an index for the second table. Only two columns are referenced from there, so they can both go in the index. The first column should be the one used in the join:
create index idx_cdu_sessions_lessons_2 on cdu_sessions_lessons(session_id, lesson_id);

Insert with on duplicate key update ignores index

So I've got a table product_supplier that I need to add data to from import_tbl. Product_supplier has three columns product_id, supplier_id and price. Import_tbl has the same columns plus some extra. What's most important and what I can't get working is that when a specific combination of product_id and supplier_id exists, only the price should be updated. If that combination does not exist a new row needs to be added. I tried this query
INSERT INTO product_supplier (product_id, supplier_id, price)
SELECT i.product_id, i.supplier_id, i.price
FROM import_tbl i
ON DUPLICATE KEY UPDATE
price = i.price
This one works if I add a row with a new product_id, but it totally ignores the supplier_id. So it won't add new rows if I a row uses the same product_id but a different supplier_id.
I think this has something to do with indexes, and I tried unique indexes for both product_id, and supplier_id and a multiple-column index of both product_id and supplier_id. But when I put EXPLAIN in front of the query it never recognises any indexes. Please some help, thanks!
Table structure of product_supplier
+---------------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+---------+------+-----+---------+----------------+
| product_supplier_id | int(11) | NO | PRI | NULL | auto_increment |
| product_id | int(11) | NO | UNI | 0 | |
| supplier_id | int(11) | NO | MUL | 0 | |
| price | int(11) | NO | | 0 | |
+---------------------+---------+------+-----+---------+----------------+
It looks like you have a key problem.
The "ON DUPLICATE KEY UPDATE" pays attention to the table's primary key only, in this case a combo primary of product_supplier_id plus product_id. The product_supplier_id field isn't being included in your INSERT, and is then being auto-generated.
If you really want to make this commit as a single action (instead of check for existing then choose to either insert or update) then you'll need to move the primary key to be based on a combo of product_id and supplier_id and drop the auto-increment field.
If you need to be able to have more than one price per product/supplier then you can't use ON DUPLICATE KEY UPDATE and will need to run multiple queries.

Optimize MySQL query with large in() clause

There has a simple requirement that is query the amount of the Six Degrees relationship from a Friend table.
The structure of the Friend is like this:
+----------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| userId | int(11) | NO | MUL | NULL | |
| friendId | int(11) | NO | | NULL | |
+----------+---------+------+-----+---------+----------------+
Assume I want to know the Six Degrees relationship amount of userId:1, And I wrote down six queries like this
SELECT friendId FROM Friend WHERE userId = 1 to get the one degree friends.
Then execute
SELECT friendId FROM Friend WHERE userId in (/*above query result*/)
five times.
The problem is not as simple as it looks like, cause I have millions records in Friend table.
There is a strong possibility is the Six Degrees relationship amount of user 1 is greater than six digits, although he/she only have two friends in One Degree relationship.
The number of items in the IN clause is exponentially.
Then the six queries taking more than one minute to get result.
How to optimize this situation?
You can use subqueries and see if MySQL optimizer is clever enough to rewrite them as joins (it usually does).
But actually RDBMS is unsuitable for the task. Better look into graph-based databases. See this question for example.
Create a temp table to hold the intermediate results, and JOIN instead of IN:
DROP TEMPORARY TABLE IF EXISTS tmp_friends;
CREATE TEMPORARY TABLE `tmp_friends` (
`id` INT UNSIGNED NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO tmp_friends VALUES(<id of the given user>);
#run this 6 times
INSERT IGNORE INTO tmp_friends
SELECT f.userId
FROM tmp_friends t
JOIN Friend f ON f.friendId = t.id
SELECT f.*
FROM tmp_friends t
JOIN Friend f ON f.userId = t.id

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.