Optimize MySQL query with large in() clause - mysql

There has a simple requirement that is query the amount of the Six Degrees relationship from a Friend table.
The structure of the Friend is like this:
+----------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| userId | int(11) | NO | MUL | NULL | |
| friendId | int(11) | NO | | NULL | |
+----------+---------+------+-----+---------+----------------+
Assume I want to know the Six Degrees relationship amount of userId:1, And I wrote down six queries like this
SELECT friendId FROM Friend WHERE userId = 1 to get the one degree friends.
Then execute
SELECT friendId FROM Friend WHERE userId in (/*above query result*/)
five times.
The problem is not as simple as it looks like, cause I have millions records in Friend table.
There is a strong possibility is the Six Degrees relationship amount of user 1 is greater than six digits, although he/she only have two friends in One Degree relationship.
The number of items in the IN clause is exponentially.
Then the six queries taking more than one minute to get result.
How to optimize this situation?

You can use subqueries and see if MySQL optimizer is clever enough to rewrite them as joins (it usually does).
But actually RDBMS is unsuitable for the task. Better look into graph-based databases. See this question for example.

Create a temp table to hold the intermediate results, and JOIN instead of IN:
DROP TEMPORARY TABLE IF EXISTS tmp_friends;
CREATE TEMPORARY TABLE `tmp_friends` (
`id` INT UNSIGNED NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO tmp_friends VALUES(<id of the given user>);
#run this 6 times
INSERT IGNORE INTO tmp_friends
SELECT f.userId
FROM tmp_friends t
JOIN Friend f ON f.friendId = t.id
SELECT f.*
FROM tmp_friends t
JOIN Friend f ON f.userId = t.id

Related

MySQL query: get rows of many to many relationship of items that are used together often efficiently

I have a table tag_thread that associates tags with threads, just like here on Stackoverflow where one thread can have multiple tags and one tag can be used on multiple threads.
Now I would like to give a tag_id as an input to get tags that are often used together with the given tag (so to get relevant tags):
Example table tag_thread:
| tag_id | thread_id |
|:-----------|------------:|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 21 | 2 |
| 3 | 2 |
Expected output for the query:
getRelevantTagIdsForThreadId(1): [3,2,21]
getRelevantTagIdsForThreadId(2): [1,3]
getRelevantTagIdsForThreadId(3): [1,2,21]
So the query should search for the given tag_id, then take the associated thread_ids, collect the tag_ids of the thread_id`s received in the step before and order them by how often the tag_id was found.
I already have a working query, however, it is not efficient at all and thus doesn't work properly for larger tables:
select `t2`.`tag_id`
from `tag_thread` as `t1`
inner join `tag_thread` as `t2`
on `t1`.`thread_id` = `t2`.`thread_id`
and t1.tag_id = :tagId
where t2.tag_id <> :tagId2
group by `t2`.`tag_id`
order by count(t2.tag_id) desc
Any idea for an efficient solution? I would be okay with limiting the number of tags that are looked at in the first place, too.
result of SHOW CREATE TABLE tag_thread:
CREATE TABLE `tag_thread` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`tag_id` int(11) NOT NULL,
`thread_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=38496 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

Single INNER JOIN of two well-indexed tables takes more than a minute to run

I have a query that takes about 90 seconds to run even though the tables should have the right indexes. I don't understand why.
I am using MySQL and the tables are InnoDB.
This is the query:
SELECT count(*)
FROM `following_lists` fl INNER JOIN users u
ON fl.user_uuid = u.user_uuid
WHERE fl.following_query_id = 1000010 AND u.status <= 2
I expect this query to start on the table following_lists, grab about 4K records as per the WHERE condition, join these records to the table users by its primary key, check the value of a field in the users table, and return the count of the resulting records. Why does it take so long? Could it be because the two fields I'm joining the tables by are CHAR(40) and not integers?
These are the tables involved and their indexes:
CREATE TABLE `users` (
`user_uuid` CHAR(40) NOT NULL,
`status` TINYINT UNSIGNED NOT NULL,
...
PRIMARY KEY (`user_uuid`),
...
)
CREATE TABLE `following_lists` (
`following_id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`following_query_id` INT UNSIGNED NOT NULL,
`user_uuid` CHAR(40) NOT NULL,
PRIMARY KEY (`following_id`),
KEY `query_id` (`following_query_id`),
KEY `user_uuid` (`user_uuid`)
)
And this is the output of the explain query:
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
| 1 | SIMPLE | fl | ref | query_id,user_uuid | query_id | 4 | const | 3718 | |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 160 | fl.user_uuid | 1 | Using index |
+----+-------------+-------+--------+--------------------+----------+---------+--------------+------+-------------+
Further details:
The table following_lists has about 25k rows, but only 3718 have fl.following_query_id = 1000010.
The table users has about 160k rows, but only 3718 should be selected in the join. Only 40 records meet both conditions fl.following_query_id = 1000010 AND u.status <= 2.
The query is slow even if I remove the condition AND u.status <= 2.
"have the right indexes" -- dead give away.
If you are using MyISAM, don't. Instead, switch to InnoDB.
Do you need following_lists.id for anything? Is (following_query_id, user_uuid) Unique? If so, make them the PRIMARY KEY.
If you can't do the above, change
KEY `query_id` (`following_query_id`)
to
INDEX(following_query_id, user_uuid)
UUIDs are terrible inefficient, especially when unnecessarily declared utf8mb4, or CHAR with a larger than necessary size. Change to CHAR(36) CHARACTER SET ascii. (Notice the "160" in the `EXPLAIN shrink significantly.)
More on why UUIDs are bad for performance: http://mysql.rjweb.org/doc.php/uuid
How much RAM do you have? What is the setting for innodb_buffer_pool_size? (Sounds like it is too low.)
More on indexing: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

MySQL range query is slow

I have read different links like http://goo.gl/1nr3s2, http://goo.gl/gv4Vlc and other stackoverflow questions, but none of them help me with this problem.
This problem interacts with multiple tables, but the EXPLAIN method help me identify range is the main problem with the query.
First I need to explain that I have this table with this sample data (I will not use ids in any table to simplify the process)
+-------+----------+----------------+--------------+---------------+----------------+
| marca | submarca | modelo_inicial | modelo_final | motor | texto_articulo |
+-------+----------+----------------+--------------+---------------+----------------+
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | BE1254AG4 |
| Buick | Century | 1993 | 1996 | 4 Cil 2.2 Lts | 854G4 |
+-------+----------+----------------+--------------+---------------+----------------+
This table has more than 1.5 Million rows and I have created a index that integrates initial_year and end_year in one and also initial_year has an index and end_year has another index independently like this structure.
CREATE TABLE `general` (
`id_general` int(11) NOT NULL AUTO_INCREMENT,
`id_marca_submarca` int(11) NOT NULL,
`id_modelo_inicial` int(11) NOT NULL,
`id_modelo_final` int(11) NOT NULL,
`id_motor` int(11) NOT NULL,
`id_articulo` int(11) NOT NULL,
PRIMARY KEY (`id_general`),
KEY `fk_general_articulo` (`id_articulo`),
KEY `modelo_inicial_final` (`id_modelo_inicial`,`id_modelo_final`),
KEY `indice_motor` (`id_motor`),
KEY `indice_marca_submarca` (`id_marca_submarca`),
KEY `indice_modelo_inicial` (`id_modelo_inicial`),
KEY `indice_modelo_final` (`id_modelo_final`),
CONSTRAINT `fk_general_articulo` FOREIGN KEY (`id_articulo`) REFERENCES `articulo` (`id_articulo`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=1191853 DEFAULT CHARSET=utf8
I have another table that contains different years like this sample data:
+---------+----------------+
| id_modelo | texto_modelo |
+-----------+--------------+
| 76 | 2014 |
| 75 | 2013 |
............................
| 1 | 1939 |
+-----------+--------------+
I created a query that contains subquery to obtain specific data but took a lot of time. I will put some queries I have tried but none of them have worked properly for me
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular.modelo M ON G.id_modelo_inicial <= M.id_modelo AND G.id_modelo_final >= M.id_modelo
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery...
WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
And this query took a lot of seconds, so I use EXPLAIN and report is:
This is another query I tried.
SELECT DISTINCT M.texto_modelo
FROM general G
INNER JOIN parque_vehicular_rigs.modelo M ON M.id_modelo BETWEEN G.id_modelo_inicial AND G.id_modelo_final
WHERE EXISTS
(
SELECT DISTINCT A.id_articulo
...subquery WHERE A.id_articulo = G.id_articulo AND AD.id_distribuidor = 1
)
ORDER BY M.texto_modelo DESC;
Some operations you could do to change the query plan:
OP1: Get rid of all the keys or indexes in table general.
OP2: Use SELECT 1 instead of SELECT DISTINCT A.id_articulo in the sub query in EXISTS.
Do these operations separately, compare the differences.

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.

When to add index on joined tables

I have a mysql table with 9 million records that doesn't have any indices set. I need to join this to another table based on a common ID. I'm going to add an index to this ID, but I also have other fields in the select and where clause.
Should I add an index to all of the fields in the where clause?
What about the fields in the select clause? Should I create one index for all fields, or an index per field?
Update - Added tables and query
Here is the query - I need to get the number of sales, item name, and item ID by item based on the store name and store ID (the store name and ID by themselves are not unique)
SELECT COUNT(*) as salescount, items.itemName, CONCAT(items.ID, items.productcode) as itemId
FROM items JOIN sales ON items.itemId = sales.itemId WHERE items.StoreName = ?
AND sales.storeID = ? GROUP BY items.ItemId ORDER BY salescount DESC LIMIT 10;
Here is the sales table:
+----------------+------------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+------------------------------+------+-----+---------+-------+
| StoreId | bigint(20) unsigned | NO | | NULL | |
| ItemId | bigint(20) unsigned | NO | | NULL | |
+----------------+------------------------------+------+-----+---------+-------+
and the items table:
+--------------------+------------------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------------------+------+-----+---------+-------+
| ItemId | bigint(20) unsigned | NO | PRI | NULL | |
| ProductCode | bigint(20) unsigned | NO | | NULL | |
| ItemName | varchar(100) | NO | | NULL | |
| StoreName | varchar(100) | NO | PRI | NULL | |
+--------------------+------------------------------+------+-----+---------+-------+
You should index all fields that will be searched for in the leading table in the WHERE clause and in the driven table in the WHERE and JOIN clauses.
Making the indexes to cover all fields used in the query (including SELECT and ORDER BY clauses) will also help, since no table lookups will be needed.
Just post your query here and I'll probably be able to tell you how to index the tables.
Update:
Your query will return at most 1 row with 1 as a COUNT(*)
This will select the sale with the given StoreID (which is the PRIMARY KEY), and join the items on the sale's itemId and given StoreName (this combination is a PRIMARY KEY too).
This join either succeeds (returning 1 row) or fails (returning no rows).
If it succeeds, the COUNT(*) will be 1.
If it's really what you want, then your table is indexed fine.
However, it seems to me that your table design is a little more complex and you just missed some fields when copying the field definitions.
Update 2:
Create a composite index on sales (storeId, itemId)
Make sure that you PRIMARY KEY on items is defined as (StoreName, ItemId) (in that order).
If the PK is defined as (ItemID, StoreName), the create an index on items (StoreName, ItemID).
Yes, you really should have indexes, but they should be appropriate for all your queries. Without having a good rummage about in your database its difficult to recommend exactly what indexes to configured.
9 milion rows is enough that indexes will make a big difference - but not so big that you can't afford to tinker a bit.
A crude solution would be to create indexes on items(storeid),items(itemid,storename), items(storename,itemid), sales(itemid),sales(storeid),sales(itemid,storeid) and sales(storeid,itemid) then drop the indexes that aren't getting used.
C.
Indexing is great -- when used in the correct form. Remember, indexes must be indexed.
Concentrate your indexes on your primary, shared keys, as well as fields which require heavy and common data comparisons, such as literal fields and date ranges.
Indexes are great when used correctly, but indexes arn't a cure-all problem. Even properly indexed tables can be brought to their knees with a bad query and a flick of the wrist.