How to make this complicated query faster [MySQL]? - mysql

I have the next query:
SELECT JL.j_id, COUNT(*) as total
FROM j_log JL
WHERE JL.log_time > '20120205164008'
AND JL.j_id IN (
SELECT j_id
FROM j
WHERE checked = '1'
AND expires >= '20120207164008'
) GROUP BY JL.j_id ORDER BY total DESC LIMIT 3
j table has big structure 100 fields and 248986 rows inside it.
next KEY's are present in it
PRIMARY KEY (`j_id`),
KEY `expires` (`expires`),
KEY `checked` (`checked`),
KEY `checked_2` (`checked`,`expires`)
j_log table has about 63000000 records and the next structure
CREATE TABLE `j_log` (
`j_id` int(11) NOT NULL DEFAULT '0',
`member_id` int(11) DEFAULT NULL,
`ip` int(10) unsigned NOT NULL DEFAULT '0',
`log_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
KEY `j_id` (`j_id`),
KEY `log_time` (`log_time`),
KEY `ip` (`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
so the considered query wants to get top3 of most visited j_id instances
this is the plan
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
| 1 | PRIMARY | JL | index | log_time | j_id | 4 | NULL | 63914602 | 0.36 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | j | unique_subquery | PRIMARY,expires,checked,checked_2 | PRIMARY | 4 | func | 1 | 100.00 | Using where |
+----+--------------------+-------+-----------------+-----------------------------------+---------+---------+------+----------+----------+----------------------------------------------+
Some times it could take up for 15!!! minutes.
Is there any way how to make faster ?

SELECT JL.j_id, COUNT(*) as total
FROM j_log JL
INNER JOIN j
ON JL.j_id = j.j_id
AND j.checked = '1'
AND j.expires >= '20120207164008'
WHERE JL.log_time > '20120205164008'
GROUP BY JL.j_id
ORDER BY total
DESC LIMIT 3
Will this be faster?

Why do you use a subquery?
Why is checked a string? ('1' instead of just 1)
Why do you compare jl.log_time and j.expires diffrently (> vs >=)
How about this query
SELECT j.j_id, COUNT(jl.j_id) as total
FROM j
LEFT JOIN j_log jl ON (jl.j_id = j.j_id AND jl.checked = '1' AND jl.log_time > '20120205164008')
WHERE j.expires >= '20120207164008'
GROUP BY j.j_id
ORDER BY total DESC
LIMIT 3
Make sure j_id is the PRIMARY KEY for both tables and put an index on j.expires and jl.checked and jl.logtime.
Also make sure the field checked is optimized. I'm not sure what the possible values can be, but I assume it's a boolean field. So rather make the field_type BIT or use an ENUM
Edit
Also you should convert the fields j.expires and jl.log_time to better fields. I think it is just a varchar now, looking at the current value you use: 20120205164008. Convert this into a DATETIME field (but don't just convert the tables because you will lose the data).

Related

SQL Query is slow and not using indexes

SELECT `productTitle`, `orderCnt`, `promPCPriceStr`,
`productImgUrl`, `oriPriceStr`, `detailUrl`,
(SELECT count(id) FROM orders t4
WHERE t4.productId = t1.productId
AND DATE( t4.`date`) > DATE_SUB(CURDATE(), INTERVAL 2 DAY)
) as ordertoday
FROM `products` t1
WHERE `orderCnt` > 0
AND `orderCnt` < 2000
AND `promPCPriceStr` > 0
AND `promPCPriceStr` < 2000
HAVING ordertoday > 5 AND ordertoday < 2000
order by ordertoday desc limit 150
This query take 18 second to finish when i run explain command on it shows this
it does not use the index keys !
The tables used
Products Table
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`productId` bigint(20) NOT NULL,
`detailUrl` text CHARACTER SET utf32 NOT NULL,
`belongToDSStore` int(11) NOT NULL,
`promPCPriceStr` float NOT NULL DEFAULT '-1',
`oriPriceStr` float NOT NULL DEFAULT '-1',
`orderCnt` int(11) NOT NULL,
`productTitle` text CHARACTER SET utf32 NOT NULL,
`productImgUrl` text CHARACTER SET utf32 NOT NULL,
`created_date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`cat` bigint(20) NOT NULL DEFAULT '-1',
PRIMARY KEY (`id`),
UNIQUE KEY `productId` (`productId`),
KEY `orderCnt` (`orderCnt`),
KEY `cat` (`cat`),
KEY `promPCPriceStr` (`promPCPriceStr`)
) ENGINE=InnoDB AUTO_INCREMENT=37773 DEFAULT CHARSET=latin1
Orders Table
CREATE TABLE `orders` (
`oid` int(11) NOT NULL AUTO_INCREMENT,
`countryCode` varchar(10) NOT NULL,
`date` datetime NOT NULL,
`id` bigint(20) NOT NULL,
`productId` bigint(20) NOT NULL,
PRIMARY KEY (`oid`),
UNIQUE KEY `id` (`id`),
KEY `date` (`date`),
KEY `productId` (`productId`)
) ENGINE=InnoDB AUTO_INCREMENT=9790205 DEFAULT CHARSET=latin1
MySQL won't use an index even if one exists on a column you search, if the values you search for appear on a large subset of the rows.
I did a test with MySQL 5.6. I created table with ~1,000,000 rows, with a column x with random values evenly distributed between 1 and 1000. There's an index on column x.
Depending on my search terms, I see the index is used if I search for a range of values matching a small enough subset of rows, otherwise it decides using the index is too much trouble, and just does a table-scan:
mysql> explain select * from foo where x < 50;
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | foo | range | x | x | 4 | NULL | 102356 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
mysql> explain select * from foo where x < 100;
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | foo | ALL | x | NULL | NULL | NULL | 1046904 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+---------+-------------+
I would infer that your query's search conditions match a pretty large portion of the rows, and MySQL decides the indexes on these columns are not worth using.
WHERE `orderCnt` > 0
AND `orderCnt` < 2000
AND `promPCPriceStr` > 0
AND `promPCPriceStr` < 2000
If you think MySQL is making the wrong choice, you can try to use an index hint to tell MySQL that a table-scan is prohibitively expensive. This will urge it to use the index (if the index is relevant to the search condition).
mysql> explain select * from foo force index (x) where x < 100;
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | foo | range | x | x | 4 | NULL | 216764 | Using index condition |
+----+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
I would write the query this way, without any subquery:
SELECT t.productTitle, t.orderCnt, t.promPCPriceStr,
t.productImgUrl, t.oriPriceStr, t.detailUrl,
COUNT(o.id) AS orderToday
FROM products t
LEFT JOIN orders o ON t.productid = o.productid AND o.date > CURDATE() - INTERVAL 2 DAY
WHERE t.orderCnt > 0 AND t.orderCnt < 2000
AND t.promPCPriceStr > 0 AND t.promPCPriceStr < 2000
GROUP BY t.productid
HAVING ordertoday > 5 AND ordertoday < 2000
ORDER BY ordertoday DESC LIMIT 150
When I EXPLAIN the query, I get this report:
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
| 1 | SIMPLE | t | ALL | productId,orderCnt,promPCPriceStr | NULL | NULL | NULL | 9993 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | o | ref | date,productId | productId | 8 | test.t.productId | 1 | Using where |
+----+-------------+-------+------+-----------------------------------+-----------+---------+------------------+------+----------------------------------------------+
It still does a table-scan for products but it joins the relevant matching rows in orders with an index lookup instead of a correlated subquery.
I filled my tables with random date, to make 98,846 product rows and 215,508 orders rows. When I run the query it takes about 0.18 seconds.
Although when I run your query with the correlated subquery, it takes 0.06 seconds. I don't know why your query is so slow. You could be running on an underpowered server.
I'm running my test on a Macbook Pro 2017 with an i7 CPU and 16GB of RAM.
In both tables, it is counterproductive to have both an AUTO_INCREMENT PRIMARY KEY and a BIGINT column that is UNIQUE. Get rid of the AI column and promote the other to PK. This may require changing some of your code, since the AI column is gone.
As for the subquery...
(SELECT count(id) FROM orders t4
WHERE t4.productId = t1.productId
AND DATE( t4.`date`) > DATE_SUB(CURDATE(), INTERVAL 2 DAY)
) as ordertoday
Change COUNT(id) to COUNT(*) unless you need to check id for being NOT NULL (which I doubt).
The date column is hidden in a function call, so no index will be useful. So, change the date test to
AND t4.`date` > CURDATE - INTERVAL 2 DAY
Then add this composite index. (It will help with Karwin's reformulation, too).
INDEX(productId, date)

Optimize and speed up MySQL query selection

I'm trying to figure out which is the best way to optimize my current selection query on a MySQL database.
I have 2 MySQL tables with a relationship one-to-many. One is the user table that contains the unique list of users and It has around 22krows. One is the linedata table which contains all the possible coordinates for each user and it has around 490k rows.
In this case we can assume the foreign key between the 2 tables is the id value. In the case of the user table the id is also the auto-increment primary key, while in the linedata table it's not primary key cause we can have more rows for the same user.
The CREATE STMT structure
CREATE TABLE `user` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`isActive` tinyint(4) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`gender` varchar(45) COLLATE utf8_unicode_ci NOT NULL,
`age` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=21938 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`timestamp` datetime NOT NULL,
`x` float NOT NULL,
`y` float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The selection query
SELECT
u.id,
u.isActive,
u.userId,
u.name,
u.gender,
u.age,
GROUP_CONCAT(CONCAT_WS(', ',timestamp,x, y)
ORDER BY timestamp ASC SEPARATOR '; '
) as linedata_0
FROM user u
JOIN linedata l
ON u.id=l.id
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
GROUP BY userId;
The EXPLAIN output
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| ID | SELECT_TYPE | TABLE | TYPE | POSSIBLE_KEYS | KEY | KEY_LEN | REF | ROWS | EXTRA |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | l | ALL | NULL | NULL | NULL | NULL | 491157 | "Using where; Using temporary; Using filesort" |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | l.id | 1 | NULL |
+-------+---------------+-----------+-----------+-------------------+-----------+---------------+-----------+-----------+------------------------------------------------------------+
The selection query works if for example I add another WHERE condition for filter single users. Let's say that I want to select just 200 user, then I got around 14 seconds as execution time. Around 7 seconds if I select just the first 100 users. But in case of having only datetime range condition it seems loading without an ending point. Any suggestions?
UPDATE
After following the Rick's suggestions now the query benchmark is around 14 seconds. Here below the EXPLAIN EXTENDED:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,PRIMARY,u,index,PRIMARY,PRIMARY,4,NULL,21959,100.00,NULL
1,PRIMARY,l,ref,id_timestamp_index,id_timestamp_index,4,u.id,14,100.00,"Using index condition"
2,"DEPENDENT SUBQUERY",NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,"No tables used"
I have changed a bit some values of the tables:
Where the id in user table can be joined with userId in linedata table. And they are integer now. We will have string type just for the userId value in user table cause it is a sort of long string identifier like 0000309ab2912b2fd34350d7e6c079846bb6c5e1f97d3ccb053d15061433e77a_0.
So, just for make a quick example we will have in user and in linedata table:
+-------+-----------+-----------+-------------------+--------+---+
| id | isActive | userId | name | gender |age|
+-------+-----------+-----------+-------------------+--------+---+
| 1 | 1 | x4by4d | john | m | 22|
| 2 | 1 | 3ub3ub | bob | m | 50|
+-------+-----------+-----------+-------------------+--------+---+
+-------+-----------+-----------+------+---+
| id | userId |timestamp | x | y |
+-------+-----------+-----------+------+----+
| 1 | 1 | somedate | 30 | 10 |
| 2 | 1 | somedate | 45 | 15 |
| 3 | 1 | somedate | 50 | 20 |
| 4 | 2 | somedate | 20 | 5 |
| 5 | 2 | somedate | 25 | 10 |
+-------+-----------+-----------+------+----+
I have added a compound index made of userId and timestamp values in linedata table.
Maybe instead of having as primary key an ai id value for linedata table, if I add a composite primary key made of userId+timestamp? Should increase the performance or maybe not?
I need to help you fix several bugs before discussing performance.
First of all, '2018-02-28T20:00:00.000Z' won't work in MySQL. It needs to be '2018-02-28 20:00:00.000' and something needs to be done about the timezone.
Then, don't "hide a column in a function". That is DATEDIFF(l.timestamp ...) cannot use any indexing on timestamp.
So, instead of
WHERE DATEDIFF(l.timestamp, '2018-02-28T20:00:00.000Z') >= 0
AND DATEDIFF(l.timestamp, '2018-11-20T09:20:08.218Z') <= 0
do something like
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218'
I'm confused about the two tables. Both have id and userid, yet you join on id. Perhaps instead of
CREATE TABLE `linedata` (
`id` int(11) NOT NULL,
`userId` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
...
you meant
CREATE TABLE `linedata` (
`id` int(11) NOT NULL AUTO_INCREMENT, -- (the id for `linedata`)
`userId` int NOT NULL, -- to link to the other table
...
PRIMARY KEY(id)
...
Then there could be several linedata rows for each user.
At that point, this
JOIN linedata l ON u.id=l.id
becomes
JOIN linedata l ON u.id=l.userid
Now, for performance: linedata needs INDEX(userid, timestamp) - in that order.
Now, think about the output. You are asking for up to 22K rows, with possibly hundreds of "ts,x,y" strung together in one of the columns. What will receive this much data? Will it choke on it?
And GROUP_CONCAT has a default limit of 1024 bytes. That will allow for about 50 points. If a 'user' can be in more than 50 spots in 9 days, consider increasing group_concat_max_len before running the query.
To make it work even faster, reformulate it this way:
SELECT u.id, u.isActive, u.userId, u.name, u.gender, u.age,
( SELECT GROUP_CONCAT(CONCAT_WS(', ',timestamp, x, y)
ORDER BY timestamp ASC
SEPARATOR '; ')
) as linedata_0
FROM user u
JOIN linedata l ON u.id = l.userid
WHERE l.timestamp >= '2018-02-28 20:00:00.000'
AND l.timestamp < '2018-11-20 09:20:08.218';
Another thing. You probably want to be able to look up a user by name; so add INDEX(name)
Oh, what the heck is the VARCHAR(255) for userID?? Ids are normally integers.

Optimizing an InnoDB table and a problematic query

I have a biggish InnoDB table which at this moment contains about 20 million rows with ~20000 new rows inserted every day. They contain messages for different topics.
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`TopicID` bigint(20) unsigned NOT NULL,
`DATESTAMP` int(11) DEFAULT NULL,
`TIMESTAMP` int(10) unsigned NOT NULL,
`Message` mediumtext NOT NULL,
`Checksum` varchar(50) DEFAULT NULL,
`Nickname` varchar(80) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `TopicID` (`TopicID`,`Checksum`),
KEY `DATESTAMP` (`DATESTAMP`),
KEY `Nickname` (`Nickname`),
KEY `TIMESTAMP` (`TIMESTAMP`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=25195126 ;
NOTE: The Cheksum stores an MD5 checksum which prevents same messages inserted twice in the same topics. (nickname + timestamp + topicid + last 20 chars of message)
The site I'm building has a newsfeed in which users can select to view newest messages from different Nicknames from different forums. The query is as follows:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Topics.SubforumID = Subforums.ID
WHERE
Messages.Nickname = FollowedNicknames.Nickname AND
Messages.TopicID = Topics.ID AND Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
The TIMESTAMP contains an unix timestamp and DATESTAMP is simply a date generated from the unix timestamp for faster access via '=' operator instead of range scans with unix timestamps.
The problem is, this query takes about 13 seconds ( or more ) unbuffered. That is of course unacceptable for the intented usage. Adding the DATESTAMP seemed to speed things up, but not by much.
At this point, I don't really know what should I do. I've read about composite primary keys, but I am still unsure whether they would do any good and how to correctly implement one in this particular case.
I know that using BIGINTs may be a little overkill, but do they affect that much?
EXPLAIN:
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | FollowedNicknames | ALL | UserID,ForumID,Nickname | NULL | NULL | NULL | 8 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | Forums | eq_ref | PRIMARY | PRIMARY | 8 | database.FollowedNicknames.ForumiID | 1 | NULL |
| 1 | SIMPLE | Messages | ref | TopicID,DATETIME,Nickname | Nickname | 242 | database.FollowedNicknames.Nickname | 15 | Using where |
| 1 | SIMPLE | Topics | eq_ref | PRIMARY,SubforumID | PRIMARY | 8 | database.Messages.TopicID | 1 | NULL |
| 1 | SIMPLE | Subforums | eq_ref | PRIMARY,ForumID | PRIMARY | 8 | database.Topics.SubforumID | 1 | Using where |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
You shouldn't be JOINing on a VARCHAR column (Nickname); you should use the user ID to join those tables. That is definitely slowing the query down and is probably the biggest issue. It would also be easier to follow if you wrote all of the JOINs explicitly instead of at the end in the WHERE clause like this:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON Messages.Nickname = FollowedNicknames.Nickname
AND FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Messages.TopicID = Topics.ID
AND Topics.SubforumID = Subforums.ID
WHERE Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
Instead of INT as the data type for the DATESTAMP column, I would use DATE. The Checksum column should probably use latin1_general_ci as the collation. I would use INT for the ID columns as long as their values are less than 2,000,000,000 since INT UNSIGNED can store values up to roughly 4,000,000,000. InnoDB is affected by the primary key much more than MyISAM and it could make a noticeable difference.

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table

Help: Optimize this query in MySQL

This is my tables, the AUTO_INCREMENT shows the size of each:
tbl_clientes:
CREATE TABLE `tbl_clientes` (
`int_clientes_id_pk` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`str_clientes_documento` varchar(255) DEFAULT NULL,
`str_clientes_nome_original` char(255) DEFAULT NULL,
PRIMARY KEY (`int_clientes_id_pk`),
UNIQUE KEY `str_clientes_documento` (`str_clientes_documento`),
KEY `str_clientes_nome_original` (`str_clientes_nome_original`),
KEY `nome_original_cliente_id` (`str_clientes_nome_original`,`int_clientes_id_pk`),
KEY `cliente_id_nome_original` (`int_clientes_id_pk`,`str_clientes_nome_original`)
) ENGINE=MyISAM AUTO_INCREMENT=2815520 DEFAULT CHARSET=utf8
tbl_clienteEnderecos:
CREATE TABLE `tbl_clienteEnderecos` (
`int_clienteEnderecos_id_pk` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`int_clienteEnderecos_cliente_id_fk` bigint(20) unsigned NOT NULL,
`str_clienteEnderecos_endereco` varchar(255) NOT NULL,
`str_clienteEnderecos_cep` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_numero` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_complemento` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_bairro` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_cidade` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_uf` varchar(2) DEFAULT NULL,
`int_clienteEnderecos_correspondencia` tinyint(1) NOT NULL DEFAULT '0',
`int_clienteEnderecos_tipo` int(11) NOT NULL DEFAULT '1',
PRIMARY KEY (`int_clienteEnderecos_id_pk`),
KEY `int_clienteEnderecos_cliente_id_fk` (`int_clienteEnderecos_cliente_id_fk`),
KEY `str_clienteEnderecos_cidade` (`str_clienteEnderecos_cidade`),
KEY `str_clienteEnderecos_uf` (`str_clienteEnderecos_uf`),
KEY `uf_cidade` (`str_clienteEnderecos_uf`,`str_clienteEnderecos_cidade`)
) ENGINE=MyISAM AUTO_INCREMENT=1542038 DEFAULT CHARSET=utf8
Then I run this query to search, it will be fast, is using indexes:
EXPLAIN
SELECT * FROM tbl_clientes LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
The result of EXPAIN is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+------------------------------------+------------------------------------+---------+---------------------------------------------------+------+-------+
| 1 | SIMPLE | tbl_clientes | index | NULL | nome_original_cliente_id | 774 | NULL | 20 | |
| 1 | SIMPLE | tbl_clienteEnderecos | ref | int_clienteEnderecos_cliente_id_fk | int_clienteEnderecos_cliente_id_fk | 8 | mydb.tbl_clientes.int_clientes_id_pk | 1 | |
+----+-------------+----------------------+-------+------------------------------------+------------------------------------+---------+---------------------------------------------------+------+-------+
All right, but I need to filter by tbl_clienteEnderecos.str_clienteEnderecos_uf. It breaks all indexes, use temporary table and filesort (no index). Here's the query:
EXPLAIN
SELECT * FROM tbl_clientes LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
WHERE str_clienteEnderecos_uf = "SP"
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
Look, this is the output of EXPLAIN:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+----------------------------------------------------------------------+-----------+---------+---------------------------------------------------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | tbl_clienteEnderecos | ref | int_clienteEnderecos_cliente_id_fk,str_clienteEnderecos_uf,uf_cidade | uf_cidade | 9 | const | 670654 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tbl_clientes | eq_ref | PRIMARY,cliente_id_nome_original | PRIMARY | 8 | mydb.tbl_clienteEnderecos.int_clienteEnderecos_cliente_id_fk | 1 | |
+----+-------------+----------------------+--------+----------------------------------------------------------------------+-----------+---------+---------------------------------------------------------------------------+--------+----------------------------------------------+
With this Using where; Using temporary; Using filesort it can't be fast. I've tried a lot of things, how optimize this query?
Is it time to switch to NoSQL/MongoDB?
MySQL will typically not use an index if it will not help narrow the results down enough. It appears that "SP" occurs in roughly 670654 rows. Since this is about 1/3 of your total rows, it is more efficient to read it in disk order.
You can try an index to tbl_clienteEnderecos:
KEY `test` (`str_clienteEnderecos_uf `, `int_clienteEnderecos_cliente_id_fk`)
This might be enough to get it to use the index.
What is the difference between these two columns? They look like they should be the same.
int_clienteEnderecos_id_pk
int_clienteEnderecos_cliente_id_fk
Edit
I understand what the names of the columns imply. I was just curious if the two values should be identical. If they are, it would simplify a few things and have them be joined on the primary key of the tables. I am not sure about the specific meaning of the tables involved, so I don't know if there is a 1-1 or 1-0 relationship between them or a one to many relationship.
I suggest trying to retrieve just the primary key of the tables that you want. For instance, instead of select * try:
EXPLAIN
SELECT int_clienteEnerecos_id_pk, int_clientes_id_pk
FROM tbl_clientes
LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
WHERE str_clienteEnderecos_uf = "SP"
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
If this works out the way I hope it will, you sell see "from index" in the Extra column. If you need additional fields returned, you can either make another round trip to fetch them, or add them to your index. Or use a nested query to fetch them based on the results of the query above.
Also, why are you grouping by and ordering by the same thing? Are you expecting multiple matches of the foreign key?
I'd suggest giving the following a try; the subquery might use the key better than the join in this context. Take care, though; I couldn't swear on a stack of K & R's that the query is the same as your original.
SELECT *,
(SELECT *
FROM tbl_clienteEnderecos
WHERE int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk AND
str_clienteEnderecos_uf = "SP") AS T2
FROM tbl_clientes
GROUP BY str_clientes_nome_original, int_clientes_id_pk
HAVING T2.int_clienteEnderecos_id_pk IS NOT NULL
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0, 20