Selecting distinct values from a join of two large tables

Selecting distinct values from a join of two large tables - mysql

I have an animals table with about 3 million records. The table has, among a few other columns, an id, name, and owner_id column. I have an animal_breeds table with about 2.5 million records. The table only has an animal_id and breed column.
I'm trying to find the distinct breed values that are associated with a specific owner_id, but the query is taking 20 seconds or so. Here's the query:
SELECT DISTINCT `breed`
FROM `animal_breeds`
INNER JOIN `animals` ON `animals`.`id` = `animal_breeds`.`animal_id`
WHERE `animals`.`owner_id` = ? ;
The tables have all appropriate indices. I can't denormalize the table by adding a breed column to the animals table because it is possible for animals to be assigned multiple breeds. I also have this problem with a few other large tables that have one-to-many relationships.
Is there a more performant way to achieve what I'm looking for? It seems like a pretty simple problem but I can't seem to figure out the best way to achieve this other than pre-calculating and caching the results.
Here is the explain output from my query. Notice the Using temporary
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 "SIMPLE" "a" NULL "ref" "PRIMARY,animals_animal_id_index" "animals_animal_id_index" "153" "const" 1126303 100.00 "Using index; Using temporary"
1 "SIMPLE" "ab" NULL "ref" "animal_breeds_animal_id_breed_unique,animal_breeds_animal_id_index,animal_breeds_breed_index" "animal_breeds_animal_id_breed_unique" "5" "pedigreeonline.a.id" 1 100.00 "Using index"
And as requested, here are the create table statements (I left off a few unrelated columns and indices from the animals table). I believe the animal_breeds_animal_id_index index on animal_breeds table is redundant because of the unique key on the table, but we can ignore that for now as long as it's not causing the problem :)
CREATE TABLE `animals` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`owner_id` varchar(50) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `animals_animal_id_index` (`owner_id`,`id`),
KEY `animals_name_index` (`name`),
) ENGINE=InnoDB AUTO_INCREMENT=2470843 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `animal_breeds` (
`animal_id` int(10) unsigned DEFAULT NULL,
`breed` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
UNIQUE KEY `animal_breeds_animal_id_breed_unique` (`animal_id`,`breed`),
KEY `animal_breeds_animal_id_index` (`animal_id`),
KEY `animal_breeds_breed_index` (`breed`),
CONSTRAINT `animal_breeds_animal_id_foreign` FOREIGN KEY (`animal_id`) REFERENCES `animals` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Any help would be appreciated. Thanks!

With knowledge about your data you can try something like this:
SELECT
b.*
FROM
(
SELECT
DISTINCT `breed`
FROM
`animal_breeds`
) AS b
WHERE
EXISTS (
SELECT
*
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
b.breed = ab.breed
AND a.owner_id = ?
)
;
The idea is to get short list of distinct breeds without any filtering (for small list it would be quite fast) and then filter further the list with correlated subquery. As the list is short it would be only few subqueries executed and they will only check for existence that is much faster that any grouping (distinct == grouping).
This will only work if your distinct list is quite short.
With random generated data based on your answers the above query gave me the following execution plan:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY <derived2> ALL 2 100.00
3 SUBQUERY a ref PRIMARY,animals_animal_id_index animals_animal_id_index 153 const 1011 100.00 Using index
3 SUBQUERY ab ref animal_breeds_animal_id_breed_unique,`animal_breeds_animal_id_index`,animal_breeds_animal_id_index `animal_breeds_animal_id_index` 5 test.a.id 2 100.00 Using index
2 DERIVED animal_breeds range animal_breeds_animal_id_breed_unique,`animal_breeds_breed_index`,animal_breeds_breed_index `animal_breeds_breed_index` 1022 2 100.00 Using index for group-by
Alternatively, you can try to create WHERE clause like this:
...
WHERE
b.breed IN (
SELECT
ab.breed
FROM
animal_breeds AS ab
INNER JOIN animals AS a ON ab.animal_id = a.id
WHERE
a.owner_id = ?
)

For this query:
SELECT DISTINCT ab.`breed`
FROM `animal_breeds` ab INNER JOIN
`animals` a
ON a.`id` = ab.`animal_id`
WHERE a.`owner_id` = ? ;
You want indexes on animals(owner_id, id) and animal_breeds(animal_id, breed). The order of the columns in the composite index is important.
With the right index, I imagine that this will be very fast.
EDIT:
According to the explain, there are 1,126,303 matches for the values you are using. The time is due to removing duplicates. Given the sizes of the tables, it is surprising that there would be so many matching one value.

Related

Use index for ORDER BY in "SELECT .. FROM .. WHERE column IN (...) ORDER BY"

Is there any way to make the following query use an index and not use filesort:
SELECT c1 FROM table WHERE c2 IN (val_1, val_2, ..., val_n) ORDER BY c3
I guess chances are bad so if it is not possible is there any way to make the following problem use indexes (or be fast):
The table contains comments from users:
CREATE TABLE `comments` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL,
`comment` varchar(180) CHARACTER SET utf8 NOT NULL,
`timestamp` int(11) unsigned NOT NULL)
ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
I want to output the comments of specific users (for example the ones who user_x is following) ordered by timestamp (compare query above).
The only way I can imagine making this query fast is to insert a new variable that is set to 1 for the last let's say 15 entries of a single user. So the first query would just get a maximum of 15 rows per user so the maximum amount of rows mysql has to order is 15*n, where n is the amount of users the comments are selected from.
Edit: This is what EXPLAIN outputs:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE comments range idx_comments_user_id_timestamp idx_comments_user_id_timestamp 4 NULL 1113 Using where; Using index; Using filesort

MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

I've got a fairly simple query that seeks to display the number of email addresses that are subscribed along with the number unsubscribed, grouped by client.
The query:
SELECT
client_id,
COUNT(CASE WHEN subscribed = 1 THEN subscribed END) AS subs,
COUNT(CASE WHEN subscribed = 0 THEN subscribed END) AS unsubs
FROM
contacts_emailAddresses
LEFT JOIN contacts ON contacts.id = contacts_emailAddresses.contact_id
GROUP BY
client_id
Schema of relevant tables follows. contacts_emailAddresses is a junction table between contacts (which has the client_id) and emailAddresses (which is not actually used in this query).
CREATE TABLE `contacts` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`firstname` varchar(255) NOT NULL DEFAULT '',
`middlename` varchar(255) NOT NULL DEFAULT '',
`lastname` varchar(255) NOT NULL DEFAULT '',
`gender` varchar(5) DEFAULT NULL,
`client_id` mediumint(10) unsigned DEFAULT NULL,
`datasource` varchar(10) DEFAULT NULL,
`external_id` int(10) unsigned DEFAULT NULL,
`created` timestamp NULL DEFAULT NULL,
`trash` tinyint(1) NOT NULL DEFAULT '0',
`updated` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `client_id` (`client_id`),
KEY `external_id combo` (`client_id`,`datasource`,`external_id`),
KEY `trash` (`trash`),
KEY `lastname` (`lastname`),
KEY `firstname` (`firstname`),
CONSTRAINT `contacts_ibfk_1` FOREIGN KEY (`client_id`) REFERENCES `clients` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=14742974 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPACT
CREATE TABLE `contacts_emailAddresses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`contact_id` int(10) unsigned NOT NULL,
`emailAddress_id` int(11) unsigned DEFAULT NULL,
`primary` tinyint(1) unsigned NOT NULL DEFAULT '0',
`subscribed` tinyint(1) unsigned NOT NULL DEFAULT '1',
`modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `contact_id` (`contact_id`),
KEY `subscribed` (`subscribed`),
KEY `combo` (`contact_id`,`emailAddress_id`) USING BTREE,
KEY `emailAddress_id` (`emailAddress_id`) USING BTREE,
CONSTRAINT `contacts_emailAddresses_ibfk_1` FOREIGN KEY (`contact_id`) REFERENCES `contacts` (`id`),
CONSTRAINT `contacts_emailAddresses_ibfk_2` FOREIGN KEY (`emailAddress_id`) REFERENCES `emailAddresses` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24700918 DEFAULT CHARSET=utf8
Here's the EXPLAIN:
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | contacts_emailAddresses | ALL | NULL | NULL | NULL | NULL | 10176639 | Using temporary; Using filesort |
| 1 | SIMPLE | contacts | eq_ref | PRIMARY | PRIMARY | 4 | icarus.contacts_emailAddresses.contact_id | 1 | |
+----+-------------+-------------------------+--------+---------------+---------+---------+-------------------------------------------+----------+---------------------------------+
2 rows in set (0.08 sec)
The problem here clearly is the GROUP BY clause, as I can remove the JOIN (and the items that depend on it) and the performance still is terrible (40+ seconds). There are 10m records in contacts_emailAddresses, 12m-some records in contacts, and 10–15 client records for the grouping.
From the doc:
Temporary tables can be created under conditions such as these:
If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.
DISTINCT combined with ORDER BY may require a temporary table.
If you use the SQL_SMALL_RESULT option, MySQL uses an in-memory temporary table, unless the query also contains elements (described later) that require on-disk storage.
I'm obviously not combining the GROUP BY with an ORDER BY, and I have tried multiple things to ensure that the GROUP BY is on a column that should be properly placed in the join queue (including rewriting the query to put contacts in the FROM and instead join to contacts_emailAddresses), all to no avail.
Any suggestions for performance tuning would be much appreciated!

I think the only real shot you have of getting away from a "Using temporary; Using filesort" operation (given the current schema, the current query, and the specified resultset) would be to use correlated subqueries in the SELECT list.
SELECT c.client_id
, (SELECT IFNULL(SUM(es.subscribed=1),0)
FROM contacts_emailAddresses es
JOIN contacts cs
ON cs.id = es.contact_id
WHERE cs.client_id = c.client_id
) AS subs
, (SELECT IFNULL(SUM(eu.subscribed=0),0)
FROM contacts_emailAddresses eu
JOIN contacts cu
ON cu.id = eu.contact_id
WHERE cu.client_id = c.client_id
) AS unsubs
FROM contacts c
GROUP BY c.client_id
This may run quicker than the original query, or it may not. Those correlated subqueries are going to get run for each returned by the outer query. If that outer query is returning a boatload of rows, that's a whole boatload of subquery executions.
Here's the output from an EXPLAIN:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo contact_id 4 cu.id Using where
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo contact_id 4 cs.id Using where
For optimum performance of this query, we'd really like to see "Using index" in the Extra column of the explain for the eu and es tables. But to get that, we'd need a suitable index, one with a leading column of contact_id and including the subscribed column. For example:
CREATE INDEX cemail_IX2 ON contacts_emailAddresses (contact_id, subscribed);
With the new index available, EXPLAIN output shows MySQL will use the new index:
id select_type table type possible_keys key key_len ref Extra
-- ------------------ ----- ----- ----------------------------------- ---------- ------- ------ ------------------------
1 PRIMARY c index (NULL) client_id 5 (NULL) Using index
3 DEPENDENT SUBQUERY cu ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
3 DEPENDENT SUBQUERY eu ref contact_id,combo,cemail_IX2 cemail_IX2 4 cu.id Using where; Using index
2 DEPENDENT SUBQUERY cs ref PRIMARY,client_id,external_id combo client_id 5 func Using where; Using index
2 DEPENDENT SUBQUERY es ref contact_id,combo,cemail_IX2 cemail_IX2 4 cs.id Using where; Using index
NOTES
This is the kind of problem where introducing a little redundancy can improve performance. (Just like we do in a traditional data warehouse.)
For optimum performance, what we'd really like is to have the client_id column available on the contacts_emailAddresses table, without a need to JOINI to the contacts table.
In the current schema, the foreign key relationship to contacts table gets us the client_id (rather, the JOIN operation in the original query is what gets it for us.) If we could avoid that JOIN operation entirely, we could satisfy the query entirely from a single index, using the index to do the aggregation, and avoiding the overhead of the "Using temporary; Using filesort" and JOIN operations...
With the client_id column available, we'd create a covering index like...
... ON contacts_emailAddresses (client_id, subscribed)
Then, we'd have a blazingly fast query...
SELECT e.client_id
, SUM(e.subscribed=1) AS subs
, SUM(e.subscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.client_id
That would get us a "Using index" in the query plan, and the query plan for this resultset just doesn't get any better than that.
But, that would require a change to your scheam, it doesn't really answer your question.
Without the client_id column, then the best we're likely to do is a query like the one Gordon posted in his answer (though you still need to add the GROUP BY c.client_id to get the specified result.) The index Gordon recommended will be of benefit...
... ON contacts_emailAddresses(contact_id, subscribed)
With that index defined, the standalone index on contact_id is redundant. The new index will be a suitable replacement to support the existing foreign key constraint. (The index on just contact_id could be dropped.)
Another approach would be to do the aggregation on the "big" table first, before doing the JOIN, since it's the driving table for the outer join. Actually, since that foreign key column is defined as NOT NULL, and there's a foreign key, it's not really an "outer" join at all.
SELECT c.client_id
, SUM(s.subs) AS subs
, SUM(s.unsubs) AS unsubs
FROM ( SELECT e.contact_id
, SUM(e.subscribed=1) AS subs
, SUM(e.eubscribed=0) AS unsubs
FROM contacts_emailAddresses e
GROUP BY e.contact_id
) s
JOIN contacts c
ON c.id = s.contact_id
GROUP BY c.client_id
Again, we need an index with contact_id as the leading column and including the subscribed column, for best performance. (The plan for s should show "Using index".) Unfortunately, that's still going to materialize a fairly sizable resultset (derived table s) as a temporary MyISAM table, and the MyISAM table isn't going to be indexed.

simple mysql query working slower than nested Select

I am doing a simple Select from a single table.
CREATE TABLE `book` (
`Book_Id` int(10) NOT NULL AUTO_INCREMENT,
`Book_Name` varchar(100) COLLATE utf8_turkish_ci DEFAULT NULL ,
`Book_Active` bit(1) NOT NULL DEFAULT b'1' ,
`Author_Id` int(11) NOT NULL,
PRIMARY KEY (`Book_Id`),
KEY `FK_Author` (`Author_Id`),
CONSTRAINT `FK_Author` FOREIGN KEY (`Author_Id`) REFERENCES `author` (`Author_Id`) ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=5947698 DEFAULT CHARSET=utf8 COLLATE=utf8_turkish_ci ROW_FORMAT=COMPACT
table : book
columns :
Book_Id (INTEGER 10) | Book_Name (VARCHAR 100) | Author_Id (INTEGER 10) | Book_Active (Boolean)
I have Indexes on three columns : Book_Id (PRIMARY key) , Author_Id (FK) , Book_Active .
first query :
SELECT * FROM book WHERE Author_Id = 1 AND Book_Active = 1
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE book ref FK_Author,index_Book_Active FK_Author 4 const 4488510 Using where
second query :
SELECT b.* FROM book b
WHERE Book_Active=1
AND Book_Id IN (SELECT Book_Id FROM book WHERE Author_Id=1)
EXPLAIN :
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY book ref index_Book_Active index_Book_Active 1 const 9369399 Using where
2 DEPENDENT SUBQUERY book unique_subquery PRIMARY,FK_Author PRIMARY 4 func 1 Using where
The data statistics is like this :
16.8 million books
10.5 million Book_Active=true
6.3 million Book_Active = false
And For Author_Id=1
2.4 million Book_Active=false
5000 Book_Active=true
The first query takes 6.7 seconds . The second query takes 0.0002 seconds
What is the cause of this enormous difference ? is it the right thing to use the nested select query ?
edit: added "sql explain"

In the first case: MySQL uses FK_Author index (this gives us ~4.5M rows), then it has to match every row against Book_Active = 1 condition — index cannot be used here.
The second case: InnoDB implicitly adds primary key to every index. Thus, when MySQL executes this part: SELECT book.* FROM book WHERE Book_Active=1 it has Book_Id from the index. Then for the subquery it has to match Book_Id with Author_Id; Author_Id is a constant and is a prefix of the index; implicitly included primary key can be matched against implicit primary key from Book_Active index. In your case it is faster to intersect two indices than to use index to retrieve 4.5M rows and scan them sequentially.

How to improve search performance in MySQL

I have a table that contains two bigint columns: beginNumber, endNumber, defined as UNIQUE. The ID is the Primary Key.
ID | beginNumber | endNumber | Name | Criteria
The second table contains a number. I want to retrieve the record from table1 when the Number from table2 is found to be between any two numbers. The is the query:
select distinct t1.Name, t1.Country
from t1
where t2.Number
BETWEEN t1.beginIpNum AND t1.endNumber
The query is taking too much time as I have so many records. I don't have experience in DB. But, I read that indexing the table will improve the search so MySQL does not have to pass through every row searching about m Number and this can be done by, for example, having UNIQE values. I made the beginNumber & endNumber in table1 as UNIQUE. Is this all what I can do ? Is there any possible way to improve the time ? Please, provide detailed answers.
EDIT:
table1:
CREATE TABLE `t1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`beginNumber` bigint(20) DEFAULT NULL,
`endNumber` bigint(20) DEFAULT NULL,
`Name` varchar(255) DEFAULT NULL,
`Criteria` varchar(455) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `beginNumber_UNIQUE` (`beginNumber`),
UNIQUE KEY `endNumber_UNIQUE` (`endNumber `)
) ENGINE=InnoDB AUTO_INCREMENT=327 DEFAULT CHARSET=utf8
table2:
CREATE TABLE `t2` (
`id2` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(255) DEFAULT NULL,
`Number` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id2`),
UNIQUE KEY ` description _UNIQUE` (`description `)
) ENGINE=InnoDB AUTO_INCREMENT=433 DEFAULT CHARSET=utf8
This is a toy example of the tables but it shows the concerned part.

I'd suggest an index on t2.Number like this:
ALTER TABLE t2 ADD INDEX numindex(Number);
Your query won't work as written because it won't know which t2 to use. Try this:
SELECT DISTINCT t1.Name, t1.Criteria
FROM t1
WHERE EXISTS (SELECT * FROM t2 WHERE t2.Number BETWEEN t1.beginNumber AND t1.endNumber);
Without the t2.Number index EXPLAIN gives this query plan:
1 PRIMARY t1 ALL 1 Using where; Using temporary
2 DEPENDENT SUBQUERY t2 ALL 1 Using where
With an index on t2.Number, you get this plan:
PRIMARY t1 ALL 1 Using where; Using temporary
DEPENDENT SUBQUERY t2 index numindex numindex 9 1 Using where; Using index
The important part to understand is that an ALL comparison is slower than an index comparison.

This is a good place to use binary tree index (default is hashmap). Btree indexes are best when you often sort or use between on column.
CREATE INDEX index_name
ON table_name (column_name)
USING BTREE

MySQL Query Optimization for GPS Tracking system

I have the following query:
SELECT * FROM `alltrackers`
WHERE `deviceid`='FT_99000083401624'
AND `locprovider`!='none'
ORDER BY `id` DESC
This is the show create table:
CREATE TABLE `alltrackers` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`deviceid` varchar(50) NOT NULL,
`gpsdatetime` int(11) NOT NULL,
`locprovider` varchar(30) NOT NULL,
PRIMARY KEY (`id`),
KEY `statename` (`statename`),
KEY `gpsdatetime` (`gpsdatetime`),
KEY `locprovider` (`locprovider`),
KEY `deviceid` (`deviceid`(18))
) ENGINE=MyISAM AUTO_INCREMENT=8665045 DEFAULT CHARSET=utf8;
I've removed the columns which I thought were unnecessary for this question.
This is the EXPLAIN output for this query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE alltrackers ref locprovider,deviceid deviceid 56 const 156416 Using
where; Using filesort
This particular query is showing as taking several seconds in mytop (mtop). I'm a bit confused though, as the same query but with a different "deviceid" doesn't take as long. Although I only need the last row, I've already removed LIMIT 1 as that makes it take even longer. This table currently contains 3 million rows.
It is used for storing the locations from different GPS devices. Each GPS device has a unique device ID. Locations come in and are added to the table. For statistics I'm running the above query to find the time of the last received location from a certain device.
I'm open to advice on ways to further optimize the query or even the tables.
Many thanks in advance.

If you only need the last row, add an index on (deviceid, id, locprovider). It would be even faster with an index on (deviceid, id, locprovider, gpsdatetime):
ALTER TABLE alltrackers
ADD INDEX special_covering_IDX
(deviceid, id, locprovider, gpsdatetime) ;
Then try this out:
SELECT id, locprovider, gpsdatetime
FROM alltrackers
WHERE deviceid = 'FT_99000083401624'
AND locprovider <> 'none'
ORDER BY id DESC
LIMIT 1 ;

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Selecting distinct values from a join of two large tables - mysql

Related

Use index for ORDER BY in "SELECT .. FROM .. WHERE column IN (...) ORDER BY"

MySQL : Avoid Temporary/Filesort Caused by GROUP BY Clause

simple mysql query working slower than nested Select

How to improve search performance in MySQL

MySQL Query Optimization for GPS Tracking system

Categories

Resources