How to optimize SQL query with WHERE IN subquery - mysql

I have two tables in MySQL 5.6 for collecting event data.
When an event occurs it generates data in certain time period.
The parent table named 'event' remembers the last state of event.
The child table named 'event_version' remembers all data versions generated by any event.
Schemas for this tables looks like that:
CREATE TABLE `event` (
`id` BIGINT(20) NOT NULL,
`version_id` BIGINT(20)', -- refers to last event_version
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears
`first_event_time` TIMESTAMP(6), -- time when a set of event data was generated first time,
-- it is immutable after creation
`event_time` TIMESTAMP(6), -- time when a set of event data changed last time
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`),
INDEX `version_id` (`version_id`),
CONSTRAINT `FK_version_id` FOREIGN KEY (`version_id`) REFERENCES `event_version` (`id`)
);
CREATE TABLE `event_version` (
`id` BIGINT(20) NOT NULL,
`event_id` BIGINT(20)', -- refers to event
`version_number` BIGINT(20)', -- consecutive numbers increased when new version appears
`event_time` TIMESTAMP(6) NULL DEFAULT NULL, -- time when a set of event data was generated
`other_event_data` VARCHAR(30),--more other columns
PRIMARY KEY (`id`),
INDEX `event_time` (`event_time`), -- time when a set of event data changed
INDEX `event_id` (event_id),
CONSTRAINT `FK_event_id` FOREIGN KEY (`event_id`) REFERENCES `event` (`id`)
);
I want to get all event_version rows which have new rows added in selected time period.
For example: there is na event with event.id=21 that appeared at 2019-04-28 and it produced versions at:
2019-04-28 version_number: 1, event_version.event_id=21
2019-04-30 version_number: 2, event_version.event_id=21
2019-05-02 version_number: 3, event_version.event_id=21
2019-05-04 version_number: 4, event_version.event_id=21
I want this records to be found when I search for period from 2019-05-01 to 2019-06-01.
The idea is to find all event_version.event_id created in selected period, and then all rows from event_version which have event_id from this list.
To create the list of event id I have an inner SELECT queries:
The first query:
SELECT DISTINCT event_id FROM event_version WHERE event_time>='2019-05-01' AND event_time<'2019-06-01';
It takes about 10s and returns about 500 000 records.
But I have second query which uses parent table and looks like this:
SELECT id FROM event WHERE (first_event_time>='2019-05-01' AND first_event_time<'2019-06-01') OR (first_event_time<'2019-05-01' AND event_time>'2019-05-01');
It takes about 7s and returns the same set of ids.
Then I use this subqueries in my final query:
SELECT * FROM event_version WHERE event_id IN (<one of prvious two queries>);
The problem is that when I use the second subquery it takes about 8s to produce result (about 5 millions records).
Creating the same result with the first subquery takes 3 minutes and 15s.
I can't understand why there is such a big difference in executing time even if subqueries produce the same result list.
I want to use a subquery like in the first example because it depends only from event_time, not from additional data from parent table.
I have more similar tables and there I can rely only on event_time.
My question: is there a possibility to optimize the query to produce expected result using only event_time?

As I understand, you want the following query to be optimized:
SELECT *
FROM event_version
WHERE event_id IN (
SELECT DISTINCT event_id
FROM event_version
WHERE event_time >= '2019-05-01'
AND event_time < '2019-06-01'
)
Things I would try:
Create an index on event_version(event_time, event_id). This should improve the performance of the subquery by avoiding a second lookup to get the event_id. Though the overall performance will probably be similar. The reason is that WHERE IN (<subquery>) tend to be slow (at least in older versions) when the subquery returns a lot of rows.
Try a JOIN with your subquery as derived table:
SELECT *
FROM (
SELECT DISTINCT event_id
FROM event_version
WHERE event_time >= '2019-05-01'
AND event_time < '2019-06-01'
) s
JOIN event_version USING(event_id)
Look if the index mentioned above is of any help here.
Try an EXISTS subquery:
SELECT v.*
FROM event e
JOIN event_version v ON v.event_id = e.id
WHERE EXISTS (
SELECT *
FROM event_version v1
WHERE v1.event_id = e.id
AND v1.event_time >= '2019-05-01'
AND v1.event_time < '2019-06-01'
)
Here you would need an index on event_version(event_id, event_time). Though the performance might be even worse. I would bet on the derived table join solution.
My guess - why your second query runs faster - is that the optimizer is able to convert the IN condition to a JOIN, because the returned column is the primary key of the event table.

im guessing the event_version table is a lot bigger then the event table. the subqueries are easy to do, you scan a table once for a predicate and return the rows. when you do this inside a subquery, forevery row the outer query checks, the subquery gets executed. so if event_version has 1m rows, it executes the subquery 1m times. theres probebly some smarter logic to not make it this extreme, but the principle stays.
how ever, i fail to see the point of the 3rd query. if you use the 3rd query with the 1st query as subquery, you get the exact same rows where if you had done the first query as Select all from event_version, so why the subquery?
wouldnt this:
SELECT * FROM event_version WHERE event_id IN (insert query 1);
be the same as
SELECT * FROM event_version WHERE event_time>='2019-05-01' AND event_time<'2019-06-01';
?

Related

mysql index selection on large table

I have a couple of tables that looks like this:
CREATE TABLE Entities (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
client_id INT NOT NULL,
display_name VARCHAR(45),
PRIMARY KEY (id)
)
CREATE TABLE Statuses (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE EventTypes (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE Events (
id INT NOT NULL AUTO_INCREMENT,
entity_id INT NOT NULL,
date DATE NOT NULL,
event_type_id INT NOT NULL,
status_id INT NOT NULL
)
Events is large > 100,000,000 rows
Entities, Statuses and EventTypes are small < 300 rows a piece
I have several indexes on Events, but the ones that come into play are
idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)
I have a large complicated query, but I'm getting the same slow query results with a simpler one like the one below (note, in the real queries, I don't use evt.*)
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
For some reason, mysql keeps choosing the index which doesn't cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
The query takes .014 seconds.
Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?
To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.
Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.
I tried adding an additional index,
ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id)
and it actually made the queries slower.
I will experiment a bit more, but I feel like I'm not sure how the optimizer makes it's decisions.
Additional Info:
I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec
I can't wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.
I would add this index:
ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);
The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.
The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.
Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.
The condition on entity_id is also an equality condition if the IN() list has one element. MySQL's optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.
It's still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It's not quite as good as a real index lookup, but it's worthwhile.
I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.
Rearrange your indexes to have columns in this order:
Any column(s) that will be tested with = or IS NULL.
Column(s) tested with IN -- If there is a single value, this will be further optimized to = for you.
One "range" column, such as your date.
Note that nothing after a "range" test will be used by WHERE.
(There are exceptions, but most are not relevant here.)
More discussion: Index Cookbook
Since the tables smell like Data Warehousing, I suggest looking into
Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.
Since Events is rather large, I suggest using smaller numbers where practical. INT takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.
When you have INDEX(a,b,c), that index will handle cases that need INDEX(a,b) and INDEX(a). Keep the longer one. (Sometimes the Optimizer picks the shorter index 'erroneously'.)
To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:
select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:
select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between #start_date and #end_date

Mysql join optimize where clause

There are two tables in Mysql5.7, and each one has 100,000 records.
And each one contains data like this:
id name
-----------
1 name_1
2 name_2
3 name_3
4 name_4
5 name_5
...
The ddl is:
CREATE TABLE `table_a` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `table_b` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Now I execute following two queries to see whether the latter will be better.
select SQL_NO_CACHE *
from table_a a inner
join table_b b on a.name = b.name
where a.id between 50000 and 50100;
select SQL_NO_CACHE *
from (
select *
from table_a
where id between 50000 and 50100
) a
inner join table_b b on a.name = b.name;
I think that in the former query, it would iterate up to 100,000 * 100,000 times and then filter the result by where clause; in the latter query, it would first filter the table_a to get 100 intermediate result and then iterate up to 100 * 100,000 times to get final result. So the former would be much faster than the latter.
But the result is that both query spends 1.5 second. And by using explain statement, I can't find any substantial differences
Does the mysql optimize the former query so that it executes like the latter?
For INNER JOIN, ON and WHERE are optimized the same. For LEFT/RIGHT JOIN, the semantics are different, so the optimization is different. (Meanwhile, please use ON for stating the relationship and WHERE for filtering -- it helps humans in understanding the query.)
Both queries can start by fetching 100 rows from a because of a.id between 50000 and 50100, then reach into the other table 100 time. But how it has to do a table scan because of the lack of any useful index. So 100 x 100,000 operations. ("Nested Loop Join" or "NLJ")
The solution to the slowness is to add
INDEX(name)
Add it at least to b. Or, if this is really a lookup table for making "names" to "ids", then UNIQUE(name). With either index, the work should be down to 100 x 100.
Another technique for analyzing queries is
FLUSH STATUS;
SELECT ...
SHOW VARIABLES LIKE 'Handler%';
It counts the actual number of rows (data or index) touched. 100,000 (or multiples of such) indicate a full table/index scan(s) in your case.
More: Index Cookbook
Joins are always faster than sub-queries, so try to use joins instead of sub-queries wherever you can to speed up the process. Whereas in this case, both the queries are equivalent.
Another way to optimize the query would be using partitions. When using partitions, mysql will directly go to the partition according to your specified query which will reduce the time spent on other unrelated records.

2 simple SQL queries become slow when merged

We have 2 SQL queries:
Get active campaigns (234 rows, 0.0007 seconds)
SELECT id FROM campaigns WHERE campaigns.is_active = 1
Get today's clicks for a user (17 rows, 0.0772 seconds)
SELECT id, campaign_id FROM clicks WHERE user_id = 1 AND created > '2022-06-23 00:00:00'
Both are fast and return a small number of rows.
Now I combine both to get active campaigns + amount of clicks today for a user:
SELECT count(clicks.id),
campaigns.id
FROM campaigns
LEFT JOIN clicks
ON ( clicks.campaign_id = campaigns.id
AND clicks.user_id = 1
AND clicks.created > '2022-06-23 00:00:00')
WHERE campaigns.is_active = 1
GROUP BY campaigns.id
Returns 234 rows in 8 seconds runtime.
Getting the campaign list (234 rows) takes 0.0007.
Getting the clicks list (17 rows) takes 0.0772 seconds.
But to assign the 17 clicks to the 234 campaigns suddenly takes 8 seconds? Why is it so slow? How can I fix it?
If I change from LEFT JOIN to INNER JOIN it takes only 0.09 seconds, but it's not the return I need.
The clicks table has around 21m rows with 50k new rows a day. It has a single index on each of these columns: user_id, campaign_id, created
CREATE TABLE:
CREATE TABLE `clicks` (
`id` int(11) UNSIGNED NOT NULL,
`user_id` int(7) UNSIGNED NOT NULL,
`campaign_id` int(11) UNSIGNED NOT NULL,
`created` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
ALTER TABLE `clicks`
ADD PRIMARY KEY (`id`),
ADD KEY `user_id` (`user_id`),
ADD KEY `campaign_id` (`campaign_id`),
ADD KEY `created` (`created`);
CREATE TABLE `campaigns` (
`id` int(11) UNSIGNED NOT NULL,
`is_active` tinyint(4) NOT NULL DEFAULT 0
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;
ALTER TABLE `campaigns`
ADD PRIMARY KEY (`id`),
ADD KEY `is_active` (`is_active`);
EXPLAIN SELECT:
Ah, the well-known antipattern of a one-column index on every column strikes again.
Try creating a multicolumn covering index on your large clicks table, tuned to your query.
ALTER TABLE clicks
ADD INDEX campaign_created_user (campaign_id, created, user_id);
See how your query performas with that one. If it's OK you're done. If not, drop that index and try this one.
ALTER TABLE clicks
DROP INDEX campaign_created_user,
ADD INDEX user_campaign_created (user_id, campaign_id, created);
Why will one of these help you?
MySQL indexes are BTREE indexes. Simply put, they can be accessed either randomly or in order. So your clicks.user_id = 1 AND clicks.created > '2022-06-23 00:00:00' conditions random-access the index I suggest to the first eligible row, then scan it sequentially. That's a fast way to satisfy a query.
I suggest a second index in case the left join somehow leads with the user_id rather than the campaign_id. EXPLAIN output will let you know what's going on there.
Pro tip Avoid putting indexes on columns unless you know your queries need them. Putting single-column indexes on every column doesn't substitute for correctly designed multicolumn indexes.
My assumption is that you're multiplying the number of the operations needed by alot just because every row of 234 needs to be checked by 17 times with three different types of statements.
Try something like, but I don't know if it's the best suited for your case:
SELECT count(clicks.id),
campaigns.id
FROM campaigns
LEFT JOIN clicks ON ( clicks.campaign_id = campaigns.id )
WHERE campaigns.is_active = 1 AND (clicks.user_id = 1 AND clicks.created > '2022-06-23 00:00:00')
GROUP BY campaigns.id
But consider that 8 seconds with these small numbers is a weird processing time and should be expected way less.
So If you try this query and still get the same result, there could potentially a problem of DB settings of some sort
Better indexes
Query 1:
The existing INDEX(is_active) is adequate.
Query 2:
INDEX(user_id, created, campaign_id) -- in this order!
and drop the existing INDEX(user_id)
Query 3:
clicks: INDEX(campaign_id, user_id, created) -- note difference
campaigns: INDEX(is_active, id) -- (as above); may not get used; that's OK
and drop the existing INDEX(campaign_id)
This reformulation may run faster; I don't know for sure:
SELECT ( SELECT COUNT(*)
FROM clicks AS cl
WHERE cl.campaign_id = cg.id
AND cl.user_id = 1
AND cl.created > '2022-06-23 00:00:00'
) AS ct,
cg.id
FROM campaigns as cg
WHERE cg.is_active = 1
GROUP BY cg.id
50k new rows a day
That begs for building maintaining a Summary Tables

MySql table performance optimization

I have a table with the following structure
CREATE TABLE rel_score (
user_id bigint(20) NOT NULL DEFAULT '0',
score_date date NOT NULL,
rel_score decimal(4,2) DEFAULT NULL,
doc_count int(8) NOT NULL
total_doc_count int(8) NOT NULL
PRIMARY KEY (user_id,score_date),
KEY SCORE_DT_IDX (score_date)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 PACK_KEYS=1
The table will store rel_score value for every user in the application for every day since 1st Jan 2000 till date. I estimated the total number records will be over 700 million. I populated the table with 6 months data (~ 30 million rows) and the query response time is about 8 minutes. Here is my query,
select
user_id, max(rel_score) as max_rel_score
from
rel_score
where score_date between '2012-01-01' and '2012-06-30'
group by user_id
order by max_rel_score desc;
I tried optimizing the query using the following techniques,
Partitioning on the score_date column
Adding an index on the score_date column
The query response time improved marginally to a little less than 8 mins.
How can I improve response time? Is the design of the table appropropriate?
Also, I cannot move the old data to archive as an user is allowed to query on the entire data range.
If you partition your table on the same level of the score_date you will not reduce the query response time.
Try to create another attribut that will contain only the year of the date, cast it to an INTEGER , partition your table on this attribut (you will get 13 partition), and reexecute your query to see .
Your primary index should do a good job of covering the table. If you didn't have it, I would suggest building an index on rel_score(user_id, score_date, rel_score). For your query, this is a "covering" index, meaning that the index has all the columns in the query, so the engine never has to access the data pages (only the index).
The following version might also make good use of this index (although I much prefer your version of the query):
select u.user_id,
(select max(rel_score)
from rel_score r2
where r2.user_id = r.user_id and
r2.score_date between '2012-01-01' and '2012-06-30'
) as rel_score
from (select distinct user_id
from rel_score
where score_date between '2012-01-01' and '2012-06-30'
) u
order by rel_score desc;
The idea behind this query is to replace the aggregation with a simple index lookup. Aggregation in MySQL is a slow operation -- it works much better in other databases so such tricks shouldn't be necessary.

MySQL: Optimizing COUNT(*) and GROUP BY

I have a simple MyISAM table resembling the following (trimmed for readability -- in reality, there are more columns, all of which are constant width and some of which are nullable):
CREATE TABLE IF NOT EXISTS `history` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`time` int(11) NOT NULL,
`event` int(11) NOT NULL,
`source` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `event` (`event`),
KEY `time` (`time`),
);
Presently the table contains only about 6,000,000 rows (of which currently about 160,000 match the query below), but this is expected to increase. Given a particular event ID and grouped by source, I want to know how many events with that ID were logged during a particular interval of time. The answer to the query might be something along the lines of "Today, event X happened 120 times for source A, 105 times for source B, and 900 times for source C."
The query I concocted does perform this task, but it performs monstrously badly, taking well over a minute to execute when the timespan is set to "all time" and in excess of 30 seconds for as little as a week back:
SELECT COUNT(*) AS count FROM history
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
This is not for real-time use, so even if the query takes a second or two that would be fine, but several minutes is not. Explaining the query gives the following, which troubles me for obvious reasons:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE history ref event,time event 4 const 160399 Using where; Using temporary; Using filesort
I've experimented with various multi-column indexes (such as (event, time)), but with no improvement. This seems like such a common use case that I can't imagine there not being a reasonable solution, but my Googling all boil down to versions of the query I already have, with no particular suggestions on how to avoid the temporary (and even then, why performance is so abysmal).
Any suggestions?
You say you have tried multi-column indexes. Have you also tried single-column indexes, one per column?
UPDATE: Also, the COUNT(*) operation over a GROUP BY clause is probably a lot faster, if the grouped column also has an index on it... Of course, this depends on the number of NULL values that are actually in that column, which are not indexed.
For event, MySQL can execute a UNIQUE SCAN, which is quite fast, whereas for time, a RANGE SCAN will be applied, which is not so fast... If you separate indexes, I'd expect better performance than with multi-column ones.
Also, maybe you could gain something by partitioning your table by some expected values / value ranges:
http://dev.mysql.com/doc/refman/5.5/en/partitioning-overview.html
I offer you to try this multi-column index:
ALTER TABLE `history` ADD INDEX `history_index` (`event` ASC, `time` ASC, `source` ASC);
Then if it doesn't help, try to force index on this query:
SELECT COUNT(*) AS count FROM history USE INDEX (history_index)
WHERE event=2000 AND time >= 0 AND time < 1310563644
GROUP BY source
ORDER BY count DESC
If the source are known or you want to find the count for specific source, then you can try like this.
select count(source= 'A' or NULL) as A,count(source= 'B' or NULL) as B from history;
and for ordering you can do it in your application code. Also try with indexing event and source together.
This will be definitely faster than the older one.