I have a table which currently has about 80 million rows, created as follows:
create table records
(
id int auto_increment primary key,
created int not null,
status int default '0' not null
)
collate = utf8_unicode_ci;
create index created_and_status_idx
on records (created, status);
The created column contains unix timestamps and status can be an integer between -10 and 10. The records are evenly distributed regarding the created date, and around half of them are of status 0 or -10.
I have a cron that selects records that are between 32 and 8 days old, processes them and then deletes them, for certain statuses. The query is as follows:
SELECT
records.id
FROM records
WHERE
(records.status = 0 OR records.status = -10)
AND records.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
The query was fast when the records were at the beginning of the creation interval, but now that the cleanup reaches the records at the end of interval it takes about 10 seconds to run. Explaining the query says it uses the index, but it parses about 40 million records.
My question is if there is anything I can do to improve the performance of the query, and if so, how exactly.
Thank you.
I think union all is your best approach:
(SELECT r.id
FROM records r
WHERE r.status = 0 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
) UNION ALL
(SELECT r.id
FROM records r
WHERE r.status = -10 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
)
LIMIT 500;
This can use an index on records(status, created, id).
Note: use union if records.id could have duplicates.
You are also using LIMIT with no ORDER BY. That is generally discouraged.
Your index is in the wrong order. You should put the IN column (status) first (you phrased it as an OR), and put the 'range' column (created) last:
INDEX(status, created)
(Don't give me any guff about "cardinality"; we are not looking at individual columns.)
Are there really only 3 columns in the table? Do you need id? If not, get rid of it and change to
PRIMARY KEY(status, created)
Other techniques for walking through large tables efficiently.
Related
I'm using MySQL 5.7.
There is a table called transactions with over 3 million records. The table schema is as follows:
id - INT (autoincrements)
deleted_at (DATETIME, NULL allowed)
record_status (TINYINT, DEFAULT value is 1)
Other columns pertaining to this table...
The record_status is an integer version of the deleted_at. When a record is deleted, the value is set to 0. An index is also created for this column.
The null based DATETIME query takes 740 ms to execute:
select transactions.id from transactions where transactions.deleted_at is null
The TINYINT based query takes 15.1 s to execute:
select transactions.id from transactions where transactions.record_status = 1
Isn't the check on the TINYINT column (with index) supposed to be faster? Why is this happening?
[EDIT]
Added information about the table's performance
To take the experiment further, all unnecessary columns were removed from the table. Only the following persist.
id - INT (autoincrements)
deleted_at (DATETIME, NULL allowed)
record_status (TINYINT, DEFAULT value is 1)
transaction_time (DATETIME)
Query 1: Takes 2.3ms
select transactions.id from transactions
where transactions.record_status = 1 limit 1000000;
Query 2: Takes 2.1ms
select transactions.id from transactions
where transactions.deleted_at is null limit 1000000;
Query 3: Takes 20 seconds
select transactions.id from transactions
where transactions.record_status = 1
and transaction_time > '2020-04-01' limit 1000;
Query 4: Takes 500ms
select transactions.id from transactions
where transactions.deleted_at is null
and transaction_time > '2020-04-01' limit 1000;
Query 5: 394ms
select transactions.id from transactions
where transaction_time > '2020-04-01' limit 1000000;
I'm unable to figure out why Query 3 is taking this long.
The issue was addressed by adding a composite index.
Both the following now result in fast performance.
Composite key on transaction_time and deleted_at.
Composite key on transaction_time and record_status.
Thanks to #Gordon Linoff and #jarlh. Their suggestions led to this finding.
select transactions.id from transactions
where transactions.record_status = 1
and transaction_time > '2020-04-01' limit 1000;
cannot be optimized without a composite index like this -- in this order:
INDEX(record_status, transaction_time)
(No, two separate single-column indexes will not work as well.)
I have written a query to select all rows where value of a column 'gvA' in previous row is 0 and non-zero in current row. But my issue is this query takes too long to execute.
My table has 40000 rows and query takes about 60-65 seconds which is too much for a query. How can I improve query for better performance.Following is my query
SELECT device_no,datetime
FROM (
SELECT
gvA,
(SELECT e2.gvA
FROM tyn_records e2
WHERE e2.tyn_id < e1.tyn_id
ORDER BY tyn_id DESC LIMIT 1) as previous_value,
datetime,
device_no
FROM tyn_records e1
WHERE gvA > 0 AND DATE(datetime) = CURDATE() - INTERVAL 2 DAY
) selected
WHERE selected.previous_value = 0
Following are my tables
Devices:
tyn_records:
I would do two things:
I would rephrase the query a bit, specifically to remove the DATE() function in the left side of the filtering condition.
select
device_no,
datetime
from (
select
gva,
lag(gva) over(order by tyn_id) as previous_value,
datetime,
device_no
from tyn_records
where gva > 0
and datetime between curdate() - interval 2 day
and curdate() - interval 1 day
) x
where previous_value = 0
With the function on the left side of the predicate removed, you can create an index suitable to optimize the query:
create index ix1 on tyn_records (datetime, gva);
As a side note, the way you compute previous_value may not be deterministic, and could produce different results each time you run the query. This may happen if the column tyn_id is non unique.
I have a big mySQL table that contains values for all kinds of data (all with a different data_id) and a timestamp (unix timestamp in ms). I trying to build a (real-time) plotter for all this data and I want to be able to plot any data on the vertical axis against any other data on the horizontal axis. The problem I encouter is how to couple datapoints efficiently based on their timestamps.
The dataset is quite large and the logging frequency is about 10 Hz and I want a datapoint for every 1-5 minute. I already managed to make a (kinda) efficient SQL call to get an average value and an average timestamp for every 1 minute:
SELECT AVG(value), AVG(timestamp)
FROM
(
(
SELECT value, timestamp
FROM database
WHERE
data_id = 100 AND
timestamp < ... and timestamp > ...
ORDER BY timestamp DESC
) as data
)
GROUP BY timestamp DIV 60000
ORDER BY timestamp DESC;
However, now I want to be able to plot for example data_id 100 against data_id 200 instead of data_id 100 against time. So how do I couple the values for data_id 100 and 200 for a timestep of about 1 minute for a large dataset?
I already tried the following, but the SQL call took way too long...
SELECT a.timestamp, a.value, b.value
FROM
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 166 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) a,
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 137 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) b
WHERE a.timestamp DIV 60000 = b.timestamp DIV 60000
ORDER BY a.timestamp DESC;
Well i have no idea what is the point of this query. But my suggestion is to create an index based on your parameters in the WHERE clause.
So, if you are searching for records with data_id and timestamp, it is a good idea to create composite index based on those two columns.
Also the most significant slow down is probably caused by the ORDER BY timestamp.
Can you do EXPLAIN SELECT and edit your question so i can update the answer with more correct editing.
I have a table user_notifications that has 1100000 records and I have to run this below query but it takes more than 3 minutes to complete the query what can I do to improve the fetch time.
SELECT `user_notifications`.`user_id`
FROM `user_notifications`
WHERE `user_notifications`.`notification_template_id` = 175
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
AND `user_notifications`.`user_id` IN (
1203, 1282, 1499, 2244, 2575, 2697, 2828, 2900, 3085, 3989,
5264, 5314, 5368, 5452, 5603, 6133, 6498..
)
the user ids in IN block are sometimes upto 1k.
for optimisation I have indexed on user_id and notification_template_id column in user_notification table.
Big IN() lists are inherently slow. Create a temporary table with an index and put the values in the IN() list into that tempory table instead, then you'll get the power of an indexed join instead of giant IN() list.
You seem to be querying for a small date range. How about having an index based on SENT_AT column? Do you know what index the current query is using?
(1) Don't hide columns in functions if you might need to use an index:
AND (DATE(sent_at) >= DATE_SUB(CURDATE(), INTERVAL 4 day))
-->
AND sent_at >= CURDATE() - INTERVAL 4 day
(2) Use a "composite" index for
WHERE `notification_template_id` = 175
AND sent_at >= ...
AND `user_id` IN (...)
The first column should be the one with '='. It is unclear what to put next, so I suggest adding both of these indexes:
INDEX(notification_template_id, user_id, sent_at)
INDEX(notification_template_id, sent_at)
The Optimizer will probably pick between them correctly.
Composite indexes are not the same as indexes on the individual columns.
(3) Yes, you could try putting the IN list in a tmp table, but the cost of doing such might outweigh the benefit. I don't think of 1K values in IN() as being "too many".
(4) My cookbook on building indexes.
I have a MySQL table like this one:
day int(11)
hour int(11)
amount int(11)
Day is an integer with a value that spans from 0 to 365, assume hour is a timestamp and amount is just a simple integer. What I want to do is to select the value of the amount field for a certain group of days (for example from 0 to 10) but I only need the last value of amount available for that day, which pratically is where the hour field has its max value (inside that day). This doesn't sound too hard but the solution I came up with is completely inefficient.
Here it is:
SELECT q.day, q.amount
FROM amt_table q
WHERE q.day >= 0 AND q.day <= 4 AND q.hour = (
SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day
) GROUP BY day
It takes 5 seconds to execute that query on a 11k rows table, and it just takes a span of 5 days; I may need to select a span of en entire month or year so this is not a valid solution.
Anybody who can help me find another solution or optimize this one is really appreciated
EDIT
No indexes are set, but (day, hour, amount) could be a PRIMARY KEY if needed
Use:
SELECT a.day,
a.amount
FROM AMT_TABLE a
JOIN (SELECT t.day,
MAX(t.hour) AS max_hour
FROM AMT_TABLE t
GROUP BY t.day) b ON b.day = a.day
AND b.max_hour = a.hour
WHERE a.day BETWEEN 0 AND 4
I think you're using the GROUP BY a.day just to get a single amount value per day, but it's not reliable because in MySQL, columns not in the GROUP BY are arbitrary -- the value could change. Sadly, MySQL doesn't yet support analytics (ROW_NUMBER, etc) which is what you'd typically use for cases like these.
Look at indexes on the primary keys first, then add indexes on the columns used to join tables together. Composite indexes (more than one column to an index) are an option too.
I think the problem is the subquery in the where clause. MySQl will at first calculate this "SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day" for the whole table and afterwards select the days. Not quite efficient :-)