I'm using MySQL 5.7.
There is a table called transactions with over 3 million records. The table schema is as follows:
id - INT (autoincrements)
deleted_at (DATETIME, NULL allowed)
record_status (TINYINT, DEFAULT value is 1)
Other columns pertaining to this table...
The record_status is an integer version of the deleted_at. When a record is deleted, the value is set to 0. An index is also created for this column.
The null based DATETIME query takes 740 ms to execute:
select transactions.id from transactions where transactions.deleted_at is null
The TINYINT based query takes 15.1 s to execute:
select transactions.id from transactions where transactions.record_status = 1
Isn't the check on the TINYINT column (with index) supposed to be faster? Why is this happening?
[EDIT]
Added information about the table's performance
To take the experiment further, all unnecessary columns were removed from the table. Only the following persist.
id - INT (autoincrements)
deleted_at (DATETIME, NULL allowed)
record_status (TINYINT, DEFAULT value is 1)
transaction_time (DATETIME)
Query 1: Takes 2.3ms
select transactions.id from transactions
where transactions.record_status = 1 limit 1000000;
Query 2: Takes 2.1ms
select transactions.id from transactions
where transactions.deleted_at is null limit 1000000;
Query 3: Takes 20 seconds
select transactions.id from transactions
where transactions.record_status = 1
and transaction_time > '2020-04-01' limit 1000;
Query 4: Takes 500ms
select transactions.id from transactions
where transactions.deleted_at is null
and transaction_time > '2020-04-01' limit 1000;
Query 5: 394ms
select transactions.id from transactions
where transaction_time > '2020-04-01' limit 1000000;
I'm unable to figure out why Query 3 is taking this long.
The issue was addressed by adding a composite index.
Both the following now result in fast performance.
Composite key on transaction_time and deleted_at.
Composite key on transaction_time and record_status.
Thanks to #Gordon Linoff and #jarlh. Their suggestions led to this finding.
select transactions.id from transactions
where transactions.record_status = 1
and transaction_time > '2020-04-01' limit 1000;
cannot be optimized without a composite index like this -- in this order:
INDEX(record_status, transaction_time)
(No, two separate single-column indexes will not work as well.)
Related
My table is defined as following:
CREATE TABLE `tracking_info` (
`tid` int(25) NOT NULL AUTO_INCREMENT,
`tracking_customer_id` int(11) NOT NULL DEFAULT '0',
`tracking_content` text NOT NULL,
`tracking_type` int(11) NOT NULL DEFAULT '0',
`time_recorded` int(25) NOT NULL DEFAULT '0',
PRIMARY KEY (`tid`),
KEY `time_recorded` (`time_recorded`),
KEY `tracking_idx` (`tracking_customer_id`,`tracking_type`,
`time_recorded`,`tid`)
) ENGINE=MyISAM
The table contains about 150 million records. Here is the query:
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE FROM_UNIXTIME(time_recorded) > DATE_SUB( NOW( ) ,
INTERVAL 90 DAY )
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10
It takes about a minute to run the query even without ORDER BY. Any thoughts? Thanks in advance!
First, refactor the query so it's sargable.
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE time_recorded > UNIX_TIMESTAMP(DATE_SUB( NOW( ) , INTERVAL 90 DAY )
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10;
Then add this multi-column index:
ALTER TABLE tracking_info
ADD INDEX cust_time (tracking_customer_id, time_recorded DESC);
Why will this help?
It compares the raw data in a column with a constant, rather than using the FROM_UNIXTIME() function to transform all the data in that column of the table. That makes the query sargable.
The query planner can random-access the index I suggest to the first eligible row, then read ten rows sequentially from the index and look up what it needs from the table, then stop.
You can rephrase the query to isolate time_recorded, as in:
SELECT tracking_content, tracking_type, time_recorded
FROM tracking_info
WHERE time_recorded > UNIX_TIMESTAMP(DATE_SUB(NOW(), INTERVAL 90 DAY))
AND tracking_customer_id = 111111
ORDER BY time_recorded DESC
LIMIT 0,10
Then, the following index will probably make the query faster:
create index ix1 on tracking_info (tracking_customer_id, time_recorded);
There are 3 things to do:
Change to InnoDB.
Add INDEX(tracking_customer_id, time_recorded)
Rephrase to time_recorded > NOW() - INTERVAL 90 DAY)
Non-critical notes:
int(25) -- the "25" has no meaning. You get a 4-byte signed number regardless.
There are datatypes DATETIME and TIMESTAMP; consider using one of them instead of an INT that represents seconds since sometime. (It would be messy to change, so don't bother.)
When converting to InnoDB, the size on disk will double or triple.
I have the following table structure.
id (INT) index
date (TIMESTAMP) index
companyId (INT) index
This is the problem I am facing
companyId 111: hasta a total of 100000 rows in a 1 year time period.
companyId 222: has a total of 8000 rows in a 1 year time period.
If companyId 111 has 100 rows between '2020-09-01 00:00:00' AND '2020-09-06 23:59:59' and companyId 222 has 2000 rows in the same date range, companyId 111 is much slower than 222 even if it has less rows in the selected date range.
Shouldn't MySQL ignore all the rows outside the date range so the query becomes faster?
This is a query example I am using:
SELECT columns FROM table WHERE date BETWEEN '2020-09-01 00:00:00' AND '2020-09-06 23:59:59' AND companyId = 111;
Thank you
I would suggest a composite index here:
CREATE INDEX idx ON yourTable (companyId, date);
The problem with your premise is that, while you have an index on each column, you don't have any indices completely covering the WHERE clause of your example query. As a result, MySQL might even choose to not use any of your indices. You can also try reversing the order of the index above to compare performance:
CREATE INDEX idx ON yourTable (date, companyId);
I have a big mySQL table that contains values for all kinds of data (all with a different data_id) and a timestamp (unix timestamp in ms). I trying to build a (real-time) plotter for all this data and I want to be able to plot any data on the vertical axis against any other data on the horizontal axis. The problem I encouter is how to couple datapoints efficiently based on their timestamps.
The dataset is quite large and the logging frequency is about 10 Hz and I want a datapoint for every 1-5 minute. I already managed to make a (kinda) efficient SQL call to get an average value and an average timestamp for every 1 minute:
SELECT AVG(value), AVG(timestamp)
FROM
(
(
SELECT value, timestamp
FROM database
WHERE
data_id = 100 AND
timestamp < ... and timestamp > ...
ORDER BY timestamp DESC
) as data
)
GROUP BY timestamp DIV 60000
ORDER BY timestamp DESC;
However, now I want to be able to plot for example data_id 100 against data_id 200 instead of data_id 100 against time. So how do I couple the values for data_id 100 and 200 for a timestep of about 1 minute for a large dataset?
I already tried the following, but the SQL call took way too long...
SELECT a.timestamp, a.value, b.value
FROM
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 166 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) a,
(
SELECT value, timestamp
FROM daq_test.data_f32
WHERE
data_id = 137 AND
timestamp < 1507720000000 AND
timestamp > 1507334400000
ORDER BY timestamp DESC
) b
WHERE a.timestamp DIV 60000 = b.timestamp DIV 60000
ORDER BY a.timestamp DESC;
Well i have no idea what is the point of this query. But my suggestion is to create an index based on your parameters in the WHERE clause.
So, if you are searching for records with data_id and timestamp, it is a good idea to create composite index based on those two columns.
Also the most significant slow down is probably caused by the ORDER BY timestamp.
Can you do EXPLAIN SELECT and edit your question so i can update the answer with more correct editing.
I have a table which currently has about 80 million rows, created as follows:
create table records
(
id int auto_increment primary key,
created int not null,
status int default '0' not null
)
collate = utf8_unicode_ci;
create index created_and_status_idx
on records (created, status);
The created column contains unix timestamps and status can be an integer between -10 and 10. The records are evenly distributed regarding the created date, and around half of them are of status 0 or -10.
I have a cron that selects records that are between 32 and 8 days old, processes them and then deletes them, for certain statuses. The query is as follows:
SELECT
records.id
FROM records
WHERE
(records.status = 0 OR records.status = -10)
AND records.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
The query was fast when the records were at the beginning of the creation interval, but now that the cleanup reaches the records at the end of interval it takes about 10 seconds to run. Explaining the query says it uses the index, but it parses about 40 million records.
My question is if there is anything I can do to improve the performance of the query, and if so, how exactly.
Thank you.
I think union all is your best approach:
(SELECT r.id
FROM records r
WHERE r.status = 0 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
) UNION ALL
(SELECT r.id
FROM records r
WHERE r.status = -10 AND
r.created BETWEEN UNIX_TIMESTAMP() - 32 * 86400 AND UNIX_TIMESTAMP() - 8 * 86400
LIMIT 500
)
LIMIT 500;
This can use an index on records(status, created, id).
Note: use union if records.id could have duplicates.
You are also using LIMIT with no ORDER BY. That is generally discouraged.
Your index is in the wrong order. You should put the IN column (status) first (you phrased it as an OR), and put the 'range' column (created) last:
INDEX(status, created)
(Don't give me any guff about "cardinality"; we are not looking at individual columns.)
Are there really only 3 columns in the table? Do you need id? If not, get rid of it and change to
PRIMARY KEY(status, created)
Other techniques for walking through large tables efficiently.
Hey guys I have a quick question regarding sql performance. I have a really really large table and it takes forever to run the query below, note that there is a column with timestamp
select name,emails,
count(*) as cnt
from table
where date(timestamp) between '2016-01-20' and '2016-02-3'
and name is not null
group by 1,2;
So my friend suggested to use this query below:
select name,emails,
count(*) as cnt
from table
where timestamp between date_sub(curdate(), interval 14 day)
and date_add(curdate(), interval 1 day)
and name is not null
group by 1,2;
And this takes much less time to run. Why? What's the difference between those two time function?
And is there another way to run this even faster? Like index?Can someone explain to me how mysql runs? Thanks a lot!
just add index on timestamp field and use query as per below-
select name,emails,
count(*) as cnt
from table
where `timestamp` between '2016-01-20 00:00:00' and '2016-02-03 23:59:59'
and name is not null
group by 1,2;
Why? What's the difference between those two time function
In first query you are getting dates from your own column but with date() function due to this reason mysql is not using index and doing table scan while 2nd suggested table you have removed date(timestamp) function so now mysql will check values from index instead of table scan so it is fast.
Same mysql will use index in my table also.