Optimize SQL to fetch 1 day data - mysql

I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where

The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).

Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.

Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.

For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.

This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.

Related

MySQL performance for aggregate functions -- 80Million records

I am currently stuck in improving the performance of MySQL query. It takes 30 seconds to execute and we don't want users waiting that long for the backend response.
My Query:
select count(case_id), sum(net_value), sum(total_time_spent), events from event_log group by events order by count(case_id) desc
Indexes:
Created a composite index on events,case_id, net_value, total_time_spent.
Time taken:30 seconds
Number of records in event_log table:80 Million
Table structure:
Create table event_log( case_id varchar(100) primary key, events varchar(200), creation_date timestamp, total_time_spent bigint)
Composite Unique key: case_id, events, creation_date.
Infrastructure: 
AWS RDS instance type : r5d.2xlarge ( 8CPUs, 64GB RAM )
Tried partitioning the data on the basis of key case_id but could see no improvement.
Tried upgrading the server size but no improvement there as well.
If you can give us some hints, or something that we can try that would be really helpful.
Build and maintain a Summary Table of events by day (or week) and subtotals of the counts and sums you need.
Then run the query against the summary table, summing up the sums, etc.
That may run 10 times as fast.
If practical, normalize case_id and/or events; that may shrink the table size by a significant amount. Consider using a smaller datatype for the total_time_spent; BIGINT consumes 8 bytes.
With a summary table, few, if any, indexes are needed; the summary table is likely to have indexes. I would try to have the PRIMARY KEY start with events.
Be aware that COUNT(x) checks x for being NOT NULL. If this is not necessary, then simply do COUNT(*).

MySQL query run time is better even though its execution plan is bad

I am trying to optimize this MySQL query and having less experience in understanding execution plan I am having hard time making sense of the execution plan.
My question is : Can you please help me in understanding why the query execution plan of New Query is worse than that of Original query even though New query performs better in Prod.
SQL needed to reproduce this case is here
Also kept relevant table definition in the end ( Table bill_range references bill using foreign key bill_id )
Original query takes 10 second to complete in PROD
select *
from bill_range
where (4050 between low and high )
order by bill_id limit 1;
while new query (I am forcing/suggesting to use index) takes 5 second to complete in PROD
select *
from bill_range
use index ( bill_range_low_high_index)
where (4050 between low and high )
order by bill_id limit 1;
But the execution plan gives suggest original query is better( this is the part where my understanding seems to be wrong )
Original query
New query
Column "type" for original query suggest index while new query
says ALL
Column "Key" is bill_id (perhaps index on FK) for
original queryand Null for new query
Column "rows" for original query is 1 while for new query says 9
So given all this information wouldn't it imply that new query is actually worse than original query .
And if that is true why is new query performing better? Or am I reading the execution plan wrong.
Table defintions
CREATE TABLE bill_range (
id int(11) NOT NULL AUTO_INCREMENT,
low varchar(255) NOT NULL,
high varchar(255) NOT NULL,
PRIMARY KEY (id),
bill_id int(11) NOT NULL,
FOREIGN KEY (bill_id) REFERENCES bill(id)
);
CREATE TABLE bill (
id int(11) NOT NULL AUTO_INCREMENT,
label varchar(10),
PRIMARY KEY (id)
);
create index bill_range_low_high_index on bill_range( low, high);
NOTE : The reason I am providing definition of 2 tables is because original query decided to use an index based on Foreign key to bill table
Your index isn't quite optimal for your query. Let me explain if I may.
MySQL indexes use BTREE data structures. Those work well in indexed-sequential access mode (hence the MyISAM name of MySQL's first storage engine). It favors queries that jump to a particular place in an index and then run through the index element by element. The typical example is this, with an index on col.
SELECT whatever FROM tbl WHERE col >= constant AND col <= constant2
That is a rewrite of WHERE col BETWEEN constant AND constant2.
Let's recast your query so this pattern is obvious, and so the columns you want are explicit.
select id, low, high, bill_id
from bill_range
where low <= 4050
and high >= 4050
order by bill_id limit 1;
An index on the high column allows a range scan starting with the first eligible row with high >= 4050. Then, we can go on to make it a compound index, including the bill_id and low columns.
CREATE INDEX high_billid_low ON bill_range (high, bill_id, low);
Because we want the lowest matching bill_id we put that into the index next, then finally the low value. So the query planner random accesses the index to the first elibible row by high, then scans until it finds the very first index item that meets the low criterion. And then it's done: that's the desired result. It's already ordered by bill_id so it can stop. ORDER BY comes from the index. The query can be satisfied entirely from the index -- it is a so-called covering index.
As to why your two queries performed differently: In the first, the query planner decided to scan your data in bill_id order looking for the first matching low/high pair. Possibly it decided that actually sorting a result set would likely be more expensive than scanning bill_ids in order. It looks to me like your second query did a table scan. Why that was faster, who knows?
Notice that this index would also work for you.
CREATE INDEX low_billid_high ON bill_range (low DESCENDING, bill_id, high);
In InnoDB the table's PK id is implicitly part of every index, so there's no need to mention it in the compound index.
And, you can still write it the way you first wrote it; the query planner will figure out what you want.
Pro tip: Avoid SELECT * ... the * makes it harder to reason about the columns you need to retrieve.

Optimizing Datetime searches in huge MySQL InnoDB table

I am trying to optimize a big MySQL InnoDB Table with 50 million rows in it. It is a kind of a log. Each row contains some columns with information and a Datetime column.
These 50 million rows contain only 5-6 dates, so there are only a few distinct dates but with different hours, minutes and seconds. Each row has a unique ID (primary key). The DateTime column has an index.
The searches are performed with the only date (w/o using hours, minutes, and sec), f.e.
select * from table where date(datetime_column) = '2021-03-08'
I've already tried to rewrite the queries without date() function, like:
select * from table where datetime_column >= '2021-03-08' and datetime_column <='2021-03-08 23:59:59'
But it's only a bit faster.
Also, I've created a new table, put the ID (primary key from the main table), year, month, day, hour, minutes, and seconds to tyniints (the year is int(4)), made a combined index on them and performed the select from the main table with join to this new table, but it's still not fast enough, because index for hours, minutes and seconds become useless while these columns are not mentioned in the "where" clause.
Also, I've thought about partitioning, but I think it won't help too.
Any ideas on how to solve it?
Change from
PRIMARY KEY(id),
INDEX(datetime)
to
PRIMARY KEY(datetime, id), -- to greatly speed up your range query
INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
Do not use the DATE(datetime) = constant; it cannot use any index. Your other attempt can use an index in some situations. I like this way to phrase it:
WHERE datetime >= '2021-03-08'
AND datetime < '2021-03-08' + INTERVAL 1 DAY
Oh, you say there is more to the WHERE? Let's see them; it may make a big difference! Also, let us know whether the datetime range does most of the filtering or the other clause(s) do more.
Many queries look something like
WHERE datetime in some range AND abc=123
That benefits from INDEX(abc, datetime), in that order. Pulling the PK trick above may also be beneficial: PRIMARY KEY(abc, datetime, id), INDEX(id).

MYSQL: how to speed up an sql Query for getting data

I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint

MySQL poor performance with a large table

I have a monitoring table which holds monitoring data for some 200+ servers.
Each server adds 3 records of data to the table every minute of every day.
I hold 6 months of data for historical reports for customers, and as you can imagine the table gets pretty large.
My issue currently is that running SELECT queries on this table is taking an age.
I understand why; It's the sheer amount of rows its working through whilst performing the SELECT, but I have tried to reduce the result set significantly by adding in time lookups...
SELECT * FROM `host_monitoring_data`
WHERE parent_id = 47 AND timestamp > (NOW() - INTERVAL 5 MINUTE);
... but still I'm looking at a long time before the data is returned to me.
I'm used to working with fairly small tables and this is by far the biggest that I've ever worked with, so I'm not familiar with how to overcome this sort of issue.
Any help at all is vastly appriciated.
My table structure is currently id, parent_id, timestamp, type, U, A, T
U,A,T is Used/Available/Total, Type tells me what kind of measurable we are working with, Timestamp is exactly that, parent_id is the id of the parent host to which the data belongs, and id is an auto-incrementing id for the record in question.
When I'm doing lookups, I'm basically trying to get the most recent 20 rows where parent_id = x or whatever, so I just do...
SELECT u,a,t from host_monitoring_data
WHERE parent_id=X AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
EDIT 1 - Including the results of EXPLAIN:
EXPLAIN SELECT * FROM `host_monitoring_data`
WHERE parent_id=36 AND timestamp > (NOW() - INTERVAL 5 MINUTE)
ORDER BY timestamp DESC LIMIT 20
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE host_monitoring_data ALL NULL NULL NULL NUL 2865454
Using where; Using filesort
Based on your EXPLAIN report, I see it says "type: ALL" which means it's scanning all the rows (the whole table) for every query.
You need an index to help it scan fewer rows.
Your first condition for parent_id=X is an obvious choice. You should create an index starting with parent_id.
The other condition on timestamp >= ... is probably the best second choice. Your index should include timestamp as the second column.
You can create this index this way:
ALTER TABLE host_monitoring_data ADD INDEX (parent_id, timestamp);
You might like my presentation How to Design Indexes, Really and a video of me presenting it: https://www.youtube.com/watch?v=ELR7-RdU9XU
P.S.: Please when you ask questions about query optimization, run SHOW CREATE TABLE <tablename> and include its output in your question. This shows us your columns, data types, current indexes, and constraints. Don't make us guess! Help us help you!
Three good tips:
EXPLAIN (as others said), will tell you what are you doing and hints to do it better.
Avoid using "*", instead, select fields you need.
Use procedure analyse to know what are the most recommended type of variables you need (and change them if needed).
https://dev.mysql.com/doc/refman/5.7/en/procedure-analyse.html
I also avoid using "order by" whenever you can.