I am using Mysql database.
I have a table daily_price_history of stock values stored with the following fields. It has 11 million+ rows
id
symbolName
symbolId
volume
high
low
open
datetime
close
So for each stock SymbolName there are various daily stock values. And the data is now more than 11 million rows,
The following sql try to get the last 100 days of daily data for a set of 1500 symbols
SELECT `daily_price_history`.`id`,
`daily_price_history`.`symbolId_id`,
`daily_price_history`.`volume`,
`daily_price_history`.`close`
FROM `daily_price_history`
WHERE (`daily_price_history`.`id` IN
(SELECT U0.`id`
FROM `daily_price_history` U0
WHERE (U0.`symbolName` = `daily_price_history`.`symbolName`
AND U0.`datetime` >= 1598471533546))
AND `daily_price_history`.`symbolName` IN (A,AA, ...... 1500 symbols Names)
I have the table indexed on symbolName and also datetime
For getting 130K (i.e 1500 x 100 ~ 150000) rows of data it takes 20 secs.
Also i have weekly_price_history and monthly_price_history tables, and I try to run the similar sql, they take less time for the same number (130K) of rows, because they have less data in the table than daily.
weekly_price_history getting 150K rows takes 3s. The total number of rows in it are 2.5million
monthly_price_history getting 150K rows takes 1s. The total number of rows in it are 800K
So how to speed up the thing when the size of table is large.
As a starter: I don't see the point for the subquery at all. Presumably, your query could filter directly in the where clause:
select id, symbolid_id, volume, close
from daily_price_history
where datetime >= 1598471533546 and symbolname in ('A', 'AA', ...)
Then, you want an index on (datetime, symbolname):
create index idx_daily_price_history
on daily_price_history(datetime, symbolname)
;
The first column of the index matches on the predicate on datetime. It is not very likley, however, that the database will be able to use the index to filter symbolname against a large list of values.
An alternative would be to put the list of values in a table, say symbolnames.
create table symbolnames (
symbolname varchar(50) primary key
);
insert into symbolnames values ('A'), ('AA'), ...;
Then you can do:
select p.id, p.symbolid_id, p.volume, p.close
from daily_price_history p
inner join symbolnames s on s.symbolname = p.symbolname
where s.datetime >= 1598471533546
That should allow the database to use the above index. We can take one step forward and try and add the 4 columns of the select clause to the index:
create index idx_daily_price_history_2
on daily_price_history(datetime, symbolname, id, symbolid_id, volume, close)
;
When you add INDEX(a,b), remove INDEX(a) as being no longer necessary.
Your dataset and query may be a case for using PARTITIONing.
PRIMARY KEY(symbolname, datetime)
PARTITION BY RANGE(datetime) ...
This will do "partition pruning": datetime >= 1598471533546. Then the PRIMARY KEY will do most of the rest of the work for symbolname in ('A', 'AA', ...).
Aim for about 50 partitions; the exact number does not matter. Too many partitions may hurt performance; too few won't provide effective pruning.
Yes, get rid of the subquery as GMB suggests.
Meanwhile, it sounds like Django is getting in the way.
Some discussion of partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Related
I have the following query that runs forever and I am looking to see if there is anyway that I can optimise it. This is running on a table that has in total 1,406,480 rows of data but apart from the Filename and Refcolumn, the ID and End_Date have both been indexed.
My Query:
INSERT INTO UniqueIDs
(
SELECT
T1.ID
FROM
master_table T1
LEFT JOIN
master_table T2
ON
(
T1.Ref_No = T2.Ref_No
AND
T1.End_Date = T2.End_Date
AND
T1.Filename = T2.Filename
AND
T1.ID > T2.ID
)
WHERE T2.ID IS NULL
AND
LENGTH(T1.Ref_No) BETWEEN 5 AND 10
)
;
Explain Results:
The reason for not indexing the Ref_No is that this is a text column and therefore I get a BLOB/TEXT error when I try and index this column.
Would really appreciate if somebody could advise on how I can quicken this query.
Thanks
Thanks to Bill in regards to multi column indexes I have managed to make some headway. I first ran this code:
CREATE INDEX I_DELETE_DUPS ON master_table(id, End_Date);
I then added a new column to show the length of the Ref_No but had to change it from the query Bill mentioned as my version of MySQL is 5.5. So I ran it in 3 steps:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED;
UPDATE master_table SET Ref_No_length = LENGTH(Ref_No);
ALTER TABLE master_table ADD INDEX (Ref_No_length);
Last step was to change my insert query with the where clause for the length. This was changed to:
AND t1.Ref_No_length between 5 and 10;
I then ran this query and within 15 mins I had 280k worth of id's inserted into my UniqueIDs table. I did go change my insert script to see if I could add more values to the length by doing the following:
AND t1.Ref_No_length IN (5,6,7,8,9,10,13);
This was to bring in the values where length was also equal to 13. This query took a lot longer, 2hr 50 mins to be precise but the additional ask of looking for all rows that have length of 13 gave me an extra 700k unique ids.
I am looking at ways to optimise the query with the IN clause, but a big improvement where this query kept running for 24 hours. So thank you so much Bill.
For the JOIN, you should have a multi-column index on (Ref_No, End_Date, Filename).
You can create a prefix index on a TEXT column like this:
ALTER TABLE master_table ADD INDEX (Ref_No(10));
But that won't help you search based on the LENGTH(). Indexing only helps search by value indexed, not by functions on the column.
In MySQL 5.7 or later, you can create a virtual column like this, with an index on the values calculated for the virtual column:
ALTER TABLE master_table
ADD COLUMN Ref_No_length SMALLINT UNSIGNED AS (LENGTH(Ref_No)),
ADD INDEX (Ref_No_length);
Then MySQL will recognize that your condition in your query is the same as the expression for the virtual column, and it will automatically use the index (exception: in my experience, this doesn't work for expressions using JSON functions).
But this is no guarantee that the index will help. If most of the rows match the condition of the length being between 5 and 10, the optimizer will not bother with the index. It may be more work to use the index than to do a table-scan.
the ID and End_Date have both been indexed.
You have PRIMARY KEY(id) and redundantly INDEX(id)? A PK is a unique key.
"have both been indexed" -- INDEX(a), INDEX(b) is not the same as INDEX(a,b) -- they have different uses. Read about "composite" indexes.
That query smells a lot like "group-wise" max done in a very slow way. (Alas, that may have come from the online docs.)
I have compiled the fastest ways to do that task here: http://mysql.rjweb.org/doc.php/groupwise_max (There are multiple versions, based on MySQL version and what issues your code can/cannot tolerate.)
Please provide SHOW CREATE TABLE. One important question: Is id the PRIMARY KEY?
This composite index may be useful:
(Filename, End_Date, Ref_No, -- first, in any order
ID) -- last
This, as others have noted, is unlikely to be helped by any index, hence T1 will need a full-table-scan:
AND LENGTH(T1.Ref_No) BETWEEN 5 AND 10
If Ref_No cannot be bigger than 191 characters, change it to a VARCHAR so that it can be used in an index. Oh, did I ask for SHOW CREATE TABLE? If you can't make it VARCHAR, then my recommended composite index is
INDEX(Filename, End_Date, ID)
I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where
The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).
Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.
Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.
For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.
This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.
Suppose I have a MySQL table, with a indexed field called balance. However, the table contains 95% of rows with balance = 0. So if I was to go:
Select * from mytable where balance > 0.02
the query would take quite a while, if the table had 1mil+ rows, as the BTree index does not have a distinct set of values for balance.
In this situation, without changing the data, how would one optimize the SQL query
First, your query is likely to be returning a lot of rows. That is going to take time.
If you only need a few, you can add limit:
Select *
from mytable
where balance > 0.02
limit 100;
Second, if you have any particularly large columns, then those could dominate the time for returning rows. If this is an issue, then select only the columns you really need.
Third, an index might help. If very few rows satisfy the where clause then an index on balance should speed the query. However, if a lot of rows match the filter condition, then you are returning a lot of data -- and that can take time.
Also, this assumes that something called mytable is really a table. If it is a view, then all bets are off. You need to optimize the view and not the query.
This is a radical approach, but if this query is very critical you could add a partition to the table in the balance field:
EDIT: For some reason MySQL partition are restricted to INT values, maybe this workaround will work:
ALTER TABLE mytable
PARTITION BY RANGE( CEILING(balance) ) (
PARTITION p0 VALUES LESS THAN (1),
PARTITION p1 VALUES LESS THAN MAXVALUE
);
NOTE: This approach will only work if balance is declared as a Decimal type, not a Float type.
I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
I believe that the right solution would be using a table with the same structure for each of the sensors:
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
What problems would result from this solution? Is there a more normalized solution?
Editing with additional information
I consider the table to be big in relation to my server:
Cloud 2xCPU and 8GB Memory
LAMP (CentOS 6.5 and MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
I mainly have two slow queries:
1) Daily summary for each sensor and type (avg, max, min):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
2) Vertical to Horizontal view and export:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
Other considerations
Timestamps may be repeated due to inserts characteristics.
Periodic inserts must coexist with selects.
No updates nor deletes are performed on the table.
Suppositions made to the "one table for each sensor" approach
Tables for each sensor would be much smaller so access would be faster.
Selects will be performed only on one table for each sensor.
Selects mixing data from different sensors are not time-critical.
Update 02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
In order to be able to partition the table, the primary index had to be removed.
Are we missing something? Is there a way to improve the performance?
Many thanks!
Edited based on changes to the question
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
You'll have to create tables each time you add (or delete) sensors.
Queries that involve data from multiple sensors will be slow and convoluted.
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts, meaning timestamp.)
Beware: int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id. Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
In your second query, a similar indexing strategy will work.
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value. That means the same four column index I mentioned should work for you.
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
Creating separate table for a range of sensors would be an idea.
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
Use composite key instead, depends from your usecase, the sequence of columns may be different.
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
You can try get randomize summary data
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.
I have a table 'tbl' something like that:
ID bigint(20) - primary key, autoincrement
field1
field2
field3
That table has 600k+ rows.
Query:
SELECT * from tbl ORDER by ID LIMIT 600000, 1 takes 1.68 second
Query:
SELECT ID, field1 from tbl ORDER by ID LIMIT 600000, 1 takes 1.69 second
Query:
SELECT ID from tbl ORDER by ID LIMIT 600000, 1 takes 0.16 second
Query:
SELECT * from tbl WHERE ID = xxx takes 0.005 second
Those queries are tested in phpmyadmin.
And the result is query 3 and query 4 together return necessarily data.
Query 1 does the same jobs but much slower...
This doesn't look right for me.
Could anyone give any advice?
P.S. I'm sorry for formatting.. I'm new to this site.
New test:
Q5 : CREATE TEMPORARY TABLE tmptable AS (SELECT ID FROM tbl WHERE ID LIMIT 600030, 30);
SELECT * FROM tbl WHERE ID IN (SELECT ID FROM tmptable); takes 0.38 sec
I still don't understand how it's possible. I recreated all indexes.. what else can I do with that table? Delete and refill it manually? :)
Query 1 looks at the table's primary key index, finds the correct 600,000 ids and their corresponding locations within the table, then goes to the table and fetches everything from those 600k locations.
Query 2 looks at the table's primary key index, finds the correct 600k ids and their corresponding locations within the table, then goes to the table and fetches whichever subset of fields are asked for from those 600k rows.
Query 3 looks at the table's primary key index, finds the correct 600k ids, and returns them. It doesn't need to look at the table at all.
Query 4 looks at the table's primary key index, finds the single entry requested, goes to the table, reads that single entry, and returns it.
Time-wise, let's build backwards:
(Q4) The table index allows lookup of a key (id) in O(log n) time, meaning every time the table doubles in size it only takes one extra step to find the key in the index*. If you have 1 million rows, then, it would only take ~20 steps to find it. A billion rows? 30 steps. The index entry includes data on where in the table to go to find the data for that row, so MySQL jumps to that spot in the table and reads the row. The time reported for this is almost entirely overhead.
(Q3) As I mentioned, the table index is very fast; this query finds the first entry and just traverses the tree until it has the requested number of rows. I'm sure I could calculate the precise number of steps it would take, but as a maximum we'll say 20 steps x 600k rows = 12M steps; since it's traversing a tree it would likely be more like 1M steps, but the precise number is largely irrelevant. The most important thing to realize here is that once MySQL has walked the index to pull the ids it needs, it has everything you asked for. There's no need to go look at the table. The time reported for this one is essentially the time it takes MySQL to walk the index.
(Q2) This begins with the same tree-walking as discussed for query 3, but while pulling the IDs it needs, MySQL also pulls their location within the table files. It then has to go to the table file (probably already cached/mmapped in memory), and for every entry it pulled, seek to the proper place in the table and get the fields requested out of those rows. The time reported for this query is the time it takes to walk the index (as in Q3) plus the time to visit every row specified in the index.
(Q1) This is identical to Q2 when all fields are specified. As the time is essentially identical to Q2, we can see that it doesn't really take measurably more time to pull more fields out of the database, any time there is dwarfed by crawling the index and seeking to the rows.
*: Most databases use an indexing data structure (B-trees for MySQL) that has a log base much higher than 2, meaning that instead of an extra step every time the table doubles, it's more like an extra step every time the table size goes up by a factor of hundreds to thousands. This means that instead of the 20-30 steps I stated in the example, it's more like 2-5.