My columns are:
job_name, job_date, job_details1, job_details2 ...
There are NO Primary key columns
In my table, I expect to have 15-20 distinct jobs. Each job with exactly 2 months of data so 60 distinct job_date per job_name. And within each date there would be 100,000 records.
Query will always be a SELECT for ONE particular job_name and a range of job_date (followed by several group bys, but that's irrelevant for now). I don't want the query to go through irrelevant job_dates or job_names when queried for a particular job_name and some range of job_date.
So what sort of optimizations can I do to make my select query faster? I'm using MySQL5.6.17, which has a partitioning limit of 8096 partitions.
Something like partitioning per job_name and subpartitions for job_date within that? This is the first time I'm dealing with such large data so I'm not sure about these optimizations. Any help or tips will be appreciated.
Thanks
"Query will always be a SELECT for ONE particular job_name and a range of job_date (followed by several group bys, but that's irrelevant for now)." -- Based on that, you need
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(job_name, job_date, id),
INDEX(id)
ENGINE=InnoDB
Notes:
The combination of InnoDB with `PK(job_name, job_date, ...) clusters the data so that you scan exactly the rows you need, and nothing more.
No partitioning; it won't help.
I am adding an AUTO_INCREMENT and adding it to the PK because a PK must be unique. (And the PK is needed for the clustering.)
INDEX(id) (or some key starting with id) is needed for AUTO_INCREMENT.
"... followed by group bys ..." That sounds like you are summarizing data for reports? If my suggestions above are not fast enough, let's talk about Summary Tables. You might get another factor of 10 speedup.
Related
I am currently stuck in improving the performance of MySQL query. It takes 30 seconds to execute and we don't want users waiting that long for the backend response.
My Query:
select count(case_id), sum(net_value), sum(total_time_spent), events from event_log group by events order by count(case_id) desc
Indexes:
Created a composite index on events,case_id, net_value, total_time_spent.
Time taken:30 seconds
Number of records in event_log table:80 Million
Table structure:
Create table event_log( case_id varchar(100) primary key, events varchar(200), creation_date timestamp, total_time_spent bigint)
Composite Unique key: case_id, events, creation_date.
Infrastructure:
AWS RDS instance type : r5d.2xlarge ( 8CPUs, 64GB RAM )
Tried partitioning the data on the basis of key case_id but could see no improvement.
Tried upgrading the server size but no improvement there as well.
If you can give us some hints, or something that we can try that would be really helpful.
Build and maintain a Summary Table of events by day (or week) and subtotals of the counts and sums you need.
Then run the query against the summary table, summing up the sums, etc.
That may run 10 times as fast.
If practical, normalize case_id and/or events; that may shrink the table size by a significant amount. Consider using a smaller datatype for the total_time_spent; BIGINT consumes 8 bytes.
With a summary table, few, if any, indexes are needed; the summary table is likely to have indexes. I would try to have the PRIMARY KEY start with events.
Be aware that COUNT(x) checks x for being NOT NULL. If this is not necessary, then simply do COUNT(*).
enter image description here
Hello
I want to configure a partition (monthly)/subpartition (day by day) as the query above.
If the total number of subpartitions exceeds 64,
'(errno: 168 "Unknown (generic) error from engine")'
The table is not created due to an error. (Creating less than 64 is successed).
I know that the maximum number of partitions (including subpartitions) that can be created is 8,192, is there anything I missed?
Below is the log table.
create table detection_log
(
id bigint auto_increment,
detected_time datetime default '1970-01-01' not null,
malware_title varchar(255) null,
malware_category varchar(30) null,
user_name varchar(30) null,
department_path varchar(255) null,
PRIMARY KEY (detected_time, id),
INDEX `detection_log_id_uindex` (id),
INDEX `detection_log_malware_title_index` (malware_title),
INDEX `detection_log_malware_category_index` (malware_category),
INDEX `detection_log_user_name_index` (user_name),
INDEX `detection_log_department_path_index` (departmen`enter code here`t_path)
);
SUBPARTITIONs provide no benefit that I know of.
HASH partitioning either provides no benefit or hurts performance.
So... Explain what you hoped to gain by partitioning; then we can discuss whether any type of partitioning is worth doing. Also, provide the likely SELECTs so we can discuss the optimal INDEXes. If you need a "two-dimensional" index, that might indicate a need for partitioning (but still not subpartitioning).
More
I see PRIMARY KEY(detected_time,id). This provides a very fast way to do
SELECT ...
WHERE detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
In fact, it will probably be faster than if you also partition the table. (As a general rule it is useless to partition on the first part of the PK.)
If you need to do
SELECT ...
WHERE user_id = 123
AND detected_time BETWEEN ... AND ...
ORDER BY detected_time, id
Then this is optimal:
INDEX(user_id, detected_time, id)
Again, probably faster than any form of partitioning on any column(s).
And
A "point query" (WHERE key = 123) takes a few milliseconds more in a 1-billion-row table compared to a 1000-row table. Rarely is the difference important. The depth of the BTree (perhaps 5 levels vs 2 levels) is the main difference. If you PARTITION the table, you are removing perhaps 1 or 2 levels of the BTree, but replacing them with code to "prune" down to the desired partition. I claim that this tradeoff does not provide a performance benefit.
A "range query" is very nearly the same speed regardless of the table size. This is because the structure is actually a B+Tree, so it is very efficient to fetch the 'next' row.
Hence, the main goal in optimizing queries on a huge table is to take advantage of the characteristics of the B+Tree.
Pagination
SELECT log.detected_time, log.user_name, log.department_path,
log.malware_category, log.malware_title
FROM detection_log as log
JOIN
(
SELECT id
FROM detection_log
WHERE user_name = 'param'
ORDER BY detected_time DESC
LIMIT 25 OFFSET 1000
) as temp ON temp.id = log.id;
The good part: Finding ids, then fetching the data.
The slow part: Using OFFSET.
Have this composite index: INDEX(user_name, detected_time, id) in that order. Make another index for when you use department_path.
Instead of OFFSET, "remember where you left off". A blog specifically about that: http://mysql.rjweb.org/doc.php/pagination
Purging
Deleting after a year is an excellent use of PARTITIONing. Use PARTITION BY RANGE(TO_DAYS(detected_time)) and have either ~55 weekly or 15 monthly partitions. See HTTP://mysql.rjweb.org/doc.php/partitionmaint for details. DROP PARTITION is immensely faster than DELETE. (This partitioning will not speed up SELECT.)
I need to fetch last 24 hrs data frequently and this query runs frequently.
Since this scans many rows, using it frequently, affects the database performance.
MySql execution strategy picks index on created_at and that returns 1,00,000 rows approx. and these rows are scanned one by one to filter customer_id = 10 and my final result has 20000 rows.
How can I optimize this query?
explain SELECT *
FROM `order`
WHERE customer_id = 10
and `created_at` >= NOW() - INTERVAL 1 DAY;
id : 1
select_type : SIMPLE
table : order
partitions : NULL
type : range
possible_keys : idx_customer_id, idx_order_created_at
key : idx_order_created_at
key_len : 5
ref : NULL
rows : 103357
filtered : 1.22
Extra : Using index condition; Using where
The first optimization I would do is on the access to the table:
create index ix1 on `order` (customer_id, created_at);
Then, if the query is still slow I would try appending the columns you are selecting to the index. If, for example, you are selecting the columns order_id, amount, and status:
create index ix1 on `order` (customer_id, created_at,
order_id, amount, status);
This second strategy could be beneficial, but you'll need to test it to find out what performance improvement it peoduces in your particular case.
The big improvement of this second strategy is that it walks the secondary index only, by avoiding to walk back to the primary clustered index of the table (that can be time consumming).
Instead of two single indexes on ID and Created, create a single composite index on ( customer_id, created_at ). This way the index engine can use BOTH parts of the where clause instead of just hoping to get the one. Jump right to the customer ID, then jump directly to the date desired, then gives results. it SHOULD be very fast.
Additional Follow-up.
I hear your comment about having multiple indexes, but add those into the main one, just after such as
( customer_id, created_at, updated_at, completion_time )
Then, in your queries could always include some help on the index in the where clause. For example, and I don't know your specific data. A record is created at some given point. The updated and completion time will always be AFTER that. How long does it take (worst-case scenario) from a creation to completion time... 2 days, 10 days, 90 days?
where
customerID = ?
AND created_at >= date - 10 days
AND updated_at >= date -1
Again, just an example, but if a person has 1000's of orders and relatively quick turn-around time, you could jump to those most recent and then find those updated within the time period.. Again, just an option as a single index vs 3, 4 or more indexes.
Seems you are dealing a very quick growing table, I should consider moving this frequent query to a cold table or replica.
One more point is that did you consider partition by customer_id. I am not quite understand the business logic behind to query customer_id = 10. If it's multi tenancy application, try partition.
For this query:
SELECT o.*
FROM `order` o
WHERE o.customer_id = 10 AND
created_at >= NOW() - INTERVAL 1 DAY;
My first inclination would be a composite index on (customer_id, created_at) -- as others have suggested.
But, you appear to have a lot of data and many inserts per day. That suggests partitioning plus an index. The appropriate partition would be on created_at, probably on a daily basis, along with an index for user_id.
A typical query would access the two most recent partitions. Because your queries are focused on recent data, this also reduces the memory occupied by the index, which might be an overall benefit.
This technique should be better than all the other answers, though perhaps by only a small amount:
Instead of orders being indexed thus:
PRIMARY KEY(order_id) -- AUTO_INCREMENT
INDEX(customer_id, ...) -- created_at, and possibly others
do this to "cluster" the rows together:
PRIMARY KEY(customer_id, order_id)
INDEX (order_id) -- to keep AUTO_INCREMENT happy
Then you can optionally have more indexes starting with customer_id as needed. Or not.
Another issue -- What will you do with 20K rows? That is a lot to feed to a client, especially of the human type. If you then munch on it, can't you make a more complex query that does more work, and returns fewer rows? That will probably be faster.
I am creating a table which will store around 100million rows in MySQL 5.6 using InnoDB storage engine. This table will have a foreign key that will link to another table with around 5 million rows.
Current Table Structure:
`pid`: [Foreign key from another table]
`price`: [decimal(9,2)]
`date`: [date field]
and every pid should have only one record for a date
What is the best way to create indexes on this table?
Option #1: Create Primary index on two fields pid and date
Option #2: Add another column id with AUTO_INCREMENT and primary index and create a unique index on column pid and date
Or any other option?
Only select query i will be using on this table is:
SELECT pid,price,date FROM table WHERE pid = 123
Based on what you said (100M; the only query is...; InnoDB; etc):
PRIMARY KEY(pid, date);
and no other indexes
Some notes:
Since it is InnoDB, all the rest of the fields are "clustered" with the PK, so a lookup by pid is acts as if price were part of the PK. Also WHERE pid=123 ORDER BY date would be very efficient.
No need for INDEX(pid, date, price)
Adding an AUTO_INCREMENT gains nothing (except a hint of ordering). If you needed ordering, then an index starting with date might be best.
Extra indexes slow down inserts. Especially UNIQUE ones.
Either method is fine. I prefer having synthetic primary keys (that is, the auto-incremented version with the additional unique index). I find that this is useful for several reasons:
You can have a foreign key relationship to the table.
You have an indicator of the order of insertion.
You can change requirements, so if some pids allows two values per day or only one per week, then the table can support them.
That said, there is additional overhead for such a column. This overhead adds space and a small amount of time when you are accessing the data. You have a pretty large table, so you might want to avoid this additional effort.
I would try with an index that attempts to cover the query, in the hope that MySQL has to access to the index only in order to get the result set.
ALTER TABLE `table` ADD INDEX `pid_date_price` (`pid` , `date`, `price`);
or
ALTER TABLE `table` ADD INDEX `pid_price_date` (`pid` , `price`, `date`);
Choose the first one if you think you may need to select applying conditions over pid and date in the future, or the second one if you think the conditions will be most probable over pid and price.
This way, the index has all the data the query needs (pid, price and date) and its indexing on the right column (pid)
By the way, always use EXPLAIN to see if the query planner will really use the whole index (take a look at the key and keylen outputs)
We have a multitenant application that has a table with 129 fields that can all be used in WHERE and ORDER BY clauses. I spent 5 days now trying to find out the best indexing strategy for us, I gained lot of knowledge but I still have some questions.
1) When creating an index should I always make it a composite index with tenant_id in the first place ?(all queries have tenant_id = ? in there WHERE clause)
2) Since all the columns can be used in both the WHERE clause and the order by clause, should I create an index on them all ? (right know when I order by a column that has no index it takes 6s to execute with a tenant that has about 1,500,000 rows )
3) make the PK (tenant_id, ID), but wouldn't this affect the joins to that table ?
Any advice on how to handle this would be much appreciated.
======
The database engine is InnoDB
=======
structure :
ID bigint(20) auto_increment primary
tenant_id int(11)
created_by int(11)
created_on Timestamp
updated_by int(11)
updated_on Timestamp
owner_id int(11)
first_name VARCHAR(60)
last_name VARCHAR(60)
.
.
.
(some 120 other columns that are all searchable)
A few brief answers to the questions. As far as I can see you are confused with using indexes
Consider creating Indexes on columns if the Ratio -
Consideration 1 -
(Number of UNIQUE Entries of the Columns)/(Number of Total Entries in the Column) ~= 1
That is Count of DISTINCT rows in a particular column is high.
Creating an extra index will always create overhead for the MySQL server, so you MUST NOT create every column an index. There is also a limit on number of indexes your single table can have = 64 per table
Now if your tenant_id is present in all the search queries, you should consider it as an index or in a composite key,
provided that -
Consideration 2 - number of UPDATEs are less that number of SELECTs on the tenant_id
Consideration 3 - The indexes should be as small as possible in terms of data types. You MUST NOT create a varchar 64 an index
http://www.mysqlperformanceblog.com/2012/08/16/mysql-indexing-best-practices-webinar-questions-followup/
Point to Note 1 - Even if you do declare any column an index, MySQL optimizer may still not consider it as best plan of query execution. So always use EXPLAIN to know whats going on. http://www.mysqlperformanceblog.com/2009/09/12/3-ways-mysql-uses-indexes/
Point to Note 2 -
You may want to cache your search queries, so remember not to use unpredicted statements in your SELECT queries, such as NOW()
Lastly - making the PK (tenant_id, ID) should not affect the joins on your table.
And an awesome link to answer all your questions in general - http://www.percona.com/files/presentations/WEBINAR-MySQL-Indexing-Best-Practices.pdf