Related
Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.
I have an application that needs to make many concurrent lightweight SQL queries. For example - the unit query is like "For this store, give me a list of sales by category today." Alone this query is very cheap - it runs in a few tens of milliseconds at most.
I need to perform this query at a store level - "For every transaction of this store group (roughly up to 30), give a list of sales by category today." This is obviously implementable as a join on the set of stores in the group - but this is too slow. It slows down proportionally to the number of transactions made (in reality, in proportion to the total number of items bought).
Instead I've implemented it as many concurrent store-level queries (I've varied the batch size to no real avail) and then I merge the results in the application layer. This works reasonably well, especially when combined with PreparedStatements. Unfortunately this is not fast enough. This takes query times from 5-15 seconds to 0.5-1.5 seconds for the majority of the time, but occasionally it will take 3 seconds, which is outside of the acceptable performance range (less than 2 seconds).
The information is not cacheable as it's unlikely that the same query will be executed within an acceptable caching time frame. Note that queries for the recent past (two weeks or so) perform very quickly - as the DB writes keep that section of the data fresh in the DB/OS cache. It's random reads that are killer.
Do any of you DB wizards have any tips to speed up this query process? I'm very new to SQL and nobody in my office has tried anything like this before. I have benchmarked and timed them very thoroughly, and I am pretty sure it's this spin-off of up to 100 queries (30 * 3 metrics + some simpler queries) simultaneously that is costing me the time. A list of query times looks like [10, 15, 30, 55, 89, 100, 300, ..., 1599], all timed only around the execute() call. For reference I'm using Java as the application language with C3P0 and 500-1000 open DB connections and Amazon Aurora as the DB. I've even tried load-balancing the 100 queries across two read-replicas, but this seems to only nominally improve performance, much to my dismay. I got a small performance boost from TRANSACTION_READ_UNCOMMITTED and SCROLL_INSENSITIVE + READ_ONLY, I think.
Edit: Some table structures and queries (Pardon the name transaction - I don't actually use this name but have changed it for business reasons.)
CREATE TABLE IF NOT EXISTS item (
item_id BIGINT UNSIGNED AUTO_INCREMENT,
item_name VARCHAR(120),
unit_price DECIMAL (10,2),
PRIMARY KEY (item_id)
) ENGINE=INNODB;
CREATE TABLE IF NOT EXISTS transaction_item_list (
ticket_transaction_id BIGINT UNSIGNED AUTO_INCREMENT,
transaction_id BIGINT UNSIGNED,
item_id BIGINT UNSIGNED,
item_quantity DECIMAL(10,2),
item_sales DECIMAL(10,2),
FOREIGN KEY (item_id)
REFERENCES item (item_id),
FOREIGN KEY (transaction_id)
REFERENCES transaction (transaction_id),
PRIMARY KEY (transaction_item_id)
) ENGINE=INNODB;
CREATE INDEX transaction_id_idx
ON transaction_item_list (transaction_id);
CREATE INDEX item_id_idx
ON transaction_item_list (item_id);
CREATE TABLE IF NOT EXISTS transaction (
transaction_id BIGINT UNSIGNED AUTO_INCREMENT,
native_transaction_id VARCHAR(36) NOT NULL,
store_id BIGINT UNSIGNED NOT NULL,
server_time DATETIME NOT NULL,
business_date DATE NOT NULL,
FOREIGN KEY (store_id)
REFERENCES store (store_id),
PRIMARY KEY (transaction_id)
) ENGINE=INNODB;
# used for insertion
CREATE UNIQUE INDEX store_date_native_transaction_id_idx
ON ticket (store_id, business_date, native_transaction_id);
# used for querying
CREATE UNIQUE INDEX store_date_transaction_id_idx
ON ticket (store_id, business_date, transaction_id);
CREATE INDEX store_id_idx
ON transaction (store_id);
CREATE INDEX date_idx
ON transaction (business_date);
CREATE INDEX server_time_idx
ON transaction (server_time);
SELECT sum(transaction_item_list.item_quantity * item.unit_price) FROM transaction_item_list
JOIN item USING (item_id)
JOIN transaction USING (transaction_id)
WHERE (transaction.store_id, transaction.transaction_date) IN ((?, ?))
GROUP BY category;
The transaction_item_list table has over 700 million rows for one year's worth of data.
Do not use this constructWHERE (store_id, transaction_date) IN ((?, ?)); it optimizes poorly. Instead, use
WHERE store_id = ?
AND transaction_date = ?
Please qualify each column mentioned in a JOIN with the table name (or alias); it is tedious for the reader (us) to figure out which comes from where.
Indexes needed:
transaction: INDEX(store_id, transaction_date) -- in that order
transaction_item_list: INDEX(transaction_id) -- if not already there
transaction_item_list smells like a many:many mapping (plus an extra column). If it is, see my 7 tips on many:many .
I have been reading lots of great answers to different problems over the time on this site but this is the first time I am posting. So in advance thanks for your help.
Here is my question:
I have a MySQL table that tracks visits to different websites we have. This is the table structure:
create table navigation_base (
uid int(11) NOT NULL,
date datetime not null,
dia date not null,
ip int(4) unsigned not null default 0,
session_id int unsigned not null,
cliente smallint unsigned not null default 0,
campaign mediumint unsigned not null default 0,
trackcookie int unsigned not null,
adgroup int unsigned not null default 0,
PRIMARY KEY (uid)
) ENGINE=MyISAM;
This table has aprox. 70 million rows (an average of 110,000 per day).
On that table we have created indexes with following commands:
alter table navigation_base add index dia_cliente_campaign_ip (dia,cliente,campaign,ip);
alter table navigation_base add index dia_cliente_campaign_ip_session (dia,cliente,campaign,ip,session_id);
alter table navigation_base add index dia_cliente_campaign_ip_session_trackcookie (dia,cliente,campaign,ip,session_id,trackcookie);
We then use this table to get visitor statistics grouped by clients, days and campaigns with the following query:
select
dia,
navigation_base.campaign,
navigation_base.cliente,
count(distinct ip) as visitas,
count(ip) as paginas_vistas,
count(distinct session_id) as sesiones,
count(distinct trackcookie) as cookies
from navigation_base where
(dia between '2017-01-01' and '2017-01-31')
group by dia,cliente,campaign order by NULL
Even having those indexes created, the response times for periods of one month are relatively slow; On our server about 3 seconds.
Are there some ways of speeding up these queries?
Thanks in advance.
With this much of data, indexing alone may not be all that helpful since there is a lot of similarity in the data. Besides you have GROUP BY and SORT along with aggregation. All these things combined makes optimization very hard. partitioning is the way forward, because:
Some queries can be greatly optimized in virtue of the fact that data
satisfying a given WHERE clause can be stored only on one or more
partitions, which automatically excludes any remaining partitions from
the search. Because partitions can be altered after a partitioned
table has been created, you can reorganize your data to enhance
frequent queries that may not have been often used when the
partitioning scheme was first set up.
And if this doesn't work for you, it's still possible to
In addition, MySQL 5.7 supports explicit partition selection for
queries. For example, SELECT * FROM t PARTITION (p0,p1) WHERE c < 5
selects only those rows in partitions p0 and p1 that match the WHERE
condition.
ALTER TABLE navigation_base
PARTITION BY RANGE( TO_DAYS(dia)) (
PARTITION p0 VALUES LESS THAN (TO_DAYS('2018-12-31')),
PARTITION p1 VALUES LESS THAN (TO_DAYS('2017-12-31')),
PARTITION p2 VALUES LESS THAN (TO_DAYS('2016-12-31')),
PARTITION p3 VALUES LESS THAN (TO_DAYS('2015-12-31')),
..
PARTITION p10 VALUES LESS THAN MAXVALUE));
Use bigger or smaller partitions as you see fit.
The most important factor to keep in mind is that mysql can only use one index per table. So choose your index wisely.
If you only do COUNT(DISTINCT ...) at the granularity of a day, then build and incrementally maintain a summary table. It would augmented each night by a query nearly identical to your SELECT, but only fetching yesterday's data.
Then use this Summary Table for the monthly "report".
More on Summary Tables
Performance problem on update MySql MyISAM big table making column ascending order based on an index on same table
My problem is that the server have only 4 GB memory.
I have to do an update query like this: previous asked question
Mine is this:
set #orderid = 0;
update images im
set im.orderid = (select #orderid := #orderid + 1)
ORDER BY im.hotel_id, im.idImageType;
On im.hotel_id, im.idImageType I have an ascending index.
On im.orderid I have an ascending index too.
The table have 21 millions records and is an MyIsam table.
The table is this:
CREATE TABLE `images` (
`photo_id` int(11) NOT NULL,
`idImageType` int(11) NOT NULL,
`hotel_id` int(11) NOT NULL,
`room_id` int(11) DEFAULT NULL,
`url_original` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_max300` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`url_square60` varchar(150) COLLATE utf8_unicode_ci NOT NULL,
`archive` int(11) NOT NULL DEFAULT '0',
`orderid` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`photo_id`),
KEY `idImageType` (`idImageType`),
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
KEY `archive` (`archive`),
KEY `room_id` (`room_id`),
KEY `orderid` (`orderid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The problem is the performance: hang for several minutes!
Server disk go busy too.
My question is: there is a better manner to achieve the same result?
Have I to partition the table or something else to increase the performance?
I cannot modify server hardware but can tuning MySql application db server settings.
best regards
Tanks to every body. Yours answers help me much. I think that now I have found a better solution.
This problem involve in two critical issue:
efficient paginate on large table
update large table.
To go on efficient paginate on large table I have found a solution by make a previous update on the table but doing so I fall in issues on the 51 minute time needed to the updates and consequent my java infrastructure time out (spring-batch step).
Now by yours help, I found two solution to paginate on large table, and one solution to update large table.
To reach this performance the server need memory. I try this solution on develop server using 32 GB memory.
common solution step
To paginate follow a fields tupla like I needed I have make one index:
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`)
to achieve the new solution we have to change this index by add the primary key part to the index tail KEY hotel_id_idImageType (hotel_id,idImageType, primary key fields):
drop index hotel_id_idImageType on images;
create index hotelTypePhoto on images (hotel_id, idImageType, photo_id);
This is needed to avoid touch table and use only the index file ...
Suppose we want the 10 records after the 19000000 record.
The decimal point is this , in this answers
solution 1
This solution is very practice and not needed the extra field orderid and you have not to do any update before the pagination:
select * from images im inner join
(select photo_id from images
order by hotel_id, idImageType, photo_id
limit 19000000,10) k
on im.photo_id = k.photo_id;
To make the table k on my 21 million table records need only 1,5 sec because it use only the three field in index hotelTypePhoto so haven't to access to the table file and work only on index file.
The order was like the original required (hotel_id, idImageType) because is included in (hotel_id, idImageType, photo_id): same subset...
The join take no time so every first time the paginate is executed on the same page need only 1,5 sec and this is a good time if you have to execute it in a batch one on 3 months.
On production server using 4 GB memory the same query take 3,5 sec.
Partitioning the table do not help to improve performance.
If the server take it in cache the time go down or if you do a jdbc params statment the time go down too (I suppose).
If you have to use it often, it have the advantage that it do not care if the data change.
solution 2
This solution need the extra field orderid and need to do the orderid update one time by batch import and the data have not to change until the next batch import.
Then you can paginate on the table in 0,000 sec.
set #orderid = 0;
update images im inner join (
select photo_id, (#orderid := #orderid + 1) as newOrder
from images order by hotel_id, idImageType, photo_id
) k
on im.photo_id = k.photo_id
set im.orderid = k.newOrder;
The table k is fast almost like in the first solution.
This all update take only 150,551 sec much better than 51 minute!!! (150s vs 3060s)
After this update in the batch you can do the paginate by:
select * from images im where orderid between 19000000 and 19000010;
or better
select * from images im where orderid >= 19000000 and orderid< 19000010;
this take 0,000sec to execute first time and all other time.
Edit after Rick comment
Solution 3
This solution is to avoid extra fields and offset use. But need too take memory of the last page read like in this solution
This is a fast solution and can work on online server production using only 4GB memory
Suppose you need to read last ten records after 20000000.
There is two scenario to take care:
You can start read it from the first to the 20000000 if you need all of it like me and update some variable to take memory of last page read.
you have to read only the last 10 after 20000000.
In this second scenario you have to do a pre query to find the start page:
select hotel_id, idImageType, photo_id
from images im
order by hotel_id, idImageType, photo_id limit 20000000,1
It give to me:
+----------+-------------+----------+
| hotel_id | idImageType | photo_id |
+----------+-------------+----------+
| 1309878 | 4 | 43259857 |
+----------+-------------+----------+
This take 6,73 sec.
So you can store this values in variable to next use.
Suppose we named #hot=1309878, #type=4, #photo=43259857
Then you can use it in a second query like this:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and idImageType>#type OR (
idImageType=#type and photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
The first clause hotel_id>#hot take all records after the actual first field on scrolling index but lost some record. To take it we have to do the OR clause that take on the first index field all remained unread records.
This take only 0,10 sec now.
But this query can be optimized (bool distributive):
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
(idImageType>#type or idImageType=#type)
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>#hot OR (
hotel_id=#hot and
idImageType>=#type
and (idImageType>#type or photo_id>#photo
)
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
(hotel_id>#hot OR hotel_id=#hot) and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
that become:
select * from images im
where
hotel_id>=#hot and
(hotel_id>#hot OR
(idImageType>=#type and (idImageType>#type or photo_id>#photo))
)
order by hotel_id, idImageType, photo_id limit 10;
Are they the same data we can get by the limit?
To quick not exhaustive test do:
select im.* from images im inner join (
select photo_id from images order by hotel_id, idImageType, photo_id limit 20000000,10
) k
on im.photo_id=k.photo_id
order by im.hotel_id, im.idImageType, im.photo_id;
This take 6,56 sec and the data is the same that the query above.
So the test is positive.
In this solution you have to spend 6,73 sec only the first time you need to seek on first page to read (but if you need all you haven't).
To real all other page you need only 0,10 sec a very good result.
Thanks to rick to his hint on a solution based on store the last page read.
Conclusion
On solution 1 you haven't any extra field and take 3,5 sec on every page
On solution 2 you have extra field and need a big memory server (32 GB tested) in 150 sec. but then you read the page in 0,000 sec.
On solution 3 you haven't any extra field but have to store last page read pointer and if you do not start reading by the first page you have to spend 6,73 sec for first page. Then you spend only 0,10 sec on all the other pages.
Best regards
Edit 3
solution 3 is exactly that suggested by Rick. Im sorry, in my previous solution 3 I have do a mistake and when I coded the right solution then I have applied some boolean rule like distributive property and so on, and after all I get the same Rich solution!
regards
You can use some of this:
Update engine to InnoDB, it blocks only one row, not all the table on update.
Create #temp table with photo_id and good orderid and than update your table from this temp:
update images im, temp tp
set im.orderid = tp.orderid
where im.photo_id = tp.photo_id
it will be fastest way and when you fill your tmp table - you have no blocks on primary table.
You can drop indexes before mass update. After all your single update you have rebuilding of indexes and it has a long time.
KEY `hotel_id` (`hotel_id`),
KEY `hotel_id_idImageType` (`hotel_id`,`idImageType`),
DROP the former; the latter takes care of any need for it. (This won't speed up the original query.)
"The problem is the performance: hang for several minutes!" What is the problem?
Other queries are blocked for several minutes? (InnoDB should help.)
You run this update often and it is annoying? (Why in the world??)
Something else?
This one index is costly while doing the Update:
KEY `orderid` (`orderid`)
DROP it and re-create it. (Don't bother dropping the rest.) Another reason for going with InnoDB is that these operations can be done (in 5.6) without copying the table over. (21M rows == long time if it has to copy the table!)
Why are you building a second Unique index (orderid) in addition to photo_id, which is already Unique? I ask this because there may be another way to solve the real problem that does not involve this time-consuming Update.
I have two more concrete suggestions, but I want to here your answers first.
Edit Pagination, ordered by hotel_id, idImageType, photo_id:
It is possible to read the records in order by that triple. And even to "paginate" through them.
If you "left off" after ($hid, $type, $pid), here would be the 'next' 20 records:
WHERE hotel_id >= $hid
AND ( hotel_id > $hid
OR idImageType >= $type
AND ( idImageType > $type
OR photo_id > $pid
)
)
ORDER BY hotel_id, idImageType, photo_id
LIMIT 20
and have
INDEX(hotel_id, idImageType, photo_id)
This avoids the need for orderid and its time consuming Update.
It would be simpler to paginate one hotel_id at a time. Would that work?
Edit 2 -- eliminate downtime
Since you are reloading the entire table periodically, do this when you reload:
CREATE TABLE New with the recommended index changes.
Load the data into New. (Be sure to avoid your 51-minute timeout; I don't know what is causing that.)
RENAME TABLE images TO old, New TO images;
DROP TABLE old;
That will avoid blocking the table for the load and for the schema changes. There will be a very short block for the atomic Step #3.
Plan on doing this procedure each time you reload your data.
Another benefit -- After step #2, you can test the New data to see if it looks OK.
I'd like to ask a question about how to improve performance in a big MySQL table using innodb engine:
There's currently a table in my database with around 200 million rows. This table periodically stores the data collected by different sensors. The structure of the table is as follows:
CREATE TABLE sns_value (
value_id int(11) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
date timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
KEY idx_sensor id (sensor_id),
KEY idx_date (date),
KEY idx_type_id (type_id) );
At first, I thought of partitioning the table in months, but due to the steady addition of new sensors it would reach the current size in about a month.
Another solution that I came up with was partitioning the table by sensors. However, due to the limit of 1024 partitions of MySQL that wasn't an option.
I believe that the right solution would be using a table with the same structure for each of the sensors:
sns_value_XXXXX
This way there would be more than 1.000 tables with an estimated size of 30 million rows per year. These tables could, at the same time, be partitioned in months for fastest access to data.
What problems would result from this solution? Is there a more normalized solution?
Editing with additional information
I consider the table to be big in relation to my server:
Cloud 2xCPU and 8GB Memory
LAMP (CentOS 6.5 and MySQL 5.1.73)
Each sensor may have more than one variable types (CO, CO2, etc.).
I mainly have two slow queries:
1) Daily summary for each sensor and type (avg, max, min):
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max, type_id
FROM sns_value
WHERE sensor_id=1 AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
This takes more than 5 min.
2) Vertical to Horizontal view and export:
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1 AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date LIMIT 4500;
This also takes more than 5 min.
Other considerations
Timestamps may be repeated due to inserts characteristics.
Periodic inserts must coexist with selects.
No updates nor deletes are performed on the table.
Suppositions made to the "one table for each sensor" approach
Tables for each sensor would be much smaller so access would be faster.
Selects will be performed only on one table for each sensor.
Selects mixing data from different sensors are not time-critical.
Update 02/02/2015
We have created a new table for each year of data, which we have also partitioned in a daily basis. Each table has around 250 million rows with 365 partitions. The new index used is as Ollie suggested (sensor_id, date, type_id, value) but the query still takes between 30 seconds and 2 minutes. We do not use the first query (daily summary), just the second (vertical to horizontal view).
In order to be able to partition the table, the primary index had to be removed.
Are we missing something? Is there a way to improve the performance?
Many thanks!
Edited based on changes to the question
One table per sensor is, with respect, a very bad idea indeed. There are several reasons for that:
MySQL servers on ordinary operating systems have a hard time with thousands of tables. Most OSs can't handle that many simultaneous file accesses at once.
You'll have to create tables each time you add (or delete) sensors.
Queries that involve data from multiple sensors will be slow and convoluted.
My previous version of this answer suggested range partitioning by timestamp. But that won't work with your value_id primary key. However, with the queries you've shown and proper indexing of your table, partitioning probably won't be necessary.
(Avoid the column name date if you can: it's a reserved word and you'll have lots of trouble writing queries. Instead I suggest you use ts, meaning timestamp.)
Beware: int(11) values aren't aren't big enough for your value_id column. You're going to run out of ids. Use bigint(20) for that column.
You've mentioned two queries. Both these queries can be made quite efficient with appropriate compound indexes, even if you keep all your values in a single table. Here's the first one.
SELECT round(avg(value)) as mean, min(value) as min, max(value) as max,
type_id
FROM sns_value
WHERE sensor_id=1
AND date BETWEEN '2014-10-29 00:00:00' AND '2014-10-29 12:00:00'
GROUP BY type_id limit 2000;
For this query, you're first looking up sensor_id using a constant, then you're looking up a range of date values, then you're aggregating by type_id. Finally you're extracting the value column. Therefore, a so-called compound covering index on (sensor_id, date, type_id, value) will be able to satisfy your query directly with an index scan. This should be very fast for you--certainly faster than 5 minutes even with a large table.
In your second query, a similar indexing strategy will work.
SELECT sns_value.date AS date,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 101)))))) AS one,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 141)))))) AS two,
sum((sns_value.value * (1 - abs(sign((sns_value.type_id - 151)))))) AS three
FROM sns_value
WHERE sns_value.sensor_id=1
AND sns_value.date BETWEEN '2014-10-28 12:28:29' AND '2014-10-29 12:28:29'
GROUP BY sns_value.sensor_id,sns_value.date
LIMIT 4500;
Again, you start with a constant value of sensor_id and then use a date range. You then extract both type_id and value. That means the same four column index I mentioned should work for you.
CREATE TABLE sns_value (
value_id bigint(20) NOT NULL AUTO_INCREMENT,
sensor_id int(11) NOT NULL,
type_id int(11) NOT NULL,
ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
value int(11) NOT NULL,
PRIMARY KEY (value_id),
INDEX query_opt (sensor_id, ts, type_id, value)
);
Creating separate table for a range of sensors would be an idea.
Do not use the auto_increment for a primary key, if you dont have to. Usually DB engine is clustering the data by its primary key.
Use composite key instead, depends from your usecase, the sequence of columns may be different.
EDIT: Also added the type into the PK. Considering the queries, i would do it like this. Choosing the field names is intentional, they should be descriptive and always consider the reserverd words.
CREATE TABLE snsXX_readings (
sensor_id int(11) NOT NULL,
reading int(11) NOT NULL,
reading_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
type_id int(11) NOT NULL,
PRIMARY KEY (reading_time, sensor_id, type_id),
KEY idx date_idx (date),
KEY idx type_id (type_id)
);
Also, consider summarizing the readings or grouping them into a single field.
You can try get randomize summary data
I have similar table. table engine myisam(smallest table size), 10m record, no index on my table because useless(tested). Get all range for the all data. result:10sn this query.
SELECT * FROM (
SELECT sensor_id, value, date
FROM sns_value l
WHERE l.sensor_id= 123 AND
(l.date BETWEEN '2013-10-29 12:28:29' AND '2015-10-29 12:28:29')
ORDER BY RAND() LIMIT 2000
) as tmp
ORDER BY tmp.date;
This query on first step get between dates and sorting randomize first 2k data, on the second step sort data. the query every time get 2k result for different data.