I'm trying to optimize a query which is taking way too long but can't seem to figure it out.
CREATE TABLE `syncs` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status` tinyint(1) NOT NULL,
`auto_retryable_after` timestamp NULL DEFAULT NULL,
`times_auto_retried` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `UsedDate` (`status`,`auto_retryable_after`)
)
With this query:
SELECT * FROM syncs WHERE status IN ('2','4') and auto_retryable_after <= NOW()
With 500,000 test records this takes roughly 16.5 seconds. I usually have a much larger data set which means it takes multiple minutes. So any help would be appreciated!
Shoveling 500K rows to the client will (1) take network time, and (2) choke the client. What will you do with that flood of data?
When looking at a performance of a query, please provide EXPLAIN SELECT .... That may show that it shunned the index and simply scanned the table. This would actually be faster. If, instead, it used the UsedDate index, that would be quite slow due to bouncing between the index's BTree and the data's BTree 500K times.
If this is a test case flaw (namely that the real data won't need to shovel all 500K), then create a "real" test case and try again.
If you really need all 500K, then re-think the overall flow of data. For example, if you just need a "count" of the number of such rows, MySQL can much more efficiently do the count and deliver 1 row instead of 500K rows.
Related
My application uses a MariaDB database which I try to keep isolated, but one particular user goes straight to the database and started complaining today after 6 weeks without incident that one of their queries slowed down from 5 mins (which I thought was bad enough) to over 120 mins.
Since then today it has sometimes been as fast as usual, sometimes slowing down again.
This is their query:
SELECT MAX(last_updated) FROM data_points;
This is the table:
CREATE TABLE data_points (
seriesId INT UNSIGNED NOT NULL,
modifiedDate DATE NOT NULL,
valueDate DATE NOT NULL,
value DOUBLE NOT NULL,
created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_updated DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP()
ON UPDATE CURRENT_TIMESTAMP,
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
CONSTRAINT pk_data PRIMARY KEY (seriesId, modifiedDate, valueDate),
KEY ix_data_modifieddate (modifiedDate),
KEY ix_data_id (id),
CONSTRAINT fk_data_seriesid FOREIGN KEY (seriesId)
REFERENCES series(id)
) ENGINE=InnoDB
DEFAULT CHARSET=utf8mb4
COLLATE=utf8mb4_unicode_ci
MAX_ROWS=222111000;
and this is the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE data_points ALL NULL NULL NULL NULL 224166191
The table has approx 250M rows and is growing relatively fast.
I can coerce the user into doing something more sensible but in the short term I'm keen to understand why the query duration is going crazy today after 6 weeks of calm. I'll accept the first answer that can explain that.
SELECT MAX(last_updated) FROM data_points; is easily optimized:
INDEX(last_updated)
That index will make that MAX be essentially instantaneous. And it will avoid pounding on the disk and cache (see below).
Two things control the un-indexed speed:
The size of the table, which is "growing relatively fast", and
[This is probably what you are fishing for.] How much of the table is cached when the query is run. This can make a 10x difference in the speed. You can partially test this claim thus:
Restart mysqld; time the query; time it again. The first run had to hit the disk a lot (because of the fresh restart); the second may have found everything in RAM.
Another thing that can mess with the timings: If some other 'big' query is run and it bumps blocks of this table out of cache, then the query will again be slow.
Of relevance: Size of table, value of innodb_buffer_pool_size, and amount of RAM.
On an unrelated topic... That PRIMARY KEY (seriesId, modifiedDate, valueDate) seems strange. A PK is must be unique. Dates (datetime, etc) are likely to have multiple entries for the same day/second; so can you be sure of uniqueness? Especially with 2 dates?
(More)
Please explain the meaning of each of the 4 dates. And ask yourself if they are all needed. (About half the bulk of the table is those dates!)
The table has an AUTO_INCREMENT; is it needed by some other table? If not then either it could be removed, or it could be used to assure that the PK is unique.
To better help you, we need to see more of the queries.
I have a large table called "queue". It has 12 million records right now.
CREATE TABLE `queue` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`userid` varchar(64) DEFAULT NULL,
`action` varchar(32) DEFAULT NULL,
`target` varchar(64) DEFAULT NULL,
`name` varchar(64) DEFAULT NULL,
`state` int(11) DEFAULT '0',
`timestamp` int(11) DEFAULT '0',
`errors` int(11) DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_unique` (`userid`,`action`,`target`),
KEY `idx_userid` (`userid`),
KEY `idx_state` (`state`)
) ENGINE=InnoDB;
Multiple PHP workers (150) use this table simultaneously.
They select a record, perform a network request using the selected data and then delete the record.
I get mixed execution times from the select and delete queries. Is the delete command locking the table?
What would be the best approach for this scenario?
SELECT record + NETWORK request + DELETE the record
SELECT record + NETWORK request + MARK record as completed + DELETE completed records using a cron from time to time (I don't want an even bigger table).
Note: The queue gets new records every minute but the INSERT query is not the issue here.
Any help is appreciated.
"Don't queue it, just do it". That is, if the tasks are rather fast, it is better to simply perform the action and not queue it. Databases don't make good queuing mechanisms.
DELETE does not lock an InnoDB table. However, you can write a DELETE that seems that naughty. Let's see your actual SQL so we can work in improving it.
12M records? That's a huge backlog; what's up?
Shrink the datatypes so that the table is not gigabytes:
action is only a small set of possible values? Normalize it down to a 1-byte ENUM or TINYINT UNSIGNED.
Ditto for state -- surely it does not need a 4-byte code?
There is no need for INDEX(userid) since there is already an index (UNIQUE) starting with userid.
If state has only a few value, the index won't be used. Let's see your enqueue and dequeue queries so we can discuss how to either get rid of that index or make it 'composite' (and useful).
What's the current value of MAX(id)? Is it threatening to exceed your current limit of about 4 billion for INT UNSIGNED?
How does PHP use the queue? Does it hang onto an item via an InnoDB transaction? That defeats any parallelism! Or does it change state. Show us the code; perhaps the lock & unlock can be made less invasive. It should be possible to run a single autocommitted UPDATE to grab a row and its id. Then, later, do an autocommitted DELETE with very little impact.
I do not see a good index for grabbing a pending item. Again, let's see the code.
150 seems like a lot -- have you experimented with fewer? They may be stumbling over each other.
Is the Slowlog turned on (with a low value for long_query_time)? If so, I wonder what is the 'worst' query. In situations like this, the answer may be surprising.
I have two databases:
Database A
CREATE TABLE `jobs` (
`job_id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`in_b`, tinyint(1) DEFAULT 0,
PRIMARY KEY (`url_id`),
KEY `idx_inb` (`in_b`),
)
Database B
CREATE TABLE `jobs_copy` (
`job_id` int(11) unsigned NOT NULL,
`created` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`url_id`)
)
Performance Issue
I am performing a query where I get a batch of jobs (100 jobs) from Database A and create a copy in Database B, then mark them as in_b with a:
UPDATE jobs SET in_b=1 WHERE job_id IN (1,2,3.....)
This worked fine. The rows were being transferred fairly quickly until I reached job_id values > 2,000,000. The select query to get a batch of jobs was still quick (4ms), but the update statement was much slower.
Is there a reason for this? I searched MySQL Docs / Stackoverflow to see if converting the "IN" to a "OR" query would improve this query, but the general consensus was that a "ON" query will be faster in most cases.
If anyone has any insight as to why this is happening and how I can avoid this slowdown as I reach 10mil + rows, I would be extremely grateful.
Thanks in advance,
Ash
P.S. I am completing these update/select/insert through two RESTful services (one attached to each DB) but this is a constant from job_id 1 to through 2mil etc.
Your UPDATE query is progressively slowing down because it's having to read many rows from your large table to find the rows it needs to process. It's probably doing a so-called full table scan because there is no suitable index.
Pro tip: when a query starts out running fast, but then gets slower and slower over time, it's a sign that optimization (possibly indexing) is required.
To optimize this query:
UPDATE jobs SET in_b=1 WHERE job_id IN (1,2,3.....)
Create an index on the job_id column, as follows.
CREATE INDEX job_id_index ON jobs(job_id)
This should allow your query to locate the records which it needs to update very quickly with its IN (2,3,6) search filter.
I've really simple query to get MIN and MAX values, it looks like:
SELECT MAX(value_avg)
, MIN(value_avg)
FROM value_data
WHERE value_id = 769
AND time_id BETWEEN 214000 AND 219760;
And here you are the schema of the value_data table:
CREATE TABLE `value_data` (
`value_id` int(11) NOT NULL,
`time_id` bigint(20) NOT NULL,
`value_min` float DEFAULT NULL,
`value_avg` float DEFAULT NULL,
`value_max` float DEFAULT NULL,
KEY `idx_vdata_vid` (`value_id`),
KEY `idx_vdata_tid` (`time_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
As you see, the query and the table are simple and I don't see anything wrong here, but when I execute this query, it takes about ~9 seconds to get data. I also made profile of this query, and 99% of time is "Sending data".
The table is really big and it weighs about 2 GB, but is it a problem? I don't think this table is too big, it must be something else...
MySQL can easily handle a database of that size. However, you should be able to improve the performance of this query and probably the table in general. By changing the time_id column to an UNSIGNED INT NOT NULL, you can significantly decrease the size of the data and indexes on that column. Also, the query you mention could benefit from a composite index on (value_id, time_id). With that index, it would be able to use the index for both parts of the query instead of just one as it is now.
Also, please edit your question with an EXPLAIN of the query. It should confirm what I expect about the indexes, but it's always helpful information to have.
Edit:
You don't have a PRIMARY index defined for the table, which definitely isn't helping your situation. If the values of (value_id, time_id) are unique, you should probably make the new composite index I mention above the PRIMARY index for the table.
I have these table structures and while it works, using EXPLAIN on certain SQL queries gives 'Using temporary; Using filesort' on one of the table. This might hamper performance once the table is populated with thousands of data. Below are the table structure and explanations of the system.
CREATE TABLE IF NOT EXISTS `jobapp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fullname` varchar(50) NOT NULL,
`icno` varchar(14) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '1',
`timestamp` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `icno` (`icno`)
) ENGINE=MyISAM;
CREATE TABLE IF NOT EXISTS `jobapplied` (
`appid` int(11) NOT NULL,
`jid` int(11) NOT NULL,
`jobstatus` tinyint(1) NOT NULL,
`timestamp` int(10) NOT NULL,
KEY `jid` (`jid`),
KEY `appid` (`appid`)
) ENGINE=MyISAM;
Query I tried which gives aforementioned statement:
EXPLAIN SELECT japp.id, japp.fullname, japp.icno, japp.status, japped.jid, japped.jobstatus
FROM jobapp AS japp
INNER JOIN jobapplied AS japped ON japp.id = japped.appid
WHERE japped.jid = '85'
AND japped.jobstatus = '2'
AND japp.status = '2'
ORDER BY japp.`timestamp` DESC
This system is for recruiting new staff. Once registration is opened, hundreds of applicant will register in a single time. They are allowed to select 5 different jobs. Later on at the end of registration session, the admin will go through each job one by one. I have used a single table (jobapplied) to store 2 items (applicant id, job id) to record who applied what. And this is the table which causes aforementioned statement. I realize this table is without PRIMARY key but I just can't figure out any other way later on for the admin to search specifically which job who have applied.
Any advice on how can I optimize the table?
Apart from the missing indexes and primary keys others have mentioned . . .
This might hamper performance once the
table is populated with thousands of
data.
You seem to be assuming that the query optimizer will use the same execution plan on a table with thousands of rows as it will on a table with just a few rows. Optimizers don't work like that.
The only reliable way to tell how a particular vendor's optimizer will execute a query on a table with thousands of rows--which is still a small table, and will probably easily fit in memory--is to
load a scratch version of the
database with thousands of rows
"explain" the query you're interested
in
FWIW, the last test I ran like this involved close to a billion rows--about 50 million in each of about 20 tables. The execution plan for that query--which included about 20 left outer joins--was a lot different than it was for the sample data (just a few thousand rows).
You are ordering by jobapp.timestamp, but there is no index for timestamp so the tablesort (and probably the temporary) will be necessary try adding and index for timestamp to jobapp something like KEY timid (timestamp,id)