Query taking long time to execute in AWS RDS - mysql

I am working on some temp tables for practice.
The one query is taking too much of time around 550 sec.Db is hosted in AWS RDS with 8cpu and 16GB ram.
Below query has to be run in different DB( prod ) , first checking in test testDB
create table test_01 as
select *
from
(
select
person
,age
,dob
,place
from
person
where
person is not null
and age is not null
and dob is not null
and place is not null
limit 1000
) ps_u
left join
employee em_u
on ps_u.age = em_u.em_age
and ps_u.place = em_u.location
order by person
limit 1000
Is there issue with query or with the resource,
CPU utilization shows 30% ram is ok not too much.
Let me know any suggestion to optimize the query.

check your left join. it can be a reason for it. left join will return everything from your left table, if this table has lot of entry, it will slow down your query.
With it, you can break your query in two separate query & check execution time using different tweaking.
Try to return specific rows rather than *.

In case you are limiting the result (with limit 1000) - do you really need order by person? If the result is huge - order by could adversely affect the performance.

You can reduce 1 select statement / also left join bring all records from left table could take time to process data.
CREATE TABLE test_01 AS
(SELECT person,
age,
dob,
place
FROM person ps_u
LEFT JOIN employee em_u ON ps_u.age = em_u.em_age
AND ps_u.place = em_u.location
ORDER BY ps_u.person
WHERE ps_u.person IS NOT NULL
AND ps_u.age IS NOT NULL
AND ps_u.dob IS NOT NULL
AND ps_u.place IS NOT NULL
LIMIT 1000)

I solved it by creating index for the column
alter table person
add fulltext index `fulltext`
(
, person asc
, age asc
, dob asc
, place asc
)
;
And then the query took only 3 seconds for 1000 records

Related

Optimizing Select SQL request with millions of entries

I'm working on a table counting around 40,000,000 rows, and I'm trying to extract first entry for each "subscription_id" (foreign key from another table), here is my acutal request:
SELECT * FROM billing bill WHERE bill.billing_value not like 'not_ok%'
AND
(SELECT bill2.billing_id
FROM billing bill2
WHERE bill2.subscription_id = bill.subscription_id
ORDER BY bill2.billing_id ASC LIMIT 1
)= bill.billing_id;
This request is working correctly, when I put a small limit on it, but I cannot seem to process it for all the database.
Is there a way I could optimise it somehow ? Or do things in an other way ?
Table indexes and structure:
Indexes:
This is an example of the ROW_NUMBER() solution mentioned in the comments above.
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where billing_value not like 'not_ok%'
) t
where rownum = 1;
The ROW_NUMBER() function is available in MySQL 8.0, so if you haven't upgraded yet, you must do so to use this function.
Unfortunately, this won't be much of an improvement, because the NOT LIKE causes a table-scan regardless of the pattern you search for.
I believe it requires a virtual column with an index to optimize that condition:
alter table billing
add column ok as tinyint(1) as (billing_value not like 'not_ok%'),
add index (ok);
select *
from (
select *, row_number() over (partition by subscription_id order by billing_id) as rownum
from billing
where ok = true
) t
where rownum = 1;
Now it will use the index on the ok virtual column to reduce the set of examined rows.
This still might be a costly query on a 40 million row table, because the derived table subquery creates a large temporary table. If it's not fast enough, you'll have to really reconsider how you store and query this data.
For example, adding a column first_ok with an index, which is true only on the rows you need to fetch (the first row per subscriber_id without 'not_ok' as the billing value). But you must maintain this new column manually, and risk it being wrong if you don't do that. This is a denormalized design, but tailored to the query you want to run.
I haven't tried it, because I don't have an MySQL DB at hand, but this query seems much simpler:
select *
from billing
where billing_id in (select min(billing_id)
from billing
group by subscription_id)
and billing_value not like 'not_ok%';
The inner select get the minimum billing_id for all subscriptions. The outer gets the rest of the billing record.
If performance is an issue, I'd add the billing_id field in the third index, so you get an index with (subscription_id,billing_id). This will help for the inner query.

MySQL: Grouped/Ordered/Left Join query very slow

I have a problem with a query which takes far too long (Over two seconds just for this simple query).
On first look it appears to be an indexing issue, all joined fields are indexed, but i cannot find what else I may need to index to speed this up. As soon as i add the fields i need to the query, it gets even slower.
SELECT `jobs`.`job_id` AS `job_id` FROM tabledef_Jobs AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Table row counts (~): tabledef_Jobs (108k), tabledef_JobCatLink (109k), tabledef_Companies (100), tabledef_Applications (50k)
Here you can see the Describe. 'Using temporary' appears to be what is slowing down the query:
table index screenshots:
Any help would be greatly appreciated
EDIT WITH ANSWER
Final improved query with thanks to #Steve (marked answer). Ultimately, the final query was reduced from ~22s to ~0.3s:
SELECT `jobs`.`job_id` AS `job_id` FROM
(
SELECT * FROM tabledef_Jobs as jobs ORDER BY `jobs`.`date_posted` ASC LIMIT 0 , 50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id
GROUP BY `jobs`.`job_id`
ORDER BY `jobs`.`date_posted` ASC
LIMIT 0 , 50
Right, I’ll have a stab at this.
It would appear that the Query Optimiser cannot use an index to fulfil the query upon the tabledef_Jobs table.
You've got an offset limit and this with the combination of your ORDER BY cannot limit the amount of data before joining and thus it is having to group by job_id which is a PK and fast – but then order that data (temporary table and a filesort) before limiting and throwing away a the vast majorly of this data before finally join everything else to it.
I would suggest, adding a composite index to jobs of “job_id, date_posted”
So firstly optimise the base query:
SELECT * FROM tabledef_Jobs
GROUP BY job_id
ORDER BY date_posted
LIMIT 0,50
Then you can combine the joins and the final structure together to make a more efficient query.
I cannot let it go by without suggesting you rethink your limit offset. This is fine for small initial offsets but when it starts to get large this can be a major cause of performance issues. Let’s for example sake say you’re using this for pagination, what happens if they want page 3,000 – you will use
LIMIT 3000, 50
This will then collect 3050 rows / manipulate the data and then throw away the first 3000.
[edit 1 - In response to comments below]
I will expand with some more information that might point you in the right direction. Unfortunately there isn’t a simple fix that will resolve it , you must understand why this is happening to be able to address it. Simply removing the LIMIT or ORDER BY may not work and after all you don’t want to remove then as its part of your query which means it must be there for a purpose.
Optimise the simple base query first that is usually a lot easier than working with multi-joined datasets.
Despite all the bashing it receives there is nothing wrong with filesort. Sometimes this is the only way to execute the query. Agreed it can be the cause of many performance issues (especially on larger data sets) but that’s not usually the fault of filesort but the underlying query / indexing strategy.
Within MySQL you cannot mix indexes or mix orders of the same index – performing such a task will result in a filesort.
How about as I suggested creating an index on date_posted and then using:
SELECT jobs.job_id, jobs.date_posted, jobcats .*, apps.*, company .* FROM
(
SELECT DISTINCT job_id FROM tabledef_Jobs
ORDER BY date_posted
LIMIT 0,50
) AS jobs
LEFT JOIN tabledef_JobCatLink AS jobcats ON jobs.job_id = jobcats.job_id
LEFT JOIN tabledef_Applications AS apps ON jobs.job_id = apps.job_id
LEFT JOIN tabledef_Companies AS company ON jobs.company_id = company.company_id

MySql select statement get most recent row for each client_id

Hi i have an issue with a mysql select statement i cant get my head around,
Table client_directory_data
id int,
verified int,
client_id int,
created timestamp,
description longtext
select * from client_directory_data where verified = 1 order by created desc
but this selects multiple rows for each client_id
what i need to do is to select every client_id which has a verified = 1 but only get the most recent row for each client_id, i hope that makes sense.
This is an issue I face all the time. Fortunately there's a nice little trick for doing this:
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
And if you want the whole row you can just join onto it like so:
SELECT
*
FROM (
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
) ids
JOIN client_directory_data USING (id);
Of course if you're ordering by an indexed field anyway (that you could therefore join on efficiently anyway), it's better to use MAX(id) AS id, although it actually has very little impact on performance. The main reason to use MAX() is really to make the code a little simpler. It also avoids the pitfalls you may encounter if the field contains commas (which you can get around with a different seperator for the group concat) or hitting the max GROUP_CONCAT length (which can be extended with SET group_concat_max_len = xxx; and only causes warnings anyway).
I can see why this would intuitively seem like it would have performance issues, however it's actually the best performng method I've found for these queries - especially on large tables.
Here are some benchmarks I've taken from some of the larger tables currently available to me comparing the three methods in this thread.
Query A: (~5,000 records, ~900 results, non-indexed field)
GROUP_CONCAT method: 0.0100 seconds
MAX method: 0.102 seconds
LEFT JOIN method: 0.0082 seconds
Query B : (~300,000 records, ~95,000 results)
GROUP_CONCAT method: 1.8618 seconds
MAX method: 1.7904 seconds
LEFT JOIN method: 6.4649 seconds
Query C : (~300,000 records, ~7 results)
GROUP_CONCAT method: 0.103 seconds
MAX method: 0.0102 seconds
LEFT JOIN method: (I got bored after 4 hours)
Query D : (~500,000 records, ~5,000 different values of the field being grouped)
GROUP method: 0.1355 seconds
MAX Method : 0.0429 seconds
LEFT JOIN method: (I got bored after 10 minutes)
That makes sense and is a classic question.
Assuming that the most recent row is the one with highest id, you can use:
SELECT *
FROM client_directory_data c
LEFT JOIN client_directory_data d ON c.client_id = d.client_id AND d.verified = 1 AND d.id > c.id
WHERE d.id IS NULL
AND c.verified = 1;
You can have an explanation of this query pattern here.
Make id as primary key for the table client_directory_data

ORDER BY NULL slower than ORDER BY column

according to MySQL documentation, adding ORDER BY NULL after GROUP BY should "avoid the overhead of sorting the result". If so, why is following query aprox 5 times slower
SELECT COUNT(*) advert_count, category.name
FROM advert
LEFT JOIN category ON advert.category_id = category.category_id
WHERE (
advert.state_id = 2
)
GROUP BY advert.category_id
ORDER BY NULL
LIMIT 5
than query with ORDER BY advert_count?
SELECT COUNT(*) advert_count, category.name
FROM advert
LEFT JOIN category ON advert.category_id = category.category_id
WHERE (
advert.state_id = 2
)
GROUP BY advert.category_id
ORDER BY advert_count DESC
LIMIT 5
From phpMyAdmin profiling:
1st query:
Sorting for group -
Sorting result 0.000002
Sending data 12.069774
2nd query:
Sorting for group 2.436986
Sorting result 0.000028
Sending data 0.000021
I am confused by this, could anyone explain me what is going on there?
ORDER BY NULL is not ordering by anything at all. Every record gets the same position.
So the first query is selecting the first 5 groups found in the data. But the second query is calculating the results for all groups in the data, and finding the top 5 based on the count.
This is born out by the first two overheads you have shown.
The 3rd overhead is the difference - Sending Data. For a non-database related reason, when you ran it it took a long time to transfer the data. This is likely due to server or network load at that moment in time.

SQL query: Delete all records from the table except latest N?

Is it possible to build a single mysql query (without variables) to remove all records from the table, except latest N (sorted by id desc)?
Something like this, only it doesn't work :)
delete from table order by id ASC limit ((select count(*) from table ) - N)
Thanks.
You cannot delete the records that way, the main issue being that you cannot use a subquery to specify the value of a LIMIT clause.
This works (tested in MySQL 5.0.67):
DELETE FROM `table`
WHERE id NOT IN (
SELECT id
FROM (
SELECT id
FROM `table`
ORDER BY id DESC
LIMIT 42 -- keep this many records
) foo
);
The intermediate subquery is required. Without it we'd run into two errors:
SQL Error (1093): You can't specify target table 'table' for update in FROM clause - MySQL doesn't allow you to refer to the table you are deleting from within a direct subquery.
SQL Error (1235): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery' - You can't use the LIMIT clause within a direct subquery of a NOT IN operator.
Fortunately, using an intermediate subquery allows us to bypass both of these limitations.
Nicole has pointed out this query can be optimised significantly for certain use cases (such as this one). I recommend reading that answer as well to see if it fits yours.
I know I'm resurrecting quite an old question, but I recently ran into this issue, but needed something that scales to large numbers well. There wasn't any existing performance data, and since this question has had quite a bit of attention, I thought I'd post what I found.
The solutions that actually worked were the Alex Barrett's double sub-query/NOT IN method (similar to Bill Karwin's), and Quassnoi's LEFT JOIN method.
Unfortunately both of the above methods create very large intermediate temporary tables and performance degrades quickly as the number of records not being deleted gets large.
What I settled on utilizes Alex Barrett's double sub-query (thanks!) but uses <= instead of NOT IN:
DELETE FROM `test_sandbox`
WHERE id <= (
SELECT id
FROM (
SELECT id
FROM `test_sandbox`
ORDER BY id DESC
LIMIT 1 OFFSET 42 -- keep this many records
) foo
);
It uses OFFSET to get the id of the Nth record and deletes that record and all previous records.
Since ordering is already an assumption of this problem (ORDER BY id DESC), <= is a perfect fit.
It is much faster, since the temporary table generated by the subquery contains just one record instead of N records.
Test case
I tested the three working methods and the new method above in two test cases.
Both test cases use 10000 existing rows, while the first test keeps 9000 (deletes the oldest 1000) and the second test keeps 50 (deletes the oldest 9950).
+-----------+------------------------+----------------------+
| | 10000 TOTAL, KEEP 9000 | 10000 TOTAL, KEEP 50 |
+-----------+------------------------+----------------------+
| NOT IN | 3.2542 seconds | 0.1629 seconds |
| NOT IN v2 | 4.5863 seconds | 0.1650 seconds |
| <=,OFFSET | 0.0204 seconds | 0.1076 seconds |
+-----------+------------------------+----------------------+
What's interesting is that the <= method sees better performance across the board, but actually gets better the more you keep, instead of worse.
Unfortunately for all the answers given by other folks, you can't DELETE and SELECT from a given table in the same query.
DELETE FROM mytable WHERE id NOT IN (SELECT MAX(id) FROM mytable);
ERROR 1093 (HY000): You can't specify target table 'mytable' for update
in FROM clause
Nor can MySQL support LIMIT in a subquery. These are limitations of MySQL.
DELETE FROM mytable WHERE id NOT IN
(SELECT id FROM mytable ORDER BY id DESC LIMIT 1);
ERROR 1235 (42000): This version of MySQL doesn't yet support
'LIMIT & IN/ALL/ANY/SOME subquery'
The best answer I can come up with is to do this in two stages:
SELECT id FROM mytable ORDER BY id DESC LIMIT n;
Collect the id's and make them into a comma-separated string:
DELETE FROM mytable WHERE id NOT IN ( ...comma-separated string... );
(Normally interpolating a comma-separate list into an SQL statement introduces some risk of SQL injection, but in this case the values are not coming from an untrusted source, they are known to be integer values from the database itself.)
note: Though this doesn't get the job done in a single query, sometimes a more simple, get-it-done solution is the most effective.
DELETE i1.*
FROM items i1
LEFT JOIN
(
SELECT id
FROM items ii
ORDER BY
id DESC
LIMIT 20
) i2
ON i1.id = i2.id
WHERE i2.id IS NULL
If your id is incremental then use something like
delete from table where id < (select max(id) from table)-N
To delete all the records except te last N you may use the query reported below.
It's a single query but with many statements so it's actually not a single query the way it was intended in the original question.
Also you need a variable and a built-in (in the query) prepared statement due to a bug in MySQL.
Hope it may be useful anyway...
nnn are the rows to keep and theTable is the table you're working on.
I'm assuming you have an autoincrementing record named id
SELECT #ROWS_TO_DELETE := COUNT(*) - nnn FROM `theTable`;
SELECT #ROWS_TO_DELETE := IF(#ROWS_TO_DELETE<0,0,#ROWS_TO_DELETE);
PREPARE STMT FROM "DELETE FROM `theTable` ORDER BY `id` ASC LIMIT ?";
EXECUTE STMT USING #ROWS_TO_DELETE;
The good thing about this approach is performance: I've tested the query on a local DB with about 13,000 record, keeping the last 1,000. It runs in 0.08 seconds.
The script from the accepted answer...
DELETE FROM `table`
WHERE id NOT IN (
SELECT id
FROM (
SELECT id
FROM `table`
ORDER BY id DESC
LIMIT 42 -- keep this many records
) foo
);
Takes 0.55 seconds. About 7 times more.
Test environment: mySQL 5.5.25 on a late 2011 i7 MacBookPro with SSD
DELETE FROM table WHERE ID NOT IN
(SELECT MAX(ID) ID FROM table)
try below query:
DELETE FROM tablename WHERE id < (SELECT * FROM (SELECT (MAX(id)-10) FROM tablename ) AS a)
the inner sub query will return the top 10 value and the outer query will delete all the records except the top 10.
What about :
SELECT * FROM table del
LEFT JOIN table keep
ON del.id < keep.id
GROUP BY del.* HAVING count(*) > N;
It returns rows with more than N rows before.
Could be useful ?
Using id for this task is not an option in many cases. For example - table with twitter statuses. Here is a variant with specified timestamp field.
delete from table
where access_time >=
(
select access_time from
(
select access_time from table
order by access_time limit 150000,1
) foo
)
Just wanted to throw this into the mix for anyone using Microsoft SQL Server instead of MySQL. The keyword 'Limit' isn't supported by MSSQL, so you'll need to use an alternative. This code worked in SQL 2008, and is based on this SO post. https://stackoverflow.com/a/1104447/993856
-- Keep the last 10 most recent passwords for this user.
DECLARE #UserID int; SET #UserID = 1004
DECLARE #ThresholdID int -- Position of 10th password.
SELECT #ThresholdID = UserPasswordHistoryID FROM
(
SELECT ROW_NUMBER()
OVER (ORDER BY UserPasswordHistoryID DESC) AS RowNum, UserPasswordHistoryID
FROM UserPasswordHistory
WHERE UserID = #UserID
) sub
WHERE (RowNum = 10) -- Keep this many records.
DELETE UserPasswordHistory
WHERE (UserID = #UserID)
AND (UserPasswordHistoryID < #ThresholdID)
Admittedly, this is not elegant. If you're able to optimize this for Microsoft SQL, please share your solution. Thanks!
If you need to delete the records based on some other column as well, then here is a solution:
DELETE
FROM articles
WHERE id IN
(SELECT id
FROM
(SELECT id
FROM articles
WHERE user_id = :userId
ORDER BY created_at DESC LIMIT 500, 10000000) abc)
AND user_id = :userId
This should work as well:
DELETE FROM [table]
INNER JOIN (
SELECT [id]
FROM (
SELECT [id]
FROM [table]
ORDER BY [id] DESC
LIMIT N
) AS Temp
) AS Temp2 ON [table].[id] = [Temp2].[id]
DELETE FROM table WHERE id NOT IN (
SELECT id FROM table ORDER BY id, desc LIMIT 0, 10
)
Stumbled across this and thought I'd update.
This is a modification of something that was posted before. I would have commented, but unfortunately don't have 50 reputation...
LOCK Tables TestTable WRITE, TestTable as TestTableRead READ;
DELETE FROM TestTable
WHERE ID <= (
SELECT ID
FROM TestTable as TestTableRead -- (the 'as' declaration is required for some reason)
ORDER BY ID DESC LIMIT 1 OFFSET 42 -- keep this many records);
UNLOCK TABLES;
The use of 'Where' and 'Offset' circumvents the sub-query.
You also cannot read and write from the same table in the same query, as you may modify entries as they're being used. The Locks allow to circumvent this. This is also safe for parallel access to the database by other processes.
For performance and further explanation see the linked answer.
Tested with mysql Ver 15.1 Distrib 10.5.18-MariaDB
For further details on locks, see here
Why not
DELETE FROM table ORDER BY id DESC LIMIT 1, 123456789
Just delete all but the first row (order is DESC!), using a very very large nummber as second LIMIT-argument. See here
Answering this after a long time...Came across the same situation and instead of using the answers mentioned, I came with below -
DELETE FROM table_name order by ID limit 10
This will delete the 1st 10 records and keep the latest records.