Hi i have an issue with a mysql select statement i cant get my head around,
Table client_directory_data
id int,
verified int,
client_id int,
created timestamp,
description longtext
select * from client_directory_data where verified = 1 order by created desc
but this selects multiple rows for each client_id
what i need to do is to select every client_id which has a verified = 1 but only get the most recent row for each client_id, i hope that makes sense.
This is an issue I face all the time. Fortunately there's a nice little trick for doing this:
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
And if you want the whole row you can just join onto it like so:
SELECT
*
FROM (
SELECT
client_id,
SUBSTRING_INDEX(GROUP_CONCAT(id ORDER BY created DESC),",",1) AS `id`
FROM client_directory_data
WHERE verified = 1
GROUP BY client_id
) ids
JOIN client_directory_data USING (id);
Of course if you're ordering by an indexed field anyway (that you could therefore join on efficiently anyway), it's better to use MAX(id) AS id, although it actually has very little impact on performance. The main reason to use MAX() is really to make the code a little simpler. It also avoids the pitfalls you may encounter if the field contains commas (which you can get around with a different seperator for the group concat) or hitting the max GROUP_CONCAT length (which can be extended with SET group_concat_max_len = xxx; and only causes warnings anyway).
I can see why this would intuitively seem like it would have performance issues, however it's actually the best performng method I've found for these queries - especially on large tables.
Here are some benchmarks I've taken from some of the larger tables currently available to me comparing the three methods in this thread.
Query A: (~5,000 records, ~900 results, non-indexed field)
GROUP_CONCAT method: 0.0100 seconds
MAX method: 0.102 seconds
LEFT JOIN method: 0.0082 seconds
Query B : (~300,000 records, ~95,000 results)
GROUP_CONCAT method: 1.8618 seconds
MAX method: 1.7904 seconds
LEFT JOIN method: 6.4649 seconds
Query C : (~300,000 records, ~7 results)
GROUP_CONCAT method: 0.103 seconds
MAX method: 0.0102 seconds
LEFT JOIN method: (I got bored after 4 hours)
Query D : (~500,000 records, ~5,000 different values of the field being grouped)
GROUP method: 0.1355 seconds
MAX Method : 0.0429 seconds
LEFT JOIN method: (I got bored after 10 minutes)
That makes sense and is a classic question.
Assuming that the most recent row is the one with highest id, you can use:
SELECT *
FROM client_directory_data c
LEFT JOIN client_directory_data d ON c.client_id = d.client_id AND d.verified = 1 AND d.id > c.id
WHERE d.id IS NULL
AND c.verified = 1;
You can have an explanation of this query pattern here.
Make id as primary key for the table client_directory_data
Related
I'm making a sample recent screen that will display a list, it displays the list, with id set as primary key.
I have done the correct query as expected but the table with big amount of data can cause slow performance issues.
This is the sample query below:
SELECT distinct H.id -- (Primary Key),
H.partnerid as PartnerId,
H.partnername AS partner, H.accountname AS accountName,
H.accountid as AccountNo,
FROM myschema.mytransactionstable H
INNER JOIN (
SELECT S.accountid, S.partnerid, S.accountname,
max(S.transdate) AS maxDate
from myschema.mytransactionstable S
group by S.accountid, S.partnerid, S.accountname
) ms ON H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname =ms.accountname
AND H.transdate = maxDate
WHERE H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname = ms.accountname
AND H.transdate = maxDate
GROUP BY H.partnerid,H.accountid, H.accountname
ORDER BY H.id DESC
LIMIT 5
In my case, there are values which are similar in the selected columns but differ only in their id's
Below is a link to an image without executing the query above. They are all the records that have not yet been filtered.
Sample result query click here
Since I only want to get the 5 most recent by their id but the other columns can contain similar values
accountname,accountid,partnerid.
I already got the correct query but,
I want to improve the performance of the query. Any suggestions for the improvement of query?
You can try using row_number()
select * from
(
select *,row_number() over(order by transdate desc) as rn
from myschema.mytransactionstable
)A where rn<=5
Don't repeat ON and WHERE clauses. Use ON to say how the tables (or subqueries) are "related"; use WHERE for filtering (that is, which rows to keep). Probably in your case, all the WHERE should be removed.
Please provide SHOW CREATE TABLE
This 'composite' index would probably help because of dealing with the subquery and the JOIN:
INDEX(partnerid, accountid, accountname, transdate)
That would also avoid a separate sort for the GROUP BY.
But then the ORDER BY is different, so it cannot avoid a sort.
This might avoid the sort without changing the result set ordering: ORDER BY partnerid, accountid, accountname, transdate DESC
Please provide EXPLAIN SELECT ... and EXPLAIN FORMAT=JSON SELECT ... if you have further questions.
If we cannot get an index to handle the WHERE, GROUP BY, and ORDER BY, the query will generate all the rows before seeing the LIMIT 5. If the index does work, then the outer query will stop after 5 -- potentially a big savings.
I am working on some temp tables for practice.
The one query is taking too much of time around 550 sec.Db is hosted in AWS RDS with 8cpu and 16GB ram.
Below query has to be run in different DB( prod ) , first checking in test testDB
create table test_01 as
select *
from
(
select
person
,age
,dob
,place
from
person
where
person is not null
and age is not null
and dob is not null
and place is not null
limit 1000
) ps_u
left join
employee em_u
on ps_u.age = em_u.em_age
and ps_u.place = em_u.location
order by person
limit 1000
Is there issue with query or with the resource,
CPU utilization shows 30% ram is ok not too much.
Let me know any suggestion to optimize the query.
check your left join. it can be a reason for it. left join will return everything from your left table, if this table has lot of entry, it will slow down your query.
With it, you can break your query in two separate query & check execution time using different tweaking.
Try to return specific rows rather than *.
In case you are limiting the result (with limit 1000) - do you really need order by person? If the result is huge - order by could adversely affect the performance.
You can reduce 1 select statement / also left join bring all records from left table could take time to process data.
CREATE TABLE test_01 AS
(SELECT person,
age,
dob,
place
FROM person ps_u
LEFT JOIN employee em_u ON ps_u.age = em_u.em_age
AND ps_u.place = em_u.location
ORDER BY ps_u.person
WHERE ps_u.person IS NOT NULL
AND ps_u.age IS NOT NULL
AND ps_u.dob IS NOT NULL
AND ps_u.place IS NOT NULL
LIMIT 1000)
I solved it by creating index for the column
alter table person
add fulltext index `fulltext`
(
, person asc
, age asc
, dob asc
, place asc
)
;
And then the query took only 3 seconds for 1000 records
I have the following query which is taking about 20 seconds on records of 60,000 in the sale table. I understand that the ORDER BY and LIMIT are causing the issue, as when ORDER BY is removed it is returned in 0.10 seconds.
I am unsure how to optimise this query, any ideas?
The explain output is here https://gist.github.com/anonymous/1b92fa64261559de32da
SELECT sale.saleID as id,
node.title as location,
sale.saleTotal as total,
sale.saleStatus as status,
payment.paymentMethod,
field_data_field_band_name.field_band_name_value as band,
invoice.invoiceID,
field_data_field_first_name.field_first_name_value as firstName,
field_data_field_last_name.field_last_name_value as lastName,
sale.created as date
FROM sale
LEFT JOIN payment
ON payment.saleID = sale.saleID
LEFT JOIN field_data_field_location
ON field_data_field_location.entity_id = sale.registerSessionID
LEFT JOIN node
ON node.nid = field_data_field_location.field_location_target_id
LEFT JOIN invoice
ON invoice.saleID = sale.saleID
LEFT JOIN profile
ON profile.uid = sale.clientID
LEFT JOIN field_data_field_band_name
ON field_data_field_band_name.entity_id = profile.pid
LEFT JOIN field_data_field_first_name
ON field_data_field_first_name.entity_id = profile.pid
LEFT JOIN field_data_field_last_name
ON field_data_field_last_name.entity_id = profile.pid
ORDER BY sale.created DESC
LIMIT 0,50
Possibly, you cannot do anything. For instance, when you are measuring performance, are you looking at the time to return the first record or the entire results set? Without the order by, the query can return the first row quite quickly, but you still might need to wait a bit to get all the rows.
Assuming the comparison is valid, the following index might help: sale(created, saleId, clientId, SaleTotal, SaleStatus. This is a covering index for the query, carefully constructed so it can be read in the right order. If this avoids the final sort, then it should speed the query, even for fetching the first row.
Minimal:
ALTER TABLE `sale` ADD INDEX (`saleID` , `created`);
ALTER TABLE `invoice` ADD INDEX (`saleID`);
Indices first. But another technique concerning LIMIT, used by paging for instance, is to use values: last row of page yielding start value for search for next page.
As you use ORDER BY sale.created DESC you could guess a sufficiently large period:
WHERE sale.created > CURRENT_DATE() - INTERVAL 2 MONTH
An index on created a must.
i am pretty much stucked in an Sql Query from past few hours . i need to get latest few elements from four tables as follows..
table names are -- events , contactinfo , video , news
i need last 3 results from events and news and last single result from video and contactinfo..
i tried following query but as expected it didnt worked ..
SELECT * FROM
((SELECT * FROM EVENTS ORDER BY eventid DESC LIMIT 3)EV) INNER JOIN
((SELECT * FROM NEWS ORDER BY newsid DESC LIMIT 3)NE) INNER JOIN
((SELECT * FROM VIDEOS ORDER BY videoid DESC LIMIT 1)VI) INNER JOIN
((SELECT * FROM CONTACTINFO ORDER BY cid DESC LIMIT 1)AB);
Actually i am not a DB Expert i am a Developer and i really dont know much about MySql.
Any Help Would be Appreciated.
If these tables have the same columns you can do a UNION (instead of your INNER JOIN). If not, I suggest doing 4 queries.
JOINs suggests that the data that is joined correlates to each other and if that's not the case than doing an JOIN seams like the wrong solution.
If you need result as a single table then use SELECT and UNION to union data, providing same column numbers and their data types in each query (CAST column and provide default values if need). Otherwise, if you need results with different structures then run 4 queries.
JOINs don't make sense for your task as last N rows from one table unlikely have corresponding rows within last N rows of another table.
UPDATE
See example:
SELECT * FROM
(SELECT TOP 5 n.ID, n.Content, n.CreatedOn as CreatedOn, n.UserID as NewsUserID, 1 as SourceType FROM News n ORDER BY n.CreatedOn DESC) t1
UNION ALL
SELECT * FROM
(SELECT TOP 5 e.ID, e.Description as Content, e.CreatedAt as CreatedOn, NULL as NewsUserID, 2 as SourceType FROM Events e ORDER BY e.CreatedAt DESC) t2
ORDER BY SourceType, CreatedOn DESC
So i decided i want to have ID, Content and CreatedOn from every source, and also want to have UserID from News table. I built 2 queries so they return same columns of same datatypes. Each query takes only first 5 rows from source (TOP 5 is MS SQL syntax, please use your database's). Also i added an extra field SourceType that keeps type of entity. In the main query i union all results and order by source type first, then by CreatedDate.
This is not a logical way to get four table data in one call, since all tables are independent.
I think you wants to minimise database call,
In order to minimise database hits, you should use memcache instead of using such query.
Memcache :
It save data as key value pair, for each key you will get result set.
Its very fast.
Which query will execute faster and which is perfect query ?
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
Or
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
I have placed two queries both will output same result. Now I want to know which method will execute faster and which method is correct way ?
You should always use EXPLAIN to determine how your query will run.
Unfortunately, MySQL will execute your subquery as a DEPENDENT QUERY, which means that the subquery will be ran for each row in the outer query. You'd think MySQL would be smart enough to detect that the subquery isn't a correlated subquery and would run it just once, alas, it's not yet that smart.
So, MySQL will scan through all of the rows in students, running the subquery for each row, and not utilizing any indexes on the outer query whatsoever.
Writing the query as a JOIN would allow MySQL to utilize indexes, and the following query would be the optimum way to write it:
SELECT COUNT(*) AS count
FROMstudents s
JOIN classes c
ON c.id = s.classes_id
AND c.departments_id = 1
WHERE s.status = 1
This would utilize the following indexes:
students(`status`)
classes(`id`, `departements_id`) : multi-column index
From a design and clarity standpoint I'd avoid inner selects like the first one. It is true that to be 100% sure on if or how each query will be optimized and which will run 'better' requires seeing how the SQL server you're using will interperet it and its plan. In Mysql, use "Explain".
However.... Even without seeing this, my money is still on the Join only version... The inner select version has to perform the inner select in it's entirety before determining the values to use inside the "IN" clause--I know this to be true when you wrap stuff in functions, and pretty sure it's true when sticking a select in as IN arguements. I also know that that's a good way to totally neutralize any benefit you might have with indexes on the tables inside the inner select.
I'm generally of the opinion that Inner selects are only really needed for very rare query situations. Usually, those who use them often are thinking like traditional iterative flow programmers not really thinking in relational DB result set terms...
EXPLAIN Both the queries individually
The difference between both queries is of Sub-Queries vs Joins
Mostly Joins are faster than sub-queries. Join creates execution plan and predict what data is going to process, hence it saves time. On the other hand sub-queries run all the queries until all the data is loaded. Most developer use Sub-queries because these are more readable than JOINS, but where the performance is matter, JOIN is better solution.
The best way to find out is to measure it:
Without index
Query 1: 0.9s
Query 2: 0.9s
With index
Query 1: 0.4s
Query 2: 0.2s
The conclusion is:
If you don't have indexes then it makes no difference which query you use.
The join is faster if you have the right index.
The effect of adding the correct index is greater than the effect of choosing the right query. If performance matters, make sure you have the correct indexes.
Of course, your results may vary depending on the MySQL version and the distribution of data you have.
Here's how I tested it:
1,000,000 students (25% with status 1).
50,000 courses.
10 departments.
Here's the SQL I used to create the test data:
CREATE TABLE students
(id INT PRIMARY KEY AUTO_INCREMENT,
status int NOT NULL,
classes_id int NOT NULL);
CREATE TABLE classes
(id INT PRIMARY KEY AUTO_INCREMENT,
departments_id INT NOT NULL);
CREATE TABLE numbers(id INT PRIMARY KEY AUTO_INCREMENT);
INSERT INTO numbers VALUES (),(),(),(),(),(),(),(),(),();
INSERT INTO numbers
SELECT NULL
FROM numbers AS n1
CROSS JOIN numbers AS n2
CROSS JOIN numbers AS n3
CROSS JOIN numbers AS n4
CROSS JOIN numbers AS n5
CROSS JOIN numbers AS n6;
INSERT INTO classes (departments_id)
SELECT id % 10 FROM numbers WHERE id <= 50000;
INSERT INTO students (status, classes_id)
SELECT id % 4 = 0, id % 50000 + 1 FROM numbers WHERE id <= 1000000;
SELECT COUNT(*) AS count
FROM students
WHERE status = 1
AND classes_id IN (SELECT id FROM classes WHERE departments_id = 1);
SELECT COUNT(*) AS count
FROM students s
LEFT JOIN classes c
ON c.id = s.classes_id
WHERE status = 1
AND c.departments_id = 1;
CREATE INDEX ix_students ON students(status, classes_id);
The two queries won't produce the same results:
SELECT
COUNT(*) AS count
FROM
students
WHERE
status = 1
AND
classes_id IN(
SELECT
id
FROM
classes
WHERE
departments_id = 1
);
...will return the number of rows in the students table that have a classes_id field that is also in the classes table with a departments_id of 1.
SELECT
COUNT(*) AS count
FROM
students s
LEFT JOIN
classes c
ON
c.id = s.classes_id
WHERE
status = 1
AND
c.departments_id = 1
...will return the total number of rows in the students table where the status field is 1 and possibly more than that depending on how your data is organised.
If you want the queries to return the same thing, you need to change the LEFT JOIN to an INNER JOIN so it will match only the rows that suit both conditions.
Run EXPLAIN SELECT ... on both queries and check which one does what ;)