I have two tables
Invoice(
Id,
Status,
VendorId,
CustomerId,
OrderDate,
InvoiceFor,
)
InvoiceItem(
Id,
Status,
InvoiceId,
ProductId,
PackageQty,
PackagePrice,
)
here invoice.id=invoiceItem.invoiceId (Foregin key)
and Id fields are primary key (big int)
these tables contains 100000(invoice) and 450000(invoiceItem) rows
Now I have to write a query which will return the ledger of invoices where invoice for = 55 or 66 and in a certain date range.
I also have to return a last taken date which will contain the previous taken date of product by that particular customer.
The output should be
OrderDate, InvoiceId, CustomerId, ProductId, LastTaken, PackageQty, PackagePrice
So I write the following query
SELECT a.*, (
SELECT MAX(ivv.orderdate)
FROM invoice AS ivv , invoiceItem AS iiv
WHERE ivv.id=iiv.invoiceid
AND iiv.ProductId=a.ProductId AND ivv.CustomerId=a.CustomerId AND ivv.orderDate<a.orderdate
) AS lastTaken FROM (
SELECT iv.Id, iv.OrderDate, iv.CustomerId, iv.InvoiceFor, ii.ProductId,
ii.PackageQty, ii.PackagePrice
FROM invoice AS iv, invoiceitem AS ii
WHERE iv.id=ii.InvoiceId
AND iv.InvoiceFor IN (55,66)
AND iv.Status=0 AND ii.Status=0
AND OrderDate BETWEEN '2011-01-01' AND '2011-12-31'
ORDER BY iv.orderdate, iv.Id ASC
) AS a
But I always got the Time out. How Will I solve the problem???
the Explain for this query is as follows:
Create index on OrderDate and InvoiceFor attributes. It should be much faster.
Two points about the query itself:
Learn to use proper JOIN syntax. Doing the joins in the WHERE clause is like writing questions in Shakespearean English.
The ORDER BY in the subquery should be outside at the highest level.
However, neither of these are killing performance. The problem is the subquery in the SELECT clause. i think the problem is that your subquery in the SELECT clause is not joining the two tables directly. Try including iiv.InvoiceId = ivv.InvoiceId in, preferably, and ON clause.
If that doesn't work, try an indexing strategy. The following indexes should improve the performance of that subquery:
An index on InvoiceItem(ProductId)
An index on Invoice (CustomerId, OrderDate)
This should allow MySQL to run the subquery from indexes, rather than full table scans, which should be a big performance improvement.
Related
I'm making a sample recent screen that will display a list, it displays the list, with id set as primary key.
I have done the correct query as expected but the table with big amount of data can cause slow performance issues.
This is the sample query below:
SELECT distinct H.id -- (Primary Key),
H.partnerid as PartnerId,
H.partnername AS partner, H.accountname AS accountName,
H.accountid as AccountNo,
FROM myschema.mytransactionstable H
INNER JOIN (
SELECT S.accountid, S.partnerid, S.accountname,
max(S.transdate) AS maxDate
from myschema.mytransactionstable S
group by S.accountid, S.partnerid, S.accountname
) ms ON H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname =ms.accountname
AND H.transdate = maxDate
WHERE H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname = ms.accountname
AND H.transdate = maxDate
GROUP BY H.partnerid,H.accountid, H.accountname
ORDER BY H.id DESC
LIMIT 5
In my case, there are values which are similar in the selected columns but differ only in their id's
Below is a link to an image without executing the query above. They are all the records that have not yet been filtered.
Sample result query click here
Since I only want to get the 5 most recent by their id but the other columns can contain similar values
accountname,accountid,partnerid.
I already got the correct query but,
I want to improve the performance of the query. Any suggestions for the improvement of query?
You can try using row_number()
select * from
(
select *,row_number() over(order by transdate desc) as rn
from myschema.mytransactionstable
)A where rn<=5
Don't repeat ON and WHERE clauses. Use ON to say how the tables (or subqueries) are "related"; use WHERE for filtering (that is, which rows to keep). Probably in your case, all the WHERE should be removed.
Please provide SHOW CREATE TABLE
This 'composite' index would probably help because of dealing with the subquery and the JOIN:
INDEX(partnerid, accountid, accountname, transdate)
That would also avoid a separate sort for the GROUP BY.
But then the ORDER BY is different, so it cannot avoid a sort.
This might avoid the sort without changing the result set ordering: ORDER BY partnerid, accountid, accountname, transdate DESC
Please provide EXPLAIN SELECT ... and EXPLAIN FORMAT=JSON SELECT ... if you have further questions.
If we cannot get an index to handle the WHERE, GROUP BY, and ORDER BY, the query will generate all the rows before seeing the LIMIT 5. If the index does work, then the outer query will stop after 5 -- potentially a big savings.
I have table with user transactions.I need to select users who made total transactions more than 100 000 in a single day.Currently what I'm doing is gather all user ids and execute
SELECT sum ( amt ) as amt from users where date = date("Y-m-d") AND user_id=id;
for each id and checking weather the amt > 100k or not.
Since it's a large table, it's taking lot of time to execute.Can some one suggest an optimised query ?
This will do:
SELECT sum ( amt ) as amt, user_id from users
where date = date("Y-m-d")
GROUP BY user_id
HAVING sum ( amt ) > 1; ' not sure what Lakh is
What about filtering the record 1st and then applying sum like below
select SUM(amt),user_id from (
SELECT amt,user_id from users where user_id=id date = date("Y-m-d")
)tmp
group by user_id having sum(amt)>100000
What datatype is amt? If it's anything but a basic integral type (e.g. int, long, number, etc.) you should consider converting it. Decimal types are faster than they used to be, but integral types are faster still.
Consider adding indexes on the date and user_id field, if you haven't already.
You can combine the aggregation and filtering in a single query...
SELECT SUM(Amt) as amt
FROM users
WHERE date=date(...)
AND user_id=id
GROUP BY user_id
HAVING amt > 1
The only optimization that can be done in your query is by applying primary key on user_id column to speed up filtering.
As far as other answers posted which say to apply GROUP BY on filtered records, it won't have any effect as WHERE CLAUSE is executed first in SQL logical query processing phases.
Check here
You could use MySql sub-queries to let MySql handle all the iterations. For example, you could structure your query like this:
select user_data.user_id, user_data.total_amt from
(
select sum(amt) as total_amt, user_id from users where date = date("Y-m-d") AND user_id=id
) as user_data
where user_data.total_amt > 100000;
I have a table with 4 columns: name, date, version,and value. There's an composite index on all four, in that order. It has 20M rows: 2.000 names, approx 1.000 dates per name, approx 10 versions per date.
I'm trying to get a list that give for all names the highest date, the highest version on that date, and the associated value.
When I do
SELECT name,
MAX(date)
FROM table
GROUP BY name
I get good performance and the database uses the composite index
However, when I join the table to this in order to get the MAX(version) per name the query takes ages. There must be a way to get the result in about the same magnitude of time as the SELECT statement above? I can easily be done by using the index.
Try this: (I know it needs a few syntax tweaks for MySQL... ask for them and I will find them)
INSERT INTO #TempTable
SELECT name, MAX(Date) as Date
FROM table
Group By name
select table.name, table.date, max(table.version) as version
from table
inner join #TempTable on table.name = #temptable.name and table.date = #temptable.date
group by table.name, table.date
I have a product order table in mysql. It's like this:
create table `order`
(productcode int,
quantity tinyint,
order_date timestamp,
blablabla)
then, to get rate of rise, i wrote this query:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order`
where date_format(order_date,'%m') = 12
group by productcode) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order`
where date_format(order_date,'%m') = 11
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
but it runs about 30s on my pc(200000 records, 200MB(include other fields)).
Are there any way to increase query speed? I already create index for productcode field.
I thought the reason of low performance is 'GROUP BY', is there any different way?
I tried your answers, but all of them seems not work, and I was wondering if there is something wrong with index(it's not me who created them), so I delete all index and re-created them, everything goes fine -- It only takes 3-4s. And difference between my query and yours is not very obvious. But REALLY thanks you guys, I learned a lot :)
Try adding an index on (ORDER_DATE, PRODUCTCODE) and change the query to eliminate the use of the DATE_FORMAT function, as in:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order`
WHERE ORDER_DATE BETWEEN '01-12-2010' AND '31-12-2010'
GROUP BY PRODUCTCODE) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order`
WHERE ORDER_DATE BETWEEN '01-11-2010' AND '30-11-2010'
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
Share and enjoy.
Given the sheer amount of data you seem to be working with, optimization may be difficult. I would first look at how you are using the order_date field. It should probably be indexed with the product_code field. I also don't think date_format is the best way to get the month out of the date - MONTH(order_date) would almost certainly be faster.
Failing that, if this is a query that is going to be hit many times, I would create a new table for the historical data and fill it with the results of your inner queries. Since it's historical data, you won't need to continually get the the latest data. Since you won't have to calculate the historical data every time you run the query, it will run a lot faster.
#Bob Jarvis' solution might resolve your speed issue. If not, or if you want to try an alternative:
Add update_month to store the month
of update_date
Update the column for existing rows
Add an index on update_month
Create a BEFORE UPDATE trigger to
set the value of update_month on row
updates
Create a BEFORE INSERT trigger to
set the value of update_month on row
inserts
Modify your query accordingly
SELECT
productcode,
(this_month_count - last_month_count) / last_month_count AS riserate
FROM (
SELECT
o.product,
SUM(CASE MONTH(o.order_date) WHEN MONTH(m.date_start) THEN o.quantity END) AS last_month_count,
SUM(CASE MONTH(o.order_date) WHEN MONTH(m.date_end) THEN o.quantity END) AS this_month_count
FROM `order` o
INNER JOIN (
SELECT
CAST('2010-11-01' AS date) AS date_start,
CAST('2010-12-31' AS date) AS date_end
) m ON o.order_date BETWEEN m.date_start AND m.date_end
GROUP BY o.product
) s
Consider using datetime instead of timestamp
If your only reason to use timestamp is to have auto default value on insert and update, use datetime instead and put now() into your inserts and updates or use triggers. Timestamp gives you additional conversion for time zones, but if you don't have clients connecting to your database from different time zones you are just losing time on conversions. This alone should give you 15-30% speed up.
This might be one of rare cases where optimizer can choose wrong index
And productcode index is wrong in this case. Because you are grouping by productcode and using where for other column, which is not very selective, optimizer may think using index for productcode can speed up things. But with this index used it gives you very random scan through index lookup but still with quite big number of rows, instead of faster sequential semi-full scan without it, but with order_date index to limit number of rows scanned. Optimizer simply doesn't know you can expect rows mostly to be sorted by order_date on the disk and not by productcode. Of course to make order_date index work you have to change your query so for every comparison using order_date column name is on one side of the =,<,> or BETWEEN and constant values on the other side, like suggested by Bob Javis in his answer (+1 to him). So you might want to try his query slightly modified, with correected date formats and force use of order_date index - assuming you have it, if not you really should add it with
ALTER TABLE `order` ADD INDEX order_date( order_date );
So the final query should look like:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order` FORCE INDEX( order_date )
WHERE order_date BETWEEN '2010-12-01' AND '2010-12-31'
GROUP BY productcode) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order` FORCE INDEX( order_date )
WHERE order_date BETWEEN '2010-11-01' AND '2010-11-30'
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
Not using productid index should give you some speed up (full scan should be faster), and using order_date index even more, depending on how many rows satisfy order_date conditions vs all rows in the table.
If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.