I have a product order table in mysql. It's like this:
create table `order`
(productcode int,
quantity tinyint,
order_date timestamp,
blablabla)
then, to get rate of rise, i wrote this query:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order`
where date_format(order_date,'%m') = 12
group by productcode) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order`
where date_format(order_date,'%m') = 11
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
but it runs about 30s on my pc(200000 records, 200MB(include other fields)).
Are there any way to increase query speed? I already create index for productcode field.
I thought the reason of low performance is 'GROUP BY', is there any different way?
I tried your answers, but all of them seems not work, and I was wondering if there is something wrong with index(it's not me who created them), so I delete all index and re-created them, everything goes fine -- It only takes 3-4s. And difference between my query and yours is not very obvious. But REALLY thanks you guys, I learned a lot :)
Try adding an index on (ORDER_DATE, PRODUCTCODE) and change the query to eliminate the use of the DATE_FORMAT function, as in:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order`
WHERE ORDER_DATE BETWEEN '01-12-2010' AND '31-12-2010'
GROUP BY PRODUCTCODE) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order`
WHERE ORDER_DATE BETWEEN '01-11-2010' AND '30-11-2010'
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
Share and enjoy.
Given the sheer amount of data you seem to be working with, optimization may be difficult. I would first look at how you are using the order_date field. It should probably be indexed with the product_code field. I also don't think date_format is the best way to get the month out of the date - MONTH(order_date) would almost certainly be faster.
Failing that, if this is a query that is going to be hit many times, I would create a new table for the historical data and fill it with the results of your inner queries. Since it's historical data, you won't need to continually get the the latest data. Since you won't have to calculate the historical data every time you run the query, it will run a lot faster.
#Bob Jarvis' solution might resolve your speed issue. If not, or if you want to try an alternative:
Add update_month to store the month
of update_date
Update the column for existing rows
Add an index on update_month
Create a BEFORE UPDATE trigger to
set the value of update_month on row
updates
Create a BEFORE INSERT trigger to
set the value of update_month on row
inserts
Modify your query accordingly
SELECT
productcode,
(this_month_count - last_month_count) / last_month_count AS riserate
FROM (
SELECT
o.product,
SUM(CASE MONTH(o.order_date) WHEN MONTH(m.date_start) THEN o.quantity END) AS last_month_count,
SUM(CASE MONTH(o.order_date) WHEN MONTH(m.date_end) THEN o.quantity END) AS this_month_count
FROM `order` o
INNER JOIN (
SELECT
CAST('2010-11-01' AS date) AS date_start,
CAST('2010-12-31' AS date) AS date_end
) m ON o.order_date BETWEEN m.date_start AND m.date_end
GROUP BY o.product
) s
Consider using datetime instead of timestamp
If your only reason to use timestamp is to have auto default value on insert and update, use datetime instead and put now() into your inserts and updates or use triggers. Timestamp gives you additional conversion for time zones, but if you don't have clients connecting to your database from different time zones you are just losing time on conversions. This alone should give you 15-30% speed up.
This might be one of rare cases where optimizer can choose wrong index
And productcode index is wrong in this case. Because you are grouping by productcode and using where for other column, which is not very selective, optimizer may think using index for productcode can speed up things. But with this index used it gives you very random scan through index lookup but still with quite big number of rows, instead of faster sequential semi-full scan without it, but with order_date index to limit number of rows scanned. Optimizer simply doesn't know you can expect rows mostly to be sorted by order_date on the disk and not by productcode. Of course to make order_date index work you have to change your query so for every comparison using order_date column name is on one side of the =,<,> or BETWEEN and constant values on the other side, like suggested by Bob Javis in his answer (+1 to him). So you might want to try his query slightly modified, with correected date formats and force use of order_date index - assuming you have it, if not you really should add it with
ALTER TABLE `order` ADD INDEX order_date( order_date );
So the final query should look like:
SELECT thismonth.productcode,
(thismonth.ordercount-lastmonth.ordercount)/lastmonth.ordercount as riserate
FROM ( (SELECT productcode,
sum(quantity) as ordercount
FROM `order` FORCE INDEX( order_date )
WHERE order_date BETWEEN '2010-12-01' AND '2010-12-31'
GROUP BY productcode) as thismonth,
(SELECT productcode,
sum(quantity) as ordercount
FROM `order` FORCE INDEX( order_date )
WHERE order_date BETWEEN '2010-11-01' AND '2010-11-30'
group by productcode) as lastmonth)
WHERE thismonth.productcode = lastmonth.productcode
ORDER BY riserate;
Not using productid index should give you some speed up (full scan should be faster), and using order_date index even more, depending on how many rows satisfy order_date conditions vs all rows in the table.
Related
So I have this data set (down below) and I'm simply trying to gather all data based on records in field 1 that have a count of more than 30 (meaning a distinct brand that has 30+ record entries) that's it lol!
I've been trying a lot of different distinct, count esc type of queries but I'm falling short. Any help is appreciated :)
Data Set
By using GROUP BY and HAVING you can achieve this. To select more columns remember to add them to the GROUP BY clause as well.
SELECT Mens_Brand FROM your_table
WHERE Mens_Brand IN (SELECT Mens_Brand
FROM your_table
GROUP BY Mens_Brand
HAVING COUNT(Mens_Brand)>=30)
You can simply use a window function (requires mysql 8 or mariadb 10.2) for this:
select Mens_Brand, Mens_Price, Shoe_Condition, Currency, PK
from (
select Mens_Brand, Mens_Price, Shoe_Condition, Currency, PK, count(1) over (partition by Mens_Brand) brand_count
from your_table
) counted where brand_count >= 30
I have a MySQL table that contains 20 000 000 rows, and columns like (user_id, registered_timestamp, etc). I have written a below query to get a count of users registered day wise. The query was taking a long time to execute. Will adding an index to the registered_timestamp column improve the execution time?
select date(registered_timestamp), count(userid) from table group by 1
Consider using this query to get a list of dates and the number of registrations on each date.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
GROUP BY date(registered_timestamp)
Then an index on table(registered_timestamp) will help a little because it's a covering index.
If you adapt your query to return dates from a limited range, for example.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE registered_timestamp >= CURDATE() - INTERVAL 8 DAY
AND registered_timestamp < CURDATE()
GROUP BY date(registered_timestamp)
the index will help. (This query returns results for the week ending yesterday.) However, the index will not help this query.
SELECT date(registered_timestamp) date, COUNT(*)
FROM table
WHERE DATE(registered_timestamp) >= CURDATE() - INTERVAL 8 DAY /* slow! */
GROUP BY date(registered_timestamp)
because the function on the column makes the query unsargeable.
You probably can address this performance issue with a MySQL generated column. This command:
ALTER TABLE `table`
ADD registered_date DATE
GENERATED ALWAYS AS DATE(registered_timestamp)
STORED;
Then you can add an index on the generated column
CREATE INDEX regdate ON `table` ( registered_date );
Then you can use that generated (derived) column in your query, and get a lot of help from that index.
SELECT registered_date, COUNT(*)
FROM table
GROUP BY registered_date;
But beware, creating the generated column and its index will take a while.
select date(registered_timestamp), count(userid) from table group by 1
Would benefit from INDEX(registered_timestamp, userid) but only because such an index is "covering". The query will still need to read every row of the index, and do a filesort.
If userid is the PRIMARY KEY, then this would give you the same answers without bothering to check each userid for being NOT NULL.
select date(registered_timestamp), count(*) from table group by 1
And INDEX(registered_timestamp) would be equivalent to the above suggestion. (This is because InnoDB implicitly tacks on the PK.)
If this query is common, then you could build and maintain a "summary table", which collects the count every night for the day's registrations. Then the query would be a much faster fetch from that smaller table.
I have around 6 million rows in the table and I am query the table with below query.
SELECT * FROM FD_CPC_HISTORICAL_DATA WHERE id IN (SELECT MAX(id) FROM FD_CPC_HISTORICAL_DATA WHERE fb_ads_account_id=1462257067274960 AND created_at BETWEEN '2019-12-13 00:00:00' AND '2019-12-13 23:59:59' GROUP BY source_text) \G
I have created index for fb_ads_account_id, created_at, source_text. id is primary key.
My question is why this query takes around 9 seconds to get the result even though I have created indexes?
Is there any other way to create this query more efficient?
Here is mysql explain command explanation
This is your query:
SELECT hd.*
FROM FD_CPC_HISTORICAL_DATA hd
WHERE hd.id IN (SELECT MAX(hd2.id)
FROM FD_CPC_HISTORICAL_DATA hd2
WHERE hd2.fb_ads_account_id = 1462257067274960 AND
hd2.created_at >= '2019-12-13' AND
hd2.created_at < '2019-12-14'
GROUP BY source_text
);
I would recommend writing this as:
SELECT hd.*
FROM FD_CPC_HISTORICAL_DATA hd
WHERE hd.fb_ads_account_id = 1462257067274960 AND
hd.id = (SELECT MAX(hd2.id)
FROM FD_CPC_HISTORICAL_DATA hd2
WHERE hd2.fb_ads_account_id = hd.hd.fb_ads_account_id AND
hd2.source_text = hd.source_tx AND
hd2.created_at >= '2019-12-13' AND
hd2.created_at < '2019-12-14'
);
For this query, you want an index on FD_CPC_HISTORICAL_DATA(fb_ads_account_id, source_text,created_at).
This query probably can be performed without a subquery against the same table ie:
SELECT * FROM FD_CPC_HISTORICAL_DATA
WHERE fb_ads_account_id=1462257067274960
AND created_at BETWEEN '2019-12-13 00:00:00' AND '2019-12-13 23:59:59'
ORDER BY id DESC LIMIT 1
if you want the max ID. Or something similar, I am not sure you need the GROUP BY to get the desired result.
I think the index is exactly what you need. The part in the EXPLAIN that confuses me is the (guesstimated?) amount of rows from the subquery being so different from the one in the primary query.
To be honest, I'm not very familiar with MYSQL, but in MSSQL I would give it a try to first dump the results from the subquery into a temporary table, put a unique clustered index on it and then select everything from the original table joined to said temporary table on the ID column. (Don't use IN, use JOIN as there can't be any doubles in the temporary table)
This might also show where all the time is being spent.
My guess is that this is mostly a statistics issue but I don't really know how to force an update of the statistics on the index in MYSQL.
(there is some talk about FLUSH TABLE in https://dzone.com/articles/updating-innodb-table-statistics-manually but it seems to come with some downsides too, use with care)
SELECT f.*
FROM
( SELECT source_text, MAX(created_at) AS mx
FROM FD_CPC_HISTORICAL_DATA
WHERE fb_ads_account_id=1462257067274960
AND created_at >= '2019-12-13'
AND created_at < '2019-12-13' + INTERVAL 1 DAY
GROUP BY source_text
) AS x
JOIN FD_CPC_HISTORICAL_DATA AS f
ON f.account_id = x.account_id
AND f.source_text = x.source_text
AND f.created_at = x.mx
Then you need this composite index:
INDEX(account_id, source_text, created_at) -- in this order
If this does not quite work because of duplicate entries with the same created_at, then a tweak may be possible.
I have a mysql SUM query that runs on more than 0.6 million records.
what i am currently doing is like this
SELECT SUM (payment)
FROM payment_table
WHERE
payment_date BETWEEN ... AND ...
AND
payment_status = 'paid'
I changed the query to this format to reduce the recordset but it is still taking almost same time.
SELECT SUM(Payments)
FROM (
SELECT payment AS Payments FROM payment_table WHERE
payment_date BETWEEN DATE_FORMAT(NOW(), '2012-2-01') AND DATE_FORMAT(LAST_DAY(DATE_FORMAT(NOW(), '2012-2-01')), '%Y-%m-%d')
AND
payment_status = 'paid'
) AS tmp_table
Is their any way to optimize this sum query.
EDIT:
This is the result when query is run with EXPLAIN
insert into ` (id,select_type,table,type,possible_keys,
key,key_len,ref,rows,Extra`)
values('1','SIMPLE','lps','index_merge','assigned_user_id,scheduled_payment_date,payment_status,deleted','deleted,assigned_user_id,payment_status','2,109,303',NULL,'23347','Using
intersect(deleted,assigned_user_id,payment_status); Using where');
You should match the data type of the preducate with the column. Because payment_type is DATE, make the BETWEEN values DATE also:
WHERE payment_date BETWEEN
CURDATE() AND LAST_DAY(CURDATE())
Matching types ensures the index will be used.
In contrast, your query is using DATE_FORMAT(), which produces a text data type, so in order to perform the comparison, mysql is converting the payment_dare column to text, so it can't use the index (the index contains DATE values, not text values), so every single row is converted and compared.
If you are still having performance problems after making the change above, execute this:
ANALYZE TABLE payment_table;
Which will check the distribution of values in the indexed columns, which helps mysql make the right choice of index.
I have two tables
Invoice(
Id,
Status,
VendorId,
CustomerId,
OrderDate,
InvoiceFor,
)
InvoiceItem(
Id,
Status,
InvoiceId,
ProductId,
PackageQty,
PackagePrice,
)
here invoice.id=invoiceItem.invoiceId (Foregin key)
and Id fields are primary key (big int)
these tables contains 100000(invoice) and 450000(invoiceItem) rows
Now I have to write a query which will return the ledger of invoices where invoice for = 55 or 66 and in a certain date range.
I also have to return a last taken date which will contain the previous taken date of product by that particular customer.
The output should be
OrderDate, InvoiceId, CustomerId, ProductId, LastTaken, PackageQty, PackagePrice
So I write the following query
SELECT a.*, (
SELECT MAX(ivv.orderdate)
FROM invoice AS ivv , invoiceItem AS iiv
WHERE ivv.id=iiv.invoiceid
AND iiv.ProductId=a.ProductId AND ivv.CustomerId=a.CustomerId AND ivv.orderDate<a.orderdate
) AS lastTaken FROM (
SELECT iv.Id, iv.OrderDate, iv.CustomerId, iv.InvoiceFor, ii.ProductId,
ii.PackageQty, ii.PackagePrice
FROM invoice AS iv, invoiceitem AS ii
WHERE iv.id=ii.InvoiceId
AND iv.InvoiceFor IN (55,66)
AND iv.Status=0 AND ii.Status=0
AND OrderDate BETWEEN '2011-01-01' AND '2011-12-31'
ORDER BY iv.orderdate, iv.Id ASC
) AS a
But I always got the Time out. How Will I solve the problem???
the Explain for this query is as follows:
Create index on OrderDate and InvoiceFor attributes. It should be much faster.
Two points about the query itself:
Learn to use proper JOIN syntax. Doing the joins in the WHERE clause is like writing questions in Shakespearean English.
The ORDER BY in the subquery should be outside at the highest level.
However, neither of these are killing performance. The problem is the subquery in the SELECT clause. i think the problem is that your subquery in the SELECT clause is not joining the two tables directly. Try including iiv.InvoiceId = ivv.InvoiceId in, preferably, and ON clause.
If that doesn't work, try an indexing strategy. The following indexes should improve the performance of that subquery:
An index on InvoiceItem(ProductId)
An index on Invoice (CustomerId, OrderDate)
This should allow MySQL to run the subquery from indexes, rather than full table scans, which should be a big performance improvement.