MySQL self-join optimization while calculating move averages - mysql

I have created a mysql query to calculate moving averages of data by using multiple self-joins as shown below. This is consuming lot of time and the data rows are in 100k per query. Any way to further optimize it to reduce time please?
select a.rownum,a.ma_small_price, b.ma_medium_price from
(SELECT t3.rownum, AVG(t.last_price) as 'ma_small_price'
FROM temp_data t3
left JOIN temp_data t ON t.rownum BETWEEN ifnull(t3.rownum,0) - #psmall AND t3.rownum
GROUP BY t3.rownum)
inner join
(SELECT t3.rownum, AVG(t.last_price) as 'ma_medium_price'
FROM temp_data t3
left JOIN temp_data t ON t.rownum BETWEEN ifnull(t3.rownum,0) - #pmedium AND t3.rownum
GROUP BY t3.rownum) b on a.rownum = b.rownum

OVER ( ... ) is disappointingly slow -- in both MySQL 8.0 and MariaDB 10.x.
I like "exponential moving average" as being easier to compute than "moving average". The following is roughly equivalent to what Nick proposed. This runs faster, but has slightly different results:
SELECT rownum,
#small := #small + 0.5 * (last_price - #small) AS mae_small_price,
#med := #med + 0.2 * (last_price - #med) AS mae_med_price
FROM ( SELECT #small := 10, #med := 10 ) AS init
JOIN temp_data
ORDER BY rownum;
The coefficient controls how fast the exponential moving average adapts to changes in the data. It should be greater than 0 and less than 1.
The "10" that I initialized the EPA to was a rough guess of the average -- it biases the first few values but is gradually swamped as more values are folded in.

Since you're running MySQL 8 you should be able to use window functions to get the same result more efficiently. Without seeing sample data it's hard to be 100% certain but this should be close. Note that to use variables in a window frame, you need to use a prepared statement:
SET #sql = '
SELECT rownum,
AVG(last_price) OVER (ORDER BY rownum ROWS BETWEEN ? PRECEDING AND CURRENT ROW) AS ma_small_price,
AVG(last_price) OVER (ORDER BY rownum ROWS BETWEEN ? PRECEDING AND CURRENT ROW) AS ma_medium_price
FROM temp_data';
PREPARE stmt FROM #sql;
EXECUTE stmt USING #psmall, #pmedium;
Demo on dbfiddle

Related

Calculating time difference of every other row from a table

Note: The data for my question is on SQLFiddle right here where you
can query it.
How the table is created
I have data from a table and put into a temp table using the below logic but the BETWEEN start and end date time stamps are dynamically generated based on other logic in the stored proc, etc.
SET #RowNum = 0;
DROP TEMPORARY TABLE IF EXISTS temp;
CREATE TEMPORARY TABLE temp AS
SELECT #RowNum := #RowNum + 1 RowNum
, TimeStr
, Value
FROM mytable
WHERE TimeStr BETWEEN '2018-01-31 06:15:56' AND '2018-01-31 19:27:09'
AND iQuality = 3 ORDER BY TimeStr;
This gives me a temp table with the row number which increments up one number in order starting with the oldest based TimeStr records, so the oldest is the time of the first record or RowNum 1.
Temp Table
The Data
You can get to this temp table data and play with the queries here on the SQLFiddle I've created but I have a few things I tried there you'll see there which don't give me what I need though.
Attempt to Clarify Further
I need to get the time for each ON and OFF set based on the TimeStr values in each set and I can get this using the TIMEDIFF() function.
I'm having a hard time figuring out how to make it give me the result of each ON and OFF record. The records are always in order from oldest to newest and the row number always starts at 1 too.
I some how need to give give every two records with one after the other RowNum values wise a matching CycleNum starting at 1 and increment by one per each ON and OFF cycle or set.
I can use TIMEDIFF(MAX(TimeStr), MIN(TimeStr)) as duration but I'm not sure how to best get it to group every two RowNum records in order as explained to give each set a subsequent CycleNum value that increments.
Expected Output
The expected output show look like the below screen shot for all ON and OFF cycles or every two RowNum in groups and sequence.
Output Clarification
I need the output to include each ON and OFF cycle's start time, end time, and the duration for the time between the start and stop.
If you can guarantee two things:
That the row numbers are strictly sequential with no gaps.
That the on/off flag is always alternating.
Then you can do this with a relatively simple join. The code looks like:
SELECT (#rn := #rn + 1) as cycle, t.*, tnext.timestr,
timediff(tnext.timestr, t.timestr)
FROM temp t JOIN
temp tnext
ON t.rownum = tnext.rownum - 1 and
t.value = 1 and
tnext.value = 0 cross join
(SELECT #rn := 0) params;
If these conditions are not true, then more complex logic is needed.
Here is a simpler one :
SELECT
t1.TimeStr AS StartTime,
t2.TimeStr AS EndTime,
TIMEDIFF(t2.TimeStr, t1.TimeStr) AS Duration
FROM temp t1
INNER JOIN temp t2 ON t2.RowNum = t1.RowNum + 1
WHERE
t2.Value = 0
AND t1.Value = 1
A quick and dirty way to do it would be this:
SELECT
T1.TimeStr AS StartTime,
(SELECT T2.TimeStr FROM temp AS T2 WHERE T2.RowNum = T1.RowNum+1) AS StopTime,
TIMEDIFF((SELECT T2.TimeStr FROM temp AS T2 WHERE T2.RowNum = T1.RowNum+1),
T1.TimeStr) AS Duration
FROM temp AS T1
WHERE Value = 1;
Seems like there must be better ways to do this. Two subqueries will be slow.
You could do it in two steps:
CREATE TEMPORARY TABLE startstop AS
SELECT
T1.TimeStr AS StartTime,
(SELECT T2.TimeStr FROM temp AS T2 WHERE T2.RowNum = T1.RowNum+1) AS StopTime,
0 AS Duration
FROM temp AS T1
WHERE Value = 1;
UPDATE startstop SET Duration = StopTime - StartTime;
However I cannot test this in the Fiddle.

Rails select top n records per group (memory leak)

I have this method that using find_by_sql which is return 10 latest records of each source
def latest_results
Entry.find_by_sql(["
select x.id,x.created_at,x.updated_at,x.source_id,x.`data`,x.`uuid`,x.source_entry_id
from
(select t.*,
(#num:=if(#group = `source_id`, #num +1, if(#group := `source_id`, 1, 1))) row_number
from (
select d.id,d.created_at,d.updated_at,d.source_id,d.`data`,d.`uuid`,d.source_entry_id
from `streams` a
JOIN `stream_filters` b
on b.stream_id=a.id
JOIN `filter_results` c
on c.filter_id=b.id
JOIN `entries` d
on d.id=c.entry_id
where a.id=?
) t
order by `source_id`,created_at desc
) as x
where x.row_number <= 10
ORDER BY x.created_at DESC
",self.id])
end
It's working properly on local environment with limited records.
I have t2.micro which has 2 Gib memory to serving the application. Now this query running out my whole memory and app get frizzing.
any suggestion how can I do it better ? I want to solve this without increasing the size of machine.
I had a similar problem once. The solution with mysql variables seems neat at the first place, though it is hard to optimize. It seems that is doing a full table scan in your case.
I would recommend to fetch the sources you want to display first. And then run a second query with multiple top 10 selects, one per source, all combined with a union.
The union top 10 select will have some repetive statements which you can easily autogenerate with ruby.
# pseudo code
sources = Entry.group(:source).limit(n)
sql = sources.map do |source|
"select * from entries where source = #{source} order by created_at limit 10"
end.join("\nunion all\n")
Entry.find_by_sql(sql)

MySQL limit work around

I need to limit records based on percentage but MYSQL does not allow that. I need 10 percent User Id of (count(User Id)/max(Total_Users_bynow)
My code is as follows:
select * from flavia.TableforThe_top_10percent_of_the_user where `User Id` in (select distinct(`User Id`) from flavia.TableforThe_top_10percent_of_the_user group by `User Id` having count(distinct(`User Id`)) <= round((count(`User Id`)/max(Total_Users_bynow))*0.1)*count(`User Id`));
Kindly help.
Consider splitting your problem in pieces. You can use user variables to get what you need. Quoting from this question's answers:
You don't have to solve every problem in a single query.
So... let's get this done. I'll not put your full query, but some examples:
-- Step 1. Get the total of the rows of your dataset
set #nrows = (select count(*) from (select ...) as a);
-- --------------------------------------^^^^^^^^^^
-- The full original query (or, if possible a simple version of it) goes here
-- Step 2. Calculate how many rows you want to retreive
-- You may use "round()", "ceiling()" or "floor()", whichever fits your needs
set #limrows = round(#nrows * 0.1);
-- Step 3. Run your query:
select ...
limit #limrows;
After checking, I found this post which says that my above approach won't work. There's, however, an alternative:
-- Step 1. Get the total of the rows of your dataset
set #nrows = (select count(*) from (select ...) as a);
-- --------------------------------------^^^^^^^^^^
-- The full original query (or, if possible a simple version of it) goes here
-- Step 2. Calculate how many rows you want to retreive
-- You may use "round()", "ceiling()" or "floor()", whichever fits your needs
set #limrows = round(#nrows * 0.1);
-- Step 3. (UPDATED) Run your query.
-- You'll need to add a "rownumber" column to make this work.
select *
from (select #rownum := #rownum+1 as rownumber
, ... -- The rest of your columns
from (select #rownum := 0) as init
, ... -- The rest of your FROM definition
order by ... -- Be sure to order your data
) as a
where rownumber <= #limrows
Hope this helps (I think it will work without a quirk this time)

MySQL update join from select very slow

We have a stored procedure that is used to prepare data for a report. Note that the schema isn't as normalized as it should be, but it is what it is and we cannot modify it, hence building the temporary table for the report. MySQL version is 5.1.70.
There is an update statement that updated the temp table from a join to a select. Note that all columns in this query that should have an index do, including the temp table (OI):
UPDATE `tmpOrderInquiry` OI
INNER JOIN (
SELECT `SalesOrderNo`,
GROUP_CONCAT(Inv.`InvoiceNo` ORDER BY `InvoiceNo` ASC SEPARATOR '~|~') as `InvoiceNoGRP`,
GROUP_CONCAT(DATE_FORMAT(date(`ShipDate`), ' %c/%d/%y') ORDER BY `ShipDate` ASC SEPARATOR '~|~') as `ShipDateGRP`,
GROUP_CONCAT(DATE_FORMAT(date(`LastPmtDate`), ' %c/%d/%y') ORDER BY `LastPmtDate` ASC SEPARATOR '~|~') as `LastPmtDateGRP`
FROM `InsynchInvoiceHistoryHeader` Inv
GROUP BY Inv.`SalesOrderNo`
) as OrdInv ON OI.`SalesOrderNo` = OrdInv.`SalesOrderNo`
SET OI.`InvoiceNo` = OrdInv.`InvoiceNoGRP`, OI.`ShipDate` = OrdInv.`ShipDateGRP`, OI.`LastPmtDate` = OrdInv.`LastPmtDateGRP`
This query takes approximately 70 seconds on average to complete. The select on its own executes sub-second. After much head banging, on a lark I replaced the above with:
CREATE TEMPORARY TABLE `tempWorking` AS SELECT
`SalesOrderNo`,
GROUP_CONCAT(Inv.`InvoiceNo` ORDER BY `InvoiceNo` ASC SEPARATOR '~|~') as `InvoiceNoGRP`,
GROUP_CONCAT(DATE_FORMAT(date(`ShipDate`), ' %c/%d/%y') ORDER BY `ShipDate` ASC SEPARATOR '~|~') as `ShipDateGRP`,
GROUP_CONCAT(DATE_FORMAT(date(`LastPmtDate`), ' %c/%d/%y') ORDER BY `LastPmtDate` ASC SEPARATOR '~|~') as `LastPmtDateGRP`
FROM `InsynchInvoiceHistoryHeader` Inv
GROUP BY Inv.`SalesOrderNo`;
UPDATE `tmpOrderInquiry` OI
INNER JOIN tempWorking as OrdInv ON OI.`SalesOrderNo` = OrdInv.`SalesOrderNo`
SET OI.`InvoiceNo` = OrdInv.`InvoiceNoGRP`, OI.`ShipDate` = OrdInv.`ShipDateGRP`, OI.`LastPmtDate` = OrdInv.`LastPmtDateGRP`
This query runs in about 2 seconds. Since the select itself is fast, I am at a loss to explain why the first query is so slow. Because the problem was acute I released this change to production but I don't like introducing a fix that all things being equal I would expect to actually make things slower.
Any insight as to why the first update statement is so slow would be appreciated.

Select nth percentile from MySQL

I have a simple table of data, and I'd like to select the row that's at about the 40th percentile from the query.
I can do this right now by first querying to find the number of rows and then running another query that sorts and selects the nth row:
select count(*) as `total` from mydata;
which may return something like 93, 93*0.4 = 37
select * from mydata order by `field` asc limit 37,1;
Can I combine these two queries into a single query?
This will give you approximately the 40th percentile, it returns the row where 40% of rows are less than it. It sorts rows by how far they are from the 40th percentile, since no row may fall exactly on the 40th percentile.
SELECT m1.field, m1.otherfield, count(m2.field)
FROM mydata m1 INNER JOIN mydata m2 ON m2.field<m1.field
GROUP BY
m1.field,m1.otherfield
ORDER BY
ABS(0.4-(count(m2.field)/(select count(*) from mydata)))
LIMIT 1
As an exercise in futility (your current solition would probably be faster and prefered), if the table is MYISAM (or you can live with the approximation of InnoDB):
SET #row =0;
SELECT x.*
FROM information_schema.tables
JOIN (
SELECT #row := #row+1 as 'row',mydata.*
FROM mydata
ORDER BY field ASC
) x
ON x.row = round(information_schema.tables.table_rows * 0.4)
WHERE information_schema.tables.table_schema = database()
AND information_schema.tables.table_name = 'mydata';
There's also this solution, which uses a monster string made by GROUP_CONCAT. I had to up the max on the output like so to get it to work:
SET SESSION group_concat_max_len = 1000000;
MySql wizards out there: feel free to comment on the relative performance of the methods.