I have a stored procedure to insert few records on daily basis. Same logic gets executed for each day but in a sequential manner. So to improve the performance, I was thinking to introduce parallelism. So is there a way or could some one point me to some example where I can run some logic in a stored procedure in parallel.
EDIT:
The query I am using in my stored procedure is :
INSERT INTO tmp (time_interval, cnt, dat, txn_id) SELECT DATE_FORMAT(d.timeslice, '%H:%i') as time_interval
, COUNT(m.id) as cnt
, date(d.timeslice) as dat
, "test" as txn_id
FROM ( SELECT min_date + INTERVAL n*60 MINUTE AS timeslice
FROM ( SELECT DATE('2015-05-04') AS min_date
, DATE('2015-05-05') AS max_date) AS m
CROSS
JOIN numbers
WHERE min_date + INTERVAL n*60 MINUTE < max_date
) AS d
LEFT OUTER
JOIN mytable AS m
ON m.timestamp BETWEEN d.timeslice
AND d.timeslice + INTERVAL 60 MINUTE
GROUP
BY d.timeslice;
This query groups the records on hour basis for each day and inserts in tmp table. So I want to run this query in parallel for each day instead of sequential.
Thanks.
Is d a set of DATETIMEs that represent the 24 hours of one day? My gut says it can be simplified a bunch. It can be sped up by adding WHERE n BETWEEN 0 AND 23. Perhaps:
SELECT '2015-05-04' + INTERVAL n*60 MINUTE AS timeslice
FROM numbers
WHERE n BETWEEN 0 AND 23
What is in mytable? In particular, is the 'old' data static or changing? If it is unchanging, why repeatedly recompute it? Compute only for the last hour, store it into a permanent (not tmp) table. No need for parallelism.
If the data is changing, it would be better to avoid
ON m.timestamp BETWEEN d.timeslice
AND d.timeslice + INTERVAL 60 MINUTE
because (I think) it will not optimize well. Let's see the EXPLAIN SELECT....
In that case, use a stored procedure to compute the start and end times and construct (think CONCAT) the ON clause with constants in it.
Back to your question...
There is no way in MySQL, by itself, to get parallelism. You could write separate scripts to do the parallelism, each with its own parameters and connection.
Related
I have a real time data table with time stamps for different data points
Time_stamp, UID, Parameter1, Parameter2, ....
I have 400 UIDs so each time_stamp is repeated 400 times
I want to write a query that uses this table to check if the real time data flow to the SQL database is working as expected - new timestamp every 5 minute should be available
For this what I usually do is query the DISTINCT values of time_stamp in the table and order descending - do a visual inspection and copy to excel to calculate the difference in minutes between subsequent distinct time_stamp
Any difference over 5 min means I have a problem. I am trying to figure out how I can do something similar in SQL, maybe get a table that looks like this. Tried to use LEAD and DISTINCT together but could not write the code myself, im just getting started on SQL
Time_stamp, LEAD over last timestamp
Thank you for your help
You can use lag analytical function as follows:
select t.* from
(select t.*
lag(Time_stamp) over (order by Time_stamp) as lg_ts
from your_Table t)
where timestampdiff('minute',lg_ts,Time_stamp) > 5
Or you can also use the not exists as follows:
select t.*
from your_table t
where not exists
(select 1 from your_table tt
where timestampdiff('minute',tt.Time_stamp,t.Time_stamp) <= 5)
and t.Time_stamp <> (select min(tt.Time_stamp) from your_table tt)
lead() or lag() is the right approach (depending on whether you want to see the row at the start or end of the gap).
For the time comparison, I recommend direct comparisons:
select t.*
from (select t.*
lead(Time_stamp) over (partition by uid order by Time_stamp) as next_time_stamp
from t
) t
where next_timestamp > time_stamp + interval 5 minute;
Note: exactly 5 minutes seems unlikely. You might want a fudge factor such as:
where next_timestamp > time_stamp + interval 5*60 + 10 second;
timestampdiff() counts the number of "boundaries" between two values. So, the difference in minutes between 00:00:59 and 00:01:02 is 1. And the difference between 00:00:00 and 00:00:59 is 0.
So, a difference of "5 minutes" could really be 4 minutes and 1 second or could be 5 minutes and 59 seconds.
I want to find a way to sum up all the increments in the value of a column.
We provide delivery services to our customers. A customer can pay as he go, but if he pays an upfront fee, he gets a better deal. There is a table that has the balance of the customer across the time. So I want to sum all the increments to the balance. I can't change the way the payment is recorded.
I have alredy coded an stored procedure that works, but is kind slow, so I'm looking for alternatives. I think that, maybe, an sql statement that can do this task, can outperform my stored procedure that has loops.
My stored procedure makes a select of the customer in a given date range, and insert the result in a temp table X. After that, it starts to pop rows from X table, comparing the balance value in that row against the previous row, and detects if there is an increment. If there is not increment, pops another row and do the same routine, if there is an increment, it calculates the difference between that row and the previous, and the result is inserted in another temp table Y.
When there are no rows left, the stored procedure performs a SUM in the temp table Y, and thus, you can know how much the customer has "refilled" its balance.
This is an example of the table X, and the expected result:
DATE BALANCE
---- -------
2019-02-01 200
2019-02-02 195 //from 200 to 195 there is a decrement, so it doesn't matter
2019-02-03 180
2019-02-04 150
2019-02-05 175 //there is an increment from 150 to 175, it's 25 that must be inserted in the temp table
2019-02-06 140
2019-02-07 180 //there is another increment, from 140 to 180, it's 40
So the resulting temp table Y must be something like this:
REFILL
------
25
40
The expected result is 65. My stored procedure returns this value, but as I said, is kind slow (it takes about 22 seconds to process 3900 rows, equivalent to 3 days, aprox), I think is because the loops. I would like to explore another alternatives. Because some details that I don't mention here, for a single costumer, I can have 1300 rows per day (the example is given in days, but I have rows by the minute). My tables are indexed, I think properly. I can't post my stored procedure, but it works as described (I know that "The devil is in the detail"). So any suggestion will be appreciated.
Use a user-defined variable to hold the balance from the previous row, and then subtract it from the current row's balance.
SELECT SUM(refill) AS total_refill
FROM (
SELECT GREATEST(0, balance - #prev_balance) AS refill, #prev_balance := balance
FROM (
SELECT balance
FROM tableX
ORDER BY date) AS t
CROSS JOIN (SELECT #prev_balance := NULL) AS ars
) AS t
There is a quite well-known mechanism to deal with these: Use a variable inside a field.
SELECT #result:=0;
SELECT #lastbalance:=9999999999; -- whatever value is sure to be highe than any real balance
SELECT SUM(increments) AS total FROM (
SELECT
IF(balance>#lastbalance, balance-#lastbalance, 0) AS increments,
#lastbalance:=balance AS ignore
FROM X -- insert real table name here
WHERE
-- insert selector here
ORDER BY
-- insert real chronological sorter here
) AS baseview;
Use lag() in MySQL 8+:
select sum(balance - prev_balance) as refills
from (select t.*, lag(balance) over (order by date) prev_balance
from t
) t
where balance > prev_balance;
In older versions of MySQL this is tricky. If the values are continuous dates, then a simple JOIN works:
select sum(t.balance - tprev.balance) as refills
from t join
t tprev
on tprev.date = t.date - 1
where t.balance > tprev.balance;
This may not be the case. Then the next best method is variables. But you have to be very careful. MySQL does not declare the order of evaluation of expressions in a SELECT. As the documentation explains:
The order of evaluation for expressions involving user variables is undefined. For example, there is no guarantee that SELECT #a, #a:=#a+1 evaluates #a first and then performs the assignment.
The variables need to be assigned and used in the same expression:
select sum(balance - prev_balance) as refills
from (select t.*,
(case when (#temp_prevb := #prevb) = NULL -- intentionally false
then -1
when (#prevb := balance)
then #temp_prevb
end) as prev_balance
from (select t.* from t order by date) t cross join
(select #prevb := NULL) params
) t
where balance > prev_balance;
And the final method is a correlated subquery:
select sum(balance - prev_balance) as refills
from (select t.*,
(select t2.balance
from t t2
where t2.date < t.date
order by t2.date desc
) as prev_balance
from t
) t
where balance > prev_balance;
I have a number of stores where I would like to sum the energy consumption so far this year compared with the same period last year. My challenge is that in the current year the stores have different date intervals in terms of delivered data. That means that store A may have data between 01.01.2018 and 20.01.2018, and store B may have data between 01.01.2018 and 28.01.2018. I would like to sum the same date intervals current year versus previous year.
Data looks like this
Store Date Sum
A 01.01.2018 12
A 20.01.2018 11
B 01.01.2018 33
B 28.01.2018 32
But millions of rows and would use these dates as references to get the same sums previous year.
This is my (erroneous) try:
SET #curryear = (SELECT YEAR(MAX(start_date)) FROM energy_data);
SET #maxdate_curryear = (SELECT MAX(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
SET #mindate_curryear = (SELECT MIN(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
-- the same date intervals last year
SET #maxdate_prevyear = (#maxdate_curryear - INTERVAL 1 YEAR);
SET #mindate_prevyear = (#mindate_curryear - INTERVAL 1 YEAR);
-- sums current year
CREATE TABLE t_sum_curr AS
SELECT name as name_curr, sum(kwh) as sum_curr, min(start_date) AS
min_date_curr, max(start_date) AS max_date_curr, count(distinct
start_date) AS ant_timer FROM energy_data WHERE agg_type = 'timesnivå'
AND start_date >= #mindate_curryear and start_date <= #maxdate_curryear GROUP BY NAME;
-- also seems fair, the same dates one year ago, figured I should find those first and in the next query use that to sum each stores between those date intervals
CREATE TABLE t_sum_prev AS
SELECT name_curr as name_curr2, (min_date_curr - INTERVAL 1 YEAR) AS
min_date_prev, (max_date_curr - INTERVAL 1 YEAR) as max_date_prev FROM
t_sum_curr;
-- getting into trouble!
CREATE TABLE the_results AS
SELECT name, start_date, sum(kwh) as sum_prev from energy_data where
agg_type = 'timesnivå' and
start_date >= #mindate_prevyear and start_date <=
#maxdate_prevyear group by name having start_date BETWEEN (SELECT
min_date_prev from t_sum_prev) AND
(SELECT max_date_prev from t_sum_prev);
`
This last query just tells me that my sub query returns more than 1 row and throws an error message.
I assume what you have is a list of energy consumption figures, where bills or readings have been taken at irregular times, so the consumption covers irregular periods.
The basic approach you need to take is to regularise the consumption periods - by establishing which days each periods covers, and then breaking each reading down into as many days as it covers, and the consumption for each day being a daily average of the period.
I'm assuming the consumption periods are entirely sequential (as a bill or reading normally would be), and not overlapping.
Because of the volume of rows involved (you say millions even in its current form), you might not want to leave the data in daily form - it might suffice to regroup them into regular weekly, monthly, or quarterly periods, depending on what level of granularity you require for comparison.
Once you have your regular periods, comparison will be as easy as cake.
If this is part of a report that will be run on an ongoing basis, you'd probably want to implement some logic that calculates a "regularised consumption" incrementally and on a scheduled basis and stores it in a summary table, with appropriate columns and indexes, so that you aren't having to process many millions of historical rows each time the report is run.
Trying to work around the irregular periods (if indeed it can be done) with fancy joins and on-the-fly averages, rather than tackling them head on, will likely lead to very difficult logic, and particularly on a data set of this size, dire performance.
EDIT: from the comments below.
#Alexander, I've knocked together an example of a query. I haven't tested it and I've written it all in a text editor, so excuse any small syntax errors. What I've come up with seems a bit complex (more complex than I imagined when I began), but I'm also a little bit tired, so I'm not sure whether it could be simplified further.
The only point I would make is that the performance of this query (or any such query), because of the nature of what it has to do in traversing date ranges, is likely to be appalling on a table with millions of rows. I stand by my earlier remarks, that proper indexing of the source data will be crucial, and summarising the source data into a larger granularity will massively aid performance (at the expense of a one-off hit to summarise it). Even daily granularity, will reduce the number of rows by a factor of 24!
WITH energy_data_ext AS
(
SELECT
ed.name AS store_name
,YEAR(ed.start_date) AS reading_year
,ed.start_date AS reading_date
,ed.kwh AS reading_kwh
FROM
energy_data AS ed
)
,available_stores AS
(
SELECT ede.store_name
FROM energy_data_ext AS ede
GROUP BY ede.store_name
)
,current_reading_yr_per_store AS
(
SELECT
ede.store_name
,MAX(ede.reading_year) AS current_reading_year
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
)
,latest_reading_ranges_per_year AS
(
SELECT
ede.store_name
,ede.reading_year
,MAX(ede.start_date) AS latest_reading_date_of_yr
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
,ede.reading_year
)
,store_reading_ranges AS
(
SELECT
avs.store_name
,lryps.current_reading_year
,lyrr.latest_reading_date_of_yr AS current_year_latest_reading_date
,(lryps.current_reading_year - 1) AS prev_reading_year
,(lyrr.latest_reading_date_of_yr - INTERVAL 1 YEAR) AS prev_year_latest_reading_date
FROM
available_stores AS avs
LEFT JOIN
current_reading_yr_per_store AS lryps
ON (lryps.store_name = avs.store_name)
LEFT JOIN
latest_reading_ranges_per_year AS lyrr
ON (lyrr.store_name = avs.store_name)
AND (lyrr.reading_year = lryps.current_reading_year)
)
--at this stage, we should have all the calculations we need to
--establish the range for the latest year, and the range for the year prior to that
,current_year_consumption AS
(
SELECT
avs.store_name
SUM(cyed.reading_kwh) AS latest_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS cyed
ON (cyed.reading_year = srs.current_reading_year)
AND (cyed.reading_date <= srs.current_year_latest_reading_date)
GROUP BY
avs.store_name
)
,prev_year_consumption AS
(
SELECT
avs.store_name
SUM(pyed.reading_kwh) AS prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS pyed
ON (pyed.reading_year = srs.prev_reading_year)
AND (pyed.reading_date <= srs.prev_year_latest_reading_date)
GROUP BY
avs.store_name
)
SELECT
avs.store_name
,srs.current_reading_year
,srs.current_year_latest_reading_date
,lyc.latest_year_kwh
,srs.prev_reading_year
,srs.prev_year_latest_reading_date
,pyc.prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
current_year_consumption AS lyc
ON (lyc.store_name = avs.store_name)
LEFT JOIN
prev_year_consumption AS pyc
ON (pyc.store_name = avs.store_name)
first of all sorry for that title, but I have no idea how to describe it:
I'm saving sessions in my table and I would like to get the count of sessions per hour to know how many sessions were active over the day. The sessions are specified by two timestamps: start and end.
Hopefully you can help me.
Here we go:
http://sqlfiddle.com/#!2/bfb62/2/0
While I'm still not sure how you'd like to compare the start and end dates, looks like using COUNT, YEAR, MONTH, DAY, and HOUR, you could come up with your desired results.
Possibly something similar to this:
SELECT COUNT(ID), YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
FROM Sessions
GROUP BY YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
And the SQL Fiddle.
What you want to do is rather hard in MySQL. You can, however, get an approximation without too much difficulty. The following counts up users who start and stop within one day:
select date(start), hour,
sum(case when hours.hour between hour(start) and hours.hour then 1 else 0
end) as GoodEstimate
from sessions s cross join
(select 0 as hour union all
select 1 union all
. . .
select 23
) hours
group by date(start), hour
When a user spans multiple days, the query is harder. Here is one approach, that assumes that there exists a user who starts during every hour:
select thehour, count(*)
from (select distinct date(start), hour(start),
(cast(date(start) as datetime) + interval hour(start) hour as thehour
from sessions
) dh left outer join
sessions s
on s.start <= thehour + interval 1 hour and
s.end >= thehour
group by thehour
Note: these are untested so might have syntax errors.
OK, this is another problem where the index table comes to the rescue.
An index table is something that everyone should have in their toolkit, preferably in the master database. It is a table with a single id int primary key indexed column containing sequential numbers from 0 to n where n is a number big enough to do what you need, 100,000 is good, 1,000,000 is better. You only need to create this table once but once you do you will find it has all kinds of applications.
For your problem you need to consider each hour and, if I understand your problem you need to count every session that started before the end of the hour and hasn't ended before that hour starts.
Here is the SQL fiddle for the solution.
What it does is use a known sequential number from the indextable (only 0 to 100 for this fiddle - just over 4 days - you can see why you need a big n) to link with your data at the top and bottom of the hour.
I have a MySQL table like this one:
day int(11)
hour int(11)
amount int(11)
Day is an integer with a value that spans from 0 to 365, assume hour is a timestamp and amount is just a simple integer. What I want to do is to select the value of the amount field for a certain group of days (for example from 0 to 10) but I only need the last value of amount available for that day, which pratically is where the hour field has its max value (inside that day). This doesn't sound too hard but the solution I came up with is completely inefficient.
Here it is:
SELECT q.day, q.amount
FROM amt_table q
WHERE q.day >= 0 AND q.day <= 4 AND q.hour = (
SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day
) GROUP BY day
It takes 5 seconds to execute that query on a 11k rows table, and it just takes a span of 5 days; I may need to select a span of en entire month or year so this is not a valid solution.
Anybody who can help me find another solution or optimize this one is really appreciated
EDIT
No indexes are set, but (day, hour, amount) could be a PRIMARY KEY if needed
Use:
SELECT a.day,
a.amount
FROM AMT_TABLE a
JOIN (SELECT t.day,
MAX(t.hour) AS max_hour
FROM AMT_TABLE t
GROUP BY t.day) b ON b.day = a.day
AND b.max_hour = a.hour
WHERE a.day BETWEEN 0 AND 4
I think you're using the GROUP BY a.day just to get a single amount value per day, but it's not reliable because in MySQL, columns not in the GROUP BY are arbitrary -- the value could change. Sadly, MySQL doesn't yet support analytics (ROW_NUMBER, etc) which is what you'd typically use for cases like these.
Look at indexes on the primary keys first, then add indexes on the columns used to join tables together. Composite indexes (more than one column to an index) are an option too.
I think the problem is the subquery in the where clause. MySQl will at first calculate this "SELECT MAX(p.hour) FROM amt_table p WHERE p.day = q.day" for the whole table and afterwards select the days. Not quite efficient :-)