We are generating some crosssell-data for a shop. We want to display stuff like "customers who looked at this product, also looked at these products".
To generate this data, we do this query on a daily routine from session-based product-viewed-data.
INSERT INTO
product_viewed_together
(
product,
product_associate,
viewed
)
SELECT
v.product,
v2.product,
COUNT(*)
FROM
product_view v
INNER JOIN
product_view v2
ON
v2.session = v.session
AND v2.product != v.product
AND DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
WHERE
DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
GROUP BY
v.product,
v2.product;
Table product_view is joined to itself. As this table is quite big (circa 26 million rows), the result is even bigger. The query issues a huge amount of performance and time.
I am not use, we choosed a layout fitting the problem in a good way. Is there a better way to store and generate this data?
Make the date tests sargable:
DATE_ADD(v.created, INTERVAL %d DAY) > NOW()
-->
v.created > NOW - INTERVAL %d DAY
Is product_view a VIEW? Or a TABLE? If a table, provide two "covering" indexes:
INDEX(created, session, product) -- (for v)
INDEX(session, created, product) -- (for v2)
Perhaps all the counts you get are even? This bug can be fixed in about 3 ways, each will double the speed. I think the optimal one is to change one line in the ON to
DATE_ADD(v2.created, INTERVAL %d DAY) > NOW()
-->
v2.created > v.created
I think that will double the speed.
However, the counts may not be exactly correct if you can have two different products with the same created.
Another issue: You will end up with
prod assoc CT
123 234 43
234 123 76 -- same pair, opposite order
My revised test says that 234 came before 123 more often than the other way.
Give those things a try. But if you still need more; I have another, more invasive, thought.
Related
I have a number of stores where I would like to sum the energy consumption so far this year compared with the same period last year. My challenge is that in the current year the stores have different date intervals in terms of delivered data. That means that store A may have data between 01.01.2018 and 20.01.2018, and store B may have data between 01.01.2018 and 28.01.2018. I would like to sum the same date intervals current year versus previous year.
Data looks like this
Store Date Sum
A 01.01.2018 12
A 20.01.2018 11
B 01.01.2018 33
B 28.01.2018 32
But millions of rows and would use these dates as references to get the same sums previous year.
This is my (erroneous) try:
SET #curryear = (SELECT YEAR(MAX(start_date)) FROM energy_data);
SET #maxdate_curryear = (SELECT MAX(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
SET #mindate_curryear = (SELECT MIN(start_date) FROM energy_data WHERE
YEAR(start_date) = #curryear);
-- the same date intervals last year
SET #maxdate_prevyear = (#maxdate_curryear - INTERVAL 1 YEAR);
SET #mindate_prevyear = (#mindate_curryear - INTERVAL 1 YEAR);
-- sums current year
CREATE TABLE t_sum_curr AS
SELECT name as name_curr, sum(kwh) as sum_curr, min(start_date) AS
min_date_curr, max(start_date) AS max_date_curr, count(distinct
start_date) AS ant_timer FROM energy_data WHERE agg_type = 'timesnivå'
AND start_date >= #mindate_curryear and start_date <= #maxdate_curryear GROUP BY NAME;
-- also seems fair, the same dates one year ago, figured I should find those first and in the next query use that to sum each stores between those date intervals
CREATE TABLE t_sum_prev AS
SELECT name_curr as name_curr2, (min_date_curr - INTERVAL 1 YEAR) AS
min_date_prev, (max_date_curr - INTERVAL 1 YEAR) as max_date_prev FROM
t_sum_curr;
-- getting into trouble!
CREATE TABLE the_results AS
SELECT name, start_date, sum(kwh) as sum_prev from energy_data where
agg_type = 'timesnivå' and
start_date >= #mindate_prevyear and start_date <=
#maxdate_prevyear group by name having start_date BETWEEN (SELECT
min_date_prev from t_sum_prev) AND
(SELECT max_date_prev from t_sum_prev);
`
This last query just tells me that my sub query returns more than 1 row and throws an error message.
I assume what you have is a list of energy consumption figures, where bills or readings have been taken at irregular times, so the consumption covers irregular periods.
The basic approach you need to take is to regularise the consumption periods - by establishing which days each periods covers, and then breaking each reading down into as many days as it covers, and the consumption for each day being a daily average of the period.
I'm assuming the consumption periods are entirely sequential (as a bill or reading normally would be), and not overlapping.
Because of the volume of rows involved (you say millions even in its current form), you might not want to leave the data in daily form - it might suffice to regroup them into regular weekly, monthly, or quarterly periods, depending on what level of granularity you require for comparison.
Once you have your regular periods, comparison will be as easy as cake.
If this is part of a report that will be run on an ongoing basis, you'd probably want to implement some logic that calculates a "regularised consumption" incrementally and on a scheduled basis and stores it in a summary table, with appropriate columns and indexes, so that you aren't having to process many millions of historical rows each time the report is run.
Trying to work around the irregular periods (if indeed it can be done) with fancy joins and on-the-fly averages, rather than tackling them head on, will likely lead to very difficult logic, and particularly on a data set of this size, dire performance.
EDIT: from the comments below.
#Alexander, I've knocked together an example of a query. I haven't tested it and I've written it all in a text editor, so excuse any small syntax errors. What I've come up with seems a bit complex (more complex than I imagined when I began), but I'm also a little bit tired, so I'm not sure whether it could be simplified further.
The only point I would make is that the performance of this query (or any such query), because of the nature of what it has to do in traversing date ranges, is likely to be appalling on a table with millions of rows. I stand by my earlier remarks, that proper indexing of the source data will be crucial, and summarising the source data into a larger granularity will massively aid performance (at the expense of a one-off hit to summarise it). Even daily granularity, will reduce the number of rows by a factor of 24!
WITH energy_data_ext AS
(
SELECT
ed.name AS store_name
,YEAR(ed.start_date) AS reading_year
,ed.start_date AS reading_date
,ed.kwh AS reading_kwh
FROM
energy_data AS ed
)
,available_stores AS
(
SELECT ede.store_name
FROM energy_data_ext AS ede
GROUP BY ede.store_name
)
,current_reading_yr_per_store AS
(
SELECT
ede.store_name
,MAX(ede.reading_year) AS current_reading_year
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
)
,latest_reading_ranges_per_year AS
(
SELECT
ede.store_name
,ede.reading_year
,MAX(ede.start_date) AS latest_reading_date_of_yr
FROM
energy_data_ext AS ede
GROUP BY
ede.store_name
,ede.reading_year
)
,store_reading_ranges AS
(
SELECT
avs.store_name
,lryps.current_reading_year
,lyrr.latest_reading_date_of_yr AS current_year_latest_reading_date
,(lryps.current_reading_year - 1) AS prev_reading_year
,(lyrr.latest_reading_date_of_yr - INTERVAL 1 YEAR) AS prev_year_latest_reading_date
FROM
available_stores AS avs
LEFT JOIN
current_reading_yr_per_store AS lryps
ON (lryps.store_name = avs.store_name)
LEFT JOIN
latest_reading_ranges_per_year AS lyrr
ON (lyrr.store_name = avs.store_name)
AND (lyrr.reading_year = lryps.current_reading_year)
)
--at this stage, we should have all the calculations we need to
--establish the range for the latest year, and the range for the year prior to that
,current_year_consumption AS
(
SELECT
avs.store_name
SUM(cyed.reading_kwh) AS latest_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS cyed
ON (cyed.reading_year = srs.current_reading_year)
AND (cyed.reading_date <= srs.current_year_latest_reading_date)
GROUP BY
avs.store_name
)
,prev_year_consumption AS
(
SELECT
avs.store_name
SUM(pyed.reading_kwh) AS prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
energy_data_ext AS pyed
ON (pyed.reading_year = srs.prev_reading_year)
AND (pyed.reading_date <= srs.prev_year_latest_reading_date)
GROUP BY
avs.store_name
)
SELECT
avs.store_name
,srs.current_reading_year
,srs.current_year_latest_reading_date
,lyc.latest_year_kwh
,srs.prev_reading_year
,srs.prev_year_latest_reading_date
,pyc.prev_year_kwh
FROM
available_stores AS avs
LEFT JOIN
store_reading_ranges AS srs
ON (srs.store_name = avs.store_name)
LEFT JOIN
current_year_consumption AS lyc
ON (lyc.store_name = avs.store_name)
LEFT JOIN
prev_year_consumption AS pyc
ON (pyc.store_name = avs.store_name)
I have a table with 3 days of data (about 4000 rows). The 3 sets of data are all from a 30 minutes session. I want to have the start and ending time of each session.
I currently use this SQL, but it's quite slow (even with only 4000 records). The datetime table is indexed, but I think the index is not properly used because of the conversion from datetime to date.
The tablelayout is fixed, so I cannot change any part of that. The query takes about 20 seconds to run.. (and every day longer and longer). Anyone have some good tips to make it faster?
select distinct
date(a.datetime) datetime,
(select max(b.datetime) from bike b where date(b.datetime) = date(a.datetime)),
(select min(c.datetime) from bike c where date(c.datetime) = date(a.datetime))
from bike a
Maybe I'm missing something, but...
Isn't the result returned by the OP query equivalent to the result from this query:
SELECT DATE(a.datetime) AS datetime
, MAX(a.datetime) AS max_datetime
, MIN(a.datetime) AS min_datetime
FROM bike a
GROUP BY DATE(a.datetime)
Alex, warning, this in typed "freehand" so may have some syntax problems. But kind of shows what I was trying to convey.
select distinct
date(a.datetime) datetime,
(select max(b.datetime) from bike b where b.datetime between date(a.datetime) and (date(a.datetime) + interval 1 day - interval 1 second)),
(select min(c.datetime) from bike c where c.datetime between date(a.datetime) and (date(a.datetime) + interval 1 day - interval 1 second))
from bike a
Instead of comparing date(b.datetime), it allows comparing the actual b.datetime against a range calculated form the a.datetime. Hopefully this helps you out and does not make things murkier.
The problem:
We're getting stock prices and trades from a provider, and to speed things up we cache the trades as they come in (1 trade per second per stock is not a lot). We've got around 2,000 stocks, so technically, we're expecting as much as 120,000 trades per minute (2,000 * 60). Now, these prices are realtime, but to avoid paying licensing fees to show these data to the customer we need to show the prices delayed with 15 minutes. (We need the realtime prices internally, which is why we've bought and pay for them (they are NOT cheap!))
I feel like I've tried everything, and I've run into an uncountable number of problems.
Things I've tried:
1:
Run a cronjob every 15 seconds that runs a query that checks what the trade for the stock, more than 15 minutes ago, had for an ID (for joins):
SELECT
MAX(`time`) as `max_time`,
`stock_id`
FROM
`stocks_trades`
WHERE
`time` <= DATE_SUB(NOW(), INTERVAL 15 MINUTE)
AND
`time` > '0000-00-00 00:00:00'
GROUP BY
`stock_id`
This works very fast - 1.8 seconds with ~2,000,000 rows, but the following is very slow:
SELECT
st.id,
st.stock_id
FROM
(
SELECT
MAX(`time`) as `max_time`,
`stock_id`
FROM
`stocks_trades`
WHERE
`time` <= DATE_SUB(NOW(), INTERVAL 15 MINUTE)
AND
`time` > '0000-00-00 00:00:00'
GROUP BY
`stock_id`
) as `tmp`
INNER JOIN
`stocks_trades` as `st`
ON
(tmp.max_time = st.time AND tmp.stock_id = st.stock_id)
GROUP BY
`stock_id`
..that takes ~180-200 seconds, which is WAY too slow. There's an index on both time and stock_id (indiviudally).
2:
Switch between InnoDB/MyISAM. I'd think I would need InnoDB (we're inserting A LOT of rows from multiple threads, we don't want to block between each insert) - InnoDB seems faster at inserting, but WAY slower at reading (we require both, obviously).
3:
Optimize tables every day. Still slow.
What I think might help:
Using ints instead of DateTime. Perhaps (since the markets are open from 9-22) keep a custom int time, which would be "seconds since 9 o'clock this morning" and use the same method as above (it seems to make some difference, albeit not a lot)
Use MEMORY instead of InnoDB - probably not the best idea with ~18,000,000 rows per 15 minutes, even though we have plenty of memory
Save price/stockID/time in memory in our application receiving the prices (I don't see how this would be any different than using MEMORY, except my code probably will be worse than MySQL's own code)
Keep deleting trades older than 15 minutes in hopes that it'll speed up the queries
Some magic query that I just haven't thought of that uses the indexes perfectly and does magical things
Give up and kill one self after spending ~12 hours on trying to wrap my head around this and different solutions
Since your are joining against your subquery on two columns (stock_id, time), MySQL ought to be able to make use of a compound index across both of them, while it cannot make use of either of the individual column indices you already have.
ALTER TABLE `stocks_trades` ADD INDEX `idx_stock_id_time` (`stock_id`, `time`)
Assuming your have an auto incrementing id as the primary key on stock_trades (call it stock_trade_id), you could select the max('stock_trade_id') as 'last_id' on the inner query and then do an inner join on the 'last_id' = 'stock_trade_id' so you will be joining on your PK and have no date compares on your main join.
SELECT
st.id,
st.stock_id
FROM
(
SELECT
MAX(`stock_trade_id`) as `last_id`,
`stock_id`
FROM
`stocks_trades`
WHERE
`time` <= DATE_SUB(NOW(), INTERVAL 15 MINUTE)
AND
`time` > '0000-00-00 00:00:00'
GROUP BY
`stock_id`
) as `tmp`
INNER JOIN
`stocks_trades` as `st`
ON
(tmp.last_id = st.stock_trade_id)
GROUP BY
`stock_id`
What happens if you run something like this? Try to change it to include the proper column name for the price if needed:
SELECT st.id, st.stock_id
FROM stock_trades as st
WHERE time <= DATE_SUB(NOW(), INTERVAL 15 MINUTE)
AND time > DATE_SUB(NOW(), INTERVAL 45 MINUTE)
AND not exists (select 1 from stock_trades as st2 where st2.time <= DATE_SUB(NOW(), INTERVAL 15 MINUTE) and st2.stock_id = st.stock_id and st2.time > st.time)
hope it helps!
first of all sorry for that title, but I have no idea how to describe it:
I'm saving sessions in my table and I would like to get the count of sessions per hour to know how many sessions were active over the day. The sessions are specified by two timestamps: start and end.
Hopefully you can help me.
Here we go:
http://sqlfiddle.com/#!2/bfb62/2/0
While I'm still not sure how you'd like to compare the start and end dates, looks like using COUNT, YEAR, MONTH, DAY, and HOUR, you could come up with your desired results.
Possibly something similar to this:
SELECT COUNT(ID), YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
FROM Sessions
GROUP BY YEAR(Start), HOUR(Start), DAY(Start), MONTH(Start)
And the SQL Fiddle.
What you want to do is rather hard in MySQL. You can, however, get an approximation without too much difficulty. The following counts up users who start and stop within one day:
select date(start), hour,
sum(case when hours.hour between hour(start) and hours.hour then 1 else 0
end) as GoodEstimate
from sessions s cross join
(select 0 as hour union all
select 1 union all
. . .
select 23
) hours
group by date(start), hour
When a user spans multiple days, the query is harder. Here is one approach, that assumes that there exists a user who starts during every hour:
select thehour, count(*)
from (select distinct date(start), hour(start),
(cast(date(start) as datetime) + interval hour(start) hour as thehour
from sessions
) dh left outer join
sessions s
on s.start <= thehour + interval 1 hour and
s.end >= thehour
group by thehour
Note: these are untested so might have syntax errors.
OK, this is another problem where the index table comes to the rescue.
An index table is something that everyone should have in their toolkit, preferably in the master database. It is a table with a single id int primary key indexed column containing sequential numbers from 0 to n where n is a number big enough to do what you need, 100,000 is good, 1,000,000 is better. You only need to create this table once but once you do you will find it has all kinds of applications.
For your problem you need to consider each hour and, if I understand your problem you need to count every session that started before the end of the hour and hasn't ended before that hour starts.
Here is the SQL fiddle for the solution.
What it does is use a known sequential number from the indextable (only 0 to 100 for this fiddle - just over 4 days - you can see why you need a big n) to link with your data at the top and bottom of the hour.
I have a query that looks like (I've tried to strip out non-relevant fields/joins for clarity):
SET #num = -1;
SELECT
*,
CAST(DATE_ADD( '2012-04-01', interval #num := #num+1 day)AS DATE) AS date_sequence,
DAYOFWEEK(DATE_ADD('2012-04-01', interval #num+1 day)) AS day_week
FROM batch AS b1
left join (
select
batch.`startedDate` AS batch_startedDate,
epiRun.`runType` AS epiRun_runType
.... other fields selected.........
from batch
left join `epiRun` epiRun ON epiRun.`batchID`= batch.`keyID`
.......other joins........
WHERE batch.`startedDate` >= '2012-04-01' AND batch.`startedDate` <= '2012-04-18'
ORDER BY batch.`startedDate`ASC)
AS b2 ON cast((b2.`batch_startedDate`) AS DATE)=CAST(DATE_ADD('2012-04-01', interval #num+1
day)AS DATE)
WHERE
(DATE_ADD('2012-04-01', interval #num+1 day) <= '2012-04-18')
The nested select query performs as I expect when run by itself. This query has a couple of problems:
-Every field in from the batch table is selected, but since this is for an iReport it's not too much of a problem
-I get the list of dates from 1st April to 18th April, but if I have multiple batches on a day then I only get one displayed - ideally I'd like multiple identical entries in the date column with a unique entry for each batch. It is important that I can see when there are days with no batches.
Example of table I have:
Date Batch
01/04/2012 TS01
02/04/2012 TS03
03/04/2012 null
and what I'd like to generate:
Date Batch
01/04/2012 TS01
01/04/2012 TS02
02/04/2012 TS03
02/04/2012 TS04
03/04/2012 null
I personally would create a separate "dates" table and populate it with all the dates from say 1/1/2000 through 12/31/2050 (or whatever date range will cover all potential queries) and then do a left join from that table to the batch and epiRun tables.
I think that this is a much cleaner way to do what you are looking for and will give you exactly the results you desire.