Pageviews to sessions without loop - mysql

I have a bit of a challenging SQL problem: Let's say you have a table of pageviews which looks like this:
CREATE TABLE pageviews (
id INT(11) NOT NULL AUTO_INCREMENT,
user_id INT(11) NOT NULL,
timestamp DATETIME NOT NULL,
PRIMARY KEY (id)
)
In this table, you have a very large number of records (>100 million). From this data, you want to generate another table which looks like this:
CREATE TABLE sessions (
id INT(11) NOT NULL AUTO_INCREMENT,
user_id INT(11) NOT NULL,
started_at DATETIME NOT NULL,
ended_at DATETIME NOT NULL,
PRIMARY KEY (id)
)
The rule is that a session is any sequence of an arbitrary number of pageviews which does not contain any gap larger than 30 minutes.
Now I have managed to generate this table using a stored procedure which uses a loop to get the sessions:
DELIMITER |
CREATE PROCEDURE generate_sessions()
BEGIN
TRUNCATE sessions;
INSERT INTO sessions
SELECT NULL, p.user_id, p.timestamp, p.timestamp FROM pageviews p
LEFT JOIN pageviews2 p2 ON p2.user_id = p.user_id AND p2.timestamp > p.timestamp AND p2.timestamp < DATE_ADD(p.timestamp, INTERVAL 30 MINUTE)
WHERE p2.id IS NULL;
REPEAT
UPDATE sessions s
LEFT JOIN pageviews p ON p.user_id = s.user_id AND p.timestamp < s.started_at AND p.timestamp > DATE_SUB(s.started_at, INTERVAL 30 MINUTE)
SET s.started_at = p.timestamp
WHERE p.id IS NOT NULL;
UNTIL ROW_COUNT() = 0 END REPEAT;
END |
Basically, what the procedure does is to first get the latest pageview of any session, insert it into the table, and then iteratively backtrack until all sessions are complete.
Needless to say, this is incredibly slow. Anybody have a better solution, preferably one that involves only one query?

This is a hard problem in MySQL. You really want window functions for this.
But, there is a way. First, you need to define each session. For this, find the gaps that are greater than half an hour between pageviews. The following query looks backwards, so this is called PrevSessionEnd.
Next, because time is increasing, select the maximum of this value for all page views for a user that occur on or before a given page view. The result should be that every page view gets a value that is constant over a session. The first will be NULL, the second will be the maximum time stamp of the first session, and so on.
Then, group by this amount.
select USER_ID, MIN(timestamp) as started_at, MAX(timestamp) as ended_at
from (select pv.*,
(select MAX(prevSessionEnd)
from (select pv.*,
(select timestamp
from pageviews pv2
where pv2.useid = pv.user_id and pv2.timestamp < pv.timestamp and
(pv.timestamp - pv2.timestamp) > 0.5/24
order by timestamp
limit 1
) as PrevSessionEnd
from pageviews pv
) pv2
where pv.user_id = pv2.user_id and pv2.timestamp <= pv.timestamp
) as SessionGrouper
from pageviews pv
) pv
group by user_id, SessionGrouper
This particular query has not been tested, so it might have syntax errors.
I'm leaving the final insert up to you.
This will, in turn, run faster if you have an index on pageviews(user_id, timestamp). The subqueries can be resolved only using this index.

Related

MySql is null vs is not null performance

I have a query where I am basically doing a left outer join and checking if the joined value is null
select count(T1.code)
from ( select code
from asset
where type = 'meter'
and creation_time <= '2022-04-29 00:00:00'
and (deactivation_time > '2022-04-28 00:00:00' or deactivation_time is null )
group by code
) as T1
left join ( select asset_code
from amr_midnight_data
where server_time between '2022-04-28 00:00:00' and '2022-04-29 00:00:00'
group by asset_code
) as T2 on T1.code = T2.asset_code
Where T2.asset_code is null;
This query takes 3 seconds to execute, but if I replace the is null at the end with is not null, it takes less then a second. Why is there a performance difference here and what alternatives do I have to make my original query faster?
Look at the EXPLAIN. A guess... Changing to IS NOT NULL lets the Optimizer change LEFT JOIN to JOIN, which lets it start with amr_midnight_data which might optimize better.
I think that the LEFT JOIN ( SELECT ... ) .. IS [NOT] NULL can be replaced with
WHERE [NOT] EXISTS ( SELECT 1 FROM amr_midnight_data
WHERE asset_code = T1.code
AND server_time >= '2022-04-28'
AND server_time < '2022-04-28' + INTERVAL 1 DAY )
That would like to have INDEX(asset_code, server_time)
EXISTS is faster than SELECT .. GROUP BY because it can stop as soon as one matching row is found.
asset would probably benefit from INDEX(type, creation_time) or (to make it "covering"):
INDEX(time, creation_time, deactivation_time, code)
If you wish to discuss further, please provide SHOW CREATE TABLE for both tables and EXPLAIN for each SELECT.

How to speed up a very slow MySQL query?

I have a very slow MySQL syntax which is basically unusable since the table has grown to over 5000 entries. It takes more than 30 sec so the server sends error code and quits.
The syntax is:
SELECT
id,
user_id,
date
FROM
table
WHERE
id IN (
SELECT
MAX(id)
FROM
table
GROUP BY date
)
AND
company_id = '1'
AND
date > '1473700785'
AND
complete = '1'
AND
name = "random string"
ORDER BY id ASC
Structure:
id - int(11)
user_id - int(10)
company_id - int(11)
date - varchar(20)
complete - varchar(2)
name - varchar(75)
Do you have any idea what could be slowing it? It used to function as expected with a much smaller table size (under 1000 entries).
Apart from subquery (like below), the best method is indexing. Like what most people here suggested
SELECT id, user_id, date
FROM table min
--sub queries sometimes run faster than IN / NOT IN
JOIN (
SELECT SELECT MAX(id)
FROM table
GROUP BY date
)
max on max.id = min.id
WHERE min.company_id = '1'
AND min.date > '1473700785'
AND min.complete = '1'
AND min.name = "random string"
ORDER BY min.id ASC
At first you need index for date field.
And you need store date as integer, because you use this expression
date > '1473700785'
Indexing is good, but I don't see the need for a SUB-SELECT
SELECT
MAX(t.id) as id,
u.user_id,
t.date
FROM table t
JOIN table u ON u.id=MAX(t.id )
WHERE
t.company_id = '1'
AND
t.date > '1473700785'
AND
t. complete = '1'
AND
t.name = "random string"
GROUP BY t.date
ORDER BY t.id ASC

Select statement joining 2 tables, searching by date, and status

OK I think I have messed up somewhere but maybe someone can spot my error, because I have little clue of what I am doing.
I have 2 Tables Players and RegionPlayer (see bottom for structure)
I am trying to find when a none of the players on a region have been seen in a while. Players can be on vacation which gives then 58 days, else its only 8 days.
If none of the players on a region have been seen in that time, I want the sql search to return the regionID, as well as the most recent person on that region who was seen.
Now I think that way to do this is to get 2 results from each region, each providing me the most recent player seen who was on vacation, and who was not on vacation.
But while, I thought this would give me that, it doesn't seem to.
SELECT RegionPlayer.Regionid, Players.key, Players.Name, Players.Seen, Players.Vacation
FROM RegionPlayer
JOIN Players
ON Players.Key = RegionPlayer.Playerid
where ( RegionPlayer.Status = 1 )
GROUP BY RegionPlayer.Regionid DESC, Players.Vacation DESC
ORDER BY Players.Seen DESC
Then I am going to need to be able to tell who has not been seen in a while, this should give me that.
Now I know I can link both queries together, but I have no idea how, it has been many years since I last had to put this much effort into sql statements.
Select Players.key FROM Players
WHERE
(( Players.Vacation != 1 ) AND
( Players.Seen <= (NOW() - INTERVAL 8 DAY ) ))
OR
(( Players.Vacation != 0 ) AND
( Players.Seen <= (NOW() - INTERVAL 58 DAY ) ))
Is There a better way of doing this, I sort of remember things like views, and store procedures, and functions, would one or more of them be better?
Table Structure.
Please forgive, the names, of the tables and some of the structure, This is an example of why deciding things late at night after 1/2 a bottle of wine is a bad idea.
CREATE TABLE IF NOT EXISTS `Players` (
`key` int(11) NOT NULL,
`Name` varchar(255) NOT NULL,
`Vacation` varchar(1) NOT NULL,
`Seen` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`Modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)
CREATE TABLE IF NOT EXISTS `RegionPlayer` (
`Key` int(11) NOT NULL,
`Playerid` int(11) NOT NULL,
`Regionid` int(11) NOT NULL,
`Type` varchar(1) NOT NULL,
`Status` int(1) NOT NULL DEFAULT '1',
`Modified` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`Created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
)
I've put up an SQLFiddle.
The query that answers your basic requirement, which seems to be: list all regions that have no active player seen in the last 8 days and no vacated player seen in the last 58 days, giving also the data of the last seen player in that region:
SELECT r.*
FROM (
SELECT rp.Regionid, p.Key, p.Name, p.Vacation, p.Seen
FROM RegionPlayer rp
JOIN Players p ON p.Key = rp.Playerid
WHERE rp.Status = 1
GROUP BY rp.Regionid
ORDER BY p.Seen DESC
) r
WHERE ((r.Vacation != 1) AND (r.Seen <= (NOW()-INTERVAL 8 DAY)))
OR ((r.Vacation != 0) AND (r.Seen <= (NOW()-INTERVAL 58 DAY)));
I desumed from your SQL that only RegionPlayer rows with a Status of 1 should be considered.
On the SQLFiddle I've create a bit of regions with different combinations, and this query does its job.
As to your first SQL statement. You say it doesn't work as expected, but to me it seems to do it... the last seen active player and last seen vacated player for each region. The sorting may not make it very readable, but it does do that.
Try this
SELECT RegionPlayer.Regionid, m.key, m.Name, m.Seen, m.Vacation
FROM RegionPlayer
JOIN (Select * as key FROM Players
WHERE
(( Players.Vacation != 1 ) AND
( Players.Seen <= (NOW() - INTERVAL 8 DAY ) ))
OR
(( Players.Vacation != 0 ) AND
( Players.Seen <= (NOW() - INTERVAL 58 DAY ) ))) m
ON m.Key = RegionPlayer.Playerid
where ( RegionPlayer.Status = 1 )
GROUP BY RegionPlayer.Regionid DESC, m.Vacation DESC
ORDER BY m.Seen DESC

MYSQL Selecting oldest date record for each unique event

I have the following two tables
CREATE TABLE IF NOT EXISTS `events` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;
CREATE TABLE IF NOT EXISTS `events_dates` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`event_id` bigint(20) NOT NULL,
`date` date NOT NULL,
`start_time` time NOT NULL,
`end_time` time NOT NULL,
PRIMARY KEY (`id`),
KEY `event_id` (`event_id`),
KEY `date` (`event_id`)
) ENGINE=MyISAM;
Where the link is event_id
What I want is to retrieve all unique event records with their respective event dates ordered by the smallest date ascending within a certain period
Basically the following query does exactly what I want
SELECT Event.id, Event.title, EventDate.date, EventDate.start_time, EventDate.end_time
FROM
events AS Event
JOIN
com_events_dates AS EventDate
ON (Event.id = EventDate.event_id AND EventDate.date = (
SELECT MIN(MinEventDate.date) FROM events_dates AS MinEventDate
WHERE MinEventDate.event_id = Event.id AND MinEventDate.date >= CURDATE() # AND `MinEventDate`.`date` < '2013-02-27'
)
)
WHERE
EventDate.date >= CURDATE() # AND `EventDate`.`date` < '2013-02-27'
ORDER BY EventDate.date ASC , EventDate.start_time ASC , EventDate.end_time DESC
LIMIT 20
This query is the result of multiple attempts at further improving the slow time this initially had (1.5 seconds) when i wanted to use group by and other subqueries. Its the fastest one yet but considering that there are 1400 event records and 10000 event records in total, the query takes 400+ ms time to process, also I run a count based on this (for paging purposes) that takes a lot of time as well.
Strangely enough omitting the EventDate condition in the main where clause causes this to be even higher 1s+.
Is there anything I can do to improve this or a different approach at the table structure?
Just to clarify to anyone else... the "#" in MySQL acts as a continuation comment and is basically ignored in the query, it is not an "AND EventDate.Date < '2013-02-27'". That said, it appears you want a list of all events COMING UP that have not yet happened. I would start with a simple "prequery" that just grabs all events and the minimum date based on the event date not happening yet. Then join that result to the other tables to get the rest of the fields you want
SELECT
E.ID,
E.Title,
ED2.`date`,
ED2.Start_Time,
ED2.End_Time
FROM
( SELECT
ED.Event_ID,
MIN( ED.`date` ) as MinEventDate
from
Event_Dates ED
where
ED.`date` >= curdate()
group by
ED.Event_ID ) PreQuery
JOIN Events E
ON PreQuery.Event_ID = E.ID
JOIN Event_Dates ED2
ON PreQuery.Event_ID = ED2.Event_ID
AND PreQuery.MinEventDate = ED2.`date`
ORDER BY
ED2.`date`,
ED2.Start_Time,
ED2.End_Time DESC
LIMIT 20
Your table has redundant index on event ID, just by different names. Calling the name of an index date does not mean that's the column being indexed. The value(s) in parens ( event_id ) is what the index is built on.
So, I would change your create table to...
KEY `date` ( `event_id`, `date`, `start_time` )
Or, to manually create an index.
Create index ByEventAndDate on Event_Dates ( `event_id`, `date`, `start_time` )
If you are talking about optimization, it is helpful to include execution plans when possible.
By the way try this ones (if you are not tried it already):
SELECT
Event.id,
Event.title,
EventDate.date,
EventDate.start_time,
EventDate.end_time
FROM
(select e.id, e.title, min(date) as MinDate
from events_dates as ed
join events as e on e.id = ed.event_id
where date >= CURDATE() and date < '2013-02-27'
group by e.id, e.title) as Event
JOIN events_dates AS EventDate ON Event.id = EventDate.event_id
and Event.MinDate = EventDate.date
ORDER BY EventDate.date ASC , EventDate.start_time ASC , EventDate.end_time DESC
LIMIT 20
;
#assuming event_dates.date for greater event_dates.id always greater
SELECT
Event.id,
Event.title,
EventDate.date,
EventDate.start_time,
EventDate.end_time
FROM
(select e.id, e.title, min(ed.id) as MinID
from events_dates as ed
join events as e on e.id = ed.event_id
where date >= CURDATE() and date < '2013-02-27'
group by e.id, e.title) as Event
JOIN events_dates AS EventDate ON Event.id = EventDate.event_id
and Event.MinID = EventDate.id
ORDER BY EventDate.date ASC , EventDate.start_time ASC , EventDate.end_time DESC
LIMIT 20

more efficient group by for query with Case

I have the following query building a recordset which is used in a pie-chart as a report.
It's not run particularly often, but when it does it takes several seconds, and I'm wondering if there's any way to make it more efficient.
SELECT
CASE
WHEN (lastStatus IS NULL) THEN 'Unused'
WHEN (attempts > 3 AND callbackAfter IS NULL) THEN 'Max Attempts Reached'
WHEN (callbackAfter IS NOT NULL AND callbackAfter > DATE_ADD(NOW(), INTERVAL 7 DAY)) THEN 'Call Back After 7 Days'
WHEN (callbackAfter IS NOT NULL AND callbackAfter <= DATE_ADD(NOW(), INTERVAL 7 DAY)) THEN 'Call Back Within 7 Days'
WHEN (archived = 0) THEN 'Call Back Within 7 Days'
ELSE 'Spoke To'
END AS statusSummary,
COUNT(leadId) AS total
FROM
CO_Lead
WHERE
groupId = 123
AND
deleted = 0
GROUP BY
statusSummary
ORDER BY
total DESC;
I have an index for (groupId, deleted), but I'm not sure it would help to add any of the other fields into the index (if it would, how do I decide which should go first? callbackAfter because it's used the most?)
The table has about 500,000 rows (but will have 10 times that a year from now.)
The only other thing I could think of was to split it out into 6 queries (with the WHEN clause moved into the WHERE), but that makes it take 3 times as long.
EDIT:
Here's the table definition
CREATE TABLE CO_Lead (
objectId int UNSIGNED NOT NULL AUTO_INCREMENT,
groupId int UNSIGNED NOT NULL,
numberToCall varchar(20) NOT NULL,
firstName varchar(100) NOT NULL,
lastName varchar(100) NOT NULL,
attempts tinyint NOT NULL default 0,
callbackAfter datetime NULL,
lastStatus varchar(30) NULL,
createdDate datetime NOT NULL,
archived bool NOT NULL default 0,
deleted bool NOT NULL default 0,
PRIMARY KEY (
objectId
)
) ENGINE = InnoDB;
ALTER TABLE CO_Lead ADD CONSTRAINT UQIX_CO_Lead UNIQUE INDEX (
objectId
);
ALTER TABLE CO_Lead ADD INDEX (
groupId,
archived,
deleted,
callbackAfter,
attempts
);
ALTER TABLE CO_Lead ADD INDEX (
groupId,
deleted,
createdDate,
lastStatus
);
ALTER TABLE CO_Lead ADD INDEX (
firstName
);
ALTER TABLE CO_Lead ADD INDEX (
lastName
);
ALTER TABLE CO_Lead ADD INDEX (
lastStatus
);
ALTER TABLE CO_Lead ADD INDEX (
createdDate
);
Notes:
If leadId cannot be NULL, then change the COUNT(leadId) to COUNT(*). They are logically equivalent but most versions of MySQL optimizer are not so clever to identify that.
Remove the two redundant callbackAfter IS NOT NULL conditions. If callbackAfter satisfies the second part, it cannot be null anyway.
You could benefit from splitting the query into 6 parts and add appropriate indexes for each one - but depending on whether the conditions at the CASE are overlapping or not, you may have wrong or correct results.
A possible rewrite (mind the different format and check if this returns the same results, it may not!)
SELECT
cnt1 AS "Unused"
, cnt2 AS "Max Attempts Reached"
, cnt3 AS "Call Back After 7 Days"
, cnt4 AS "Call Back Within 7 Days"
, cnt5 AS "Call Back Within 7 Days"
, cnt6 - (cnt1+cnt2+cnt3+cnt4+cnt5) AS "Spoke To"
FROM
( SELECT
( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND lastStatus IS NULL
) AS cnt1
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND attempts > 3 AND callbackAfter IS NULL
) AS cnt2
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND callbackAfter > DATE_ADD(NOW(), INTERVAL 7 DAY)
) AS cnt3
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND callbackAfter <= DATE_ADD(NOW(), INTERVAL 7 DAY)
) AS cnt4
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND archived = 0
) AS cnt5
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
) AS cnt6
) AS tmp ;
If it does return correct results, you could add indexes to be used for each one of the subqueries:
For subquery 1: (groupId, deleted, lastStatus)
For subquery 2, 3, 4: (groupId, deleted, callbackAfter, attempts)
For subquery 5: (groupId, deleted, archived)
Another approach would be to keep the query you have (minding only notes 1 and 2 above) and add a wide covering index:
(groupId, deleted, lastStatus, callbackAfter, attempts, archived)
Try removing the index to see if this improves the performance.
Indexes do not necessarily improve performance, in some databases. If you have an index, MySQL will always use it. In this case, that means that it will read the index, then it will have to read data from each page. The page reads are random, rather than sequential. This random reading can reduce performance, on a query that has to read all the pages anyway.