SELECT MIN(classification) AS classification
,MIN(START) AS START
,MAX(next_start) AS END
,SUM(duration) AS seconds
FROM ( SELECT *
, CASE WHEN (duration < 20*60) THEN CASE WHEN (duration = -1) THEN 'current_session' ELSE 'session' END
ELSE 'break'
END AS classification
, CASE WHEN (duration > 20*60) THEN ((#sum_grouping := #sum_grouping +2)-1)
ELSE #sum_grouping
END AS sum_grouping
FROM ( SELECT *
, CASE WHEN next_start IS NOT NULL THEN TIMESTAMPDIFF(SECOND, START, next_start) ELSE -1 END AS duration
FROM ( SELECT id, studentId, START
, (SELECT MIN(START)
FROM attempt AS sub
WHERE sub.studentId = main.studentId
AND sub.start > main.start
) AS next_start
FROM attempt AS main
WHERE main.studentId = 605
ORDER BY START
) AS t1
) AS t2
WHERE duration != 0
) AS t3
GROUP BY sum_grouping
ORDER BY START DESC, END DESC
Explanation and goal
The attempt table records a student's attempt at some activity, during a session. If two attempts are less than 20 minutes apart, we consider those to be the same session. If they are more than 20 minutes apart, we assume they took a break.
My goal with this query is to take all of the attempts and condense them down in a list of sessions and breaks, with the start time of each session, the end time (defined as the start of the subsequent session), and how long the session was. The classification is whether it is a session, a break, or the current session.
The above query does all of that, but is too slow. How can I improve the performance?
How the current query works
The innermost queries select an attempt's start time and the subsequent attempt's start time, along with the duration between those values.
Then, the #sum_grouping and sum_grouping are used to split the attempts into the sessions and breaks. #sum_grouping is only ever increased when an attempt is more than 20 minutes long (i.e. a break), and it is always increased by 2. However, sum_grouping is set to a value of one less than that for that "break". If an attempt is less than 20 minutes long, then the current #sum_grouping value is used, without modification. As a result, all breaks are distinct odd values, and all sessions (whether of 1 or more attempt) end up as distinct even numbers. This allows the GROUP BY portion to correctly separate the attempts into sessions and breaks.
Example:
Attempt type #sum_grouping sum_grouping
non-break 0 0
non-break 0 0
break 2 1
break 4 3
non-break 4 4
break 6 5
As you can see, all the breaks will be grouped by sum_grouping separately with distinct odd values and all the non-breaks will be grouped together as sessions with the even values.
The MIN(classification) simply forces "current session" to be returned when both "session" and "current session" are present within a grouped row.
OUTPUT OF SHOW CREATE TABLE attempt
CREATE TABLE attempt (
id int(11) NOT NULL AUTO_INCREMENT,
caseId int(11) NOT NULL DEFAULT '0',
eventId int(11) NOT NULL DEFAULT '0',
studentId int(11) NOT NULL DEFAULT '0',
activeUuid char(36) NOT NULL,
start timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
end timestamp NULL DEFAULT NULL,
outcome float DEFAULT NULL,
response varchar(5000) NOT NULL DEFAULT '',
PRIMARY KEY id),
KEY activeUuid activeUuid),
KEY caseId caseId,activeUuid),
KEY end end),
KEY start start),
KEY studentId studentId),
KEY attempt_idx_studentid_stat_id studentId,start,id),
KEY attempt_idx_studentid_stat studentId,start
) ENGINE=MyISAM AUTO_INCREMENT=298382 DEFAULT CHARSET=latin1
(This is not a proper Answer, but here goes anyway.)
Try not to nest 'derived' tables.
I see a lot of syntax errors.
Move from MyISAM to InnoDB.
INDEX(a, b) handles situations where you need INDEX(a), so DROP the latter.
Related
I have a date table, which has a column date (PK). The CREATE script is here:
CREATE TABLE date_table (
date DATE
,year INT(4)
,month INT(2)
,day INT(2)
,month_pad VARCHAR(2)
,day_pad VARCHAR(2)
,month_name VARCHAR(10)
,year_month_index INT(6)
,year_month_hypname VARCHAR(7)
,year_month_name VARCHAR(15)
,week_day_index INT(1)
,day_name VARCHAR(9)
,week INT(2)
,week_interval VARCHAR(13)
,weekend_fl INT(1)
,quarter_num INT(1)
,quarter_num_pad VARCHAR(2)
,quarter_name VARCHAR(2)
,year_quarter_index INT(6)
,year_quarter_name VARCHAR(7)
,PRIMARY KEY (date)
);
Now I would like select rows from this table with dynamic values, using such as LAST_DAY() or DATE_SUB(DATE_FORMAT(SYSDATE(),'%Y-01-01'), INTERVAL X YEAR), etc.
When one of my queries failed and didn't execute in 30 secs, I knew something was fishy, and it looks like the reason is that the index on the primary key column is not used. Here are my results (sorry for using an image instead of copying the queries, but I thought it's concise enough for this purpose, and the queries are short/simple enough):
First of all, it's strange that the BETWEEN works differently than using >= and <=. Secondly, it looks like the index is only used for constant values. If you look closely, you can see that on the right side (where >= and <= is used), it shows ~9K rows, which is half of the rows in the table (the table has about ~18k rows, dates from 2000-01-01 to `2050-12-31).
SYSDATE() returns the time at which it executes. This differs from the behavior for NOW(), which returns a constant time that indicates the time at which the statement began to execute. (Within a stored function or trigger, NOW() returns the time at which the function or triggering statement began to execute.)
-- https://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html#function_sysdate
That is, the Optimizer does not see this as a "constant". Otherwise, the Optimizer eagerly evaluates any "constant expressions", then tries to take advantage of knowing the value.
See also the sysdate_is_now option.
Bottom line: Don't use SYSDATE() for normal datetime usage; use NOW() or CURDATE().
Looks like if I use CURRENT_DATE() (or NOW()) instead of SYSDATE(), it's working. Both of these queries:
SELECT *
FROM date_table t
WHERE 1 = 1
AND t.ddate >= LAST_DAY(CURRENT_DATE()) AND t.ddate <= LAST_DAY(CURRENT_DATE());
SELECT *
FROM date_table t
WHERE 1 = 1
AND t.ddate >= LAST_DAY(NOW()) AND t.ddate <= LAST_DAY(NOW());
Give the same result, which is this:
I will accept my answer as a solution, but I'm still looking for an explanation. I thought it might has to do something with SYSDATE() not being a DATE, but NOW() is also not a DATE...
EDIT: Forgot to add, BETWEEN is also working as I see.
I am recording each page that is viewed by logged in users in a MySQL table. I would like to calculate how may visits the site has had in within time period (eg day, week, month, between 2 dates etc.) in a similar way to Google Analytics.
Google Analytics defines a visit as user activity separated by at least 30 minutes of inactivity. I have the user ID, URL and date/time of each pageview so I need a query that can calculate a visit defined in this way.
I can easily count the pageviews between 2 dates but how can dynamically work out if a pageview from a user is within 30 minutes of another pageview and only count it once?
Here is a small sample of the data:
http://sqlfiddle.com/#!2/56695/2
Many thanks.
First, note that doing this kind of analysis in SQL is not the best idea indeed. It just has a very high computational complexity. There are many ways of eliminating the complexity from here.
Since we're talking about the analytics data, or something more akin to access logs of a typical web-server, we could as well just add a cookie value to it, and have a simple piece of front-end code that makes this cookie and gives it a random id, unless the cookie already exists. And sets the expiry of the cookie to whatever you want your session to be, which is 30 minutes by default. Note that you can change your session length in GA. Now your task is as simple as counting unique ids grouped by user. The complexity of N. The favourite complexity of most DBMSes.
Now if you really want to be solving the gaps-and-islands problem, you can just look at classical solutions of the problem, as well as some examples here on SO: SQL Server - Counting Sessions - Gaps and islands
Finally, the 'proper' way of tracking the session id would be generating a random string on every hit and setting it to a certain custom dimension, while having it as a session-level dimension for GA UA. Here's a more detailed explanation.
GA4 is gracious enough to surface the session id more properly, and here is how.
First, I would also index the uri column and make each column "not nullable":
CREATE TABLE IF NOT EXISTS `uri_history` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user` int(10) unsigned NOT NULL, /* cannot be NULL */
`timestamp` int(10) unsigned NOT NULL, /* cannot be NULL */
`uri` varchar(255) NOT NULL, /* cannot be NULL */
PRIMARY KEY (`id`),
KEY `user` (`user`),
KEY `timestamp` (`timestamp`),
KEY `uri`
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
However, I am a bit bewildered by your timestamp column having an int(10) definition and values such as 1389223839. I would expect an integer timestamp to be a value created with the UNIX_TIMESTAMP function call, but 1389223839 would then represent a value of '2014-01-08 18:30:39' for the 'America/New_York' time zone. I would have expected a sample timestamp to be more "contemporary." But I will have to assume that this column is a Unix timestamp value.
Let's say I was interested in gathering statistics for the month of June of this year:
SELECT * FROM uri_history
WHERE DATE(FROM_UNIXTIME(`timestamp`)) between '2022-06-01' and '2022-06-30'
ORDER BY `uri`, `user`, `timestamp`
From this point on I would process the returned rows in sequence recognizing breaks on the uri and user columns. For any returned uri and user combination, it should be very simple to compare the successive timestamp values and see if they differ by at least 30 minutes (i.e. 1800 seconds). In Python this would look like:
current_uri = None
current_user = None
current_timestamp = None
counter = None
# Process each returned row:
for row in returned_rows:
uri = row['uri']
user = row['user']
timestamp = row['timestamp']
if uri != current_uri:
# We have a new `uri` column:
if current_uri:
# Statistics for previous uri:
print(f'Visits for uri {current_uri} = {counter}')
current_uri = uri
current_user = user
counter = 1
elif user != current_user:
# We have a new user for the current uri:
current_user = user
counter += 1
elif timestamp - current_timestamp >= 1800:
# New visit is at least 30 minutes after the user's
# previous visit for this uri:
counter += 1
current_timestamp = timestamp
# Output final statistics, if any:
if current_uri:
print(f'Visits for uri {current_uri} = {counter}
Do i'm correct that you want count how many user visit the site within 30 minute for login user but only count as one per user event user visit more page in that period of time? If that so you could filter it then group by period of time visit within 30 minute.
First convert integer timestimp into date by using FROM_UNIXTIME, get minute visit, group minute has past, get period of start and end
SELECT DATE_FORMAT(FROM_UNIXTIME(timestamp), '%e %b %Y %H:%i:%s') visit_time,
FROM_UNIXTIME(timestamp) create_at,
MINUTE(FROM_UNIXTIME(timestamp)) create_minute,
MINUTE(FROM_UNIXTIME(timestamp))%30 create_minute_has_past_group,
date_format(FROM_UNIXTIME(timestamp) - interval minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_start,
date_format(FROM_UNIXTIME(timestamp) + interval 30-minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_end
FROM uri_history
After that group by period of start and COUNT DISTINCT user
SELECT date_format(FROM_UNIXTIME(timestamp) - interval minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_start,
date_format(FROM_UNIXTIME(timestamp) + interval 30-minute(FROM_UNIXTIME(timestamp))%30 minute, '%H:%i') as period_end,
COUNT(DISTINCT(user)) count
FROM uri_history
GROUP BY period_start
ORDER BY period_start ASC;
I got these from these answer
I have an sql query to select randomly 1200 top retweeted tweets at least 50 times retweeted and the tweetDate should be 4 days older from 40 million records. The query I pasted below works but It takes 40 minutes, so is there any faster version of that query?
SELECT
originalTweetId, Count(*) as total, tweetContent, tweetDate
FROM
twitter_gokhan2.tweetentities
WHERE
originalTweetId IS NOT NULL
AND originalTweetId <> - 1
AND isRetweet = true
AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
GROUP BY originalTweetId
HAVING total > 50
ORDER BY RAND()
limit 0 , 1200;
---------------------------------------------------------------
Table creation sql is like:
CREATE TABLE `tweetentities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tweetId` bigint(20) NOT NULL,
`tweetContent` varchar(360) DEFAULT NULL,
`tweetDate` datetime DEFAULT NULL,
`userId` bigint(20) DEFAULT NULL,
`userName` varchar(100) DEFAULT NULL,
`retweetCount` int(11) DEFAULT '0',
`keyword` varchar(500) DEFAULT NULL,
`isRetweet` bit(1) DEFAULT b'0',
`isCompleted` bit(1) DEFAULT b'0',
`applicationId` int(11) DEFAULT NULL,
`latitudeData` double DEFAULT NULL,
`longitude` double DEFAULT NULL,
`originalTweetId` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index` (`originalTweetId`),
KEY `index3` (`applicationId`),
KEY `index2` (`tweetId`),
KEY `index4` (`userId`),
KEY `index5` (`userName`),
KEY `index6` (`isRetweet`),
KEY `index7` (`tweetDate`),
KEY `index8` (`originalTweetId`),
KEY `index9` (`isCompleted`),
KEY `index10` (`tweetContent`(191))
) ENGINE=InnoDB AUTO_INCREMENT=41501628 DEFAULT CHARSET=utf8mb4$$
You are, of course, summarizing a huge number of records, then randomizing them. This kind of thing is hard to make fast. Going back to the beginning of time makes it worse. Searching on a null condition just trashes it.
If you want this to perform reasonably, you must get rid of the IS NOT NULL selection. Otherwise, it will perform badly.
But let us try to find a reasonable solution. First, let's get the originalTweetId values we need.
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate,
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
This summary query gives us the lowest id number and date in your database for each subject tweet.
To get this to run fast, we need a compound index on (originalTweetId, isRetweet, tweetDate, id). The query will do a range scan of this index on tweetDate, which is about as fast as you can hope for. Debug this query, both for correctness and performance, then move on.
Now do the randomization. Let's do this with the minimum amount of data we can, to avoid sorting some enormous amount of stuff.
SELECT originalTweetId, tweetDate, total, RAND() AS randomOrder
FROM (
SELECT MIN(id) originalId,
MIN(tweetDate) tweetDate
originalTweetId,
Count(*) as total
FROM twitter_gokhan2.tweetentities
WHERE originalTweetId <> -1
/*AND originalTweetId IS NOT NULL We have to leave this out for perf reasons */
AND isRetweet = true
AND tweetDate < CURDATE() - INTERVAL 4 DAY
AND tweetDate > CURDATE() - INTERVAL 30 DAY /*let's add this, if we can*/
GROUP BY originalTweetId
HAVING total >= 50
) AS retweets
ORDER BY randomOrder
LIMIT 1200
Great. Now we have a list of 1200 tweet ids and dates in random order. Now let's go get the content.
SELECT a.originalTweetId, a.total, b.tweetContent, a.tweetDate
FROM (
/* that whole query above */
) AS a
JOIN twitter_gokhan2.tweetentities AS b ON (a.id = b.id)
ORDER BY a.randomOrder
See how this goes? Use a compound index to do your summary, and do it on the minimum amount of data. Then do the randomizing, then go fetch the extra data you need.
You're selecting a huge number of records by selecting every record older than 4 days old....
Since the query takes a huge amount of time, why not simply prepare the results using an independant script which runs repeatedly in the background....
You might be able to make the assumption that if its a retweet, the originalTweetId cannot be null/-1
Just to clarify... did you really mean to query everything OLDER than 4 days???
AND (tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
OR... Did you mean you only wanted to aggregate RECENT TWEETS WITHIN the last 4 days. To me, tweets that happened 2 years ago would be worthless to current events... If thats the case, you might be better to just change to
AND (tweetDate >= DATE_ADD(CURDATE(), INTERVAL - 4 DAY))
See if this isn't a bit faster than 40 minutes:
Test first without the commented lines, then re-add them to compare performance impact. (especially ORDER BY RAND() is known to be horrible)
SELECT
originalTweetId,
total,
-- tweetContent, -- may slow things somewhat
tweetDate
FROM (
SELECT
originalTweetId,
COUNT(*) AS total,
-- tweetContent, -- may slow things somewhat
MIN(tweetDate) AS tweetDate,
MAX(isRetweet) AS isRetweet
FROM twitter_gokhan2.tweetentities
GROUP BY originalTweetId
) AS t
WHERE originalTweetId > 0
AND isRetweet
AND tweetDate < DATE_ADD(CURDATE(), INTERVAL - 4 DAY)
AND total > 50
-- ORDER BY RAND() -- very likely to slow performance,
-- test with and without...
LIMIT 0, 1200;
PS - originalTweetId should be indexed hopefully
I have a table with three columns:
`id` int(11) NOT NULL auto_increment
`tm` int NOT NULL
`ip` varchar(16) NOT NULL DEFAULT '0.0.0.0'
I want to run a query that will check if the same IP was logged within a minute and then delete all but one of those entries.
For example I have the two rows below.
id=1 tm=1361886629 ip=192.168.0.1
id=2 tm=1361886630 ip=192.168.0.1
I would only like to keep one in the database.
I have read lots of other remove duplicate/partial duplicate entry questions but I'm looking for a way to compare the last two digits of the Unix/epoch time and delete all but one based on that plus the IP.
Any help is much appreciated.
you can use CAST in mysql to remove the last 2 digits
SELECT CAST( tm AS CHAR( 8 ) )
this will select only the first 8 digits from the timestamp and allow you to find duplicates
if you only want to know what the last 2 digits are
SELECT RIGHT(CAST( tm AS CHAR( 10 ) ), 2)
this will select the last two digits only from each timestamp
How would you go about selecting records timestamped within a certain amount of time of each other?
Application and sought solution:
I have a table with records of clicks, I am wanting to go through and find the clicks from the same IP that occurred within a certain time period.
e.g.: SELECT ALL ip_address WHERE 5 or more of the same ip_address, occurred/are grouped within/timestamped, within 10 minutes of each other
You can select record like that
$recorddate = date("Y-m-d");
SELECT * FROM table WHERE date > UNIX_TIMESTAMP('$recorddate');
UNIX_TIMESTAMP function converts date to timestamp. And you can easily use it in your queries.
If you want to grab the record in 10 minutes interval you can do something like that
$starttime = "2012-08-30 19:00:00";
$endtime = "2012-08-30 19:10:00";
SELECT * FROM table WHERE date >= UNIX_TIMESTAMP('$starttime') AND date <= UNIX_TIMESTAMP('$endtime') ;
Decided not to try for a single query on the raw data.
After discussion with a friend, and then reading about options mentioning the memory engine, and PHP memcatche; I decided to go with a regular table to record click counts that use a time to live timestamp. After that timestamp is passed, a new ttl is assigned and the count is re-set.
One thing is for my application I can't be exactly sure how long the parameter configuration settings will be - if they are larger and the memory gets cleared, then things start over.
It isn't a perfect solution if it is run on user link click, but it should be pretty good about catching click fraud storms, and do the job.
Some managing PHP/MySQL code ("Drupalized queries"):
$timeLimit = $clickQualityConfigs['edit-submitted-within-x-num-of-same-ip-clicks']." ".$clickQualityConfigs['edit-submitted-time-period-same-ip-ban']; // => 1 days // e.g.
$filterEndTime = strtotime("+".$timeLimit);
$timeLimitUpdate_results = db_query('UPDATE {ip_address_count}
SET ttl_time_stamp = :filterendtime, click_count = :clickcountfirst WHERE ttl_time_stamp < :timenow', array(':filterendtime' => $filterEndTime, ':clickcountfirst' => '0', ':timenow' => time()));
$clickCountUpdate_results = db_query('INSERT INTO {ip_address_count} (ip_address,ttl_time_stamp,click_count)
VALUES (:ipaddress,:timestamp,:clickcountfirst)
ON DUPLICATE KEY UPDATE click_count = click_count + 1', array(':ipaddress' => $ip_address,':timestamp' => $filterEndTime,':clickcountfirst' => '1'));
DB info:
CREATE TABLE `ip_address_count` (
`ip_address` varchar(24) NOT NULL DEFAULT '',
`ttl_time_stamp` int(11) DEFAULT NULL,
`click_count` int(11) DEFAULT NULL,
PRIMARY KEY (`ip_address`)
)