Find pattern based on events and number of occurences - mysql

I have a MySQL database table that looks like below:
+----------+------------+---------------------+
| name | event | created |
+----------+------------+---------------------+
| Player1 | Logged in | 2023-02-14 10:05:00 |
| Player2 | Logged in | 2023-02-14 10:05:30 |
| Player3 | Logged out | 2023-02-14 10:06:00 |
| Player1 | Logged out | 2023-02-14 10:10:30 |
| Player4 | Logged in | 2023-02-14 10:10:45 |
| Player2 | Logged out | 2023-02-14 10:20:00 |
| Player4 | Logged out | 2023-02-14 10:30:00 |
| Player5 | Logged in | 2023-02-14 10:30:05 |
| Player1 | Logged in | 2023-02-14 10:30:10 |
| Player5 | Logged out | 2023-02-14 10:32:00 |
+----------+------------+---------------------+
What I want to do, is to figure which players might be played by the same person.
To do that, I can look at their respective "Logged in" / "Logged out" events and use that as a pattern.
If a player logs out from the game, and another player logs in within 30 seconds - and this happens a few times - then I can assume they are being played by the same person. Similarily if a player logs in and another one logs out.
In the example above, we can see that:
(row 4) **Player1** -> **Logged out**
(row 5) **Player4** -> **Logged in**
These events took place less than 30 seconds apart.
And again, the same thing happened here:
(row 7) **Player4** -> **Logged out**
(row 9) **Player1** -> **Logged in**
These events took place less than 30 seconds apart as well.
We can therefore assume that Player1 and Player4 are being played by the same person.
What I want to generate as result is a new table, that allows me to search for a specific player.
In this case, I want to search for "Player1" and it should return a list of all players that logged in/logged out within 30 seconds after Player1, and had at least 2 such occurences.
For instance, a table like below would be sufficient as results if I make a search for "Player1":
+----------+------------+
| name | occurences |
+----------+------------+
| Player4 | 2 |
+----------+------------+
Any clue on how I can achieve this?

You can make a query, which uses a subquery to get all created for player 1 and then uses CROSS JOIN to compare it to all dates for other players
SELECT name, COUNT(*) as occurances from table1 ta1 CROSS JOIN (SELECT `created` FROM table1 WHERE name = 'Player1') t1
WHERE ta1.`created` BETWEEN
t1.`created` - INTERVAL 30 SECOND
AND t1.`created` + INTERVAL 30 SECOND aND name <> 'Player1'
GROUP BY name
HAVING COUNT(*) > 1
name
occurances
Player4
2
fiddle

Something like:
SELECT LEAST(t1.name, t2.name) name1, GREATEST(t1.name, t2.name) name2, COUNT(*)
FROM events t1
JOIN events t2 ON t1.event = 'out'
AND t2.event = 'in'
AND t2.created BETWEEN t1.created AND t1.created + INTERVAL 30 SECOND
WHERE t1.created BETWEEN #some_range_start AND #some_range_end
GROUP BY 1, 2
HAVING COUNT(*) >= #some_limit;
where:
#some_range_start, #some_range_end - dates range to be investigated (remove if you want to investigate the whole table);
#some_limit - minimal amount of such incidents for definite logins pair.
DEMO fiddle
Remember - this query won't be fast..
PS. The amount of second can be parametrized too, of course.

Related

Click-through ratio of articles with two different tables using SQL

I would like to calculate the Click-Through Ratio (CTR) of several articles of a website using SQL.
The formula of the CTR is CTR = number clicks / number impressions, i.e. a ratio of how many times an article has been clicked and how many times it has been shown.
I have two tables:
´article_click´: A table with several columns, namely ´article_id´ (denoting the id of the article), ´description´ (a brief description of the article), ´timestamp´ (when it has been clicked), among others. Every time a user clicks an article, a new row is created in the table.
´article_impression´: Similarly, a table with several columns, namely ´article_id´ (denoting the id of the article), ´description´ (a brief description of the article), ´timestamp´ (when it has been shown), among others. Every time an article is shown to a user, a new row is created in the table.
Both tables 1 and 2 look like this:
+------------+-------------+------------------+-----+
| article_id | description | timestamp | ... |
+------------+-------------+------------------+-----+
| 102 | Potatoe | 2021-01-01 13:45 | ... |
| 11 | Lettuce | 2020-02-11 11:00 | ... |
| 34 | Train | 2019-12-12 09:31 | ... |
| 21 | Car | 2011-11-11 08:32 | ... |
| 201 | Train | 2014-02-10 02:12 | ... |
| ... | ... | ... | ... |
+------------+-------------+------------------+-----+
And I would like to create a table such that:
+------------+-----+
| article_id | CTR |
+------------+-----+
| 11 | 0.4 |
| 23 | 0.6 |
| 34 | 0.2 |
| 44 | 0.8 |
| 45 | 0.3 |
| ... | ... |
+------------+-----+
In order to do so, I have tried:
SELECT article_click.article_id, COUNT(article_click.article_id) / COUNT(article_impression.article_id) AS CTR
FROM article_click
INNER JOIN article_impression ON article_click.article_id = article_impression.article_id
GROUP BY article_click.article_id DESC;
But I obtain something like:
+------------+-----+
| article_id | CTR |
+------------+-----+
| 11 | 1.0 |
| 23 | 1.0 |
| 34 | 1.0 |
| 44 | 1.0 |
| 45 | 1.0 |
| ... | ... |
+------------+-----+
Can anyone spot the mistake here? I'm using MySQL as RDBMS.
If the click-through-rate (CTR) is number clicks / number impressions then you'll need to calculate the number of clicks on an article and the number of impressions on an article before joining them to perform the calculation.
You could do this with subqueries or CTEs, but I've opted for the former here.
SELECT c.article_id, c.click_count / i.impression_count AS CTR
FROM (
SELECT article_id, COUNT(*) AS click_count
FROM article_click
GROUP BY article_id) AS c
INNER JOIN (
SELECT article_id, COUNT(*) AS impression_count
FROM article_impression
GROUP BY article_id) AS i
ON c.article_id = i.article_id;
Try it out on SQL Fiddle.
Note that using an INNER JOIN will exclude articles that have impressions but no clicks, so you won't get results where the CTR is 0. If you want those, you can use a LEFT JOIN from impressions to clicks. Since an article cannot be clicked if it has not been shown, we know that a LEFT JOIN from impressions to clicks is sufficient to show all data.
SELECT i.article_id, COALESCE(c.click_count, 0) / i.impression_count AS CTR
FROM (
SELECT article_id, COUNT(*) AS impression_count
FROM article_impression
GROUP BY article_id) AS i
LEFT JOIN (
SELECT article_id, COUNT(*) AS click_count
FROM article_click
GROUP BY article_id) AS c
ON i.article_id = c.article_id;
Note that we have to use the article_id from article_impression since article_click might be NULL. For the same reason, we have to COALESCE the click_count so that we don't end up with an error trying to divide NULL.
Before using joins duplicate data must be avoided. Get individual counts of each table and join both the queries.
select a.article_id, article_click/article_impression_click as ctr
from ( select a.article_id, count(a.article_id) article_click from
article_click a group by article_id) a inner join (select
a.article_id, count(a.article_id) article_impression_click from
article_impression a group by article_id) b on
a.article_id=b.article_id
WITH
v_article AS
( SELECT 'S' type, article_impression.id FROM article_impression
UNION ALL
SELECT 'C' type, article_click.id FROM article_click
)
SELECT
v_article.id,
COUNT(CASE WHEN v_article.type = 'C' THEN 1 END) nb_show,
COUNT(CASE WHEN v_article.type = 'S' THEN 1 END) nb_click,
CASE
WHEN COUNT(CASE WHEN v_article.type = 'S' THEN 1 END) > 0 THEN
ROUND(COUNT(CASE WHEN v_article.type = 'C' THEN 1 END) / COUNT(CASE WHEN v_article.type = 'S' THEN 1 END), 2)
END ratio_click_show
FROM v_article
GROUP BY
v_article.id
;
If you're sure an article can be click only if it has been previously shown (nb_show > 0 and nb_show > nb_click), you can remove the CASE around the ratio calculation.

combine 3 queries to one, SELECT / COUNT / INSERT

I need help to optimize my 3 queries into one.
I have 2 tables, the first has a list of image processing servers I use, so different servers can handle different simultaneous job loads at a time, so I have a field called quota as seen below.
First table name, "img_processing_servers"
| id | server_url | server_key | server_quota |
| 1 | examp.uu.co | X0X1X2XX3X | 5 |
| 2 | examp2.uu.co| X0X1X2YX3X | 3 |
The second table registers if there is a job being performed at this moment on the server
Second table, "img_servers_lock"
| id | lock_server | timestamp |
| 1 | 1 | 2020-04-30 12:08:09 |
| 2 | 1 | 2020-04-30 12:08:09 |
| 3 | 1 | 2020-04-30 12:08:09 |
| 4 | 2 | 2020-04-30 12:08:09 |
| 5 | 2 | 2020-04-30 12:08:09 |
| 6 | 2 | 2020-04-30 12:08:09 |
Basically what I want to achieve is that my image servers don't go past the max quota and crash, so the 3 queries I would like to combine are:
Select at least one server available that hasn't reached it's quota and then insert a lock record for it.
SELECT * FROM `img_processing_servers` WHERE
SELECT COUNT(timestamp) FROM `img_servers_lock` WHERE `lock_server` = id
! if the count is < than quota, go ahead and register use
INSERT INTO `img_servers_lock`(`lock_server`, `timestamp`) VALUES (id_of_available_server, now())
How would I go about creating this single query?
My goal is to keep my image servers safe from overload.
Join the two tables and put that into an INSERT query.
INSERT INTO img_servers_lock(lock_server, timestamp)
SELECT s.id, NOW()
FROM img_processing_servers s
LEFT JOIN img_servers_lock l ON l.lock_server = s.id
GROUP BY s.id
HAVING IFNULL(COUNT(l.id), 0) < s.server_quota
ORDER BY s.server_quota - IFNULL(COUNT(l.id), 0) DESC
LIMIT 1
The ORDER BY clause makes it select the server with the most available quota.
OK, so I encountered just a small addition that was giving me a bug and it was that the s.server_quota had to be added to GROUP BY for it to work in the HAVING
INSERT INTO img_servers_lock(lock_server, timestamp)
SELECT s.id, NOW()
FROM alpr_servers s
LEFT JOIN img_servers_lock l ON l.lock_server = s.id
GROUP BY s.id, s.server_quota
HAVING IFNULL(COUNT(l.id), 0) < s.server_quota
ORDER BY s.server_quota - IFNULL(COUNT(l.id), 0) DESC
LIMIT 1
Thanks again Barmar!

Query to get the count of logins by a user within a set time interval from previous login

I want to get a count of how many times a user logs in within, let's say, 5 hours from the previous login.
So something like new_login - old_login < 5 hours.
The login table would have user_id and time_accessed.
This query is to get the count of user logins within a day. I can't figure out how to compare the different times within the same column within the same statement:
SELECT user_id, date(time_accessed), count(user_id) AS login_within_5_hour_period
FROM login
GROUP BY user_id, date(time_accessed)
ORDER BY time_accessed;
Sample input
+---------+---------------------+
| user_id | time_accessed |
+---------+---------------------+
| 1 | 2020-02-19 09:00:00 |
| 1 | 2020-02-19 12:00:00 |
| 1 | 2020-02-19 13:00:00 |
| 1 | 2020-02-19 19:00:00 |
+---------+---------------------+
Sample ouput
+---------+---------------------+----------------------------+
| user_id | date(time_accessed) | login_within_5_hour_period |
+---------+---------------------+----------------------------+
| 1 | 2020-02-19 | 3 |
| 1 | 2020-02-19 | 1 |
+---------+---------------------+----------------------------+
In order to compare different times, you need to join the table with itself.
The following query will find the number of logins by the user within 5 hours, excluding the current login. If you want to include the current login in the count, change this l1.time_accessed > l2.time_accessed to l1.time_accessed >= l2.time_accessed.
SELECT l1.user_id, l1.time_accessed, COUNT(l2.user_id) AS login_within_5_hour_period
FROM logins l1
LEFT JOIN logins l2
ON l1.user_id = l2.user_id
AND l1.time_accessed > l2.time_accessed
AND TIME_TO_SEC(TIMEDIFF(l1.time_accessed, l2.time_accessed)) / 3600 <= 5
GROUP BY l1.user_id, l1.time_accessed;
This second query will return a single result, showing the number of logins by the user within 5 hours of the time specified.
SELECT l1.user_id, l1.time_accessed, COUNT(l2.user_id) AS login_within_5_hour_period
FROM logins l1
LEFT JOIN logins l2
ON l1.user_id = l2.user_id
AND l1.time_accessed > l2.time_accessed
AND TIME_TO_SEC(TIMEDIFF(l1.time_accessed, l2.time_accessed)) / 3600 <= 5
WHERE l1.time_accessed = '2020-02-19 19:00:00'
GROUP BY l1.user_id, l1.time_accessed;
Working example: https://www.db-fiddle.com/f/g7jDYqoKn38iQTFuPjej9m/1

In mysql: how can I select the most recently added row when selecting by MAX if two values are equal (application is a games high score table)

I am trying to construct a highscore table from entries in a table with the layout
id(int) | username(varchar) | score(int) | modified (timestamp)
selecting the highest scores per day for each user is working well using the following:
SELECT id, username, MAX( score ) AS hiscore
FROM entries WHERE DATE( modified ) = CURDATE( )
Where I am stuck is that in some cases plays may achieve the same score multiple times in the same day, in which case I need to make sure that it is always the earliest one that is selected because 2 scores match will be the first to have reached that score who wins.
if my table contains the following:
id | username | score | modified
________|___________________|____________|_____________________
1 | userA | 22 | 2014-01-22 08:00:14
2 | userB | 22 | 2014-01-22 12:26:06
3 | userA | 22 | 2014-01-22 16:13:22
4 | userB | 15 | 2014-01-22 18:49:01
The returned winning table in this case should be:
id | username | score | modified
________|___________________|____________|_____________________
1 | userA | 22 | 2014-01-22 08:00:14
2 | userB | 22 | 2014-01-22 12:26:06
I tried to achieve this by adding ORDER BY modified desc to the query, but it always returns the later score. I tried ORDER BY modified asc as well, but I got the same result
This is the classic greatest-n-per-group problem, which has been answered frequently on StackOverflow. Here's a solution for your case:
SELECT e.*
FROM entries e
JOIN (
SELECT DATE(modified) AS modified_date, MAX(score) AS score
FROM entries
GROUP BY modified_date
) t ON DATE(e.modified) = t.modified_date AND e.score = t.score
WHERE DATE(e.modified) = CURDATE()
I think this would works for you and is the simplest way:
SELECT username, MAX(score), MIN(modified)
FROM entries
GROUP BY username
This returns this in your case:
"userB";22;"2014-01-22 12:26:06"
"userA";22;"2014-01-22 08:00:14"
However, I think what you want (in your example would be wrong) the most recent row. To do it, you need this:
SELECT username, MAX(score), MAX(modified)
FROM entries
GROUP BY username
Which returns:
"userB";22;"2014-01-22 18:49:01"
"userA";22;"2014-01-22 16:13:22"

Groupwise maximum

I have a table from which I am trying to retrieve the latest position for each security:
The Table:
My query to create the table: SELECT id, security, buy_date FROM positions WHERE client_id = 4
+-------+----------+------------+
| id | security | buy_date |
+-------+----------+------------+
| 26 | PCS | 2012-02-08 |
| 27 | PCS | 2013-01-19 |
| 28 | RDN | 2012-04-17 |
| 29 | RDN | 2012-05-19 |
| 30 | RDN | 2012-08-18 |
| 31 | RDN | 2012-09-19 |
| 32 | HK | 2012-09-25 |
| 33 | HK | 2012-11-13 |
| 34 | HK | 2013-01-19 |
| 35 | SGI | 2013-01-17 |
| 36 | SGI | 2013-02-16 |
| 18084 | KERX | 2013-02-20 |
| 18249 | KERX | 0000-00-00 |
+-------+----------+------------+
I have been messing with versions of queries based on this page, but I cannot seem to get the result I'm looking for.
Here is what I've been trying:
SELECT t1.id, t1.security, t1.buy_date
FROM positions t1
WHERE buy_date = (SELECT MAX(t2.buy_date)
FROM positions t2
WHERE t1.security = t2.security)
But this just returns me:
+-------+----------+------------+
| id | security | buy_date |
+-------+----------+------------+
| 27 | PCS | 2013-01-19 |
+-------+----------+------------+
I'm trying to get the maximum/latest buy date for each security, so the results would have one row for each security with the most recent buy date. Any help is greatly appreciated.
EDIT: The position's id must be returned with the max buy date.
You can use this query. You can achieve results in 75% less time. I checked with more data set. Sub-Queries takes more time.
SELECT p1.id,
p1.security,
p1.buy_date
FROM positions p1
left join
positions p2
on p1.security = p2.security
and p1.buy_date < p2.buy_date
where
p2.id is null;
SQL-Fiddle link
You can use a subquery to get the result:
SELECT p1.id,
p1.security,
p1.buy_date
FROM positions p1
inner join
(
SELECT MAX(buy_date) MaxDate, security
FROM positions
group by security
) p2
on p1.buy_date = p2.MaxDate
and p1.security = p2.security
See SQL Fiddle with Demo
Or you can use the following in with a WHERE clause:
SELECT t1.id, t1.security, t1.buy_date
FROM positions t1
WHERE buy_date = (SELECT MAX(t2.buy_date)
FROM positions t2
WHERE t1.security = t2.security
group by t2.security)
See SQL Fiddle with Demo
This is done with a simple group by. You want to group by the securities and get the max of buy_date. The SQL:
SELECT security, max(buy_date)
from positions
group by security
Note, this is faster than bluefeet's answer but does not display the ID.
The answer by #bluefeet has two more ways to get the results you want - and the first will probably be more efficient than your query.
What I don't understand is why you say that your query doesn't work. It seems pretty fine and returns the expected result. Tested at SQL-Fiddle
SELECT t1.id, t1.security, t1.buy_date
FROM positions t1
WHERE buy_date = ( SELECT MAX(t2.buy_date)
FROM positions t2
WHERE t1.security = t2.security ) ;
If the problems appears when you add the client_id = 4 condition, then it's because you add it only in one WHERE clause while you have to add it in both:
SELECT t1.id, t1.security, t1.buy_date
FROM positions t1
WHERE client_id = 4
AND buy_date = ( SELECT MAX(t2.buy_date)
FROM positions t2
WHERE client_id = 4
AND t1.security = t2.security ) ;
select security, max(buy_date) group by security from positions;
is all you need to get max buy date for each security (when you say out loud what you want from a query and you include the phrase "for each x", you probably want a group by on x)
When you use a group by, all columns in your select must either be columns that have been grouped by or aggregates, so if, for example, you wanted to include id, you'd probably have to use a subquery similar to what you had before, since there doesn't seem to be any aggregate you can reasonably use on the ids, and another group by would give you too many rows.