SQL performance of a large number of sum()s - mysql

Within my J2EE web application, I need to generate a bar chart representing the percentage of users in the system with specific alerts. (EDIT - I forgot to mention, the graph only deals with alerts associated with the first situationof each user, thus the min(date) ).
A simplified (but structurally similar) version of my database schema is as follows :
users { id, name }
situations { id, user_id, date }
alerts { id, situation_id, alertA, alertB }
where users to situations are 1-n, and situations to alerts are 1-1.
I've omitted datatypes but the alerts (alertA and B) are booleans. In my actual case, there are many such alerts (30-ish).
So far, this is what I have come up with :
select sum(alerts.alertA), sum(alerts.alertB)
form alerts, (
select id, min(date)
from situations
group by user_id) as situations
where situations.id = alerts.situation_id;
and then divide these sums by
select count(users.id) from users;
This seems far from ideal.
Your recommendations/advice as to how to improve as query would be most appreciated (or maybe I need to re-think my database schema)...
Thanks,
Anthony
PS. I was also thinking of using a trigger to refresh a chart specific table whenever the alerts table is updated but I guess that's a subject for a different query (if it turns out to be problematic).

At first, think about your schema again. You will have a lot of different alerts and you probably don't want to add a single column for every one of those.
Consider changing your alerts table to something like { id, situation_id, type, value } where type would be (A,B,C,....) and value would be your boolean.
Your task to calculate the percentages would then split up into:
(1) Count the total number of users:
SELECT COUNT(id) AS total FROM users
(2) Find the "first" situation for each user:
SELECT situations.id, situations.user_id
-- selects the minimum date for every user_id
FROM (SELECT user_id, MIN(date) AS min_date
FROM situations
GROUP BY user_id) AS first_situation
-- gets the situations.id for user with minimum date
JOIN situations ON
first_situation.user_id = situations.user_id AND
first_situation.min_date = situations.date
-- limits number of situations per user to 1 (possible min_date duplicates)
GROUP BY user_id
(3) Count users for whom an alert is set in at least one of the situations in the subquery:
SELECT
alerts.type,
COUNT(situations.user_id)
FROM ( ... situations.user_id, situations.id ... ) AS situations
JOIN alerts ON
situations.id = alerts.situation_id
WHERE
alerts.value = 1
GROUP BY
alerts.type
Put those three steps together to get something like:
SELECT
alerts.type,
COUNT(situations.user_id)/users.total
FROM (SELECT situations.id, situations.user_id
FROM (SELECT user_id, MIN(date) AS min_date
FROM situations
GROUP BY user_id) AS first_situation
JOIN situations ON
first_situation.user_id = situations.user_id AND
first_situation.min_date = situations.date
GROUP BY user_id
) AS situations
JOIN alerts ON
situations.id = alerts.situation_id
JOIN (SELECT COUNT(id) AS total FROM users) AS users
WHERE
alerts.value = 1
GROUP BY
alerts.type
All queries written from my head without testing. Even if they don't work exactly like that, you should still get the idea!

Related

Query for average response time in mysql

I have a table with columns:
id , conversation_id , session_id , user_id , message , created_at
every time a user starts a conversation with an employee, a new session starts (different session number).all messages between every employees and users are stored in this table. the created_at column is a timestamp. I need to filter out sessions by employee number, and calculate the average response time between the first message a user sends and the first message sent back by a specific employee, for every session disregarding outlying data where either a customer or employee did not reply ( only one user in the session)
i know this is complicated but please help!
in this example in the user_id column, 4 is the employee ( keep in mind there are other employees). everytime a new conversation starts the session_id changes. i have to go through each session for a specific employee, take the timestamp of the first message sent by the customer as well as the employee, take the difference, sum all the differences and then take an average, while making sure that the session actually contains two users ( filtering outlying data).
So far, ive come up with this:
SELECT * FROM messages
WHERE session_id IN (
SELECT session_id FROM messages
WHERE user_id =4 )
GROUP BY session_id, user_id
to get the first message from each customer and employee (gives something like this)
so from this specific example, i would omit line 41040 as it only as the session contains only 1 person (column 3, id 1028) and is considered outlying data
I'm actually appalled by some of the comments... StackOverflow is meant to be a community for helping others. Why bother even taking up comment space if you're gonna complain about my ponctuation or give a vague, useless answer?
Anyways, i figured it out.
Basically, i joined the same table multiple times but only queried the necessary data. In the first join, I queried the messages table with the employee messages and grouped them by session number. In the second join, i did the same procedure but only extracted the messages from the user. By joining them on the session id, it automatically omits any sessions where either a user or employee is not present. By default, the groupby returns the first set of data from the group ( in this situation i didn't have to manipulate the groupby because I was actually looking for the first message in the session), I then took the average of the difference between the message timestamp for the user and employee.In this specific situation, the number 4 is the employee number. Here is what the query looks like Also, the HAVING AVG_RESP > 0 was necessary in this situation to remove outlying data when tests are performed :
SELECT AVG(AVG_RESP)
FROM(
SELECT TIME_TO_SEC(TIMEDIFF(t.created_at, u.created_at )) AS AVG_RESP
FROM (
SELECT * FROM messages
WHERE session_id IN (
SELECT session_id FROM messages
WHERE user_id = 4) AND user_id = 4
GROUP BY session_id
) AS t
JOIN(
SELECT * FROM messages
WHERE session_id IN (
SELECT session_id FROM messages
WHERE user_id = 4) AND user_id != 4
GROUP BY session_id
) as u
ON t.session_id = u.session_id
GROUP BY t.session_id
HAVING AVG_RESP > 0
) as ar
Hopefully this helps someone in the future, unlike the people who leave ridiculous, useless comments.

SQL Query sorting rows by duplicate name keeping lowest in result

I've got a table with 11 columns and I want to create a query that removes the rows with duplicate names in the Full Name's column but keeps the row with the lowest value in the Result's column. Currently I have this.
SELECT
MIN(sql363686.Results2014.Result),
sql363686.Results2014.Temp,
sql363686.Results2014.Full Name,
sql363686.Results2014.Province,
sql363686.Results2014.BirthDate,
sql363686.Results2014.Position,
sql363686.Results2014.Location,
sql363686.Results2014.Date
FROM
sql363686.Results2014
WHERE
sql363686.Results2014.Event = '50m Freestyle'
AND sql363686.Results2014.Gender = 'M'
AND sql363686.Results2014.Agegroup = 'Junior'
GROUP BY
sql363686.Results2014.Full Name
ORDER BY
sql363686.Results2014.Result ASC ;
At first glance it seems to work fine and I get all the correct values, but I seem to be getting a different (wrong) value in the Position column then what I have in my database table. All other values seem to be right. Any ideas on what I'm doing wrong?
I'm currently using dbVisualizer connected to a mysql database. Also, my knowledge and experience with sql is the bare mimimum
Use group by and a join:
select r.*
from sql363686.Results2014 r
(select fullname, min(result) as minresult
from sql363686.Results2014 r
group by fullname
) rr
on rr.fullname = r.fullname and rr.minresult = r.minresult;
You have fallen into the trap of the nonstandard MySQL extension to GROUP BY.
(I'm not going to work with all those fully qualified column names; it's unnecessary and verbose.)
I think you're looking for each swimmer's best time in a particular event, and you're trying to pull that from a so-called denormalized table. It looks like your table has these columns.
Result
Temp
FullName
Province
BirthDate
Position
Location
Date
Event
Gender
Agegroup
So, the first step is to locate the best time in each event for each swimmer. To do this we need to make a couple of assumptions.
A person is uniquely identified by FullName, BirthDate, and Gender.
An event is uniquely identified by Event, Gender, Agegroup.
This subquery will get the best time for each swimmer in each event.
SELECT MIN(Result) BestResult,
FullName,BirthDate, Gender,
Event, Agegroup
FROM Results2014
GROUP BY FullName,BirthDate, Gender, Event, Agegroup
This gets you a virtual table with each person's fastest result in each event (using the definitions of person and event mentioned earlier).
Now the challenge is to go find out the circumstances of each person's best time. Those circumstances include Temp, Province, Position, Location, Date. We'll do that with a JOIN between the original table and our virtual table, like this
SELECT resu.Event,
resu.Gender,
resu.Agegroup,
resu.Result,
resu.Temp.
resu.FullName,
resu.Province,
resu.BirthDate,
resu.Position,
resu.Location,
resu.Date
FROM Results2014 resu
JOIN (
SELECT MIN(Result) BestResult,
FullName,BirthDate, Gender,
Event, Agegroup
FROM Results2014
GROUP BY FullName,BirthDate, Gender, Event, Agegroup
) best
ON resu.Result = best.BestResult
AND resu.FullName = best.FullName
AND resu.BirthDate = best.BirthDate
AND resu.Gender = best.Gender
AND resu.Event = best.Event
AND resu.Agegroup = best.Agegroup
ORDER BY resu.Agegroup, resu.Gender, resu.Event, resu.FullName, resu.BirthDate
Do you see how this works? You need an aggregate query that pulls the best times. Then you need to use the column values in that aggregate query in the ON clause to go get the details of the best times from the detail table.
If you want to report on just one event you can include an appropriate WHERE clause right before ORDER BY as follows.
WHERE resu.Event = '50m Freestyle'
AND resu.Gender = 'M'
AND resu.Agegroup = 'Junior'

Find the one-time users in a system- sql table

I have a table with 6 columns- Date, time, action, user_id, channel, and time_and_date.
Action refers to open or close, when a user starts or end watching a tv channel.
My tasks are as following
to get an overview of the data:
- find the one-time users (who used the service only once or in only one day and
never came back) for each channel, each genre, each community
Anoother table provides the user_id, genre(news, sport....)
How can I find the one time users for those requirements?
You can try something like
SELECT FROM first_table LEFT JOIN users ON first_table.user_id=users.id GROUP BY users.id HAVING COUNT(users.id)=1
You can join your genre table after for selecting over genre channels...
To get the one-time users:
select user_id ,min(channel) as channel, min(genre) as genre, min(community) as community
from action_table
group by user_id
having min(date) = max(date);
Note the having clause. This guarantees that a users has only one date (but not necessarily one record).
This returns one value for each of the three dimensions -- for a one-time user they are the same. For someone who visits multiple times in one day, it chooses one value.
Sounds something like this:
select user_id, count(*)
from action_table
where action = 'open'
group by user_id
having count(*) = 1
order by user_id

SQL get polls that specified user is winning

Hello all and thanks in advance
I have the tables accounts, votes and contests
A vote consists of an author ID, a winner ID, and a contest ID, so as to stop people voting twice
Id like to show for any given account, how many times theyve won a contest, how many times theyve come second and how many times theyve come third
Whats the fastest (execution time) way to do this? (Im using MySQL)
After using MySQL for a long time I'm coming to the conclusion that virtually any use of GROUP BY is really bad for performance, so here's a solution with a couple of temporary tables.
CREATE TEMPORARY TABLE VoteCounts (
accountid INT,
contestid INT,
votecount INT DEFAULT 0
);
INSERT INTO VoteCounts (accountid, contestid)
SELECT DISTINCT v2.accountid, v2.contestid
FROM votes v1 JOIN votes v2 USING (contestid)
WHERE v1.accountid = ?; -- the given account
Make sure you have an index on votes(accountid, contestid).
Now you have a table of every contest that your given user was in, with all the other accounts who were in the same contests.
UPDATE Votes AS v JOIN VoteCounts AS vc USING (accountid, contestid)
SET vc.votecount = vc.votecount+1;
Now you have the count of votes for each account in each contest.
CREATE TEMPORARY TABLE Placings (
accountid INT,
contestid INT,
placing INT
);
SET #prevcontest := 0;
SET #placing := 0;
INSERT INTO Placings (accountid, placing, contestid)
SELECT accountid,
IF(contestid=#prevcontest, #placing:=#placing+1, #placing:=1) AS placing,
#prevcontest:=contestid AS contestid
FROM VoteCounts
ORDER BY contestid, votecount DESC;
Now you have a table with each account paired with their respective placing in each contest. It's easy to get the count for a given placing:
SELECT accountid, COUNT(*) AS count_first_place
FROM Placings
WHERE accountid = ? AND placing = 1;
And you can use a MySQL trick to do all three in one query. A boolean expression always returns an integer value 0 or 1 in MySQL, so you can use SUM() to count up the 1's.
SELECT accountid,
SUM(placing=1) AS count_first_place,
SUM(placing=2) AS count_second_place,
SUM(placing=3) AS count_third_place
FROM Placings
WHERE accountid = ?; -- the given account
Re your comment:
Yes, it's a complex task no matter what to go from the normalized data you have to the results you want. You want it aggregated (summed), ranked, and aggregated (counted) again. That's a heap of work! :-)
Also, a single query is not always the fastest way to do a given task. It's a common misconception among programmers that shorter code is implicitly faster code.
Note I have not tested this so your mileage may vary.
Re your question about the UPDATE:
It's a tricky way of getting the COUNT() of votes per account without using GROUP BY. I've added table aliases v and vc so it may be more clear now. In the votes table, there are N rows for a given account/contest. In the votescount table, there's one row per account/contest. When I join, the UPDATE is evaluated against the N rows, so if I add 1 for each of those N rows, I get the count of N stored in votescount in the row corresponding to each respective account/contest.
If I'm interpreting things correctly, to stop people voting twice I think you only need a unique index on the votes table by author (account?) ID and contestID. It won't prevent people from having multiple accounts and voting twice but it will prevent anyone from casting a vote in a contest twice from the same account. To prevent fraud (sock puppet accounts) you'd need to examine voting patterns and detect when an account votes for another account more often then statistically likely. Unless you have a lot of contests that might actually be hard.

Showing all duplicates, side by side, in MySQL

I have a table like so:
Table eventlog
user | user_group | event_date | event_dur.
---- ---------- --------- ----------
xyz 1 2009-1-1 3.5
xyz 2 2009-1-1 4.5
abc 2 2009-1-2 5
abc 1 2009-1-2 5
Notice that in the above sample data, the only thing reliable is the date and the user. Through an over site that is 90% mine to blame, I have managed to allow users to duplicate their daily entries. In some instances the duplicates were intended to be updates to their duration, in others it was their attempt to change the user_group they were working with that day, and in other cases both.
Fortunately, I have a fairly strong idea (since this is an update to an older system) of which records are correct. (Basically, this all happened as an attempt to seamlessly merge the old DB with the new DB).
Unfortunately, I have to more or less do this by hand, or risk losing data that only exists on one side and not the other....
Long story short, I'm trying to figure out the right MySQL query to return all records that have more than one entry for a user on any given date. I have been struggling with GROUP BY and HAVING, but the best I can get is a list of one of the two duplicates, per duplicate, which would be great if I knew for sure it was the wrong one.
Here is the closest I've come:
SELECT *
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(user) > 1
ORDER BY event_date, user
Any help with this would be extremely useful. If need be, I have the list of users/date for each set of duplicates, so I can go by hand and remove all 400 of them, but I'd much rather see them all at once.
Thanks!
Would this work?
SELECT event_date, user
FROM eventlog
GROUP BY event_date, user
HAVING COUNT(*) > 1
ORDER BY event_date, user
What's throwing me off is the COUNT(user) clause you have.
You can list all the field values of the duplicates with GROUP_CONCAT function, but you still get one row for each set.
I think this would work (untested)
SELECT *
FROM eventlog e1
WHERE 1 <
(
SELECT COUNT(*)
FROM eventlog e2
WHERE e1.event_date = e2.event_date
AND e1.user = e2.user
)
-- AND [maybe an additionnal constraint to find the bad duplicate]
ORDER BY event_date, user;
;