SQL: Previous Column empty when setting AVG() - mysql

Ok, I am a little noobie when it comes to SQL. In fact very muchly so, so I apologize if this is self evident.
I am trying to find out 3 things from database (This table is a log of every message sent):
Total Reply Time
Total # of Replies that were Under 10 Mins
Average Reply Time
Here is my SQL:
SELECT
*, SUM(case when tmp.reply_time <= 10 then 1 else 0 end) as under_10_mins,
COUNT(tmp.reply_time) AS total_replies
FROM
(SELECT
TIMESTAMPDIFF(MINUTE, `date`, reply_date) as reply_time
FROM
tme_email_staff_reply sr
JOIN
tme_user u
ON
u.id = sr.staff_id
JOIN
tme_email_message m
ON
m.id = sr.message_id
WHERE
`reply_date` >= '2017-04-01 00:00:00'
AND
`reply_date` < '2017-04-27 00:00:00'
)
AS tmp
Which outputs:
| reply_time | under_10_mins | total_replies |
| 106 | 165 | 375 |
Now, when I add in:
SELECT
*, SUM(case when tmp.reply_time <= 10 then 1 else 0 end) as under_10_mins,
COUNT(tmp.reply_time) AS total_replies
FROM
(SELECT
TIMESTAMPDIFF(MINUTE, `date`, reply_date) as reply_time,
(AVG(TIMESTAMPDIFF(SECOND, `date`, reply_date))/60) AS average_reply_time
FROM
tme_email_staff_reply sr
JOIN
tme_user u
ON
u.id = sr.staff_id
JOIN
tme_email_message m
ON
m.id = sr.message_id
WHERE
`reply_date` >= '2017-04-01 00:00:00'
AND
`reply_date` < '2017-04-27 00:00:00'
)
AS tmp
my response is:
| reply_time | average_reply_time |under_10_mins | total_replies |
| 106 | 149.08626667 | 0 | 1 |
As you can see, the under_10_mins and total_replies fields have changed.
Schema for tables linked:
tme_email_staff_reply:
id | staff_id | message_id | reply_date |
1 | 234,221,001 | 15fg16d5dgw2 | 2017-04-01 09:34:16 |
tme_user
id | username | password | email | dob | gender |
// data omited
tme_email_message
id | thread_id | From | To | subject | message | message_id
// data omited
Can anyone tell me why this is so? and how to fix it?

Why this is so?
Let's see AVG:
AVG([DISTINCT] expr)
Returns the average value of expr. The DISTINCT option can be used to return the average of the distinct values of expr.
If there are no matching rows, AVG() returns NULL.
And doc in 13.19.1 Aggregate (GROUP BY) Function Descriptions also said:
If you use a group function in a statement containing no GROUP BY clause, it is equivalent to grouping on all rows. For more information, see Section 13.19.3, “MySQL Handling of GROUP BY”.
This means in your subquery, you used avg without group by, this will avg all the rows, then return one row in subquery.
How to fix it?
I think you should move avg from subquery to outer query:
SELECT
SUM(case when tmp.reply_time <= 10 then 1 else 0 end) as under_10_mins,
COUNT(tmp.reply_time) AS total_replies,
AVG(average_reply_time) AS average_reply_time
FROM
(SELECT
TIMESTAMPDIFF(MINUTE, `date`, reply_date) as reply_time,
(TIMESTAMPDIFF(SECOND, `date`, reply_date))/60 AS average_reply_time
FROM
tme_email_staff_reply sr
JOIN
tme_user u
ON
u.id = sr.staff_id
JOIN
tme_email_message m
ON
m.id = sr.message_id
WHERE
`reply_date` >= '2017-04-01 00:00:00'
AND
`reply_date` < '2017-04-27 00:00:00'
)
AS tmp

The issue is because, in your nested query, you are referring to nonaggregated columns not named in the GROUP BY clause on a MySQL version under 5.7.5. See documentation, notice that: The server is free to choose any value from each group.
MySQL < 5.7.5 allow this syntax but has special behaviour (your case):
MySQL extends the standard SQL use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Result set sorting occurs after values have been chosen, and ORDER BY does not affect which values within each group the server chooses.
MySQL >= 5.7.5 allow this syntax and checks for functional dependence:
MySQL 5.7.5 and up implements detection of functional dependence. If the ONLY_FULL_GROUP_BY SQL mode is enabled (which it is by default), MySQL rejects queries for which the select list, HAVING condition, or ORDER BY list refer to nonaggregated columns that are neither named in the GROUP BY clause nor are functionally dependent on them.

Related

speed up most recent query

I am trying to get the 3 successful (success =1) recent records and then see their average response time.
I have manipulated the results so that the average response is always 2ms.
I have 20,000 records in this table right now, but I plan on have 1-2 million. It takes 40 seconds just with 20,000 records, so I need to optimize this query.
Here is the fiddle: http://sqlfiddle.com/#!9/dc91eb/1/0
The fiddle contains my indices too, so I am open to adding more indices if needed.
SELECT proxy,
Avg(a.responsems) AS avgResponseMs,
COUNT(*) as Count
FROM proxylog a
WHERE
a.success = 1
AND ( (SELECT Count(0)
FROM proxylog b
WHERE ( ( b.success = a.success )
AND ( b.proxy = a.proxy )
AND ( b.datetime >= a.datetime ) )) <= 3 )
GROUP BY proxy
ORDER BY avgResponseMs
Here is the result of EXPLAIN
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 1 | PRIMARY | a | index | NULL | proxy | 61 | NULL | 19110 | Using where; Using temporary; Using filesort |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
| 2 | DEPENDENT SUBQUERY | b | ref | proxy,datetime | proxy | 52 | wwwim_iroom.a.proxy | 24 | Using where; Using index |
+----+--------------------+-------+-------+----------------+-------+---------+---------------------+-------+----------------------------------------------+
Before you suggest windowed functions, I am using MariaDB 10.1.21 which is ~Mysql 5.6 AFAIK
An index on (success, proxy, datetime, responsems) should help. success, proxy and datetime are the columns shared between both queries. datetime should come after the other two, because it is used to filter a range whereas the other two filter on a point. responsems comes last as this is the column the calculation is done on. That way the needed values can be taken directly from the index.
And please edit the question and include the DDL and DML also in the question it self. The fiddle might be down some day and the question therefore useless for future readers.
I was able to mimic row_number and follow #Gordon Linoff answer
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (
SELECT
#row_number:=CASE
WHEN #g = proxy
THEN #row_number + 1
ELSE 1
END AS RN,
#g:=proxy g,
pl.*
FROM proxyLog pl,
(SELECT #g:=0,#row_number:=0) as t
WHERE pl.success = 1
ORDER BY proxy,datetime DESC
) pl
WHERE RN <= 3
GROUP BY proxy
ORDER BY avgResponseMs
From your comment back to my question, I think I know what your problem is.
If you have a proxy that has 900 requests, your first is still counting 900 (at or greater). Second is counting 899, Third, 898 and so on. That is what is killing your performance. Now add that to having millions of records will choke the crud out of your query.
What you may want to do is have a max date applied to the first one you are querying against where it makes reasonable sense. If you have proxy requests such as times are (and all are success values)
8:00:00
8:00:18
8:00:57
9:02:12
9:15:27
Do really care about the success time between 8:00:57 and 9:02 and 9:15? If a computer is getting pounded with activity in one hour vs light in another, is that really a fair assessment of success times?
What you MAY want is to have some (your discretion) cutoff time, such as within 3 minutes. What if someone does not even resume work going through a proxy for some time. Is that really it? Again, your discresion
AND ( a.datetime <= b.datetime and b.datetime < date_add( a.datetime, interval 5 minutes )) )) <= 3 )
And the <= 3 is not giving you what I THINK you expect. Again, your innermost COUNT(*) is counting all records >= a.datetime, so it would not be until you were at the end of a given batch of proxy times that you would get these counts.
So are you looking for the HISTORICAL average times, or only the most recent 3 time cycles for a given proxy. What you are requesting and querying may be two completely different things.
You may want to edit your original post to clarify. I end here until I hear back to possible offer additional assistance.
I would advise you to try writing the query using window functions:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE pl.success = 1
) pl
WHERE seqnum <= 3
GROUP BY proxy
ORDER BY avgResponseMs;
For this, you want an index on proxylog(success, proxy, datetime, responsems).
In older versions, I would replace your version of the subquery with:
SELECT pl.proxy, Avg(pl.responsems) AS avgResponseMs, COUNT(*) as Count
FROM (SELECT pl.*,
ROW_NUMBER() OVER (PARTITION BY pl.proxy ORDER BY datetime DESC) as seqnum
FROM proxylog pl
WHERE
) pl
WHERE pl.success = 1 AND
pl.datetime >= ANY (SELECT pl2.datetime
FROM proxylog pl2
WHERE pl2.success = pl.success AND
pl2.proxy = pl.proxy
ORDER BY pl2.datetime DESC
LIMIT 1 OFFSET 2
)
GROUP BY proxy
ORDER BY avgResponseMs;
The index you want for this is the same as above.

MySQL COUNT(DISTINCT) giving wrong values with GROUP BY

I have a table that contains custom user analytics data. I was able to pull the number of unique users with a query:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users'
FROM `events`
WHERE client_id = 123
And this will return 16728
This table also has a column of type DATETIME that I would like to group the counts by. However, if I add a GROUP BY to the end of it, everything groups properly it seems except the totals don't match. My new query is this:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users', DATE(server_stamp) AS 'date'
FROM `events`
WHERE client_id = 123
GROUP BY DATE(server_stamp)
Now I get the following values:
|-----------------------------|
| unique_users | date |
|---------------|-------------|
| 2650 | 2019-08-26 |
| 3486 | 2019-08-27 |
| 3475 | 2019-08-28 |
| 3631 | 2019-08-29 |
| 3492 | 2019-08-30 |
|-----------------------------|
Totaling to 16734. I tried using a sub query to get the distinct users then count and group in the main query but no luck there. Any help in this would be greatly appreciated. Let me know if there is further information to help diagnosis.
A user, who is connected with events on multiple days (e.g. session starts before midnight and ends afterwards), will occur the number of these days times in the new query. This is due to the fact, that the first query performs the DISTINCT over all rows at once while the second just removes duplicates inside each groups. Identical values in different groups will stay untouched.
So if you have a combination of DISTINCT in the select clause and a GROUP BY, the GROUP BY will be executed before the DISTINCT. Thus without any restrictions you cannot assume, that the COUNT(DISTINCT user_id) of the first query and the sum over the COUNT(DISTINCT user_id) of all groups is the same.
Xandor is absolutely correct. If a user logged on 2 different days, There is no way your 2nd query can remove them. If you need data grouped by date, You can try below query -
SELECT COUNT(user_id) AS 'unique_users', DATE(MIN_DATE) AS 'date'
FROM (SELECT user_id, MIN(DATE(server_stamp)) MIN_DATE -- Might be MAX
FROM `events`'
WHERE client_id = 123
GROUP BY user_id) X
GROUP BY DATE(server_stamp);

table or column not found in mysql join with selects and cases

So I have this bit of mysql that I'm trying to work out. My goal is to insert the count of a grouping into the primary records to tell me how many of each status is within the related table for the record, so the result might look like this:
| id | name | count1 | count2 |
------------------------------------
| 1 | primary 1 | 5 | 3 |
| 1 | primary 2 | 2 | 7 |
select * from primaryTable
left join (
select
case
when relationTable.relation_status_id = 1
then count(*)
END as count1,
case
when relationTable.relation_status_id = 2
then count(*)
END as count2
) relationTable
on relationTable.primary_id = primaryTable.id
I tried using a subquery to do it, which worked, but requires a select per count, which I'm trying to avoid.
Adding a group by to the subquery resulted in an error that more than one row was being returned.
In the subquery, rather than aggregate COUNT()s inside CASE, you may more easily use SUM() to add up the result of a boolean comparison (0 or 1) to return a result resembling a count.
SELECT
primaryTable.*,
count1,
count2
FROM
primaryTable
JOIN (
SELECT
primary_id,
-- Sum the results of a boolean comparison
SUM(relation_status_id = 1) AS count1,
SUM(relation_status_id = 2) AS count2
FROM relationTable
-- Group in the subquery
GROUP BY primary_id
-- Join the subquery to the main table by primary_id
) counts ON primaryTable.primary_id = counts.primary_id
Note that because MySQL treats the booleans the same as 0 or 1, the comparison relation_status_id = 1 returns 1 or 0. The syntax above isn't supported in every RDBMS. To be more portable, you would need to use a CASE inside SUM() to explicitly return an integer 1 or 0.
SUM(CASE WHEN relation_status_id = 1 THEN 1 ELSE 0 END) AS count1,
SUM(CASE WHEN relation_status_id = 2 THEN 1 ELSE 0 END) AS count2
Your original attempt has some syntax problems. Chiefly, it has no FROM clause, which is causing MySQL to think it should be treated as a scalar value and then complain that it returns more than one row.

MySQL: Transfer Data Based on a Column Without Also Transferring That Column

My table stores revision data for my CMS entries. Each entry has an ID and a revision date, and there are multiple revisions:
Table: old_revisions
+----------+---------------+-----------------------------------------+
| entry_id | revision_date | entry_data |
+----------+---------------+-----------------------------------------+
| 1 | 1302150011 | I like pie. |
| 1 | 1302148411 | I like pie and cookies. |
| 1 | 1302149885 | I like pie and cookies and cake. |
| 2 | 1288917372 | Kittens are cute. |
| 2 | 1288918782 | Kittens are cute but puppies are cuter. |
| 3 | 1288056095 | Han shot first. |
+----------+---------------+-----------------------------------------+
I want to transfer some of this data to another table:
Table: new_revisions
+--------------+----------------+
| new_entry_id | new_entry_data |
+--------------+----------------+
| | |
+--------------+----------------+
I want to transfer entry_id and entry_data to new_entry_id and new_entry_data. But I only want to transfer the most recent version of each entry.
I got as far as this query:
INSERT INTO new_revisions (
new_entry_id,
new_entry_data
)
SELECT
entry_id,
entry_data,
MAX(revision_date)
FROM old_revisions
GROUP BY entry_id
But I think the problem is that I'm trying to insert 3 columns of data into 2 columns.
How do I transfer the data based on the revision date without transferring the revision date as well?
You can use the following query:
insert into new_revisions (new_entry_id, new_entry_data)
select o1.entry_id, o1.entry_data
from old_revisions o1
inner join
(
select max(revision_date) maxDate, entry_id
from old_revisions
group by entry_id
) o2
on o1.entry_id = o2.entry_id
and o1.revision_date = o2.maxDate
See SQL Fiddle with Demo. This query gets the max(revision_date) for each entry_id and then joins back to your table on both the entry_id and the max date to get the rows to be inserted.
Please note that the subquery is only returning the entry_id and date, this is because we want to apply the GROUP BY to the items in the select list that are not in an aggregate function. MySQL uses an extension to the GROUP BY clause that allows columns in the select list to be excluded in a group by and aggregate but this could causes unexpected results. By only including the columns needed by the aggregate and the group by will ensure that the result is the value you want. (see MySQL Extensions to GROUP BY)
From the MySQL Docs:
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. ... You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values the server chooses.
If you want to enter the last entry you need to filter it before:
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id;
Then use this as a subquery to filter the data you need:
insert into new_revisions (new_entry_id, new_entry_data)
select entry_id, entry_data
from old_revisions as o
inner join (
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id
) as a on o.entry_id = a.entry_id and o.revision_date = a.maxDate

Why is this MySQL query slow?

I have the following query, all relevant columns are indexed correctly. MySQL version 5.0.8. The query takes forever:
SELECT COUNT(*) FROM `members` `t` WHERE t.member_type NOT IN (1,2)
AND ( SELECT end_date FROM subscriptions s
WHERE s.sub_auth_id = t.member_auth_id AND s.sub_status = 'Completed'
AND s.sub_pkg_id > 0 ORDER BY s.id DESC LIMIT 1 ) < curdate( )
EXPLAIN output:
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
1 | PRIMARY | t | ALL | membership_type | NULL | NULL | NULL | 9610 | Using where
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
2 | DEPENDENT SUBQUERY | s | index | subscription_auth_id, | PRIMARY | 4 | NULL | 1 | Using where
| | | | subscription_pkg_id, | | | | |
| | | | subscription_status | | | | |
----+--------------------+-------+-------+-----------------------+---------+---------+------+------+-------------
Why?
Your subselect refers to values in the parent query. This is known as a correlated (dependent) subquery, and such a query has to be executed once for every row in the parent query, which often leads to poor performance. It is often faster to rewrite the query as a JOIN, for example like this
(Note: without a sample schema to test with, it is impossible to say in advance if this will be faster and still correct, you might need to adjust it a little):
SELECT COUNT(*) FROM members t
LEFT JOIN (
SELECT sub_auth_id as member_id, max(id) as sid FROM subscriptions
WHERE sub_status = 'Completed'
AND sub_pkg_id > 0
GROUP BY sub_auth_id
LEFT JOIN (
SELECT id AS subid, end_date FROM subscriptions
WHERE sub_status = 'Completed'
AND sub_pkg_id > 0
) sdate ON sid = subid
) sub ON sub.member_id = t.member_auth_id
WHERE t.member_type NOT IN (1,2)
AND sub.end_date < curdate( )
The logic here is:
For each member, find his latest subscription.
For each latest subscription, find its end date.
Join these member-latest_sub_date pair to the members list.
Filter the list.
Your query is slow because as written you are considering 9,610 rows and therefore performing 9,610 SELECT subqueries in your WHERE clause. You really should rewrite your query to JOIN the members and subscriptions tables first, to which your WHERE conditions could still apply.
EDIT: Try this.
SELECT COUNT(*)
FROM `members` `t`
JOIN subscriptions s ON (s.sub_auth_id = t.member_auth_id)
WHERE t.member_type NOT IN (1,2)
AND s.sub_status = 'Completed'
AND s.sub_pkg_id > 0
AND end_date < curdate()
ORDER BY s.id DESC LIMIT 1
Caveat: I'm not a MySQL expert, but pretty good in a different SQL flavour (VFP), but I believe you will save some time if:
You count just one field, let's say memberid, instead of *.
Your comparison NOT IN (1,2) is replaced with > 2 (provided that is valid).
The ORDER BY in your subselect is unnecessary, I think. You're trying to get the last completed subscription?
The < curdate() should be inside your subselect's WHERE.
(SELECT end_date FROM subscriptions s
WHERE s.end_date < curdate() and s.sub_auth_id = t.member_auth_id AND
s.sub_status = 'Completed' AND s.sub_pkg_id > 0 ORDER BY s.id DESC LIMIT 1 )
Tune your subselect so as to trim down the set as quickly as possible. The first conditional should be the one least likely to occur.
I ended up doing it like this:
select count(*) from members t
JOIN subscriptions s ON s.sub_auth_id = t.member_auth_id
WHERE t.membership_type > 2 AND s.sub_status = 'Completed' AND s.sub_pkg_id > 0
AND s.sub_end_date < curdate( )
AND s.id = (SELECT MAX(ss.id) FROM subscriptions ss WHERE ss.sub_auth_id = t.member_auth_id)
I believe that the problem is due to a bug that won't be fixed until MySQL 6.