NULL value count in group by - mysql

for simplification purposes, I will use simple table attribute (meaning the table is bigger) to demonstrate the issue:
I have the following table test:
id | nbr
----+-----
1 | 0
2 |
3 |
4 | 1
5 | 1
(5 rows)
id and nbr are both numeric values
The following query
select nbr, count(nbr) from test group by nbr;
outputs:
nbr | count
-----+-------
| 0
1 | 2
0 | 1
(3 rows)
whereas the query:
select nbr, count(*) from test group by nbr;
outputs:
nbr | count
-----+------
| 2
1 | 2
0 | 1
(3 rows)
I find it hard to explain the difference between count(nbr) and count(*) regarding null values
can someone explain this to me like I'm five, thanks

It's pretty simple:
count(<expression>) counts the number of values. Like most aggregate functions, it removes null values before doing the actual aggregation.
count(*) is a special case that counts the number of rows (regardless of any null).
count (no matter if * or <expression>) never returns null (unlike most other aggregate functions). In case no rows are aggregated, the result is 0.
Now, you have done a group by on an nullable column. group by put's null values into the same group. That means, the group for nbr null has two rows. If you now apply count(nbr), the null values are removed before aggregation, giving you 0 as result.
If you would do count(id), there would be no null value to be removed, giving you 2.
This is standard SQL behavior and honored by pretty much every database.
One of the common use-cases is to emulate the filter clause in databases that don't support it natively: http://modern-sql.com/feature/filter#conforming-alternatives
The exceptions (aggregate functions that don't remove null prior to aggregation) are functions like json_arrayagg, json_objectagg, array_agg and the like.

MySQL explains it in the documentation of function COUNT():
COUNT(expr)
Returns a count of the number of non-NULL values of expr in the rows retrieved by a SELECT statement.
COUNT(*) is somewhat different in that it returns a count of the number of rows retrieved, whether or not they contain NULL values.
PostgreSQL also explains it in the documentation:
Most aggregate functions ignore null inputs, so that rows in which one or more of the expression(s) yield null are discarded. This can be assumed to be true, unless otherwise specified, for all built-in aggregates.
For example, count(*) yields the total number of input rows; count(f1) yields the number of input rows in which f1 is non-null, since count ignores nulls; and count(distinct f1) yields the number of distinct non-null values of f1.

count(*) count the number of rows related to the group by colums. Inpependntly of the fatc the the column in group by contain null or not null values
count(nbr) count the number of rows related to the group by column where nbr is not null

Count with null values:
SELECT nbr, COUNT(*) FROM mytables WHERE nbr IS NULL GROUP BY nbr
UNION
SELECT nbr, COUNT(nbr) FROM mytables WHERE nbr IS NOT NULL GROUP BY nbr

Related

Mysql - How do I avoid group by but still with concat and group concat I would need to combine multiple columns and rows results

I have something like in table
mysql> select uuid , short-uuid FROM sampleUUID WHERE identifier ="test123";
+--------------------------------------+-------------+
| uuid | short-uuid |
+--------------------------------------+-------------+
| 11d52ebd-1404-115d-903e-8033863ee848 | 8033863ee848 |
| 22b6f783-aeaf-1195-97ef-a6d8c47261b1 | 8033863ee848 |
| 33c51085-ccd8-1119-ac37-332510a16e1b | 332510a16e1b |
+--------------------------------------+-------------+
I would be needing a result like (grouped all in single row, single value w.r.t uuid and short-uuid being same)
| uuidDetails
+----------------------------------------------------------------------------------------------------------------+-------------+
| 11d52ebd-1404-115d-903e-8033863ee848,22b6f783-aeaf-1195-97ef-a6d8c47261b1|8033863ee848&&33c51085-ccd8-1119-ac37-332510a16e1b| 332510a16e1b |
+----------------------------------------------------------------------------------------------------------------+-------------+
(basically grouping uuid and short uuid in a single row from multiple rows and columns)
I know this can be achieved by select GROUP_CONCAT(uuid)FROM sampleUUID WHERE identifier ="test123" group by short-uuid;
but i don't wanna use group by here because that give multiple rows, i would need all in one row .
I have tried with below stuffs but failed to get the the results in single row
select ANY_VALUE(CONCAT_WS( '||',CONCAT_WS('|',GROUP_CONCAT(uuid) SEPARATOR ','),short-uuid)) )as uuidDetails from sampleUUID
where identifier ="test123";
this resulted like below with not appending short-uuid properly (there is only 1 short uuid appended here,Actually it needs to be grouped first 2 uuids with 1 short(because same short-uuid) uuid and 3rd uuid with other short uuid)
| uuidDetails
+----------------------------------------------------------------------------------------------------------------+-------------+
| 11d52ebd-1404-115d-903e-8033863ee848,22b6f783-aeaf-1195-97ef-a6d8c47261b1,33c51085-ccd8-1119-ac37-332510a16e1b| 332510a16e1b |
+----------------------------------------------------------------------------------------------------------------+-------------+
which is not i expected
Any help here will be appreciated . Thank you
Use nested queries.
SELECT GROUP_CONCAT(result ORDER BY result SEPARATOR '&&') AS uuidDetails
FROM (
SELECT CONCAT(GROUP_CONCAT(uuid ORDER BY uuid SEPARATOR ','), '|', short_uid) AS result
FROM sampleUUID
WHERE identifier = 'test123'
GROUP BY short_uid
) AS x
NOTE: If there is no requirement for ordering of the UUID values, we can use ORDER BY inside the GROUP_CONCAT aggregates to make the result more deterministic, so the query will return just one of a number of possible results given the same data e.g. return aa,bb|1&&cc|3 rather than bb,aa|1&&cc|3 or cc|3&&aa,bb|1 or cc|3&&bb,aa|1.

MySQL how to find percentage field population base on states in column

I did some research but nothing seemed to help my individual case
I have a table (with 40 columns and million rows)
FirstName| LaseName | State |...100+ other columns|
aaa | bbb | CA
ccc | ddd | NY
abc | null | CA
null | ggg | AL
...150 million rows
I need a very long query to return something like below
State | field | # of state pupulation | # of rows in state | % state population
__________________________________________________________________________
AL | firstName | 0 | 1 | 0%
| lastName | 1 | | 100%
__________________________________________________________________________
CA | firstName | 2 | 2 | 100%
| lastname | 1 | | 50%
__________________________________________________________________________
NY | firstName | 1 | 1 | 100%
| lastname | 1 | | 100%
this is for internal use only so format/order doesn't really matter as long as i can get the numbers needed
note that the % is calculated by (# of non null in state = AL, CA, etc / # of total records where state = AL, CA, etc)
and not (# of non null / # of all rows)
Im new to sql and I have no idea what to do
Here's one possible query pattern. To get the percentage, we divide the third expression in the select list by the fourth expression. (To do that in a single SELECT, we have to repeat those expressions, separated by the division operator.
SELECT c.state
, f.field
, CASE f.field
WHEN 'firstname' THEN c.cnt_fn_nn
WHEN 'lastname' THEN c.cnt_ln_nn
WHEN 'somecol' THEN c.cnt_sc_nn
ELSE NULL
END AS `# populated`
, c.cnt_tot AS `# rows`
, CASE f.field
WHEN 'firstname' THEN c.cnt_fn_nn
WHEN 'lastname' THEN c.cnt_ln_nn
WHEN 'somecol' THEN c.cnt_sc_nn
ELSE NULL
END
/ c.cnt_tot * 100.0 AS `pct`
FROM ( SELECT 'firstname' AS `field`
UNION ALL SELECT 'lastname'
UNION ALL SELECT 'somecol'
) f
CROSS
JOIN ( SELECT a.state
, SUM(1) AS `cnt_tot`
, SUM(a.first_name IS NOT NULL) AS `cnt_fn_nn`
, SUM(a.last_name IS NOT NULL) AS `cnt_ln_nn`
, SUM(a.somecol IS NOT NULL) AS `cnt_sc_nn`
FROM atable a
GROUP BY a.state
) c
ORDER
BY c.state
, f.field
NOTES:
The inline view (derived table) f returns the field values we want to display in the second column.
The inline view (derived table) c gets us the "counts" by state.
The GROUP BY clause collapses all of the rows that have the same value for state, so we get one row back for each distinct value of state.
The expression a.first_name IS NOT NULL is evaluated for each row, as a boolean, and returns either FALSE (0) or TRUE (1) for each row. We can use the SUM() aggregate function to total up the ones and zeros. That gets us a count of the number of rows that have a non-null value for first_name.
SUM(1) will get us the same thing as COUNT(*) would. That's the total number of rows for each state.
We do a CROSS JOIN operation on the two inline views to produce a Cartesian (cross) product. Every row from f is matched to every row from c. If we have 50 states, 50 rows from c, and three rows from f, that gets us a total of 50*3 rows.
The searched CASE expression is a SQL way of saying...
if field = 'firstname' then return cnt_fn_nn
elsif field = 'lastname' then return cnt_ln_nn
...
We use this to "pick out" the count we want to return on each row, return the count of first name not null on the row with firstname field.
This can be extended, copy the lines with somecol and replace that with a valid column name in atable, and assign a unique column alias AS foo to each expression in the SELECT list.
What this query doesn't get us is any state values that don't appear in atable. To get zero counts, we would need another source for state.
And this query doesn't suppress repeated values for state, or for # rows in state. Those values are returned on each row, even though they are duplicated.
To get those supressed would be icky, but we could do it. To get it on the "first line" for each state we need to know which field is going to be first. We could do a conditional test in the first and fourth expressions in the SELECT list of the outer query. Get the query working first, before you start mucking with it. (The suppression of repeated values would better be handled in the client that processes the result. But if the specification is to return that exact resultset, it can be done like this... I cringe...)
SELECT IF(f.field='firstname',c.state,'') AS `state`
, f.field
, CASE f.field
WHEN 'firstname' THEN c.cnt_fn_nn
WHEN 'lastname' THEN c.cnt_ln_nn
WHEN 'somecol' THEN c.cnt_sc_nn
ELSE NULL
END AS `# populated`
, IF(f.field='firstname',c.cnt_tot,NULL) AS `# rows`

MySQL - What is the difference between SUM and COUNT?

In MySQL - What is the difference between using SUM or COUNT?
SELECT SUM(USER_NAME = 'JoeBlow')
SELECT COUNT(USER_NAME = 'JoeBlow')
To answer the OP question more direct and literal, consider if you were totalling integers in your column instead of strings.
+----+------+
| id | vote |
+----+------+
| 1 | 1 |
| 2 | -1 |
| 3 | 1 |
| 4 | -1 |
| 5 | 1 |
+----+------+
COUNT = 5 votes
SUM = 1 vote
(-2 + 3 = 1)
Sum is doing the mathematical sum, whereas count simply counts any value as 1 regardless of what data type.
It is a big difference because the result is not the same.
The first query returns the number of times the condition is true, because true is 1 and false is 0.
The second query returns the complete record count because count() does not care about the content inside it, as long as the content is NOT NULL. Because count(1) and count(0) are still values and both get counted.
To get the correct return value for the second query you would have to make the result of the condition be null (instead of 0) to not being counted. Like this:
SELECT COUNT(case when USER_NAME = 'JoeBlow' then 'no matter what' else NULL end)
from your_table
Or simply remove the else part from the case statement which automatically makes the else part null.
I guess COUNT() returns the number of rows in a column whereas SUM() returns the sum for the column
select count(field) from table
is slower than
select sum(1) from table
Consider using the second option

GROUP BY error - SQL

This is the SQL query I have written. It works until right before the group by statement but once I add that part, I get this error:
'reading_datetime' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
My query:
Select A.bill_account, hour(A.reading_datetime), A.reading_value
from (
Select cast(cast(bill_account as double) as int)bill_account, reading_datetime, cast(reading_value as double)reading_value, `interval`
from amerendataorc
WHERE cast(cast(`interval` as double)as int) = 3600 AND reading_datetime between '2015-03-15 00:00:00' and '2016-03-14 23:59:59'
) A
GROUP BY A.bill_account
HAVING (COUNT(A.bill_account)>= 8000) and (COUNT(A.bill_account) < 9500)")
Not sure exactly how the group by is messing up the query.
take the sum of reading date time and reading value
Select A.bill_account, sum(hour(A.reading_datetime)), sum(A.reading_value)
from (
Select cast(cast(bill_account as double) as int)bill_account, reading_datetime, cast(reading_value as double)reading_value, `interval`
from amerendataorc
WHERE cast(cast(`interval` as double)as int) = 3600 AND reading_datetime between '2015-03-15 00:00:00' and '2016-03-14 23:59:59'
) A
GROUP BY A.bill_account
HAVING (COUNT(A.bill_account)>= 8000) and (COUNT(A.bill_account) < 9500)")
---- explanation ------------
mysql> SELECT * FROM tt where user="user1";
+----------+-------+
| duration | user |
+----------+-------+
| 00:06:00 | user1 |
| 00:02:00 | user1 |
+----------+-------+
2 rows in set (0.00 sec)
mysql> SELECT * FROM tt where user="user1" group by user;
+----------+-------+
| duration | user |
+----------+-------+
| 00:06:00 | user1 |
+----------+-------+
1 row in set (0.00 sec)
once you add group by it will give only the summery after group by on that column in above example its giving 1st value
else you can get sum,max ... aggreagte values
SQL is trying to avoid an issue whereby you have multiple hour(A.reading_datetime) per A.Bill_Account. Grouping by Bill_account will give you a list of unique Bill_accounts. Then it has multiple hour(A.reading_datetime) per Bill_account and needs you to help it choose how to select one.
You need to group by each value that occurs or use aggregate functions on non-group by fields. If you group by reading_datetime and reading_value as well SQL will list all unique combinations of the three fields in the group by.
MySql suggests using first(); max() min() sum() etc are all aggregate functions what will help you get once value per Bill_account.
You will need to doing this for reading_value as well.
Standard SQL doesn't permit queries for which the select list refers to nonaggregated columns that are not named in the GROUP BY clause.
Therefore you have to add those columns to the GROUP BY clause, or you have to aggregate the columns in the SELECT clause, in your case:
Select A.bill_account, sum(hour(A.reading_datetime)), sum(A.reading_value)
But you have to evaluate if it is adequate for your data to sum those columns in that way, and if it isn't, add the columns as GROUP BY criteria.
Any field that is not included in the Group By Clause will require an aggregate function like SUM, COUNT, MIN or MAX to be included in the Selected fields.
http://www.w3schools.com/sql/sql_groupby.asp
To correct the issue you will need to use the following group by clause
GROUP BY A.bill_account, A.reading_datetime, A.reading_value

conditional CASE WHEN statement

I have a table with the below structure:
CustID | Number 1 | Number 2 | Number 3 | Number 4
1 | 072454584 | | 017726593 |
2 | |0125456852| | 0125785448
I'm trying to do a query that selects the first number that is available, so if using customer ID 2 it would return only number 2, if there was a record with only number 4 present it would ignore 1,2,3. I've tried doing a case when statement but I cant seem to work out the logic.
In case you have NULL values in those columns then use COALESCE:
SELECT CUSTID, COALESCE(number1, number2, number3, number4)
You can use COALESCE which returns the first non-null value:
SELECT COALESCE([Number 1],[Number 2],[Number 3], [Number 4]) AS FirstNonNullNum
FROM dbo.Table1
WHERE CustID = #paramID
Demo
However, your model seems to be semi optimal. If you have columns Number 1 - Number N you shoudld better normalize it and use a separate table instead of columns. That makes all queries simpler and far more efficient. It's also much more maintainable and less error-prone if you plan to add more columns.