Using a sum of values as a condition (SQL query) - mysql

I have a table that looks roughly like this
Year Species Count
1979 A 0
1980 A 10
1981 A 4
1982 A 3
1979 B 0
1980 B 1
1981 B 2
1982 B 3
1979 C 9
1980 C 14
1981 C 2
1982 C 1
What i want is to return all Year, Species, Count for those species that have a total count (as in summed over all years) of 10 or more. so for a total count of 20 i would want it to just return
1979 C 9
1980 C 14
1981 C 2
1982 C 1
i played around with having but havent really gotten anything useful (total SQL beginner)

In MySQL, you can do this using aggregation and a join:
select t.*
from table t join
(select species, count(*) as cnt
from table
group by species
) s
on t.species = s.species
where s.cnt >= 10;

This is the easiesy. You already have the counts. Group on species and filter table on the results of the subquesy. You can get the same functionality with an exists or a join also.
SELECT
[YEAR]
,SPECIES
,[COUNT]
FROM TABLE
WHERE SPECIES IN (
SELECT SPECIES
FROM TABLE
GROUP BY SPECIES
HAVING SUM([COUNT]) > 20)
)
Adding some addtional explanation for BootstrapBill
Group by "makes multiple sets" for each unique value of the GROUP BY column. That allows the aggregate function SUM() act on only one set of the GROUP BY values at a time. HAVING is sort of like a WHERE clause for the GROUP BY statement that allows you to apply a predicate. The only fields allowed to be returned by a GROUP BY are the grouped column itself and the results of any aggregate function(s), you need to join back to or filter the original set to get the other columns your are targeting in the query.
And I apoligze, I did not see where the OP stated this was for MySql. The core concept is the same so I am leaving the answer. [] are MS SQL syntax for escaping the keywords COUNT and YEAR.

You'll want to use GROUP BY with the SUM() aggregate function and HAVING clause (similar to WHERE, but for groups instead of rows), combined with a self-join:
SELECT t1.`Year`, t1.`Species`, t1.`Count`
FROM mytable t1 INNER JOIN (
SELECT `Species`, SUM(`Count`)
FROM mytable
GROUP BY `Species`
HAVING SUM(`Count`) >= 20
) t2
ON t1.`Species` = t2.`Species`

Related

How do I COUNT rows of a GROUP BY query where a condition matches?

This is my persons table:
neighborhood birthyear
a 1958
a 1959
b 1970
c 1980
I'd like to get the COUNT of people in an age group within every neighborhood. For example, if I wanted to get everyone under the age of 18, I would get:
neighborhood count
a 0
b 0
c 0
If I wanted to get everyone over 50, I'd get
neighborhood count
a 2
b 0
c 0
I tried
SELECT neighborhood, COUNT(*)
FROM persons
WHERE YEAR(NOW()) - persons.birthyear < 18
GROUP BY neighborhood;
but this gives me 0 rows, when instead I want 3 rows with distinct neighborhoods and 0 count for each. How would I accomplish this?
You can use conditional aggregation:
SELECT neighborhood, SUM(YEAR(NOW()) - p.birthyear) as under_18,
SUM(YEAR(NOW()) - p.birthyear BETWEEN 34 AND 42) as age_34_42
FROM persons p
GROUP BY neighborhood;
I think that if the count is 0, the row doesn't appear.
Your code seems correct to me, if you try it on the example with age 50, it should give you one row whith the expected line (neighborhood:a,count:2)
I would recommend using a sub query:
SELECT
count(*) [group-by-count-greater-than-ten]
FROM
(
SELECT
columnFoo,
count(*) cnt
FROM barTable
WHERE columnBaz = "barbaz"
GROUP BY columnFoo
)
AS subQuery
WHERE cnt > 10
In the above, the subquery return result set is being used by the main query as any other table.
The column cnt is no longer seen by the main query as a computed field and does not have to reference the count() function.
However, inside the subquery running a where clause or a having clause that must look at the alias cnt column, the count() function would have to be referenced as referencing cnt in the subquery would throw an error.
In your case using a subquery would look something like this.
SELECT
neighborhood,
age,
count(*) as cnt
FROM
(
SELECT
*,
(YEAR(NOW()) - birthyear) as age
FROM PERSONS
) as WithAge
WHERE age < 18
GROUP BY neighborhood, age

Finding missing data in a sequence in MySQL

Is there an efficient way to find missing data not just in one sequence, but many sequences?
This is probably unavoidably O(N**2), so efficient here is defined as relatively few queries using MySQL
Let's say I have a table of temporary employees and their starting and ending months.
employees | start_month | end_month
------------------------------------
Jane 2017-05 2017-07
Bob 2017-10 2017-12
And there is a related table of monthly payments to those employees
employee | paid_month
---------------------
Jane 2017-05
Jane 2017-07
Bob 2017-11
Bob 2017-12
Now, it's clear that we're missing a month for Jane (2017-06) and one for Bob too (2017-10).
Is there a way to somehow find the gaps in their payment record, without lots of trips back and forth?
In the case where there's just one sequence to check, some people generate a temporary table of valid values, and then LEFT JOIN to find the gaps. But here we have different sequences for each employee.
One possibility is that we could do an aggregate query to find the COUNT() of paid_months for each employee, and then check it versus the expected delta of months. Unfortunately the data here is a bit dirty so we actually have payment dates that could be before or after that employee start or end date. But we're verifying that the official sequence definitely has payments.
Form a Cartesian product of employees and months, then left join the actual data to that, then the missing data is revealed when there is no matched payment to the Cartesian product.
You need a list of every months. This might come from a "calendar table" you already have, OR, it MIGHT be possible using a subquery if every month is represented in the source data)
e.g.
select
m.paid_month, e.employee
from (select distinct paid_month from payments) m
cross join (select employee from employees) e
left join payments p on m.paid_month = p.paid_month and e.employee = p.employee
where p.employee is null
The subquery m can be substituted by the calendar table or some other technique for generating a series of months. e.g.
select
DATE_FORMAT(m1, '%Y-%m')
from (
select
'2017-01-01'+ INTERVAL m MONTH as m1
from (
select #rownum:=#rownum+1 as m
from (select 1 union select 2 union select 3 union select 4) t1
cross join (select 1 union select 2 union select 3 union select 4) t2
## cross join (select 1 union select 2 union select 3 union select 4) t3
## cross join (select 1 union select 2 union select 3 union select 4) t4
cross join(select #rownum:=-1) t0
) d1
) d2
where m1 < '2018-01-01'
order by m1
The subquery e could contain other logic (e.g. to determine which employees are still currently employed, or that are "temporary employees")
First we need to get all the months between start date and end_date in a temporary table then need do a left outer join with the payments table on paid month filtering all non matching months ( payment employee name is null )
select e.employee, e.yearmonth as missing_paid_month from (
with t as (
select e.employee, to_date(e.start_date, 'YYYY-MM') as start_date, to_date(e.end_date, 'YYYY-MM') as end_date from employees e
)
select distinct t.employee,
to_char(add_months(trunc(start_date,'MM'),level - 1),'YYYY-MM') yearmonth
from t
connect by trunc(end_date,'mm') >= add_months(trunc(start_date,'mm'),level - 1)
order by t.employee, yearmonth
) e
left outer join payments p
on p.paid_month = e.yearmonth
where p.employee is null
output
EMPLOYEE MISSING_PAID_MONTH
Bob 2017-10
Jane 2017-06
SQL Fiddle http://sqlfiddle.com/#!4/2b2857/35

Comparing two SQL queries

I've got a MySQL database of all NCAA basketball tournament results. I'm looking at the "haves" and "have nots" of college hoops, and looking for who drops in and out of the "haves" list by examining NCAA tournament bids over time.
I've got a query that counts the number of NCAA appearances by each team for two sets of years. I want to compare the results for the two sets - seeing who dropped out and who dropped in from one year to the next.
For example, which teams made 6 of 10 NCAA tournaments between 1985-94, which made 6 between 1986-95, and what are the differences in the two lists. Here's what I have:
Select t1.Team AS "1994 Teams",t2.Team AS "1995 Teams"
FROM
(SELECT Count(DISTINCT TABLE_NAME.`Year`) AS 'Totals', TABLE_NAME.Team, TABLE_NAME.Current_Conference
FROM TABLE_NAME
WHERE TABLE_NAME.`Year` BETWEEN 1985 AND 1994
GROUP BY TABLE_NAME.Team HAVING Totals >= 6
ORDER BY TABLE_NAME.Team) AS t1,
(SELECT Count(DISTINCT TABLE_NAME.`Year`) AS 'Totals', TABLE_NAME.Team, TABLE_NAME.Current_Conference
FROM TABLE_NAME
WHERE TABLE_NAME.`Year` BETWEEN 1986 AND 1995
GROUP BY TABLE_NAME.Team HAVING Totals >= 6
ORDER BY TABLE_NAME.Team) AS t2
WHERE t1.Team = t2.Team
This returns (in this case) 32 records - all the teams that were in 6 of 10 NCAA tournaments in both 1985-94 and 1986-95. I'm trying to find the teams that are in one set and not the other.
One way of doing this is using the subquery in the WHERE part:
SELECT t1.Team
FROM (
SELECT COUNT(DISTINCT TABLE_NAME.`Year`) AS 'Totals',
TABLE_NAME.Team,
TABLE_NAME.Current_Conference
FROM TABLE_NAME
WHERE TABLE_NAME.`Year` BETWEEN 1985 AND 1994
GROUP BY TABLE_NAME.Team
HAVING Totals >= 6
ORDER BY TABLE_NAME.Team
) AS t1
WHERE t1.team NOT IN (
SELECT TABLE_NAME.Team
FROM TABLE_NAME
WHERE TABLE_NAME.`Year` BETWEEN 1986 AND 1995
GROUP BY TABLE_NAME.Team
HAVING COUNT(DISTINCT TABLE_NAME.`Year`) >= 6 )

SQL count using multiple tables

Table 1: mappingtable (this contains the tags mapping with sentence)
id tag_id sentence_id
1 10 30
2 11 40
Table 2 reports
sentence_id DATE property (sentences may repeat)
30 timestamp1 property1
30 timestamp2 property2
40 timestamp3 property1
I am trying to get the tag ids and count of tags grouped by time.
I tried this query
SELECT DISTINCT(tag_id),COUNT(tag_id) AS cnt, MONTH(DATE) AS mnt
FROM mappingtable
INNER JOIN reports
ON mappingtable .sentence_id=reports.sentence_id AND reports.property= 'property1' GROUP BY tag_id,mnt ORDER BY cnt DESC;
However if the sentence repeats in the reports table (as is usually the case) the count of tags is coming wrong.
Edit:
EDIT
Tried the query suggested below:
SELECT M.tag_id, COUNT(M.tag_id) AS cnt, MONTH(R.DATE) AS mnt FROM mappingtable M INNER JOIN reports R ON M.sentence_id = R.sentence_id AND R.property = 'property1' GROUP BY M.tag_id, MONTH(R.DATE) ORDER BY COUNT(M.tag_id) DESC;
Even this query is giving additional counts because of repeating sentence ids.
What I need is the unique sentences for property property1 grouped by month and then the tags counts of those sentences.
tag_id cnt mnt
60865 145 11
60869 99 11
60994 74 11
61163 74 11
Something like this:
SELECT
M.tag_id,
COUNT(M.tag_id) AS cnt,
MONTH(R.DATE) AS mnt
FROM mappingtable M
INNER JOIN reports R
ON M.sentence_id = R.sentence_id
AND R.property = 'property1'
GROUP BY M.tag_id,
MONTH(R.DATE)
ORDER BY COUNT(M.tag_id) DESC;
The inner join would take the records common to both tables. I believe thats why you are getting a wrong count of tags. Even if a sentence has two properties, there would be just one occurrence in the join.

mysql moving average of N rows

I have a simple MySQL table like below, used to compute MPG for a car.
+-------------+-------+---------+
| DATE | MILES | GALLONS |
+-------------+-------+---------+
| JAN 25 1993 | 20.0 | 3.00 |
| FEB 07 1993 | 55.2 | 7.22 |
| MAR 11 1993 | 44.1 | 6.28 |
+-------------+-------+---------+
I can easily compute the Miles Per Gallon (MPG) for the car using a select statement, but because the MPG varies widely from fillup to fillup (i.e. you don't fill the exact same amount of gas each time), I would like to computer a 'MOVING AVERAGE' as well. So for any row the MPG is MILES/GALLON for that row, and the MOVINGMPG is the SUM(MILES)/SUM(GALLONS) for the last N rows. If less than N rows exist by that point, just SUM(MILES)/SUM(GALLONS) up to that point.
Is there a single SELECT statement that will fetch the rows with MPG and MOVINGMPG by substituting N into the select statement?
Yes, it's possible to return the specified resultset with a single SQL statement.
Unfortunately, MySQL does not support analytic functions, which would make for a fairly simple statement. Even though MySQL does not have syntax to support them, it is possible to emulate some analytic functions using MySQL user variables.
One of the ways to achieve the specified result set (with a single SQL statement) is to use a JOIN operation, using a unique ascending integer value (rownum, derived by and assigned within the query) to each row.
For example:
SELECT q.rownum AS rownum
, q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM ( SELECT #s_rownum := #s_rownum + 1 AS rownum
, s.date
, s.miles
, s.gallons
FROM mytable s
JOIN (SELECT #s_rownum := 0) c
ORDER BY s.date
) q
JOIN ( SELECT #t_rownum := #t_rownum + 1 AS rownum
, t.date
, t.miles
, t.gallons
FROM mytable t
JOIN (SELECT #t_rownum := 0) d
ORDER BY t.date
) r
ON r.rownum <= q.rownum
AND r.rownum > q.rownum - 2
GROUP BY q.rownum
Your desired value of "n" to specify how many rows to include in each rollup row is specified in the predicate just before the GROUP BY clause. In this example, up to "2" rows in each running total row.
If you specify a value of 1, you will get (basically) the original table returned.
To eliminate any "incomplete" running total rows (consisting of fewer than "n" rows), that value of "n" would need to be specified again, adding:
HAVING COUNT(1) >= 2
sqlfiddle demo: http://sqlfiddle.com/#!2/52420/2
Followup:
Q: I'm trying to understand your SQL statement. Does your solution do a select of twenty rows for each row in the db? In other words, if I have 1000 rows will your statement perform 20000 selects? (I'm worried about performance)...
A: You are right to be concerned with performance.
To answer your question, no, this does not perform 20,000 selects for 1,000 rows.
The performance hit comes from the two (essentially identical) inline views (aliased as q and r). What MySQL does with these (basically) is create temporary MyISAM tables (MySQL calls them "derived tables"), which are basically copies of mytable, with an extra column, each row assigned a unique integer value from 1 to the number of rows.
Once the two "derived" tables are created and populated, MySQL runs the outer query, using those two "derived" tables as a row source. Each row from q, is matched with up to n rows from r, to calculate the "running total" miles and gallons.
For better performance, you could use a column already in the table, rather than having the query assign unique integer values. For example, if the date column is unique, then you could calculate "running total" over a certain period of days.
SELECT q.date AS latest_date
, SUM(q.miles)/SUM(q.gallons) AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM mytable q
JOIN mytable r
ON r.date <= q.date
AND r.date > q.date + INTERVAL -30 DAY
GROUP BY q.date
(For performance, you would want an appropriate index defined with date as a leading column in the index.)
For the first query, any predicates included (in the inline view definition queries) to reduce the number of rows returned (for example, return only date values in the past year) would reduce the number of rows to be processed, and would also likely improve performance.
Again, to your question about running 20,000 selects for 1,000 rows... a nested loops operation is another way to get the same result set. For a large number of rows, this can exhibit slower performance. (On the other hand, this approach can be fairly efficient, when only a few rows are being returned:
SELECT q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, ( SELECT SUM(r.miles)/SUM(r.gallons)
FROM mytable r
WHERE r.date <= q.date
AND r.date >= q.date + INTERVAL -90 DAY
) AS rtot_mpg
FROM mytable q
ORDER BY q.date
Something like this should work:
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallMiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal
FROM YourTable
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
SQL Fiddle Demo
Which produces the following:
DATE MILES GALLONS MILESPERGALLON RUNNINGTOTAL
January, 25 1993 20 3 6.666667 6.666666666667
February, 07 1993 55.2 7.22 7.645429 7.358121330724
March, 11 1993 44.1 6.28 7.022293 7.230303030303
--EDIT--
In response to the comment, you can add another Row Number to limit your results to the last N rows:
SELECT *
FROM (
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallmiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal,
#RowNumber:=#RowNumber+1 rowNumber
FROM (SELECT * FROM YourTable ORDER BY Date DESC) u
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
JOIN (SELECT #RowNumber:= 0) r
) t
WHERE rowNumber <= 3
Just change your ORDER BY clause accordingly. And here is the updated fiddle.