MYSQL SELECT * doesn't work with GROUP BY and HAVING [duplicate] - mysql

This question already has answers here:
Fetch the rows which have the Max value for a column for each distinct value of another column
(35 answers)
Closed 8 years ago.
SELECT * FROM or_mail
GROUP BY campaign_id
HAVING date_time = MAX(date_time);
SELECT campaign_id, date_time FROM or_mail
GROUP BY campaign_id
HAVING date_time = MAX(date_time);
The 1st query returns 13 records. The 2nd returns 35.
Why are records missing from the first query!? Why should what I'm selecting matter at all?

This is your query:
SELECT campaign_id, date_time
FROM or_mail
GROUP BY campaign_id
HAVING date_time = MAX(date_time);
You are aggregating by campaign_id. That means that the results will have one row per campaign_id. What date_time goes on the row? Well, an arbitrary value from one of the matching rows. Just one value, an arbitrary one. The same is true of the having clause. In other words, the query does not do what you expect it to do.
Whether you know it or not, you are using a group by extension that is particular to MySQL (you can read the documentation here). The documentation specifically warns against using the extension this say. (There would be no problem if date_time were the same on all rows with the same campaign_id, but that is not the case.)
The following is the query that you actually want:
SELECT campaign_id, date_time
FROM or_mail om
WHERE not exists (select 1
from or_mail om2
where om2.campaign_id = om.campaign_id and
om2.date_time > date_time
);
What this says is: Return results from all rows in or_mail with the property that there is no larger date_time with the same campaign_id.
HAVING date_time = MAX(date_time);

It shouldn't... Did you wait a while before running second query? If so, then a bunch of records could have been created.

Related

Optomizing a simple query with 70mil rows to fit into Tableau

Noobie to SQL. I have a simple query here that is 70 million rows, and my work laptop will not handle the capacity when I import it into Tableau. Usually 20 million rows and less seem to work fine. Here's my problem.
Table name: Table1
Fields: UniqueID, State, Date, claim_type
Query:
SELECT uniqueID, states, claim_type, date
FROM table1
WHERE date >= '11-09-2021'
This gives me what I want, BUT, I can limit the query significantly if I count the number of uniqueIDs that have been used in 3 or more different states. I use this query to do that.
SELECT unique_id, count(distinct states), claim_type, date
FROM table1
WHERE date >= '11-09-2021'
GROUP BY Unique_id, claim_type, date
HAVING COUNT(DISTINCT states) > 3
The only issue is, when I put this query into Tableau it only displays the FIRST state a unique_id showed up in, and the first date it showed up. A unique_id shows up in multiple states over multiple dates, so when I use this count aggregation it's only giving me the first result and not the whole picture.
Any ideas here? I am totally lost and spent a whole business day trying to fix this
Expected output would be something like
uniqueID | state | claim type | Date
123 Ohio C 01-01-2021
123 Nebraska I 02-08-2021
123 Georgia D 03-08-2021
If your table is only of those four columns, and your queries are based on date ranges, your index must exist to help optimize that. If 70 mil records exist, how far back does that go... Years? If your data since 2021-09-11 is only say... 30k records, that should be all you are blowing through for your results.
I would ensure you have the index based on (and in this order)
(date, uniqueId, claim_type, states). Also, you mentioned you wanted a count of 3 OR MORE, your query > 3 will results in 4 or more unless you change to count(*) >= 3.
Then, to get the entries you care about, you need
SELECT date, uniqueID, claim_type
FROM table1
WHERE date >= '2021-09-11'
group by date, uniqueID, claim_type
having count( distinct states ) >= 3
This would give just the 3-part qualifier for date/id/claim that HAD them. Then you would use THIS result set to get the other entries via
select distinct
date, uniqueID, claim_type, states
from
( SELECT date, uniqueID, claim_type
FROM table1
WHERE date >= '2021-09-11'
group by date, uniqueID, claim_type
having count( distinct states ) >= 3 ) PQ
JOIN Table1 t1
on PQ.date = t1.date
and PQ.UniqueID = t1.UniqueID
and PQ.Claim_Type = t1.Claim_Type
The "PQ" (preQuery) gets the qualified records. Then it joins back to the original table and grabs all records that qualified from the unique date/id/claim_type and returns all the states.
Yes, you are grouping rows, so therefore you 'loose' information on the grouped result.
You won't get 70m records with your grouped query.
Why don't you split your imports in smaller chunks? Like limit the rows to chunks of, say 15m:
1st:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000;
2nd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 15000000;
3rd:
SELECT uniqueID, states, claim_type, date FROM table1 WHERE date >= '11-09-2021' LIMIT 15000000 OFFSET 30000000;
and so on..
I know its not a perfect or very handy solution but maybe it gets you to the desired outcome.
See this link for infos about LIMIT and OFFSET
https://www.bitdegree.org/learn/mysql-limit-offset
It is wise in the long run to use DATE datatype. That requires dates to look like '2021-09-11, not '09-11-2021'. That will let > correctly compare dates that are in two different years.
If your data is coming from some source that formats it '11-09-2021', use STR_TO_DATE() to convert as it goes in; You can reconstruct that format on output via DATE_FORMAT().
Once you have done that, we can talk about optimizing
SELECT unique_id, count(distinct states), claim_type, date
FROM table1
WHERE date >= '2021-09-11'
GROUP BY Unique_id, claim_type, date
HAVING COUNT(DISTINCT states) > 3
Tentatively I recommend this composite index speed up the query:
INDEX(Unique_id, claim_type, date, states)
That will also help with your other query.
(I as assuming the ambiguous '11-09-2021' is DD-MM-YYYY.)

MySQL: Undesired result with max function on a timestamp

I use a Mantis Bug Database (that uses MySQL) and I want to query which bugs had a change in their severity within the last 2 weeks, however only the last severity change of the bug should be indicated.
The problem is, that I get multiple entries per bugID (which is the primary key), which is not my desired result since I want to have only the latest change per bug. This means that somehow I am using the max function and the group by clause wrongfully.
Here you can see my query:
SELECT `bug_id`,
max(date_format(from_unixtime(`mantis_bug_history_table`.`date_modified`),'%Y-%m-%d %h:%i:%s')) AS `Severity_changed`,
`mantis_bug_history_table`.`old_value`,
`mantis_bug_history_table`.`new_value`
from `prepared_bug_list`
join `mantis_bug_history_table` on `prepared_bug_list`.`bug_id` = `mantis_bug_history_table`.`bug_id`
where (`mantis_bug_history_table`.`field_name` like 'severity')
group by `bug_id`,`old_value`,`.`new_value`
having (`Severity_modified` >= (now() - interval 2 week))
order by bug_id` ASC
For the bug with the id 8 for example I get three entries with this query. The bug with the id 8 had indeed three severity changes within the last 2 weeks but I only want to get the latest severity change.
What could be the problem with my query?
max() is an aggregation function and it does not appear to be suitable for what you are trying to do.
I have feeling that what you are trying to do is to get the latest out of all the applicable bug_id in mantis_bug_history_table . If that is true, then I would rewrite the query as the following -- I would write a sub-query getLatest and join it with prepared_bug_list
Updated answer
Caution: I don't have access to the actual DB tables so this query may have bugs
select
`getLatest`.`last_bug_id`
, `mantis_bug_history_table`.`date_modified`
, `mantis_bug_history_table`.`old_value`
, `mantis_bug_history_table`.`new_value`
from
(
select
(
select
`bug_id`
from
`mantis_bug_history_table`
where
`date_modified` > unix_timestamp() - 14*24*3600 -- two weeks
and `field_name` like 'severity'
and `bug_id` = `prepared_bug_list`.`bug_id`
order by
`date_modified` desc
limit 1
) as `last_bug_id`
from
`prepared_bug_list`
) as `getLatest`
inner join `mantis_bug_history_table`
on `prepared_bug_list`.`bug_id` = `getLatest`.`last_bug_id`
order by `getLatest`.`bug_id` ASC
I finally have a solution! I friend of mine helped me and one part of the solution was to include the Primary key of the mantis bug history table, which is not the bug_id, but the column id, which is a consecutive number.
Another part of the solution was the subquery in the where clause:
select `prepared_bug_list`.`bug_id` AS `bug_id`,
`mantis_bug_history_table`.`old_value` AS `old_value`,
`mantis_bug_history_table`.`new_value` AS `new_value`,
`mantis_bug_history_table`.`type` AS `type`,
date_format(from_unixtime(`mantis_bug_history_table`.`date_modified`),'%Y-%m-%d %H:%i:%s') AS `date_modified`
FROM `prepared_bug_list`
JOIN mantis_import.mantis_bug_history_table
ON `prepared_bug_list`.`bug_id` = mantis_bug_history_table.bug_id
where (mantis_bug_history_table.id = -- id = that is the id of every history entry, not confuse with bug_id
(select `mantis_bug_history_table`.`id` from `mantis_bug_history_table`
where ((`mantis_bug_history_table`.`field_name` = 'severity')
and (`mantis_bug_history_table`.`bug_id` = `prepared_bug_list`.`bug_id`))
order by `mantis_bug_history_table`.`date_modified` desc limit 1)
and `date_modified` > unix_timestamp() - 14*24*3600 )
order by `prepared_bug_list`.`bug_id`,`mantis_bug_history_table`.`date_modified` desc

Access Query for TopN by Group

I've reviewed quite a bit of the sites (e.g. Allen Brown) for creating a query that produces top 5 (or N) values by group. I think I am getting hung up on the creation of a subquery because I'm referencing a previous query not a table.
I have a query started which counts by month the number of PIs (qryPICountbyMonth). Currently the below gives a data mismatch expression error:
SELECT qryPI.EventMonth, qryPI.PI_Issue, Count(qryPI.PI_Issue) AS
CountOfPI_Issue
FROM qryPI
GROUP BY qryPI.EventMonth, qryPI.PI_Issue
HAVING (((Count(qryPI.PI_Issue)) In (Select Top 5 [PI_Issue] From [qryPI]
Where [EventMonth]=[qryPI].[EventMonth] Order By [PI_Issue] Desc)))
ORDER BY qryPI.EventMonth DESC , Count(qryPI.PI_Issue) DESC;
It is built off a a separate query, qryPI
SELECT tblPI.EventDate, Format([EventDate],'yyyy-mm',1,1) AS EventMonth, tblPI.PI_Issue
FROM tblPI
WHERE (((tblPI.EventDate) >= #4/1/2016# And (tblPI.EventDate) <= #5/31/2016#))
GROUP BY tblPI.EventDate, Format([EventDate],'yyyy-mm',1,1), tblPI.PI_Issue;
I'm hoping to have it generate the top 5 counts of PI_Issue by EventMonth. If I haven't provided enough info let me know.
The problem (or at least a problem) is with [EventMonth]=[qryPI].[EventMonth]. Both your primary source and your lookup are called qryPI. You have to alias at least one of them.
You can't do this:
HAVING (((Count(qryPI.PI_Issue)) In (Select Top 5 [PI_Issue] From [qryPI]
count(field) will return an integer, not the set of values you're counting
I thought you could specify TopN in an Access query (it's in the properties), but you have to specify an order by clause, so it knows how to determine the TOP.
Have you tried:
SELECT top 5
tblPI.EventDate, Format([EventDate],'yyyy-mm',1,1) AS EventMonth, tblPI.PI_Issue
FROM tblPI
WHERE (((tblPI.EventDate) >= #4/1/2016# And (tblPI.EventDate) <= #5/31/2016#))
GROUP BY tblPI.EventDate, Format([EventDate],'yyyy-mm',1,1), tblPI.PI_Issue
order by PI_Issue
also not sure why you're using GROUP BY in your inner query as you're not returning any aggregate functions. Do you just need DISTINCT instead?
try:
SELECT distinct top 5
tblPI.EventDate, Format([EventDate],'yyyy-mm',1,1) AS EventMonth, tblPI.PI_Issue
FROM tblPI
WHERE (((tblPI.EventDate) >= #4/1/2016# And (tblPI.EventDate) <= #5/31/2016#))
order by PI_Issue
Actually, if I understand what you want, you need that GROUP BY instead of DISTINCT, but you also need to return the COUNT(*):
SELECT
Year([eventDate]) AS yr,
Month([eventDate]) AS mo,
tblPI.PI_issue,
Min(tblPI.eventDate) AS MinOfeventDate,
Max(tblPI.eventDate) AS MaxOfeventDate,
Count(tblPI.PI_issue) AS CountOfPI_issue
FROM tblPI
WHERE
(((tblPI.EventDate)>=#4/1/2016# And
(tblPI.EventDate)<#6/1/2016#))
GROUP BY
Year([eventDate]),
Month([eventDate]),
tblPI.PI_issue;
then you want to apply the TOPN function to cnt_issue in an outer query:
SELECT TOP 5 from qryInner
order by cnt_issue desc
except that TOP5 applies to all the query results, not the results grouped by yy/mm, which is what I'm assuming you want, so try this:
SELECT TOP 5
qry_inner.yr,
qry_inner.mo,
qry_inner.CountOfPI_issue,
qry_inner.PI_issue,
qry_inner.MinOfeventDate,
qry_inner.MaxOfeventDate
FROM qry_inner
ORDER BY qry_inner.CountOfPI_issue DESC;
As far as I know, Access doesn't allow you to select the top number of rows within a group, so you'll need to limit your outer query results to one month, then apply the TOP function.

mysql order by count - ordering by value

I am attempting to count number of records in a database by grouping them. This works fine, but when I try to order by, it orders the count by a different method than wanted. Example Result:
Question - Answer - Count
Q1 - A1 - 1
Q2 - A2 - 11
Q3 - A3 - 2
Result wanted: I want 11 after 2-9, not before. The query is simply:
SELECT Question, Answer, count(*) as `Count` GROUP BY Question, Answer ORDER BY Question, Answer
A further example of the sort is that the mysql sorts like, 1,11,118,12,2,3 where I am expecting the increasing value like 1,2,3,11,12,118
try this query
SELECT Question, Answer, count(*) as `Count`
FROM table
GROUP BY Question, Answer
ORDER BY count(*) ASC
You have put in your query
ORDER BY Question, Answer
If you want 11 to come after 2, then surely you want
ORDER BY Count
The issue appears that I was trying to order by character value instead of integer value? I have to cast the answer as an integer and then it order properly. Here is the query that works:
SELECT Question, Answer, count(*) as `Count` GROUP BY Question, Answer ORDER BY Question, CAST( answer AS SIGNED INTEGER )
Found the answer here:
Sorting varchar field numerically in MySQL

pulling the following information with a single select [duplicate]

This question already has answers here:
Count columns according to dates in SQL
(3 answers)
Closed 8 years ago.
I have the following information in my mysql table
Eventually I would like to make a select that would count the entries for a page_id but just for the specific date , meaning I would like to have the following output:
84 - 7 - 09/23/2013
85 - 4 - 09/23/2013
84 - 1 - 09/24/2013
Can it be done in a single select ?
select page_id, count(*), date
from table_name
group by page_id, date
SELECT page_id, count(page_id), date
FROM table
GROUP BY page_id, date
You need to SELECT the 3 fields. Then you need to COUNT one of those (page_id) since you need to count how many repetitions you got. And the last step, for the query to run (and also to make sense) is to GROUP BY the other 2 fields.
Hope you get a clearer idea on how to query the table.