MYSQL Getting all values for a GROUP BY

MYSQL Getting all values for a GROUP BY - mysql

DB has 3 columns (thing1, thing2, datetime). What I want to do is pull all the records for thing1 that has more than 1 unique thing2 entry for it.
SELECT thing1,thing2 FROM db WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR) GROUP BY thing1 HAVING COUNT(DISTINCT(thing2)) > 1;
Gets me almost what I need but of course the "GROUP BY" makes it so it only returns 1 entry for the thing1 column, but I need all the thing1,thing2 entries.
Any suggestions would be greatly appreciated.

I think you should use group by this way
SELECT thing1,thing2
FROM db WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1, thing2 HAVING COUNT(*) > 1;

Shamelessly copying Matt S' original answer as a starting point to provide an alternative...
SELECT db.thing1, db.thing2
FROM db
INNER JOIN (
SELECT thing1, MIN(`datetime`) As `datetime`
FROM db
WHERE `datetime` >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1
HAVING COUNT(DISTINCT thing2) > 1
) AS subQ ON db.thing1 = subQ.thing1 AND db.`datetime` >= subQ.`datetime`
;
MySQL is very finicky, performance-wise, when it comes to subqueries in WHERE clauses; this JOIN alternative may perform faster than such a query.
It may also perform faster, than in it's current form, with the MIN removed from the subquery (and the join condition), and a redundant datetime condition on the outer WHERE supplied instead.
Which is best will depend on data, hardware, configuration, etc...
Sidenote: I would caution against using keywords such as datetime as field (or table) names; they tend to bite their user when least expected, and at very least should always be escaped with ` as in the example.

If I'm understanding what you're looking for, you'll want to use your current query as a sub-query:
SELECT thing1, thing2 FROM db WHERE thing1 IN (
SELECT thing1 FROM db
WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY thing1
HAVING COUNT(DISTINCT thing2) > 1
);
The subquery is already getting the thing1s you want, so this lets you get the original rows back from the table, limited to just those thing1s.

Related

more efficient inner join query

Is that possible to make this query more efficient ?
SELECT DISTINCT(static.template.name)
FROM probedata.probe
INNER JOIN static.template ON probedata.probe.template_fk = static.template.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
Thanks.

First, I'm going to rewrite it using table aliases, so I can read it:
SELECT DISTINCT(t.name)
FROM probedata.probe p INNER JOIN
static.template t
ON p.template_fk = t.pk
WHERE creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH);
Let me make two assumptions:
name is unique in static.template
creation_time comes from probe
The first assumption is particularly useful. You can rewrite the query as:
SELECT t.name
FROM static.template t
WHERE EXISTS (SELECT 1
FROM probedata.probe p
WHERE p.template_fk = t.pk AND
p.creation_time >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
);
The second assumption only affects the indexing. For this query, you want indexes on probe(template_fk, creation_time).
If template has wide records, then an index on template(pk, name) might also prove useful.
This will change the execution plan to be a scan of the template table with a fast look up using the index into the probe table. There will be no additional processing to remove duplicates.

Could help:
If you use this statement in a script, assign the result of the DATE_SUB(NOW(), INTERVAL 6 MONTH) in a variable before the select statement and use that variable in the where condition (because the functions to calculate last X months would execute just once)
Instead of distinct, try and see if there is an improvement using just the column in the select clause (so no distinct) and add the GROUP BY static.template.name

DISTINCT ON query w/ ORDER BY max value of a column

I've been tasked with converting a Rails app from MySQL to Postgres asap and ran into a small issue.
The active record query:
current_user.profile_visits.limit(6).order("created_at DESC").where("created_at > ? AND visitor_id <> ?", 2.months.ago, current_user.id).distinct
Produces the SQL:
SELECT visitor_id, MAX(created_at) as created_at, distinct on (visitor_id) *
FROM "profile_visits"
WHERE "profile_visits"."social_user_id" = 21
AND (created_at > '2015-02-01 17:17:01.826897' AND visitor_id <> 21)
ORDER BY created_at DESC, id DESC
LIMIT 6
I'm pretty confident when working with MySQL but I'm honestly new to Postgres. I think this query is failing for multiple reasons.
I believe the distinct on needs to be first.
I don't know how to order by the results of max function
Can I even use the max function like this?
The high level goal of this query is to return the 6 most recent profile views of a user. Any pointers on how to fix this ActiveRecord query (or it's resulting SQL) would be greatly appreciated.

The high level goal of this query is to return the 6 most recent
profile views of a user.
That would be simple. You don't need max() nor DISTINCT for this:
SELECT *
FROM profile_visits
WHERE social_user_id = 21
AND created_at > (now() - interval '2 months')
AND visitor_id <> 21 -- ??
ORDER BY created_at DESC NULLS LAST, id DESC NULLS LAST
LIMIT 6;
I suspect your question is incomplete. If you want:
the 6 latest visitors with their latest visit to the page
then you need a subquery. You cannot get this sort order in one query level, neither with DISTINCT ON, nor with window functions:
SELECT *
FROM (
SELECT DISTINCT ON (visitor_id) *
FROM profile_visits
WHERE social_user_id = 21
AND created_at > (now() - interval '2 months')
AND visitor_id <> 21 -- ??
ORDER BY visitor_id, created_at DESC NULLS LAST, id DESC NULLS LAST
) sub
ORDER BY created_at DESC NULLS LAST, id DESC NULLS LAST
LIMIT 6;
The subquery sub gets the latest visit per user (but not older than two months and not for a certain visitor21. ORDER BY must have the same leading columns as DISTINCT ON.
You need the outer query to get the 6 latest visitors then.
Consider the sequence of events:
Best way to get result count before LIMIT was applied
Why NULLS LAST? To be sure, you did not provide the table definition.
PostgreSQL sort by datetime asc, null first?

MySQL join date columns with 1-month lag and performance issues

Note: I found this similar question but it does not address my issue, so I do not believe this is a duplicate.
I have two simple MySQL tables (created with the MyISAM engine), Table1 and Table2.
Both of the tables have 3 columns, a date-type column, an integer ID column, and a float value column. Both tables have about 3 million records and are very straightforward.
The contents of the tables looks like this (with Date and Id as primary keys):
Date Id Var1
2012-1-27 1 0.1
2012-1-27 2 0.5
2012-2-28 1 0.6
2012-2-28 2 0.7
(assume Var1 becomes Var2 for the second table).
Note that for each (year, month, ID) triplet, there will only be a single entry. But the actual day of the month that appears is not necessarily the final day, nor is it the final weekday, nor is it the final business day, etc... It's just some day of the month. This day is important as an observation day in other tables, but the day-of-month itself doesn't matter between Table1 and Table2.
Because of this, I cannot rely on Date + INTERVAL 1 MONTH to produce the matching day-of-month for the date it should match to that is one month ahead.
I'm looking to join the two tables on Date and Id but where the values from the second table (Var2) come from 1-month ahead than Var1.
This sort of code will accomplish it, but I am noticing a significant performance degradation with this, explained below.
-- This is exceptionally slow for me
SELECT b.Date,
b.Id,
a.Var1,
b.Var2
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND YEAR(a.Date + INTERVAL 1 MONTH) = YEAR(b.Date)
AND MONTH(a.Date + INTERVAL 1 MONTH) = MONTH(b.Date)
-- This returns quickly, but if I use it as a sub-query
-- then the parent query is very slow.
SELECT Date + INTERVAL 1 MONTH as FutureDate,
Id,
Var1
FROM Table1
-- That is, the above is fast, but this is super slow:
select b.Date,
b.Id,
a.Var1,
b.Var2
FROM (SELECT Date + INTERVAL 1 MONTH as FutureDate
Id,
Var1
FROM Table1) a
JOIN Table2 b
ON YEAR(a.FutureDate) = YEAR(b.Date)
AND MONTH(a.FutureDate) = MONTH(b.Date)
AND a.Id = b.Id
I've tried re-ordering the JOIN criteria, thinking maybe that matching on Id first in the code would change the query execution plan, but it seems to make no difference.
When I say "super slow", I mean that option #1 from the code above doesn't return the results for all 3 million records even if I wait for over an hour. Option #2 returns in less than 10 minutes, but then option number three takes longer than 1 hour again.
I don't understand why the introduction of the date lag makes it take so long.
How can I
profile the queries to understand why it takes a long time?
write a better query for joining tables based on a 1-month date lag (where day-of-month that results from the 1-month lag may cause mismatches).

Here is an alternative approach:
SELECT b.Date, b.Id, b.Var2
(select a.var1
from Table1 a
where a.id = b.id and a.date < b.date
order by a.date
limit 1
) as var1
b.Var2
FROM Table2 b;
Be sure the primary index is set up with id first and then date on Table1. Otherwise, create another index Table1(id, date).
Note that this assumes that the preceding date is for the preceding month.

Here's another alternative way to go about this:
SELECT thismonth.Date,
thismonth.Id,
thismonth.Var1 AS Var1_thismonth,
lastmonth.Var1 AS Var1_lastmonth
FROM Table2 AS thismonth
JOIN
(SELECT id, Var1,
DATE(DATE_FORMAT(Date,'%Y-%m-01')) as MonthStart
FROM Table2
) AS lastmonth
ON ( thismonth.id = lastmonth.id
AND thismonth.Date >= lastmonth.MonthStart + INTERVAL 1 MONTH
AND thismonth.Date < lastmonth.MonthStart + INTERVAL 2 MONTH
)
To get this to perform ideally, I think you're going to need a compound covering index on (id, Date, Var1).
It works by generating a derived table containing Id,MonthStart,Var1 and then joining the original table to it by a sequence of range scans. Hence the compound covering index.

The other answers gave very useful tips, but ultimately, without making significant modifications to the index structure of my data (which is not feasible at the moment), those methods would not work faster (in any meaningful sense) than what I had already tried in the question.
Ollie Jones gave me the idea to use date formatting, and coupling that with the TIMESTAMPDIFF function seems to make it passably fast, though I still welcome any comments explaining why the use of YEAR, MONTH, DATE_FORMAT, and TIMESTAMPDIFF have such wildly different performance properties.
SELECT b.Date,
b.Id,
b.Var2,
a.Date,
a.Id,
a.Var1
FROM Table1 a
JOIN Table2 b
ON a.Id = b.Id
AND (TIMESTAMPDIFF(MONTH,
DATE_FORMAT(a.Date, '%Y-%m-01'),
DATE_FORMAT(b.Date, '%Y-%m-01')) = 1)

MySQL query predominant non-numeric value

I'm looking for a function to return the most predominant non numeric value from a table.
My database table records readings from a weatherstation. Many of these are numeric, but wind direction is recorded as one of 16 text values - N,NNE,NE,ENE,E... etc in a varchar field. Records are added every 15 minutes so 95 rows represent a day's weather.
I'm trying to compute the predominant wind direction for the day. Manually you would add together the number of Ns, NNEs, NEs etc and see which there are most of.
Has MySQL got a neat way of doing this?
Thanks

It's difficult to answer your question without seeing your schema, but this should help you.
Assuming the wind directions are stored in the same column as the numeric values you want to ignore, you can use REGEXP to ignore the numeric values, like this:
select generic_string, count(*)
from your_table
where day = '2014-01-01'
and generic_string not regexp '^[0-9]*$'
group by generic_string
order by count(*) desc
limit 1
If wind direction is the only thing stored in the column then it's a little simpler:
select wind_direction, count(*)
from your_table
where day = '2014-01-01'
group by wind_direction
order by count(*) desc
limit 1
You can do this for multiple days using sub-queries. For example (assuming you don't have any data in the future) this query will give you the most common wind direction for each day in the current month:
select this_month.day,
(
select winddir
from weatherdatanum
where thedate >= this_month.day
and thedate < this_month.day + interval 1 day
group by winddir
order by count(*) desc
limit 1
) as daily_leader
from
(
select distinct date(thedate) as day
from weatherdatanum
where thedate >= concat(left(current_date(),7),'-01') - interval 1 month
) this_month

The following query should return you a list of wind directions along with counts sorted by most occurrences:
SELECT wind_dir, COUNT(wind_dir) AS count FROM `mytable` GROUP BY wind_dir ORDER DESC
Hope that helps

mysql query: how to get the number of yes/no votes per day

I have to create a mysql query to get a voting distribution of each day exceeding a particular date, something like this...
date yes_votes no_votes
------------------------------------------
2010-01-07 21 22
2010-01-07 2 0
My table is like this..
post_votes
--------------------------
id(longint)
date(timestamp)
flag(tinyint) // this stores the yes/no votes 1-yes, 2-no
I am stuck at this....
SELECT COUNT(*) AS count, DATE(date) FROM post_votes WHERE date > '2010-07-01' GROUP BY DATE(date)
this gives the total number of votes per day, but not the distribution that I want.

SELECT COUNT(*) AS count
, DATE(date)
, SUM(flag = 1) AS yes_votes
, SUM(flag = 2) AS no_votes
FROM post_votes
WHERE date > '2010-07-01'
GROUP BY DATE(date)
This is a trick that works in MySQL, as flag=1 will either be True or False. But True = 1 and False = 0 in MySQL so you can add the 1s and 0s using the SUM() function.
Other solutions with IF or CASE would be better for clarity or if there is any chance you want to move the database to another RDBMS.
Comments not related to the question:
It's bad habit to use reserved words like date or count for naming fields or tables.
It's also not good to use "date" when you actually store a timestamp. Names should reflect use.
For table names it's recommended to use singular (post_vote) and not plural - although many use plural, it gets confusing in the end. Plural is good for some fields or calulated fields, like your yes_votes and no_votes where we have a counting.

Sum it:
select date(date) as date,
sum(case when flag = 1 then 1 else 0) as yes,
sum(case when flag = 2 then 1 else 0) as no
from post_votes
where date > '2010-07-01'
group by date(date)

you are almost at the solution :)
i would recommend the use of an IF condition in a SUM method like so:
SELECT SUM(IF(flag = 'yes',1,0)) AS yes_count,
SUM(IF(flag = 'no',1,0)) AS no_count,
DATE(date)
FROM post_votes
WHERE date > '2010-07-01'
GROUP BY DATE(date)
this will allow for the function to add 1 to each sum only if the value is equal to yes/no

SELECT DATE(date) as dt,
sum(if(flag=1,1,0)) as yes,
sum(if(flag=2,1,0)) as no
FROM post_votes WHERE date > '2010-07-01'
GROUP BY dt

I had that problem too. The best solution of that I can think of, is to split the "flag" in two fields, like:
upvote(tinyint)
downvote(tinyint)
Then you are able to count them very easy and without mysql-voodoo:
SELECT
SUM(upvote) AS up,
SUM(downvote) AS down,
DATE(`date`) AS Created_at
FROM post_votes
WHERE Created_at > '2010-07-01'
GROUP BY Created_at
Btw.: You should not name a column date, because it's a MySQL-Keyword.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MYSQL Getting all values for a GROUP BY - mysql

I think you should use group by this way SELECT thing1,thing2 FROM db WHERE datetime >= DATE_SUB(NOW(), INTERVAL 1 HOUR) GROUP BY thing1, thing2 HAVING COUNT(*) > 1;

Related

more efficient inner join query

DISTINCT ON query w/ ORDER BY max value of a column

MySQL join date columns with 1-month lag and performance issues

MySQL query predominant non-numeric value

mysql query: how to get the number of yes/no votes per day

Categories

Resources