I have a table t1 with 5 columns and 80000 rows :
+---+--------+-------+--------+------------+
|id |category|groupe |subject | description|
+---+--------+-------+--------+------------+
|1 |categ1 |group1 |subject1| desc1 |
|2 |categ1 |group2 |subject2| desc2 |
|3 |categ1 |group2 |subject5| desc3 |
|4 |categ2 |group1 |subject5| desc4 |
|5 |categ2 |group3 |subject1| desc5 |
|6 |categ2 |group3 |subject2| desc6 |
|7 |categ3 |group1 |subject1| desc7 |
|8 |categ3 |group1 |subject4| desc8 |
+---+--------+-------+--------+------------+
I need to extract rows that have minimum 30 occurrences of values in category AND 30 occurrences of group AND 30 of subject.
This means if "categ3" appears more than 30 times, i need rows with categ3
same with group and subject.
but when i used the query bellow the final result can have less than 30 categ3 because result has been filtered by group or subject that remove id who have categ3.
You can see an example on db<>fiddle,the good query result count() with 10 occurences have to return 118 rows.
select
*
from
t1
where
category in (
SELECT
category
FROM
t1
GROUP BY
category
HAVING
COUNT(category) >= 30
)
and
groupe in (
SELECT
groupe
FROM
t1
GROUP BY
groupe
HAVING
COUNT(groupe) >= 30
)
and
subject in (
SELECT
subject
FROM
t1
GROUP BY
subject
HAVING
COUNT(subject) >= 30
)
This query return intersection on ID where category,groupe and subject have 30 occurrences on values, but this intersection reduce the result count...
this means certain category values count could be reduce to a number less than 30.
for resume,i need 30 occurences in the intersection result.
I think I need to do a recursive filter and have to repeat the loop until input rows is equal to output rows.. But I don't know how to do that... An idea?
Thanks 😊
Add some DISTINCT's, while grouping on the 3 columns.
select *
from dataset t
where t.category in (SELECT distinct category FROM dataset GROUP BY category, groupe, subject HAVING COUNT(*) >= 30)
and t.groupe in (SELECT distinct groupe FROM dataset GROUP BY category, groupe, subject HAVING COUNT(*) >= 30)
and t.subject in (SELECT distinct subject FROM dataset GROUP BY category, groupe, subject HAVING COUNT(*) >= 30)
A test on db<>fiddle here
For reference sake, this query will only select those with a tupple that occurs 30 times or more.
Which will naturally be less that the query above.
SELECT *
FROM dataset
WHERE (category, groupe, subject) IN (
SELECT category, groupe, subject
FROM dataset
GROUP BY category, groupe, subject
HAVING COUNT(*) >= 30
)
Pro tip: This is a case where describing your requirement takes a lot of thought. As you think about it, think of SQL as a processor of sets of rows. It is always worthwhile to describe the requirement as carefully as you can, especially when it is as tricky as this one. Often it's helpful to describe the problem domain, rather than just talking about columns and values.
I guess you need the sets of rows meeting your three different criteria (more than x duplicates). You can use a set of id values for those rows because they are apparently a primary key (unique).
Here's one set of IDs
SELECT id FROM dataset WHERE category IN (
SELECT category FROM dataset GROUP BY category HAVING COUNT(*) >= 5))
I believe you need all the rows lying in the intersection of those three sets. That is, you want any rows having all three items recurring frequently. You can get that with
id IN set1 AND id IN set2 AND id IN set3
If you need the union of those sets you can use this instead. This gives you the rows with any of the three items recurring frequently.
id IN set1 OR id IN set2 OR id IN set3
So here's the query.
SELECT *
FROM dataset
WHERE id IN (
SELECT id FROM dataset WHERE category IN (
SELECT category FROM dataset GROUP BY category HAVING COUNT(*) >= 5))
AND id IN (
SELECT id FROM dataset WHERE groupe IN (
SELECT groupe FROM dataset GROUP BY groupe HAVING COUNT(*) >= 5))
AND id IN (
SELECT id FROM dataset WHERE subject IN (
SELECT subject FROM dataset GROUP BY subject HAVING COUNT(*) >= 5))
I used 5 for the repeat threshold. You can use another number.
If you want your result set to contain only those rows with at least ten items in the result set, rather than in the dataset, you would use this query.
select d.*
from dataset d
join (
select count(*), groupe, category, subject
from dataset
group by groupe, category, subject
having count(*) >= 10
) e ON d.groupe=e.groupe AND d.category = e.category AND d.subject = e.subject
Related
I have two MySQL tables A and B both with this schema
ID
entity_id
asset
asset_type
0
12345
x
1
..
.........
.....
..........
I would like to get an aggregated top 10/50/whatever entity_ids with the largest row count difference between the two tables. I think I could do this manually by just getting the highest row count by entity_id like so
select count(*), entity_id
-> from A
-> group by entity_id
-> order by count(*) desc;
and just manually comparing to the same query for table B but I'm wondering if there's a way to do this in just one query, that compares row counts for each distinct entity_id and aggregates the differences between row counts. A few notes
There is an index on entity_id for both tables
Table B will always have an equivalent or greater number of rows for each entity_id
Sample output
entity_id
difference
12345
100
3232
75
5992
40
and so on
for top 10/50
Aggregate in each table and join the results to get the difference:
SELECT a.entity_id, b.counter - a.counter diff
FROM (SELECT entity_id, COUNT(*) counter FROM A GROUP BY entity_id) a
INNER JOIN (SELECT entity_id, COUNT(*) counter FROM B GROUP BY entity_id) b
ON a.entity_id = b.entity_id
ORDER BY diff DESC LIMIT 10
I have a mysql table with entries of my driver's logbook. In the table there are two columns: start_place and end_place. Sometimes it's possible, that end_place is equal to start_place (i think that sounds logical).
Now I wan't to select the entries of the table which occour as tupel (x,y), but not as (y,x).
Example:
id | start_place | end_place
-----------------------------------
0 | New York | San Francisco
-----------------------------------
1 | San Francisco | New York
The row with the id 1 is a duplicate of id 0 in reversed order and should not be part of the result.
Does someone has an idea? Several times I tried with subselects or where conditions like (x,y) != (y,x) but that doesn't work.
This can be done with least and greatest functions with a group by.
select least(start_place,end_place), greatest(start_place,end_place)
from tbl
group by least(start_place,end_place), greatest(start_place,end_place)
having count(*) = 1
To retrieve such rows with other columns, use
select *
from tbl
where (least(start_place,end_place), greatest(start_place,end_place))
in (select least(start_place,end_place), greatest(start_place,end_place)
from tbl
group by least(start_place,end_place), greatest(start_place,end_place)
having count(*) = 1
)
Use LEAST, GREATEST and DISTINCT to get distinct pairs:
select distinct
least(start_place, end_place) as place1,
greatest(start_place, end_place) as place2
from mytable;
The below table contains an id and a Year and Groups
GroupingTable
id | Year | Groups
1 | 2000 | A
2 | 2001 | B
3 | 2001 | A
Now I want select the greatest year even after grouping them by the Groups Column
SELECT
id,
Year,
Groups
FROM
GroupingTable
GROUP BY
`Groups`
ORDER BY Year DESC
And below is what I am expecting even though the query above doesnt work as expected
id | Year | Groups
2 | 2001 | B
3 | 2001 | A
You need to learn how to use aggregate functions.
SELECT
MAX(Year) AS Year,
Groups
FROM
GroupingTable
GROUP BY
`Groups`
ORDER BY Year DESC
When using GROUP BY, only the column(s) you group by are unambiguous, because they have the same value on every row of the group.
Other columns return a value arbitrarily from one of the rows in the group. Actually, this is behavior of MySQL (and SQLite), but because of the ambiguity, it's an illegal query in standard SQL and all other brands of SQL implementations.
For more on this, see my answer to Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause
Your query misuses the heinously confusing nonstandard extension to GROUP BY that's built in to MySQL. Read this and weep. https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
If all you want is the year it's a snap.
SELECT MAX(Year) Year, Groups
FROM GroupingTable
GROUP BY Groups
If you want the id of the row in question, you have to do a bunch of monkey business to retrieve the column id from the above query.
SELECT a.*
FROM GroupingTable a
JOIN (
SELECT MAX(Year) Year, Groups
FROM GroupingTable
GROUP BY Groups
) b ON a.Groups = b.Groups AND a.Year = b.Year
You have to do this because the GROUP BY query yields a summary result set, and you have to join that back to the detail result set to retrieve the ID.
I have a table of records (lets call them TV shows) with an air_date field.
I have another table of advertisements that are related by a show_id field.
I am trying to get the average number of advertisements per show for each date (with a where clause specifying the shows).
I currently have this:
SELECT
`air_date`,
(SELECT COUNT(*) FROM `commercial` WHERE `show_id` = `show`.`id`) AS `num_commercials`,
FROM `show`
WHERE ...
This gives me a result like so:
air_date | num_commercials
2015-6-30 | 6
2015-6-30 | 3
2015-6-30 | 8
2015-6-30 | 2
2015-6-31 | 9
2015-6-31 | 4
When I do a GROUP_BY, it only gives me one of the records, but I want the average for each air_date.
Not too sure I am clear on what you want - but does this do it
SELECT `air_date`,
AVG((SELECT COUNT(*) FROM `commercial` WHERE `show_id` = `show`.`id`)) AS `num_commercials`,
FROM `show`
WHERE .....
GROUP BY `air_date`
(Note double parentheses for AVG function is required)
You can use a sub-query to select count of commercials by air_date/show, then use an outer query to select the average commercials count per air_date.
Something like this should work:
select air_date, avg(num_commercials)
from
(
select show.air_date as air_date,
show.id as show_id,
count(*) as num_commercials
from show
inner join commercial on commercial.show_id = show.id
group by show.air_date, show.id
where ...
) sub
group by air_date
In my table, i have the following columns :
CRMID | user | ticket_id | | description | date | hour
what i am trying to do is to select all the rows from the table, but when two (or more) rows have the same ticket_id, i want only the newest one to appear in the results, so the row with the newest date and hour.
the problem here is that i should be addin cases, if the values from the date column are the same, then i will compare the hour colum, otherwise, its simple cauz i'll be comparing only the date column.
SELECT
n.*
FROM
table n RIGHT JOIN (
SELECT
MAX(date) AS max_date,
(SELECT MAX(hour) AS hour WHERE date = max_date) AS hour,
user,
ticket_id
FROM
table
GROUP BY
user,
ticket_id
) m ON n.user = m.user AND n.ticket_id = m.ticket_id
You may want to combine your date and hour columns, then perform the comparison
SELECT foo.*
FROM foo
JOIN (SELECT ticket_id, MAX(ADDTIME(`date`,`hour`)) as mostrecent
FROM foo
GROUP BY ticket_id) AS bar
ON bar.ticket_id = foo.ticket_id
and bar.mostrecent = ADDTIME(foo.`date`,foo.`hour`);