GROUP BY does not remove duplicates - mysql

I have a watchlist system that I've coded, in the overview of the users' watchlist, they would see a list of records, however the list shows duplicates when in the database it only shows the exact, correct number.
I've tried GROUP BY watch.watch_id, GROUP BY rec.record_id, none of any types of group I've tried seems to remove duplicates. I'm not sure what I'm doing wrong.
SELECT watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
FROM
(
watchlist watch
LEFT OUTER JOIN records rec ON rec.record_id = watch.record_id
LEFT OUTER JOIN members usr ON rec.user_id = usr.user_id
)
WHERE watch.user_id = 1
GROUP BY watch.watch_id
LIMIT 0, 25
The watchlist table looks like this:
+----------+---------+-----------+------------+
| watch_id | user_id | record_id | watch_date |
+----------+---------+-----------+------------+
| 13 | 1 | 22 | 1314038274 |
| 14 | 1 | 25 | 1314038995 |
+----------+---------+-----------+------------+

GROUP BY does not "remove duplicates". GROUP BY allows for aggregation. If all you want is to combine duplicated rows, use SELECT DISTINCT.
If you need to combine rows that are duplicate in some columns, use GROUP BY but you need to to specify what to do with the other columns. You can either omit them (by not listing them in the SELECT clause) or aggregate them (using functions like SUM, MIN, and AVG). For example:
SELECT watch.watch_id, COUNT(rec.street_number), MAX(watch.watch_date)
... GROUP by watch.watch_id
EDIT
The OP asked for some clarification.
Consider the "view" -- all the data put together by the FROMs and JOINs and the WHEREs -- call that V. There are two things you might want to do.
First, you might have completely duplicate rows that you wish to combine:
a b c
- - -
1 2 3
1 2 3
3 4 5
Then simply use DISTINCT
SELECT DISTINCT * FROM V;
a b c
- - -
1 2 3
3 4 5
Or, you might have partially duplicate rows that you wish to combine:
a b c
- - -
1 2 3
1 2 6
3 4 5
Those first two rows are "the same" in some sense, but clearly different in another sense (in particular, they would not be combined by SELECT DISTINCT). You have to decide how to combine them. You could discard column c as unimportant:
SELECT DISTINCT a,b FROM V;
a b
- -
1 2
3 4
Or you could perform some kind of aggregation on them. You could add them up:
SELECT a,b, SUM(c) "tot" FROM V GROUP BY a,b;
a b tot
- - ---
1 2 9
3 4 5
You could add pick the smallest value:
SELECT a,b, MIN(c) "first" FROM V GROUP BY a,b;
a b first
- - -----
1 2 3
3 4 5
Or you could take the mean (AVG), the standard deviation (STD), and any of a bunch of other functions that take a bunch of values for c and combine them into one.
What isn't really an option is just doing nothing. If you just list the ungrouped columns, the DBMS will either throw an error (Oracle does that -- the right choice, imo) or pick one value more or less at random (MySQL). But as Dr. Peart said, "When you choose not to decide, you still have made a choice."

While SELECT DISTINCT may indeed work in your case, it's important to note why what you have is not working.
You're selecting fields that are outside of the GROUP BY. Although MySQL allows this, the exact rows it returns for the non-GROUP BY fields is undefined.
If you wanted to do this with a GROUP BY try something more like the following:
SELECT watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
FROM
(
watchlist watch
LEFT OUTER JOIN est8_records rec ON rec.record_id = watch.record_id
LEFT OUTER JOIN est8_members usr ON rec.user_id = usr.user_id
)
WHERE watch.watch_id IN (
SELECT watch_id FROM watch WHERE user_id = 1
GROUP BY watch.watch_id)
LIMIT 0, 25

I Would never recommend using SELECT DISTINCT, it's really slow on big datasets.
Try using things like EXISTS.

You are grouping by watch.watch_id and you have two results, which have different watch IDs, so naturally they would not be grouped.
Also, from the results displayed they have different records. That looks like a perfectly valid expected results. If you are trying to only select distinct values, then you don't want ot GROUP, but you want to select by distinct values.
SELECT DISTINCT()...

If you say your watchlist table is unique, then one (or both) of the other tables either (a) has duplicates, or (b) is not unique by the key you are using.
To suppress duplicates in your results, either use DISTINCT as #Laykes says, or try
GROUP BY watch.watch_date,
rec.street_number,
rec.street_name,
rec.city,
rec.state,
rec.country,
usr.username
It sort of sounds like you expect all 3 tables to be unique by their keys, though. If that is the case, you are simply masking some other problem with your SQL by trying to retrieve distinct values.

Related

SQL Validate a column with the same column

I have the following situation. I have a table with all info of article. I will like to compare the same column with it self. because I have multiple type of article. Single product and Master product. the only way that I have to differences it, is by SKU. for example.
ID | SKU
1 | 11111
2 | 11112
3 | 11113
4 | 11113-5
5 | 11113-8
6 | 11114
7 | 11115
8 | 11115-1-W
9 | 11115-2
10 | 11116
I only want to list or / and count only the sku that are full unique. follow th example the sku that are unique and no have variant are (ID = 1, 2, 6 and 10) I will want to create a query where if 11113 are again on the column not cout it. so in total I will be 4 unique sku and not "6 (on total)". Please let me know. if this are possible.
Assuming the length of master SKUs are 5 characters, try this:
select a.*
from mytable a
left join mytable b on b.sku like concat(a.sku, '%')
where length(a.sku) = 5
and b.sku is null
This query joins master SKUs to child ones, but filters out successful joins - leaving only solitary master SKUs.
You can do this by grouping and counting the unique rows.
First, we will need to take your table and add a new column, MasterSKU. This will be the first five characters of the SKU column. Once we have the MasterSKU, we can then GROUP BY it. This will bundle together all of the rows having the same MasterSKU. Once we are grouping we get access to aggregate functions like COUNT(). We will use that function to count the number of rows for each MasterSKU. Then, we will filter out any rows that have a COUNT() over 1. That will leave you with only the unique rows remaining.
Take that unique list and LEFT JOIN it back into your original table to grab the IDs.
SELECT ID, A.MasterSKU
FROM (
SELECT
MasterSKU = SUBSTRING(SKU,1,5),
MasterSKUCount = COUNT(*)
FROM MyTable
GROUP BY SUBSTRING(SKU,1,5)
HAVING COUNT(*) = 1
) AS A
LEFT JOIN (
SELECT
ID,
MasterSKU = SUBSTRING(SKU,1,5)
FROM MyTable
) AS B
ON A.MasterSKU = B.MasterSKU
Now one thing I noticed from you example. The original SKU column really looks like three columns in one. We have multiple values being joined with hypens.
11115-1-W
There may be a reason for it, but most likely this violates first normal form and will make the database hard to query. It's part of the reason why such a complicated query is needed. If the SKU column really represents multiple things then we may want to consider breaking it out into MasterSKU, Version, and Color or whatever each hyphen represents.

ORDER BY and GROUP BY those results in a single query

I am trying to query a dataset from a single table, which contains quiz answers/entries from multiple users. I want to pull out the highest scoring entry from each individual user.
My data looks like the following:
ID TP_ID quiz_id name num_questions correct incorrect percent created_at
1 10154312970149546 1 Joe 3 2 1 67 2015-09-20 22:47:10
2 10154312970149546 1 Joe 3 3 0 100 2015-09-21 20:15:20
3 125564674465289 1 Test User 3 1 2 33 2015-09-23 08:07:18
4 10153627558393996 1 Bob 3 3 0 100 2015-09-23 11:27:02
My query looks like the following:
SELECT * FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`
ORDER BY `correct` DESC
In my mind, what that should do is get the two users from the IN clause, order them by the number of correct answers and then group them together, so I should be left with the 2 highest scores from those two users.
In reality it's giving me two results, but the one from Joe gives me the lower of the two values (2), with Bob first with a score of 3. Swapping to ASC ordering keeps the scores the same but places Joe first.
So, how could I achieve what I need?
You're after the groupwise maximum, which can be obtained by joining the grouped results back to the table:
SELECT * FROM entries NATURAL JOIN (
SELECT TP_ID, MAX(correct) correct
FROM entries
WHERE TP_ID IN ('10153627558393996', '10154312970149546')
GROUP BY TP_ID
) t
Of course, if a user has multiple records with the maximal score, it will return all of them; should you only want some subset, you'll need to express the logic for determining which.
MySql is quite lax when it comes to group-by-clauses - but as a rule of thumb you should try to follow the rule that other DBMSs enforce:
In a group-by-query each column should either be part of the group-by-clause or contain a column-function.
For your query I would suggest:
SELECT `TP_ID`,`name`,max(`correct`) FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`,`name`
Since your table seems quite denormalized the group by name-par could be omitted, but it might be necessary in other cases.
ORDER BY is only used to specify in which order the results are returned but does nothing about what results are returned - so you need to apply the max()-function to get the highest number of right answers.

Mixing HAVING with CASE OR Analytic functions in MySQL (PartitionQualify(?

I have a SELECT query that returns some fields like this:
Date | Campaign_Name | Type | Count_People
Oct | Cats | 1 | 500
Oct | Cats | 2 | 50
Oct | Dogs | 1 | 80
Oct | Dogs | 2 | 50
The query uses aggregation and I only want to include results where when Type = 1 then ensure that the corresponding Count_People is greater than 99.
Using the example table, I'd like to have two rows returned: Cats. Where Dogs is type 1 it's excluded because it's below 100, in this case where Dogs = 2 should be excluded also.
Put another way, if type = 1 is less than 100 then remove all records of the corresponding campaign name.
I started out trying this:
HAVING CASE WHEN type = 1 THEN COUNT(DISTINCT Count_People) > 99 END
I used Teradata earlier int he year and remember working on a query that used an analytic function "Qualify PartitionBy". I suspect something along those lines is what I need? I need to base the exclusion on aggregation before the query is run?
How would I do this in MySQL? Am I making sense?
Now that I understand the question, I think your best bet will be a subquery to determine which date/campaign combinations of a type=1 have a count_people greater than 99.
SELECT
<table>.date,
<table>.campaign_name,
<table>.type,
count(distinct count_people) as count_people
FROM
(
SELECT
date,
campaign_name
FROM
<table>
WHERE type=1
HAVING count(distinct count_people) > 99
GROUP BY 1,2
) type1
LEFT OUTER JOIN <table> ON
type1.campaign_name = <table>.campaign_name AND
type1.date = <table>.date
WHERE <table>.type IN (1,2)
GROUP BY 1,2,3
The subquery here only returns campaign/date combinations when both the type=1 AND it has greater than 99 count_people. It uses a LEFT JOIN back to the to insure that only those campaign/date combinations make it into the result set.
The WHERE on the main query keeps the results to only Types 1 and 2, which you stated was already a filter in place (though not mentioned in the question, it was stated in a comment to a previous answer).
Based on your comments to answer by #JNevill I think you will have no option but to use subselects to pre-filter the record set you are dealing with, as working with HAVING is going to limit you only to the current record being evaluated - there is no way to compare against previous or subsequent records in the set in this manner.
So have a look at something like this:
SELECT
full_data.date AS date,
full_data.campaign_name AS campaign_name,
full_data.type AS type,
COUNT(full_data.people) AS people_count
FROM
(
SELECT
date,
campaign_name,
type,
COUNT(people) AS people_count
FROM table
WHERE type IN (1,2)
GROUP BY date, campaign_name, type
) AS full_data
LEFT JOIN
(
SELECT
date,
campaign_name,
COUNT(people) AS people_count
FROM table
WHERE type = 1
GROUP BY date, campaign_name
HAVING people_count < 100
) AS filter
ON
full_data.date = filter.date
AND full_data.campaign_name = filter.campaign_name
WHERE
filter.date IS NULL
AND filter.campaign_name IS NULL
The first subselect is basically your current query without any attempt at using HAVING to filter out results. The second subselect is used to find all date/campaign name combos which have people_count > 100 and use those as a filter for against the full data set.

Is it possible to write a query to compare rows to other rows in same table?

I have a table with the following structure. I need to return all rows where the district of the record immediately preceding and immediately following the row are different than the district for that row. Is this possible? I was thinking of a join on the table itself but not sure how to do it.
id | zip_code | district
__________________________
20063 10169 12
20064 10169 9
20065 10169 12
Assuming that "preceding" and "following" are in the sense of the ID column, you can do:
select *
from zip_codes z1
inner join zip_codes z2 on z1.id=z2.id + 1
inner join zip_codes z3 on z1.id=z3.id - 1
where z1.district <> z2.district and z1.district <> z3.district
This will automatically filter out the first and last rows, because of the inner joins, if you need those to count, change it to left outer join.
Also, this checks if it's different from both. To find if it's different from either (as is implied in the comment), change the and in the where clause to an or. But note, that then, all three rows in your example fit that criteria, even if there are long rows of twelves above and below these rows.

SQL GROUP BY - Multiple results in one column?

I am trying to perform a SELECT query using a GROUP BY clause, however I also need to access data from multiple rows and somehow concatenate it into a single column.
Here's what I have so far:
SELECT
COUNT(v.id) AS quantity,
vt.name AS name,
vt.cost AS cost,
vt.postage_cost AS postage_cost
FROM vouchers v
INNER JOIN voucher_types vt
ON v.type_id = vt.id
WHERE
v.order_id = 1 AND
v.sold = 1
GROUP BY vt.id
Which gives me the first four columns I need in the following format.
quantity | name | cost | postage_cost
2 X 5 1
2 Y 6 1
However, I would also like a fifth column to be displayed, showing all of the codes associated with each line of the order like this:
code
ABCD, EFGH
IJKL, MNOP
Where the comma separated values are pulled from the voucher table.
Is this possible?
Any advice would be appreciated.
Thanks
This is what GROUP_CONCAT does.
Assuming the column is called code you would just add ,GROUP_CONCAT(v.code) As Codes to your select list.