I saw
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber, *
FROM <table_name>
)
DELETE FROM tblTemp where RowNumber > 1
This query for deleting duplicate rows, but I can't understand the Query. Could you please explain clearly?
The most instructive thing for you to do would be to select from the CTE:
SELECT *
FROM tblTemp
ORDER BY Department, Name;
Off the top of my head, you might see a result set looking like this:
Name | Department | RowNumber
Jon Skeet | Software | 1
Gordon Linoff | Database | 1
Gordon Linoff | Database | 2
The (Gordon Linoff, Database) record appears in duplicate, so there is a row number value in the second record which is greater than 1. Your delete logic would remove this duplicate, but would not affect the (Jon Skeet, Software) record, which has no duplicate.
The common table expression is locating duplicate rows, then the delete query that follows removes all but one row if there are duplicates.
SELECT
ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber
,*
FROM <table_name>
In the over() clause the partition and order by defines where numbering starts at 1, then within each partition each extra row gets a larger row number. Hence when [RowNumber] = 1 you have a unique set of rows for name and department.
see: Over() and row_number()
Demonstration
CREATE TABLE mytable(
name VARCHAR(6) NOT NULL
,department VARCHAR(11) NOT NULL
);
INSERT INTO mytable(name,department) VALUES ('fred','sales');
INSERT INTO mytable(name,department) VALUES ('fred','sales');
INSERT INTO mytable(name,department) VALUES ('barney','admin');
INSERT INTO mytable(name,department) VALUES ('barney','admin');
INSERT INTO mytable(name,department) VALUES ('wilma','engineering');
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name) As RowNumber, *
FROM mytable
)
DELETE FROM tblTemp where RowNumber > 1
;
select
*
from mytable
;
+---+--------+-------------+
| | name | department |
+---+--------+-------------+
| 1 | fred | sales |
| 2 | barney | admin |
| 3 | wilma | engineering |
+---+--------+-------------+
Related
I have bit strange requirement in mysql.
I should select all records from table where last 6 characters are not unique.
for example if I have table:
I should select row 1 and 3 since last 6 letters of this values are not unique.
Do you have any idea how to implement this?
Thank you for help.
I uses a JOIN against a subquery where I count the occurences of each unique combo of n (2 in my example) last chars
SELECT t.*
FROM t
JOIN (SELECT RIGHT(value, 2) r, COUNT(RIGHT(value, 2)) rc
FROM t
GROUP BY r) c ON c.r = RIGHT(value, 2) AND c.rc > 1
Something like that should work:
SELECT `mytable`.*
FROM (SELECT RIGHT(`value`, 6) AS `ending` FROM `mytable` GROUP BY `ending` HAVING COUNT(*) > 1) `grouped`
INNER JOIN `mytable` ON `grouped`.`ending` = RIGHT(`value`, 6)
but it is not fast. This requires a full table scan. Maybe you should rethink your problem.
EDITED: I had a wrong understanding of the question previously and I don't really want to change anything from my initial answer. But if my previous answer is not acceptable in some environment and it might mislead people, I have to correct it anyhow.
SELECT GROUP_CONCAT(id),RIGHT(VALUE,6)
FROM table1
GROUP BY RIGHT(VALUE,6) HAVING COUNT(RIGHT(VALUE,6)) > 1;
Since this question already have good answers, I made my query in a slightly different way. And I've tested with sql_mode=ONLY_FULL_GROUP_BY. ;)
This is what you need: a subquery to get the duplicated right(value,6) and the main query yo get the rows according that condition.
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT RIGHT(`value`,6)
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1);
UPDATE
This is the solution to avoid the mysql error in the case you have sql_mode=only_full_group_by
SELECT t.* FROM t WHERE RIGHT(`value`,6) IN (
SELECT DISTINCT right_value FROM (
SELECT RIGHT(`value`,6) AS right_value,
COUNT(*) AS TOT
FROM t
GROUP BY RIGHT(`value`,6) HAVING COUNT(*) > 1) t2
)
Fiddle here
Might be a fast code, as there is no counting involved.
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/0
select *
from tbl outr
where not exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | --------------- |
| 2 | aaaaaaaaaaaaaa |
| 4 | aaaaaaaaaaaaaaB |
| 5 | Hello |
The logic is to test other rows that is not equal to the same id of the outer row. If those other rows has same right 6 characters as the outer row, then don't show that outer row.
UPDATE
I misunderstood the OP's intent. It's the reversed. Anyway, just reverse the logic. Use EXISTS instead of NOT EXISTS
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/3
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Output:
| id | value |
| --- | ----------- |
| 1 | abcdePuzzle |
| 3 | abcPuzzle |
UPDATE
Tested the query. The performance of my answer (correlated EXISTS approach) is not optimal. Just keeping my answer, so others will know what approach to avoid :)
GhostGambler's answer is faster than correlated EXISTS approach. For 5 million rows, his answer takes 2.762 seconds only:
explain analyze
SELECT
tbl.*
FROM
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
) grouped
JOIN tbl ON grouped.ending = RIGHT(value, 6)
My answer (correlated EXISTS) takes 4.08 seconds:
explain analyze
select *
from tbl outr
where exists
(
select 1 / 0 -- just a proof that this is not evaluated. won't cause division by zero
from tbl inr
where
inr.id <> outr.id
and right(inr.value, 6) = right(outr.value, 6)
)
Straightforward query is the fastest, no join, just plain IN query. 2.722 seconds. It has practically the same performance as JOIN approach since they have the same execution plan. This is kiks73's answer. I just don't know why he made his second answer unnecessarily complicated.
So it's just a matter of taste, or choosing which code is more readable select from in vs select from join
explain analyze
SELECT *
FROM tbl
where right(value, 6) in
(
SELECT
RIGHT(value, 6) AS ending
FROM
tbl
GROUP BY
ending
HAVING
COUNT(*) > 1
)
Result:
Test data used:
CREATE TABLE tbl (
id INTEGER primary key,
value VARCHAR(20)
);
INSERT INTO tbl
(id, value)
VALUES
('1', 'abcdePuzzle'),
('2', 'aaaaaaaaaaaaaa'),
('3', 'abcPuzzle'),
('4', 'aaaaaaaaaaaaaaB'),
('5', 'Hello');
insert into tbl(id, value)
select x.y, 'Puzzle'
from generate_series(6, 5000000) as x(y);
create index ix_tbl__right on tbl(right(value, 6));
Performances without the index, and with index on tbl(right(value, 6)):
JOIN approach:
Without index: 3.805 seconds
With index: 2.762 seconds
IN approach:
Without index: 3.719 seconds
With index: 2.722 seconds
Just a bit neater code (if using MySQL 8.0). Can't guarantee the performance though
Live test: https://www.db-fiddle.com/f/dBdH9tZd4W6Eac1TCRXZ8U/1
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count = 1
Output:
| id | value | unique_count |
| --- | --------------- | ------------ |
| 2 | aaaaaaaaaaaaaa | 1 |
| 4 | aaaaaaaaaaaaaaB | 1 |
| 5 | Hello | 1 |
UPDATE
I misunderstood OP's intent. It's the reversed. Just change the count:
select x.*
from
(
select
*,
count(*) over(partition by right(value, 6)) as unique_count
from tbl
) as x
where x.unique_count > 1
Output:
| id | value | unique_count |
| --- | ----------- | ------------ |
| 1 | abcdePuzzle | 2 |
| 3 | abcPuzzle | 2 |
user
user_id | name
1 | John
2 | Matt
food
food_id | food_name
1 | A
2 | B
fav_food
user_id(-> user) | food_id(->food)
2 | 1
My insertion query:
INSERT INTO `fav_food`(user_id,food_id) VALUES(?,(SELECT id from `food` where food_name=?))
When the SELECT id from food where food_name=? subquery returns null insertion fails with an error which should be.
My question is, How can I ignore the insertion only when the subquery returns null or no rows? Thanks
With EXISTS:
INSERT INTO fav_food (user_id, food_id)
SELECT ?, (SELECT food_id FROM food WHERE food_name = ?)
WHERE EXISTS (SELECT 1 FROM food WHERE food_name = ?)
See the demo
You can try this:
INSERT INTO `fav_food`(user_id,food_id) VALUES(?,(SELECT id from `food` where food_name=? AND food_name IS NOT NULL))
From what I've tried before, you can do that by using an if statement in a stored procedure, the code might look something like the following:
if exists (SELECT id from `food` where food_name=?) then
INSERT INTO `fav_food`(user_id,food_id) VALUES(?,(SELECT id from `food` where food_name=?))
end if;
is this possible in mysql queries? Current data table is:
id - fruit - name
1 - Apple - George
2 - Banana - George
3 - Orange - Jake
4 - Berries - Angela
In the name column, i would like to sort it so there is no consecutive name on my select query.
My desires output would be, no consecutive george in name column.
id - fruit - name
1 - Apple - George
3 - Orange - Jake
2 - Banana - George
4 - Berries - Angela
Thanks in advance.
In MySQL 8+, you can do:
order by row_number() over (partition by name order by id)
In earlier versions, you can do this using variables.
Another idea...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id SERIAL PRIMARY KEY
,name VARCHAR(12) NOT NULL
);
INSERT INTO my_table VALUES
(1,'George'),
(2,'George'),
(3,'Jake'),
(4,'Angela');
SELECT x.*
FROM my_table x
JOIN my_table y
ON y.name = x.name
AND y.id <= x.id
GROUP
BY x.id
ORDER
BY COUNT(*)
, id;
+----+--------+
| id | name |
+----+--------+
| 1 | George |
| 3 | Jake |
| 4 | Angela |
| 2 | George |
+----+--------+
Following solution would work for all the MySQL versions, especially version < 8.0
In a Derived table, first sort your actual table, using name and id.
Then, determine the row number for a particular row, within all the rows having same name value.
Now, use this result-set and sort it by the row number values. So, all the rows having row number = 1 will come first (for all the different name value(s)) and so on. Hence, consecutive name rows wont appear.
You can try the following using User-defined Session Variables:
SELECT dt2.id,
dt2.fruit,
dt2.name
FROM (SELECT #row_no := IF(#name_var = dt1.name, #row_no + 1, 1) AS row_num,
dt1.id,
dt1.fruit,
#name_var := dt1.name AS name
FROM (SELECT id,
fruit,
name
FROM your_table_name
ORDER BY name,
id) AS dt1
CROSS JOIN (SELECT #row_no := 0,
#name_var := '') AS user_init_vars) AS dt2
ORDER BY dt2.row_num,
dt2.id
DB Fiddle DEMO
Here is my algorithm:
count each name's frequency
order by frequency descending and name
cut into partitions as large as the maximum frequency
number rows within each partition
order by row number and partition number
An example: Names A, B, C, D, E
step 1 and 2
------------
AAAAABBCCDDEE
step 3 and 4
------------
12345
AAAAA
BBCCD
DEE
step 5
------
ABDABEACEACAD
The query:
with counted as
(
select id, fruit, name, count(*) over (partition by name) as cnt
from mytable
)
select id, fruit, name
from counted
order by
(row_number() over (order by cnt desc, name) - 1) % max(cnt) over (),
row_number() over (order by cnt desc, name);
Common table expression (WITH clauses) and window functions (aggregation OVER) are available as of MySQL 8 or MariaDB 10.2. Before that you can retreat to subqueries, which will make the same query quite long and hard to read, though. I suppose you could also use variables instead, somehow.
DB fiddle demo: https://www.db-fiddle.com/f/8amYX6iRu8AsnYXJYz15DF/1
I have the following query:
select column_name, count(column_name)
from table
group by column_name
having count(column_name) > 1;
What would be the difference if I replaced all calls to count(column_name) to count(*)?
This question was inspired by How do I find duplicate values in a table in Oracle?.
To clarify the accepted answer (and maybe my question), replacing count(column_name) with count(*) would return an extra row in the result that contains a null and the count of null values in the column.
count(*) counts NULLs and count(column) does not
[edit] added this code so that people can run it
create table #bla(id int,id2 int)
insert #bla values(null,null)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,1)
insert #bla values(1,null)
insert #bla values(null,null)
select count(*),count(id),count(id2)
from #bla
results
7 3 2
Another minor difference, between using * and a specific column, is that in the column case you can add the keyword DISTINCT, and restrict the count to distinct values:
select column_a, count(distinct column_b)
from table
group by column_a
having count(distinct column_b) > 1;
A further and perhaps subtle difference is that in some database implementations the count(*) is computed by looking at the indexes on the table in question rather than the actual data rows. Since no specific column is specified, there is no need to bother with the actual rows and their values (as there would be if you counted a specific column). Allowing the database to use the index data can be significantly faster than making it count "real" rows.
The explanation in the docs, helps to explain this:
COUNT(*) returns the number of items in a group, including NULL values and duplicates.
COUNT(expression) evaluates expression for each row in a group and returns the number of nonnull values.
So count(*) includes nulls, the other method doesn't.
We can use the Stack Exchange Data Explorer to illustrate the difference with a simple query. The Users table in Stack Overflow's database has columns that are often left blank, like the user's Website URL.
-- count(column_name) vs. count(*)
-- Illustrates the difference between counting a column
-- that can hold null values, a 'not null' column, and count(*)
select count(WebsiteUrl), count(Id), count(*) from Users
If you run the query above in the Data Explorer, you'll see that the count is the same for count(Id) and count(*)because the Id column doesn't allow null values. The WebsiteUrl count is much lower, though, because that column allows null.
The COUNT(*) sentence indicates SQL Server to return all the rows from a table, including NULLs.
COUNT(column_name) just retrieves the rows having a non-null value on the rows.
Please see following code for test executions SQL Server 2008:
-- Variable table
DECLARE #Table TABLE
(
CustomerId int NULL
, Name nvarchar(50) NULL
)
-- Insert some records for tests
INSERT INTO #Table VALUES( NULL, 'Pedro')
INSERT INTO #Table VALUES( 1, 'Juan')
INSERT INTO #Table VALUES( 2, 'Pablo')
INSERT INTO #Table VALUES( 3, 'Marcelo')
INSERT INTO #Table VALUES( NULL, 'Leonardo')
INSERT INTO #Table VALUES( 4, 'Ignacio')
-- Get all the collumns by indicating *
SELECT COUNT(*) AS 'AllRowsCount'
FROM #Table
-- Get only content columns ( exluce NULLs )
SELECT COUNT(CustomerId) AS 'OnlyNotNullCounts'
FROM #Table
COUNT(*) – Returns the total number of records in a table (Including NULL valued records).
COUNT(Column Name) – Returns the total number of Non-NULL records. It means that, it ignores counting NULL valued records in that particular column.
Basically the COUNT(*) function return all the rows from a table whereas COUNT(COLUMN_NAME) does not; that is it excludes null values which everyone here have also answered here.
But the most interesting part is to make queries and database optimized it is better to use COUNT(*) unless doing multiple counts or a complex query rather than COUNT(COLUMN_NAME). Otherwise, it will really lower your DB performance while dealing with a huge number of data.
Further elaborating upon the answer given by #SQLMeance and #Brannon making use of GROUP BY clause which has been mentioned by OP but not present in answer by #SQLMenace
CREATE TABLE table1 (
id INT
);
INSERT INTO table1 VALUES
(1),
(2),
(NULL),
(2),
(NULL),
(3),
(1),
(4),
(NULL),
(2);
SELECT * FROM table1;
+------+
| id |
+------+
| 1 |
| 2 |
| NULL |
| 2 |
| NULL |
| 3 |
| 1 |
| 4 |
| NULL |
| 2 |
+------+
10 rows in set (0.00 sec)
SELECT id, COUNT(*) FROM table1 GROUP BY id;
+------+----------+
| id | COUNT(*) |
+------+----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 3 |
| 3 | 1 |
| 4 | 1 |
+------+----------+
5 rows in set (0.00 sec)
Here, COUNT(*) counts the number of occurrences of each type of id including NULL
SELECT id, COUNT(id) FROM table1 GROUP BY id;
+------+-----------+
| id | COUNT(id) |
+------+-----------+
| 1 | 2 |
| 2 | 3 |
| NULL | 0 |
| 3 | 1 |
| 4 | 1 |
+------+-----------+
5 rows in set (0.00 sec)
Here, COUNT(id) counts the number of occurrences of each type of id but does not count the number of occurrences of NULL
SELECT id, COUNT(DISTINCT id) FROM table1 GROUP BY id;
+------+--------------------+
| id | COUNT(DISTINCT id) |
+------+--------------------+
| NULL | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+------+--------------------+
5 rows in set (0.00 sec)
Here, COUNT(DISTINCT id) counts the number of occurrences of each type of id only once (does not count duplicates) and also does not count the number of occurrences of NULL
It is best to use
Count(1) in place of column name or *
to count the number of rows in a table, it is faster than any format because it never go to check the column name into table exists or not
There is no difference if one column is fix in your table, if you want to use more than one column than you have to specify that how much columns you required to count......
Thanks,
As mentioned in the previous answers, Count(*) counts even the NULL columns, whereas count(Columnname) counts only if the column has values.
It's always best practice to avoid * (Select *, count *, …)
I'd like to construct a single query (or as few as possible) to group a data set. So given a number of buckets, I'd like to return results based on a specific column.
So given a column called score which is a double which contains:
90.00
91.00
94.00
96.00
98.00
99.00
I'd like to be able to use a GROUP BY clause with a function like:
SELECT MIN(score), MAX(score), SUM(score) FROM table GROUP BY BUCKETS(score, 3)
Ideally this would return 3 rows (grouping the results into 3 buckets with as close to equal count in each group as is possible):
90.00, 91.00, 181.00
94.00, 96.00, 190.00
98.00, 99.00, 197.00
Is there some function that would do this? I'd like to avoid returning all the rows and figuring out the bucket segments myself.
Dave
create table test (
id int not null auto_increment primary key,
val decimal(4,2)
) engine = myisam;
insert into test (val) values
(90.00),
(91.00),
(94.00),
(96.00),
(98.00),
(99.00);
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row
from test,(select #row:=0) as r order by id
) as t
group by ceil(row/2)
+-------+--------+--------+
| lower | higher | total |
+-------+--------+--------+
| 90.00 | 91.00 | 181.00 |
| 94.00 | 96.00 | 190.00 |
| 98.00 | 99.00 | 197.00 |
+-------+--------+--------+
3 rows in set (0.00 sec)
Unluckily mysql doesn't have analytical function like rownum(), so you have to use some variable to emulate it. Once you do it, you can simply use ceil() function in order to group every tot rows as you like. Hope that it helps despite my english.
set #r = (select count(*) from test);
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row
from test,(select #row:=0) as r order by id
) as t
group by ceil(row/ceil(#r/3))
or, with a single query
select min(val) as lower,max(val) as higher,sum(val) as total from (
select id,val,#row:=#row+1 as row,tot
from test,(select count(*) as tot from test) as t2,(select #row:=0) as r order by id
) as t
group by ceil(row/ceil(tot/3))