I have a query like this:
DELETE FROM doublon WHERE id in
( Select id from `doublon` where `id` not in
( Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) > 1 and Count(amenities_id) > 1
union
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
)
)
My table 'doublon' is structured like that:
id
etablissement_id
amenities_id
The structure table it's like this:
http://hpics.li/bbb5eda
I have 2 millions rows and the query is to slow , many hours..
Anybody know how to optimize this query to execute that faster ?
SqlFiddle
Your query is not correct, in the first place. But keep reading, it's possible that by the end of the answer I discovered the reason you need such a strange query.
Let's discuss the last subquery:
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
You can use a column in the SELECT clause of a query that has GROUP BY only if at least one of the following happens:
it is present in the GROUP BY clause too;
it is used as an argument of an aggregate function;
the value of that column is functionally dependent of the values of the columns that are present in the GROUP BY clause; for example, if a column that has an UNIQUE index is present (or all the columns that are present in an UNIQUE index of the table).
The column id doesn't fit in any of the cases above1. This makes the query illegal according to the SQL specification.
MySQL, however, accepts it and struggles to produce a result set for it but it says in the documentation:
... the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
The HAVING clause contains Count(etablissement_id) and Count(amenities_id). When etablissement_id and amenities_id are both not-NULL then these two expressions have the same value and that is the same as COUNT(*) (the number of rows in the group). And it is always greater than 0 (a group cannot contain 0 rows).
For the groups generated when etablissement_id or amenities_id is NULL the corresponding COUNT() returns 0. This applies also when both are NULL on the same time.
Using this information, this query returns the ids of rows whose combination (etablissement_id, amenities_id) is unique in the table (the groups contain only one row) and both fields are not NULL.
The other GROUP BY query (that is UNION-ed with this one) returns indeterminate values from the groups of rows whose combination (etablissement_id, amenities_id) is not unique in the table (and both fields are not NULL), as explained in the fragment quoted from the documentation.
It seems the UNION picks one (random) id from each group of (etablissement_id, amenities_id) where both etablissement_id and amenities_id are not-NULL. The outer SELECT intends to ignore the ids chosen by the UNION and provide to DELETE the rest of them.
(I think the intermediate SELECT is not even needed, you could use its WHERE clause in the DELETE query).
The only reason I could imagine you need to run this query is that table doublon is the correspondence table of a many-to-many relationship that was created without an UNIQUE index on (etablissement_id, amenities_id)(the FOREIGN KEY columns imported from the related tables).
If this is your intention then there are simpler ways to achieve this goal.
I would create a duplicate of the doublon table, with the correct structure then I would use an INSERT ... SELECT query with DISTINCT to get from the old table the needed values. Then I would swap the tables and remove the old one.
The queries:
# Create the new table
CREATE TABLE `doublon_fixed` LIKE `doublon`;
# Add the needed UNIQUE INDEX
ALTER TABLE `doublon_fixed`
ADD UNIQUE INDEX `etablissement_amenities`(`etablissement_id`, `amenities_id`);
# Copy the needed values
INSERT INTO `doublon_fixed` (`etablissement_id`, `amenities_id`)
SELECT DISTINCT `etablissement_id`, `amenities_id`
FROM `doublon`;
# Swap the tables
RENAME TABLE `doublon` TO `doublon_old`, `doublon_fixed` TO `doublon`;
# Remove the old table
DROP TABLE `doublon_old`;
The RENAME query atomically operates the renames, from left to right. It is useful to avoid downtime.
Notes:
1 If the id column is functionally dependent on the (etablissement_id, amenities_id) pair then all the groups produced by the UNION-ed queries contain a single row. The first SELECT won't produce any result and the second SELECT will return the entire table).
If am not wrong this should work
DELETE FROM doublon
WHERE id IN (SELECT id
FROM doublon
WHERE id NOT IN (SELECT id
FROM doublon
GROUP BY etablissement_id,
amenities_id
HAVING Count(etablissement_id) >= 1
AND Count(amenities_id) >= 1))
Related
An example from a book about MySql:
SELECT vendor_id, vendor_name, vendor_state
FROM vendors
WHERE NOT EXISTS
(SELECT *
FROM invoices
WHERE vendor_id = vendors.vendor_id)
"In this example, the correlated subquery selects all invoices that have the same vendor_id value as the current vendor in the outer query. Because the subquery doesn't actually return a result set, it doesn't matter what columns are included in the SELECT clause. As a result it's customary to just code an asterisk."
The invoices table has like 10 separate columns which look like this: http://prntscr.com/h3106k
I am not fully understanding the asterisk part. Since there is 10 separate columns in this table is it not possible that some columns will be empty (or not empty) and we can check for that? There is no use of checking individual columns, and it only makes sense to check a table as a whole (so nothing else that the asterisk is needed here)?
In this example, there is no row satisfying the condition (WHERE …=…). So, it is not important which column is checked as there is no row to check at all.
An alternative would be the following clause, maybe it is easier to understand:
SELECT vendor_id, vendor_name, vendor_state
FROM vendors
WHERE
(
SELECT COUNT(vendor_id)
FROM invoices
WHERE vendor_id = vendors.vendor_id
) = 0
I want to remove duplicates based on the combination of listings.product_id and listings.channel_listing_id
This simple query returns 400.000 rows (the id's of the rows I want to keep):
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
While this variation returns 1.600.000 rows, which are all records on the table, not only is_verified = 0:
SELECT *
FROM (
SELECT id
FROM `listings`
WHERE is_verified = 0
GROUP BY product_id, channel_listing_id
) AS keepem
I'd expect them to return the same amount of rows.
What's the reason for this? How can I avoid it (in order to use the subselect in the where condition of the DELETE statement)?
EDIT: I found that doing a SELECT DISTINCT in the outer SELECT "fixes" it (it returns 400.000 records as it should). I'm still not sure if I should trust this subquery, for there is no DISTINCT in the DELETE statement.
EDIT 2: Seems to be just a bug in the way phpMyAdmin reports the total count of the rows.
Your query as it stands is ambiguous. Suppose you have two listings with the same product_id and channel_id. Then what id is supposed to be returned? The first, the second? Or both, ignoring the GROUP request?
What if there is more than one id with different product and channel ids?
Try removing the ambiguity by selecting MAX(id) AS id and adding DISTINCT.
Are there any foreign keys to worry about? If not, you could pour the original table into a copy, empty the original and copy back in it the non-duplicates only. Messier, but you only do SELECTs or DELETEs guaranteed to succeed, and you also get to keep a backup.
Assign aliases in order to avoid field reference ambiguity:
SELECT
keepem.*
FROM
(
SELECT
innerStat.id
FROM
`listings` AS innerStat
WHERE
innerStat.is_verified = 0
GROUP BY
innerStat.product_id,
innerStat.channel_listing_id
) AS keepem
I want to get the distinct value of a particular column however duplicity is not properly managed if more than 3 columns are selected.
The query is:
SELECT DISTINCT
ShoppingSessionId, userid
FROM
dbo.tbl_ShoppingCart
GROUP BY
ShoppingSessionId, userid
HAVING
userid = 7
This query produces correct result, but if we add another column then result is wrong.
Please help me as I want to use the ShoppingSessionId as a distinct, except when I want to use all the columns from the table, including with the where clause .
How can I do that?
The DISTINCT keyword applies to the entire row, never to a column.
Presently DISTINCT is not needed at all, because your script already makes sure that ShoppingSession is distinct: by specifying the column in GROUP BY and filtering on the other grouping column (userid).
When you add a third column to GROUP BY and it results in duplicated ShoppingSession, it means that some ShoppingSession values are associated with many different values of the added column.
If you want ShoppingSession to remain distinct after including that third column, you should decide which values of the the added column should be left in the output and which should be discarded. This is called aggregating. You could apply the MAX() function to that column, or MIN() or any other suitable aggregate function. Note that the column should not be included in GROUP BY in this case.
Here's an illustration of what I'm talking about:
SELECT
ShoppingSessionId,
userid,
MAX(YourThirdColumn) AS YourThirdColumn
FROM dbo.tbl_ShoppingCart
GROUP BY
ShoppingSessionId,
userid
HAVING userid = 7
There's one more note on your query. The HAVING clause is typically used for filtering on aggregated columns. If your filter does not involve aggregated columns, you'll be better off using the WHERE clause instead:
SELECT
ShoppingSessionId,
userid,
MAX(YourThirdColumn) AS YourThirdColumn
FROM dbo.tbl_ShoppingCart
WHERE userid = 7
GROUP BY
ShoppingSessionId,
userid
Although both queries would produce identical results, their efficiency would be different, because the first query would have to pull all rows, group/aggregate them, then discard all rows except userid = 7, but the second one would discard rows first and only then group/aggregate the remaining, which is much more efficient.
You could go even further and exclude the userid column from GROUP BY and pull its value with an aggregate function:
SELECT
ShoppingSessionId,
MAX(userid) AS userid,
MAX(YourThirdColumn) AS YourThirdColumn
FROM dbo.tbl_ShoppingCart
WHERE userid = 7
GROUP BY
ShoppingSessionId
Since all userid values in your output are supposed to contain 7 (because that's in your filter), you can just pick a maximum value per every ShoppingSession, knowing that it'll always be 7.
How can I SELECT the last row in a MySQL table?
I'm INSERTing data and I need to retrieve a column value from the previous row.
There's an auto_increment in the table.
Yes, there's an auto_increment in there
If you want the last of all the rows in the table, then this is finally the time where MAX(id) is the right answer! Kind of:
SELECT fields FROM table ORDER BY id DESC LIMIT 1;
Keep in mind that tables in relational databases are just sets of rows. And sets in mathematics are unordered collections. There is no first or last row; no previous row or next row.
You'll have to sort your set of unordered rows by some field first, and then you are free the iterate through the resultset in the order you defined.
Since you have an auto incrementing field, I assume you want that to be the sorting field. In that case, you may want to do the following:
SELECT *
FROM your_table
ORDER BY your_auto_increment_field DESC
LIMIT 1;
See how we're first sorting the set of unordered rows by the your_auto_increment_field (or whatever you have it called) in descending order. Then we limit the resultset to just the first row with LIMIT 1.
You can combine two queries suggested by #spacepille into single query that looks like this:
SELECT * FROM `table_name` WHERE id=(SELECT MAX(id) FROM `table_name`);
It should work blazing fast, but on INNODB tables it's fraction of milisecond slower than ORDER+LIMIT.
on tables with many rows are two queries probably faster...
SELECT #last_id := MAX(id) FROM table;
SELECT * FROM table WHERE id = #last_id;
Almost every database table, there's an auto_increment column(generally id )
If you want the last of all the rows in the table,
SELECT columns FROM table ORDER BY id DESC LIMIT 1;
OR
You can combine two queries into single query that looks like this:
SELECT columns FROM table WHERE id=(SELECT MAX(id) FROM table);
Make it simply use: PDO::lastInsertId
http://php.net/manual/en/pdo.lastinsertid.php
Many answers here say the same (order by your auto increment), which is OK, provided you have an autoincremented column that is indexed.
On a side note, if you have such field and it is the primary key, there is no performance penalty for using order by versus select max(id). The primary key is how data is ordered in the database files (for InnoDB at least), and the RDBMS knows where that data ends, and it can optimize order by id + limit 1 to be the same as reach the max(id)
Now the road less traveled is when you don't have an autoincremented primary key. Maybe the primary key is a natural key, which is a composite of 3 fields...
Not all is lost, though. From a programming language you can first get the number of rows with
SELECT Count(*) - 1 AS rowcount FROM <yourTable>;
and then use the obtained number in the LIMIT clause
SELECT * FROM orderbook2
LIMIT <number_from_rowcount>, 1
Unfortunately, MySQL will not allow for a sub-query, or user variable in the LIMIT clause
If you want the most recently added one, add a timestamp and select ordered in reverse order by highest timestamp, limit 1. If you want to go by ID, sort by ID. If you want to use the one you JUST added, use mysql_insert_id.
You can use an OFFSET in a LIMIT command:
SELECT * FROM aTable LIMIT 1 OFFSET 99
in case your table has 100 rows this return the last row without relying on a primary_key
Without ID in one query:
SELECT * FROM table_name LIMIT 1 OFFSET (SELECT COUNT(*) - 1 FROM table_name)
SELECT * FROM adds where id=(select max(id) from adds);
This query used to fetch the last record in your table.
Does an SQL Server "join" preserve any kind of row order consistently (i.e. that of the left table or that of the right table)?
Psuedocode:
create table #p (personid bigint);
foreach (id in personid_list)
insert into #p (personid) values (id)
select id from users inner join #p on users.personid = #p.id
Suppose I have a list of IDs that correspond to person entries. Each of those IDs may correspond to zero or more user accounts (since each person can have multiple accounts).
To quickly select columns from the users table, I populate a temp table with person ids, then inner join it with the users table.
I'm looking for an efficient way to ensure that the order of the results in the join matches the order of the ids as they were inserted into the temp table, so that the user list that's returned is in the same order as the person list as it was entered.
I've considered the following alternatives:
using "#p inner join users", in case the left table's order is preserved
using "#p left join users where id is not null", in case a left join preserves order and the inner join doesn't
using "create table (rownum int, personid bigint)", inserting an incrementing row number as the temp table is populated, so the results can be ordered by rownum in the join
using an SQL Server equivalent of the "order by order of [tablename]" clause available in DB2
I'm currently using option 3, and it works... but I hate the idea of using an order by clause for something that's already ordered. I just don't know if the temp table preserves the order in which the rows were inserted or how the join operates and what order the results come out in.
EDIT:
Assuming I go with option 3, so there IS a field to order on... is there any form of the join that will help SQL Server to do the least amount of work in maintaining the order. I mean, is it smart enough, for example, to look at what table's fields are in the order by clause and work off that table first while doing the join, so that the result set's order roughly or completely coincides with that table's order, just in case it's already in the desired order?
SQL sets are never ordered unless you explicitly order them with an order by clause.
Do this:
create table #p (personid bigint);
insert into #p (personid) values (id)
select id from users
ORDER BY <something like users.name>;
select * from #p
ORDER BY <something like users.name>;
Note that while you can insert in order, that doesn't mean the subsequent select will be ordered, because SQL sets are never ordered unless you explicitly order them with an order by clause.
You write:
To quickly select columns from the users table, I populate a temp table with person ids, then inner join it with the users table.
Note that in most cases, it'll be faster to just select directly from users, using an in list:
select * form users where users.id in (1, 2, 3, 6, 9, ... );
You're probably prematurely "optimizing" something that doesn't need optimizing. RDBMSes are (usually) written to be efficient, and will probably do little extra work sorting something that's already sorted by chance. Concentrate on functionality until you have a demonstrated need to optimize. (I say this as someone who has been spending the last several months almost solely optimizing SQL on very large (~ half billion row OLTP) datasets, because most of the time, that's true.)