I have a query like this:
DELETE FROM doublon WHERE id in
( Select id from `doublon` where `id` not in
( Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) > 1 and Count(amenities_id) > 1
union
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
)
)
My table 'doublon' is structured like that:
id
etablissement_id
amenities_id
The structure table it's like this:
http://hpics.li/bbb5eda
I have 2 millions rows and the query is to slow , many hours..
Anybody know how to optimize this query to execute that faster ?
SqlFiddle
Your query is not correct, in the first place. But keep reading, it's possible that by the end of the answer I discovered the reason you need such a strange query.
Let's discuss the last subquery:
Select id
From `doublon`
group by etablissement_id,amenities_id
having Count(etablissement_id) = 1 and Count(amenities_id) = 1
You can use a column in the SELECT clause of a query that has GROUP BY only if at least one of the following happens:
it is present in the GROUP BY clause too;
it is used as an argument of an aggregate function;
the value of that column is functionally dependent of the values of the columns that are present in the GROUP BY clause; for example, if a column that has an UNIQUE index is present (or all the columns that are present in an UNIQUE index of the table).
The column id doesn't fit in any of the cases above1. This makes the query illegal according to the SQL specification.
MySQL, however, accepts it and struggles to produce a result set for it but it says in the documentation:
... the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
The HAVING clause contains Count(etablissement_id) and Count(amenities_id). When etablissement_id and amenities_id are both not-NULL then these two expressions have the same value and that is the same as COUNT(*) (the number of rows in the group). And it is always greater than 0 (a group cannot contain 0 rows).
For the groups generated when etablissement_id or amenities_id is NULL the corresponding COUNT() returns 0. This applies also when both are NULL on the same time.
Using this information, this query returns the ids of rows whose combination (etablissement_id, amenities_id) is unique in the table (the groups contain only one row) and both fields are not NULL.
The other GROUP BY query (that is UNION-ed with this one) returns indeterminate values from the groups of rows whose combination (etablissement_id, amenities_id) is not unique in the table (and both fields are not NULL), as explained in the fragment quoted from the documentation.
It seems the UNION picks one (random) id from each group of (etablissement_id, amenities_id) where both etablissement_id and amenities_id are not-NULL. The outer SELECT intends to ignore the ids chosen by the UNION and provide to DELETE the rest of them.
(I think the intermediate SELECT is not even needed, you could use its WHERE clause in the DELETE query).
The only reason I could imagine you need to run this query is that table doublon is the correspondence table of a many-to-many relationship that was created without an UNIQUE index on (etablissement_id, amenities_id)(the FOREIGN KEY columns imported from the related tables).
If this is your intention then there are simpler ways to achieve this goal.
I would create a duplicate of the doublon table, with the correct structure then I would use an INSERT ... SELECT query with DISTINCT to get from the old table the needed values. Then I would swap the tables and remove the old one.
The queries:
# Create the new table
CREATE TABLE `doublon_fixed` LIKE `doublon`;
# Add the needed UNIQUE INDEX
ALTER TABLE `doublon_fixed`
ADD UNIQUE INDEX `etablissement_amenities`(`etablissement_id`, `amenities_id`);
# Copy the needed values
INSERT INTO `doublon_fixed` (`etablissement_id`, `amenities_id`)
SELECT DISTINCT `etablissement_id`, `amenities_id`
FROM `doublon`;
# Swap the tables
RENAME TABLE `doublon` TO `doublon_old`, `doublon_fixed` TO `doublon`;
# Remove the old table
DROP TABLE `doublon_old`;
The RENAME query atomically operates the renames, from left to right. It is useful to avoid downtime.
Notes:
1 If the id column is functionally dependent on the (etablissement_id, amenities_id) pair then all the groups produced by the UNION-ed queries contain a single row. The first SELECT won't produce any result and the second SELECT will return the entire table).
If am not wrong this should work
DELETE FROM doublon
WHERE id IN (SELECT id
FROM doublon
WHERE id NOT IN (SELECT id
FROM doublon
GROUP BY etablissement_id,
amenities_id
HAVING Count(etablissement_id) >= 1
AND Count(amenities_id) >= 1))
Ok, so I've got a MySQL database with several tables. One of the tables (table A) has the items of most interest to me.
It has a column called type and a column called entity_id. The primary key is something called registration_id, which is more or less irrelevant to me currently.
Ultimately, I want to gather all items of a particular type, but which have a unique entity_id. The only problem with this is that entity_id in table A is NOT a unique key. It is possible to have multiple registration_ids per entity_id.
Now, there's another table (table B) which has only a list of unique entity_ids (that is, it is the primary key on that table), however there's no information on the type in that table.
So with these two tables, what is the best way to get the data I want?
I was thinking some sort of way (DISTINCT) that I could use on the first table, alone, or possibly a join of some sort (I'm still relatively new to the concept of joins) between table A and table B, combining the entity_id from table B with the type from table A.
What's the most efficient database operation for this for now? And should I (eventually, not right now as I simply do not have the time, sadly) change the database structure for greater efficiency?
If anyone needs any additional information or graphics, let me know.
If I understand correctly you can use either GROUP BY
SELECT entity_id
FROM table1
WHERE type = ?
GROUP BY entity_id
or DISTINCT
SELECT DISTINCT entity_id
FROM table1
WHERE type = ?
Here is SQLFiddle demo
Table Joins are a costly operation. If you are dealing with large datasets then the time it takes to execute a join operation is non-negligible.
The following SQL statement will grab all entity_id's and group them by type. So for each entity_id only 1 of each type will be in the result set:
SELECT type, entity_id FROM TableA GROUP BY type, entity_id;
I think this is what you are looking for. Try this to give you the types that have only one (unique) entity_id.
SELECT type , count(entity_id)
FROM table1
GROUP BY type
HAVING COUNT(entity_id)=1
Here is the SQL Fiddle
I have this data in a column
Steffi | ND Baumecker | Cassy
I would like to do a query to find if any of the above exist in another column
example of other column (Artist being column name)
Artist
Steffi
Derrick Carter
Ben Klock
Craig Richards
I don't think a LIKE will work here so wondering what query I can use to return the artist name from column 'Artist' when a match is made - so in the above example 'Steffi' would be returned.
Would I also need to remove the spaces before and after the | in the first column?
Thanks!
If I understand properly your problem: you want to filter rows using values from a column and searching for these values in another column?
SELECT a.first_name, a.last_name, a.nickname
FROM artist AS a
WHERE a.related_nickname IN (
SELECT sa.nickname
FROM artist AS sa
WHERE sa.popularity > 30
)
MySQL documentation: http://dev.mysql.com/doc/refman/5.1/en/any-in-some-subqueries.html
It seems that you are trying to achieve a complicated task and I'd advise you to try a couple of things.
Subqueries are useful but makes your queries much slower so using two queries might speed things up. The first query would pick the values that will be used for filtering and the second query will search rows.
If you filter by string, consider using indexes on your table: http://dev.mysql.com/doc/refman/5.1/en/create-index.html
I have the following query left joining 2 tables :
explain
select
n.* from npi n left join npi_taxonomy nt on n.NPI=nt.NPI_CODE
where
n.Provider_First_Name like '%s%' and
n.Provider_Last_Name like '%b%' and
n.Provider_Business_Practice_Location_Address_State_Name = 'SC' and
n.Provider_Business_Practice_Location_Address_City_Name = 'charleston' and
n.Provider_Business_Practice_Location_Address_Postal_Code in (29001,29003,29010,29016,29018,29020,29030,29032,29033,29038,29039,29040,29041,29042,29044,29045,29046,29047,29048,29051,29052,29053,29056,29059,29061,29062,29069,29071,29072,29073,29078,29079,29080,29081,29082,29102,29104,29107,29111,29112,29113,29114,29115,29116,29117,29118,29123,29125,29128,29133,29135,29137,29142,29143,29146,29147,29148,29150,29151,29152,29153,29154,29160,29161,29162,29163,29164,29168,29169,29170,29171,29172,29201,29202,29203,29204,29205,29206,29207,29208,29209,29210,29212,29214,29215,29216,29217,29218,29219,29220,29221,29222,29223,29224,29225,29226,29227,29228,29229,29230,29240,29250,29260,29290,29292,29401,29402,29403,29404,29405,29406,29407,29409) and
n.Entity_Type_Code = 1 and
nt.Healthcare_Provider_Taxonomy_Code in ('101Y00000X')
limit 0,10;
I have added a multi-column index :
npi_fname_lname_state_city_zip_entity on the table npi which indexes the columns in the following order :
NPI,
Provider_First_Name,
Provider_First_Name,
Provider_Business_Practice_Location_Address_State_Name, Provider_Business_Practice_Location_Address_City_Name, Provider_Business_Practice_Location_Address_Postal_Code,
Entity_Type_Code
However, when i do an explain on the query, it shows me that it uses the primary index (NPI). Also, it says rows examined = 1
What's worse is : the query takes roughly 120 seconds to execute. How do i optimize this ?
I would really appreciate some help regarding this.
The reason why your multi column index doesn't help, is because you are filtering with a wild card like '%s%'.
Indexes can only be used when filtering using the left most prefix of the index, which means that 1) cannot do a contains search, and 2) if the left most column of the multi column index cannot be used, the other columns in the index cannot be used aswell.
You should switch the order of the columns in the index to
Provider_Business_Practice_Location_Address_State_Name,
Provider_Business_Practice_Location_Address_City_Name,
Provider_Business_Practice_Location_Address_Postal_Code,
Entity_Type_Code
That way MySql will only scan the rows that match those the criteria for those columns (SC, charleston etc).
Alternatively, look into full text indexes.
The MySQL 5.4 documentation, on Optimizing Queries with EXPLAIN, says this about these Extra remarks:
Using index
The column information is retrieved
from the table using only information
in the index tree without having to do
an additional seek to read the actual
row. This strategy can be used when
the query uses only columns that are
part of a single index.
[...]
Using index condition
Tables are read by accessing index
tuples and testing them first to
determine whether to read full table
rows. In this way, index information
is used to defer (“push down”) reading
full table rows unless it is
necessary.
Am I missing something, or do these two mean the same thing (i.e. "didn't read the row, index was enough")?
An example explains it best:
SELECT Year, Make --- possibly more fields and/or from extra tables
FROM myUsedCarInventory
WHERE Make = 'Toyota' AND Year > '2006'
Assuming the Available indexes are:
CarId
VIN
Make
Make and Year
This query would EXPLAIN with 'Using Index' because it doesn't need, at all, to "hit" the myUsedCarInventory table itself since the "Make and Year" index "cover" its need with regards to the elements of the WHERE clause that pertain to that table.
Now, imagine, we keep the query the same, but for the addition of a condition on the color
...
WHERE Make = 'Toyota' AND Year > '2006' AND Color = 'Red'
This query would likely EXPLAIN with 'Using Index Condition' (the 'likely', here is for the case that Toyota + year would not be estimated to be selective enough, and the optimizer may decide to just scan the table). This would mean that MySQL would FIRST use the index to resolve the Make + Year, and it would have to lookup the corresponding row in the table as well, only for the rows that satisfy the Make + Year conditions. That's what is sometimes referred as "push down optimization".
The difference is that "Using index" doesn't need a lookup from the index to the table, while "Using index condition" sometimes has to. I'll try to illustrate this with an example. Say you have this table:
id, name, location
With an index on
name, id
Then this query doesn't need the table for anything, it can retrieve all it's information "Using index":
select id, name from table where name = 'Piskvor'
But this query needs a table lookup for all rows where name equals 'Piskvor', because it can't retrieve location from the index:
select id from table where name = 'Piskvor' and location = 'North Pole'
The query can still use the index to limit the results to the small sets of row with a particular name, but it has to look at those rows in the table to check if the location matches too.