Removing all duplicates except one - optimized queries - mysql

I have tried the following two queries:
delete from app where not exists
(select a2.app_package, max(a2.id) from (select * from app) as a2
where a2.app_package = app.app_package having max(a2.id) = app.id);
AND
DELETE FROM app
USING app,
(select app_package, max(id) as ID from app
group by app_package
) as A
where A.ID > app.ID AND
A.app_package = app.app_package;
and am really stuck as to which one would execute faster.
SQLFiddles:
http://sqlfiddle.com/#!2/46498/1
http://sqlfiddle.com/#!2/142593/1
Both execution plans are the same:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA
1 SIMPLE app ALL 7 100
Are there further optimizations that could be made?

The execution plan you are showing, is not that of the DELETE query but that from the SELECT * FROM app query, which just does a full table scan (as expected as you aren't filtering on anything).
To see the execution plan, you will need to run the explain on the delete statements instead (appearantly not possible in sqlfiddle).
I took the liberty of assuming the you have an index on app_package. If you don't, you should definitely add it.
The first example (simply replace DELETE FROM with SELECT * FROM) shows that you are doing full table scans (bad) and using a DEPENDENT subquery which will be ran for almost every record in the outer table (which is bad as well).
1 PRIMARY app ALL 7 Using where
2 DEPENDENT SUBQUERY <derived3> ALL 7 Using where
3 DERIVED app ALL 7
To see that of the second one, you will have to translate the delete into a SELECT statement, something like this
SELECT * FROM app, (
SELECT app_package, MAX( id ) AS ID
FROM app
GROUP BY app_package
) AS A
WHERE A.ID > app.ID
AND A.app_package = app.app_package
which gives
1 PRIMARY <derived2> ALL 4
1 PRIMARY app ref 1 Using where
2 DERIVED app index 7
As you can see, this is one isn't using dependant subqueries and not doing full table scans. This will definitely run faster when the amount of data in the table grows.

Related

Laravel Join tables and group by sum query too slow

I am using Laravel query builder to get desired results from database. The following query if working perfectly but taking too much time to get results. Can you please help me with this?
select
`amz_ads_sp_campaigns`.*,
SUM(attributedUnitsOrdered7d) as order7d,
SUM(attributedUnitsOrdered30d) as order30d,
SUM(attributedSales7d) as sale7d,
SUM(attributedSales30d) as sale30d,
SUM(impressions) as impressions,
SUM(clicks) as clicks,
SUM(cost) as cost,
SUM(attributedConversions7d) as attributedConversions7d,
SUM(attributedConversions30d) as attributedConversions30d
from
`amz_ads_sp_product_targetings`
inner join `amz_ads_sp_report_product_targetings` on `amz_ads_sp_product_targetings`.`campaignId` = `amz_ads_sp_report_product_targetings`.`campaignId`
inner join `amz_ads_sp_campaigns` on `amz_ads_sp_report_product_targetings`.`campaignId` = `amz_ads_sp_campaigns`.`campaignId`
where
(
`amz_ads_sp_product_targetings`.`user_id` = ?
and `amz_ads_sp_product_targetings`.`profileId` = ?
)
group by
`amz_ads_sp_product_targetings`.`campaignId`
Result of Explain SQL
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE amz_ads_sp_report_product_targetings ALL campaignId NULL NULL NULL 50061 Using temporary; Using filesort
1 SIMPLE amz_ads_sp_campaigns ref campaignId campaignId 8 pr-amz-ppc.amz_ads_sp_report_product_targetings.ca... 1
1 SIMPLE amz_ads_sp_product_targetings ref campaignId campaignId 8 pr-amz-ppc.amz_ads_sp_report_product_targetings.ca... 33 Using where
Your query could benefit from several indices to cover the WHERE clause as well as the join conditions:
CREATE INDEX idx1 ON amz_ads_sp_product_targetings (
user_id, profileId, campaignId);
CREATE INDEX idx2 ON amz_ads_sp_report_product_targetings (
campaignId);
CREATE INDEX idx3 ON amz_ads_sp_campaigns (campaignId);
The first index idx1 covers the entire WHERE clause, which might let MySQL throw away many records on the initial scan of the amz_ads_sp_product_targetings table. It also includes the campaignId column, which is needed for the first join. The second and third indices cover the join columns of each respective table. This might let MySQL do a more rapid lookup during the join process.
Note that selecting amz_ads_sp_campaigns.* is not valid unless the campaignId of that table be the primary key. Also, there isn't much else we can do speed up the query, as SUM, by its nature, requires touching every record in order to come up the result sum.

MySQL: Efficiency of views containing GROUP BY

The fact that I haven't been able to come up (or research) a solution to this question means that I'm either too stupid to read the docs or it is in fact a complicated problem.
In a rather big database I often need a query like this:
SELECT ... WHERE condition GROUP BY something;
This takes a fraction of a second to complete. So I put this in a VIEW:
CREATE VIEW view_x AS SELECT ... GROUP BY something;
And when I then do
SELECT * FROM view_x WHERE condition;
it takes more than a minute to complete. Now it's easy to see why: In the plain SELECT, the DB engine first selects a few hundred results from millions of records and then does the aggregating and grouping only on the matching records. When using the view, it seems to first evaluate the entire dataset, aggregating and grouping everything, and then returns only the records meeting the condition and throwing away the expensively calculated rest.
Is there a more intelligent VIEW solution, or do I have to use the full SELECT each time?
Thanks.
EDIT: Here's the original SQL code for the view:
CREATE VIEW v_status1 AS SELECT
FROM_UNIXTIME(J.ts_start) AS job_start,
J.id AS job_id, J.carrier, J.n_wafers,
count(W.id) AS n
FROM job AS J
JOIN wafer AS W ON J.id=W.job_id
GROUP BY J.carrier, J.n_wafers, W.status_id;
table job: 100k records, table wafer: 2M records.
Comparison is between these queries:
SELECT * FROM v_status1 WHERE carrier LIKE 'W96L00%'; -- very slow
versus the identical SELECT in the VIEW definition with the WHERE clause before the GROUP BY clause.
Some additional information: The query yields 9 records. Using the view it takes 19 seconds to execute. Using the direct query, it takes 0.000 seconds according to MySQL Workbench.
When I replace the WHERE clause in the direct query by a HAVING clause with the same condition at the end of the query, I end up at the same execution time as the query using the view.
Yes, I forgot some columns in the GROUP BY part. Put them in, doesn't make much of a difference.
Minimal example (5 seconds execution time):
CREATE VIEW v_status2 AS SELECT
job_id,
status_id,
count(id) AS n
FROM wafer
GROUP BY job_id, status_id;
yields 2 records given some job_id
well, I did the obvious and asked MySQL to EXPLAIN. The output is below. My interpretation is what I suspected all along: MySQL first builds a temporary table, doing all the hard work aggregating and grouping, and then selects only the rows matching the selection criteria. In other words, MySQL is not intelligent enough to first analyze the view to find where it can efficiently cull the original dataset and only work on the remaining records.
BTW, this has nothing to do with joins and indexes. You can see the effect with any sufficiently large two-column table.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 952929 Using where
2 DERIVED WS index PRIMARY ix_waferstatus_text 123 NULL 9 Using index; Using temporary; Using filesort
2 DERIVED W ref ix_wafer_job_id,wafer_ibfk_2 wafer_ibfk_2 5 jobwatch.WS.id 105881 Using where
2 DERIVED J eq_ref PRIMARY,job_ibkf_2 PRIMARY 4 jobwatch.W.job_id 1 Using where
2 DERIVED T eq_ref PRIMARY PRIMARY 4 jobwatch.J.tool_id 1

optimization of mysql explain output

i have just run mysql explain to check one query and got surprised to see that it has to check more than 250000 records to sort the result however i have index for where clause and whatever mysql explain is giving i am completely agree as that much new row has been added so how to sort out this issue .mysql table structure is
tableA is a forum where users can post the content
id userid created title
1 3 12232 xyz
2 etc...............
my mysql query is
explain SELECT * from tableA where userid='2' order by created desc limit 3
output of this explain query is
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE tableA ref userid userid 4 const 275216 Using where; Using temporary; Using filesort
my worrie is how to reduce this to 3-4 as i am interested in displaying only three result but mysql is searching 275216 records before displaying the output . as 275216 records has been created after userid 2 has posted in forum but what is the solution to tell mysql that look only for specific data so that it can search the result from very small set of rows i want maximum 20-30 rows mysql should search to server 3 rows
Try this::
SELECT * from tableA FORCE INDEX (userid) where userid=2 order by created desc limit 3

Query Optimization for Friends Feed - MySQL

I'm having my weird trouble with a friends feed query - here is the background:
I have 3 tables
checkin - around 13m records
users - around 250k records
friends - around 1.5m records
In the checkin table - it lists activity that are performed by users. (here are numerous indexes, however there is an index on user_id, created_at, and (user_id,created_at).
The users table is just the basic user information There is an index on user_id.
The friends table has a user_id, target_id and is_approved. There is an index on the (user_id, is_approved) fields.
In my query, I am trying to pull down just a basic friends feed of any users - so I have been doing this:
SELECT checkin_id, created_at
FROM checkin
WHERE (user_id IN (SELECT friend_id from friends where user_id = 1 and is_approved = 1) OR user_id = 1)
ORDER by created_at DESC
LIMIT 0, 15
The goal of the query is just to pull the checkin_id and created_at for all the users' friend plus their activity. It's a pretty simple query, but when a user's friends have tons of recent activity, this query is very quick, here is the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY checkin index user_id,user_id_2 created_at 8 NULL 15 Using where
2 DEPENDENT SUBQUERY friends eq_ref user_id,friend_id,is_approved,friend_looku... PRIMARY 8 const,func 1 Using where
As an explanation, user_id is a simple index on user_id - while user_id_2 is an index on user_id and created_at. On the friends table, friends_lookup is the index of user_id and is_approved.
This is a very simple query and get's completed in: Showing rows 0 - 14 (15 total, Query took 0.0073 sec).
However when a user's friends activity is not very recent and there isn't a lot of the data, the same query takes around 5-7 seconds and it has the same EXPLAIN as the previous query - but takes longer.
It doesn't seem to have an affect on more friends, it seems to speed up with more recent activity.
Is there any tips that anyone have to optimize these queries to makes sure they run the same speed irregardless of activity?
Server Setup
This is a dedicated MySQL server running 16GB of RAM. It is running Ubuntu 10.10 and the version of MySQL is 5.1.49
UPDATE
So most people have suggested remove the IN piece and move them into a INNER JOIN:
SELECT c.checkin_id, c.created_at
FROM checkin c
INNER JOIN friends f ON c.user_id = f.friend_id
WHERE f.user_id =1
AND f.is_approved =1
ORDER BY c.created_at DESC
LIMIT 0 , 15
This query is 10x worse - as reported in the EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE f ref PRIMARY,user_id,friend_id,is_approved,friend_looku... friend_lookup 5 const,const 938 Using temporary; Using filesort
1 SIMPLE c ref user_id,user_id_2 user_id 4 untappd_prod.f.friend_id 71 Using where
The goal of this query to get all the friends activity, and yours in the same query (instead of having to create two queries and merge the results together and sort by created_at). I also can't remove the index on user_id as it's important piece of another query.
The interesting part is when I run this query on a user account that doesn't have a lot activity, I get this explain:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE f index_merge PRIMARY,user_id,friend_id,is_approved,friend_looku... user_id,friend_lookup 4,5 NULL 11 Using intersect(user_id,friend_lookup); Using wher...
1 SIMPLE c ref user_id,user_id_2 user_id 4 untappd_prod.f.friend_id 71 Using where
Any advice?
so.. you have a few things going on here..
in the explain plan .. usually the optimizer will choose whats in "key" and not whats in possible_keys. So thats why you experience when it needs to scan more records when the data is not recent.
on checkin table only ( user_id, created_at ) and created_at is necessary.. you dont need another index for user_id.. the optimizer will use (user_id, created_at ) since user_id is the first order.
try this..
use join between friends and checkin and remove the in clause, such that friends becomes the driving table and you should see that first on the execution path of your explain plan.
with 1 done, you should make sure that checkin is using (user_id, created_dt ) index in the execution path.
write another query for the OR condition where user_id from checkin table is 1. I think your data set should be mutually exclusive for these two sets, it should then be ok .. or else you would not need to have the OR condition after the IN clause in the first place.
remove the user_id index thats by it self as you have user_id, created_at index.
-- your goal is that it uses the index under key not just possible keys.
this should take care of older non recent checkins as well as recent ones.
My first suggestion is to remove the dependent subquery and turn it into a join. I've found that MySQL is not good at processing these types of queries. Try this:
SELECT c.checkin_id, c.created_at
FROM checkin c
INNER JOIN friends f
ON c.user_id = f.friend_id
WHERE f.user_id = 1
AND f.is_approved = 1
ORDER by c.created_at DESC
LIMIT 0, 15
My second suggestion, since you have a dedicated server, is to use the InnoDB storage engine for all your tables. Make sure that you tweak default InnoDB settings, especially for innodb_buffer_pool_size: http://www.mysqlperformanceblog.com/2007/11/03/choosing-innodb_buffer_pool_size/

MySQL join query performance issue

I am running the be query
SELECT packages.id, packages.title, subcat.id, packages.weight
FROM packages ,provider, packagestosubcat,
packagestocity, subcat, usertosubcat,
usertocity, usertoprovider
WHERE packages.endDate >'2011-03-11 06:00:00' AND
usertosubcat.userid = 1 AND
usertocity.userid = 1 AND
packages.providerid = provider.id AND
packages.id = packagestosubcat.packageid AND
packages.id = packagestocity.packageid AND
packagestosubcat.subcatid = subcat.id AND
usertosubcat.subcatid = packagestosubcat.subcatid AND
usertocity.cityid = packagestocity.cityid AND
(
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
)
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
When i run explain, everything seems to look ok except for the scan on the usertoprovider table, which doesn't seem to be using table's keys:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE usertocity ref user,city user 4 const 4 Using temporary; Using filesort
1 SIMPLE packagestocity ref city,packageid city 4 usertocity.cityid 419
1 SIMPLE packages eq_ref PRIMARY,enddate PRIMARY 4 packagestocity.packageid 1 Using where
1 SIMPLE provider eq_ref PRIMARY,providertype PRIMARY 4 packages.providerid 1 Using where
1 SIMPLE packagestosubcat ref subcatid,packageid packageid 4 packages.id 1 Using where
1 SIMPLE subcat eq_ref PRIMARY PRIMARY 4 packagestosubcat.subcatid 1
1 SIMPLE usertosubcat ref userid,subcatid subcatid 4 const 12 Using where
1 SIMPLE usertoprovider ALL userid,providerid NULL NULL NULL 3735 Using where
As you can see in the above query, the condition itself is:
provider.providertype = 'reg' OR
(
usertoprovider.userid = 1 AND
provider.providertype != 'reg' AND
usertoprovider.providerid = provider.ID
)
Both tables, provider and usertoprovider, are indexed. provider has indexes on providerid and providertype while usertoprovider has indexes on userid and providerid
The cardinality of the keys is:
provider.id=47, provider.type=1, usertoprovider.userid=1245, usertoprovider.providerid=6
So its quite obvious that the indexes are not used.
Further more, to test it out, i went ahead and:
Duplicated the usertoprovider table
Inserted all the provider values that have providertype='reg' into the cloned table
Simplified the condition to (usertoprovider.userid = 1 AND usertoprovider.providerid = provider.ID)
The query execution time changed from 8.1317 sec. to 0.0387 sec.
Still, provider values that have providertype='reg' are valid for all the users and i would like to avoid inserting these values into the usertoprovider table for all the users since this data is redundant.
Can someone please explain why MySQL still runs a full scan and doesn't use the keys? What can be done to avoid it?
It seems that provider.providertype != 'reg' is redundant (always true) unless provider.providertype is nullable and you want the query to fail on NULL.
And shouldn't != be <> instead to be standard SQL, although MySQL may allow !=?
On cost of table scans
It is not necessarily that a full table scan is more expensive than walking an index, because walking an index still requires multiple page accesses. In many database engines, if your table is small enough to fit inside a few pages, and the number of rows are small enough, it will be cheaper to do a table scan. Database engines make this type of decision based on the data and index statistics of the table.
This case
However, in your case, it might also be because of the other leg in your OR clause: provider.providertype = 'reg'. If providertype is "reg", then this query joins in ALL the rows of usertoprovider (most likely not what you want) since it is a multi-table cross join.
The database engine is correct in determining that you'll likely need all the table rows in usertoprovider anyway (unless none of the providertype's is "reg", but the engine also may know!).
The query hides this fact because you are grouping on the (MASSIVE!) result set later on and just returning the package ID, so you won't see how many usertoprovider rows have been returned. But it will run very slowly. Get rid of the GROUP BY clause to find out how many rows you are actually forcing the database engine to work on!!!
The reason you see a massive speed improvement if you fill out the usertoprovider table is because then every row participates in a join, and there is no full cross join happening in the case of "reg". Before, if you have 1,000 rows in usertoprovider, every row with type="reg" expands the result set 1,000 times. Now, that row joins with only one row in usertoprovider, and the result set is not expanded.
If you really want to pass anything with providertype='reg', but not in your many-to-many mapping table, then the easiest way may be to use a sub-query:
Remove usertoprovider from your FROM clause
Do the following:
provider.providertype='reg' OR EXISTS (SELECT * FROM usertoprovider WHERE userid=1 AND providerid = provider.ID)
Another method is to use an OUTER JOIN on the usertoprovider -- any row with "reg" which is not in the table will come back with one row of NULL instead of expanding the result set.
Hmm, I know that MySQL does funny things with grouping. In any other RDBMS, your query won't even be executed. What does that even mean,
SELECT packages.id
[...]
GROUP BY packages.title
ORDER BY subcat.id, packages.weight DESC
You want to group by title. Then in standard SQL syntax, this means you can only select title and aggregate functions of the other columns. MySQL magically tries to execute (and probably guess) what you may have meant to execute. So what would you expect to be selected as packages.id ? The First matching package ID for every title? Or the last? And what would the ORDER BY clause mean with respect to the grouping? How can you order by columns that are not part of the result set (because only packages.title really is)?
There are two solutions, as far as I can see:
You're on the right track with your query, then remove the ORDER BY clause, because I don't think it will affect your result, but it may severely slow down your query.
You have a SQL problem, not a performance problem