Improve performance for query to delete duplicates

Improve performance for query to delete duplicates - mysql

My hosting company recently gave me this entry from the slow-query log. The rows examined seem excessive and might be helping to slow down the server. A test in phpMyAdmin resulted in duration of 0.9468 seconds.
The Check_in table ordinarily contains 10,000 to 17,000 rows. It also has one index: Num, unique = yes, cardinality = 10852, collation = A.
I would like to improve this query. The first five conditions following WHERE contain the fields to check to throw out duplicates.
# User#Host: fxxxxx_member[fxxxxx_member] # localhost []
# Query_time: 5 Lock_time: 0 Rows_sent: 0 Rows_examined: 701321
use fxxxxx_flifo;
SET timestamp=1364277847;
DELETE FROM Check_in USING Check_in,
Check_in as vtable WHERE
( Check_in.empNum = vtable.empNum )
AND ( Check_in.depCity = vtable.depCity )
AND ( Check_in.travelerName = vtable.travelerName )
AND ( Check_in.depTime = vtable.depTime )
AND ( Check_in.fltNum = vtable.fltNum )
AND ( Check_in.Num > vtable.Num )
AND ( Check_in.accomp = 'NO' )
AND Check_in.depTime >= TIMESTAMPADD ( MINUTE, 3, NOW() )
AND Check_in.depTime < TIMESTAMPADD ( HOUR, 26, NOW() );
Edit:
empNum int (6)
lastName varchar (30)
travelerName varchar (40) (99.9% = 'All')
depTime datetime
fltNum varchar (6)
depCity varchar (4)
23 fields total (including one blob, holding 25K images)
Edit:
ADD INDEX deleteQuery (empNum, lastName, travelerName, depTime, fltNum, depCity, Num)
Is this a matter of creating an index? If so, what type and what fields?
The last 3 conditions limit the number of rows, by asking if accomplished and within time period. Could they be better positioned (earlier) in the query? Is the 5th AND ... necessary?
Open to all ideas. Thanks for looking.

It's hard to know exactly how to help without seeing the table definition.
Don't delete the self-join (the same table mentioned twice) because this query is clearing out duplicates (check_in.Num > vtable.Num).
Do you have an index on depTime? If not, add one.
You may also want to add a compound index on
(empNum,depCity,travelerName,depTime,fltNum)
to optimize the self-join. You probably have to muck about a bit to figure out what works.

If your objective is to delete dupicates, the the solution is to avoid having duplicates in the first place - define a unique index across the fields that you deeem to collectively define a duplicate (but you won't be able to create the index while you have duplicates in the database).
The index you need for this query is on (deptime,empnum,depcity,travellername,fltnum,num,accomp} in that order. The deptime field has to come first for it to optimize the 2 accesses on the table. Once you've removed the duplicates, make the index unique.
Leaving that aside for now, you've got a whole load of performance problems.
1) you appear to be offering some sort of commercial service - so why are you waiting for your ISP to tell you that your site is running like a dog?
2) while your indexes should be designed to prevent duplicates, there are many cases where other indexes will help with performance - but in order to understand what those are you needto look at all the queries running against your data.
3) the blob should probably be in a separate table
Could they be better positioned (earlier) in the query?
Order of predicates at the same level in the query hierarchy has no impact on performance.
is the 5th AND necessary?
If you mean 'AND ( Check_in.Num > vtable.Num )', then yes - without that it will remove all the rows that are duplicated - i.e. it won't leave one row behid.

The purpose of indexes is to speed up searches and filters... the index is (in layman terms) a sorted table that pin-points each row of the data (which may be itself unsorted).
So, if you want to speed your delete query, it would help to know where the data is. So, as a set of thumb rules, you will need to add indexes to the following fields:
Every primary or foreign key
Every date on which you perform frequent searches / filters
Every numeric field on which you perform frequent searches / filters
I avoid indexes on text fields, since they are quite expensive (in terms of space), but if you need to perform frequent searches on text fields, you should also index them.

Related

SQL gets slow on a simple query with ORDER BY

I have problem with MySQL ORDER BY, it slows down query and I really don't know why, my query was a little more complex so I simplified it to a light query with no joins, but it stills works really slow.
Query:
SELECT
W.`oid`
FROM
`z_web_dok` AS W
WHERE
W.`sent_eRacun` = 1 AND W.`status` IN(8, 9) AND W.`Drzava` = 'BiH'
ORDER BY W.`oid` ASC
LIMIT 0, 10
The table has 946,566 rows, with memory taking 500 MB, those fields I selecting are all indexed as follow:
oid - INT PRIMARY KEY AUTOINCREMENT
status - INT INDEXED
sent_eRacun - TINYINT INDEXED
Drzava - VARCHAR(3) INDEXED
I am posting screenshoots of explain query first:
The next is the query executed to database:
And this is speed after I remove ORDER BY.
I have also tried sorting with DATETIME field which is also indexed, but I get same slow query as with ordering with primary key, this started from today, usually it was fast and light always.
What can cause something like this?

The kind of query you use here calls for a composite covering index. This one should handle your query very well.
CREATE INDEX someName ON z_web_dok (Drzava, sent_eRacun, status, oid);
Why does this work? You're looking for equality matches on the first three columns, and sorting on the fourth column. The query planner will use this index to satisfy the entire query. It can random-access the index to find the first row matching your query, then scan through the index in order to get the rows it needs.
Pro tip: Indexes on single columns are generally harmful to performance unless they happen to match the requirements of particular queries in your application, or are used for primary or foreign keys. You generally choose your indexes to match your most active, or your slowest, queries. Edit You asked whether it's better to create specific indexes for each query in your application. The answer is yes.

There may be an even faster way. (Or it may not be any faster.)
The IN(8, 9) gets in the way of easily handling the WHERE..ORDER BY..LIMIT completely efficiently. The possible solution is to treat that as OR, then convert to UNION and do some tricks with the LIMIT, especially if you might also be using OFFSET.
( SELECT ... WHERE .. = 8 AND ... ORDER BY oid LIMIT 10 )
UNION ALL
( SELECT ... WHERE .. = 9 AND ... ORDER BY oid LIMIT 10 )
ORDER BY oid LIMIT 10
This will allow the covering index described by OJones to be fully used in each of the subqueries. Furthermore, each will provide up to 10 rows without any temp table or filesort. Then the outer part will sort up to 20 rows and deliver the 'correct' 10.
For OFFSET, see http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).

All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

Instructing MySQL to apply WHERE clause to rows returned by previous WHERE clause

I have the following query:
SELECT dt_stamp
FROM claim_notes
WHERE type_id = 0
AND dt_stamp >= :dt_stamp
AND DATE( dt_stamp ) = :date
AND user_id = :user_id
AND note LIKE :click_to_call
ORDER BY dt_stamp
LIMIT 1
The claim_notes table has about half a million rows, so this query runs very slowly since it has to search against the unindexed note column (which I can't do anything about). I know that when the type_id, dt_stamp, and user_id conditions are applied, I'll be searching against about 60 rows instead of half a million. But MySQL doesn't seem to apply these in order. What I'd like to do is to see if there's a way to tell MySQL to only apply the note LIKE :click_to_call condition to the rows that meet the former conditions so that it's not searching all rows with this condition.
What I've come up with is this:
SELECT dt_stamp
FROM (
SELECT *
FROM claim_notes
WHERE type_id = 0
AND dt_stamp >= :dt_stamp
AND DATE( dt_stamp ) = :date
AND user_id = :user_id
)
AND note LIKE :click_to_call
ORDER BY dt_stamp
LIMIT 1
This works and is extremely fast. I'm just wondering if this is the right way to do this, or if there is a more official way to handle it.

It shouldn't be necessary to do this. The MySQL optimizer can handle it if you have multiple terms in your WHERE clause separated by AND. Basically, it knows how to do "apply all the conditions you can using indexes, then apply unindexed expressions only to the remaining rows."
But choosing the right index is important. A multi-column index is best for a series of AND terms than individual indexes. MySQL can apply index intersection, but that's much less effective than finding the same rows with a single index.
A few logical rules apply to creating multi-column indexes:
Conditions on unique columns are preferred over conditions on non-unique columns.
Equality conditions (=) are preferred over ranges (>=, IN, BETWEEN, !=, etc.).
After the first column in the index used for a range condition, subsequent columns won't use an index.
Most of the time, searching the result of a function on a column (e.g. DATE(dt_stamp)) won't use an index. It'd be better in that case to store a DATE data type and use = instead of >=.
If the condition matches > 20% of the table, MySQL probably will decide to skip the index and do a table-scan anyway.
Here are some webinars by myself and my colleagues at Percona to help explain index design:
Tools and Techniques for Index Design
MySQL Indexing: Best Practices
Advanced MySQL Query Tuning
Really Large Queries: Advanced Optimization Techniques
You can get the slides for these webinars for free, and view the recording for free, but the recording requires registration.

Don't go for the derived table solution as it is not performant. I'm surprised about the fact that having = and >= operators MySQL is going for the LIKE first.
Anyway, I'd say you could try adding some indexes on those fields and see what happens:
ALTER TABLE claim_notes ADD INDEX(type_id, user_id);
ALTER TABLE claim_notes ADD INDEX(dt_stamp);
The latter index won't actually improve the search on the indexes but rather the sorting of the results.
Of course, having an EXPLAIN of the query would help.

Faster way of retrieving aggregate data from large table?

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.
The indices on the table are on url and datetime.
I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;
The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?

Tens of millions of rows per day is quite a lot.
Assuming:
only 10 million new records per day;
your table contains only the columns that you mention in your question;
url is of type TEXT with a mean (Punycode) length of ~77 characters;
pageviews is of type INT;
int_views is of type INT;
ext_views is of type INT; and
datetime is of type DATETIME
then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.
MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!
As #Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.
I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as #Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:
CREATE TABLE news_2012 (
INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;
CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):
DROP TABLE news;
CREATE TABLE news (
date DATE,
INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;
You can then run your above query (along with any other) on this merge table:
SELECT url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM news
WHERE date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT 10;

A few points:
Also, as the only predicate that you're filtering on you should
probably have an index with datetime as the first column.
You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= '2012-08-29 00:00:00'
AND datetime < '2012-08-30 00:00:00'
GROUP BY url
ORDER BY sum(pageviews) DESC
LIMIT 10;
You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.
As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.

How to prevent MySQL selecting one index when a better one is available?

I have a table with 30,000 rows (and growing), which I join with another table. One some pages, I need to run a some 100+ of those queries, and things get slow. If I EXPLAIN the query, I notice that one table uses a primary key and is fast, but another table using one of its indexes, which is not the best one. Here's an overview:
SIMPLE | acc_entries | ref | ledger,date,type,status,status_ledger_date_type | type | 1 | const | 15359 | Using where
This is a sample query:
SELECT SUM(usd) AS total FROM acc_entries
LEFT JOIN acc_ledgers ON acc_entries.ledger = acc_ledgers.id
WHERE acc_entries.status = 1 AND
acc_ledgers.account = 3004 AND
date >= '2011-01-01' AND
date <= '2011-08-30' AND
type = 'credit'
As you can see, I am using in my WHERE the fields status, ledger (which is the field that joins with acc_ledgers.account), date and type. All of these fields have indexes. However, there is also a specific index that is used for all of them, in that same order. It is called status_ledger_data_type, and as you can see it is one of the indexes that MySQL considers using. However, at the end MySQL opts to use type as an index. This has some 15,000 possible rows (half of the table), whereas the other combined index only features a fraction of this. So my questions is: why does MySQL selects this index when a better one is available, and how can I prevent this?

You can try using index hints to force the use of your desired index.
MySql docs on Index Hints
The Battle Between Force Index and the Query Optimizer
7 ways to convince MySQL to use the right index

Actually, you want your index based on your smaller granularity. The Ledger from your Acc_Entries table will join to your ACC_Ledgers table on ITS primary index of ID, so the Acc_Ledgers is not really utilizing the Ledger portion for the WHERE clause. Your index should match as closely to the WHERE clause of your common queries. In this case, I would have an index on
(Account, Status, Type, Date)
The reason for Account being first, smaller result set. You could have 5,000 entries. Of those, 300 entries for the one account accounts, so you've already eliminated a huge amount of data to go through. Then, the Status... of the 300, you could have 100 # status 1, 100 # status 2, 100 # status 3, so you've now reduced the set even more, etc by other criteria of type and date.
Your query otherwise is completely fine... just a personal style in writing, I try to write my queries with the WHERE conditions as closely matching the index in same sequence too, so I would just have the Account clause first, then Status, Type and Date... but again, thats a personal style in writing queries.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008