Efficiency of query using NOT IN()? - mysql

I have a query that runs on my server:
DELETE FROM pairing WHERE id NOT IN (SELECT f.id FROM info f)
It takes two different tables, pairing and info and says to DELETE all entries from pairing whenever the id of that pairing is not in info.
I've run into an issue on the server where this is beginning to take too long to execute, and I believe it has to do with the efficiency (or lack of constraints in the SELECT statement).
However, I took a look at the MySQL slow_log and the number of compared entries is actually LOWER than it should be. From my understanding, this should be O(mn) time where m is the number of entries in pairing and n is the number of entries in info. The number of entries in pairing is 26,868 and in info is 34,976.
This should add up to 939,735,168 comparisons. But the slow_log is saying there are only 543,916,401: almost half the amount.
I was wondering if someone could please explain to me how the efficiency of this specific query works. I realize the fact that it's performing quicker than I think it should is a blessing in this case, but I still need to understand where the optimization comes from so that I can further improve upon it.

I haven't used the slow query log much (at all) but isn't it possible that the difference can just be chalked up to simple... can't think of the word. Basically, 939,735,168 is the theoretical worst case scenario where the query literally checks every single row except the one it needs to first. Realistically, with a roughly even distribution (and no use of indexing), a check of row in pairing will on average compare to half the rows in info.
It looks like your real world performance is only 15% off (worse) than what would be expected from the "average comparisons".
Edit: Actually, "worse than expected" should be expected when you have rows in pairing that are not in info, as they will skew the number of comparisons.
...which is still not great. If you have id indexed in both tables, something like this should work a lot faster.
DELETE pairing
FROM pairing LEFT JOIN info ON pairing.id = info.id
WHERE info.id IS NULL
;
This should take advantage of an index on id to make the comparisons needed something like O(NlogM).

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

Mysql Performance: Which of the query will take more time?

I have two tables:
1. user table with around 10 million data
columns: token_type, cust_id(Primary)
2. pm_tmp table with 200k data
columns: id(Primary | AutoIncrement), user_id
user_id is foreign key for cust_id
1st Approach/Query:
update user set token_type='PRIME'
where cust_id in (select user_id from pm_tmp where id between 1 AND 60000);
2nd Approach/Query: Here we will run below query for different cust_id individually for 60000 records:
update user set token_type='PRIME' where cust_id='1111110';
Theoretically time will be less for the first query as it involves less number of commits and in turn less number of index rebuilds. But, I would recommend to go with the second option since its more controlled and will appear to be less in time and you can event think about executing 2 seperate sets parellelly.
Note: The first query will need sufficient memory provisioned for mysql buffers to get it executed quickly. Second query being set of independent single transaction queries, they will need comparatively less memory and hence will appear faster if executed on limited memory environments.
Well, you may rewrite the first query this way too.
update user u, pm_tmp p set u.token_type='PRIME' where u.cust_id=p.id and p.in <60000;
Some versions of MySQL have trouble optimizing in. I would recommend:
update user u join
pm_tmp pt
on u.cust_id = pt.user_id and pt.id between 1 AND 60000
set u.token_type = 'PRIME' ;
(Note: This assumes that cust_id is not repeated in pm_temp. If that is possible, you will want a select distinct subquery.)
Your second version would normally be considerably slower, because it requires executing thousands of queries instead of one. One consideration might be the update. Perhaps the logging and locking get more complicated as the number of updates increases. I don't actually know enough about MySQL internals to know if this would have a significant impact on performance.
IN ( SELECT ... ) is poorly optimized. (I can't provide specifics because both UPDATE and IN have been better optimized in some recent version(s) of MySQL.) Suffice it to say "avoid IN ( SELECT ... )".
Your first sentence should say "rows" instead of "columns".
Back to the rest of the question. 60K is too big of a chunk. I recommend only 1000. Aside from that, Gordon's Answer is probably the best.
But... You did not use OFFSET; Do not be tempted to use it; it will kill performance as you go farther and farther into the table.
Another thing. COMMIT after each chunk. Else you build up a huge undo log; this adds to the cost. (And is a reason why 1K is possibly faster than 60K.)
But wait! Why are you updating a huge table? That is usually a sign of bad schema design. Please explain the data flow.
Perhaps you have computed which items to flag as 'prime'? Well, you could keep that list around and do JOINs in the SELECTs to discover prime-ness when reading. This completely eliminates the UPDATE in question. Sure, the JOIN costs something, but not much.

Whether or not SQL query (SELECT) continues or stops reading data from table when find the value

Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?

Does the order of conditions make a performance difference in MySQL?

Suppose I have a MySQL query like this, the table PEOPLE has about 2 million rows:
SELECT * FROM `PEOPLE` WHERE `SEX`=1 AND `AGE`=28;
The first condition will return 1 million rows, and the second condition may return 20,000 rows. From the local website, most developers said that it will cause a better affect to change the order of them. And they also said that It will cause a 2 million + 1 million + *10,000* I/O time if change the order, while original query above will cause a 2 million + 20,000 + *10,000* I/O time. It sounds make sense.
As we all know that MySQL has an internal query optimizer for such work. Does the order needs pay particular attention for optimal performance? I was totally confused.
PS: I noticed that there are some similar question asked already, but they are two or tree years ago, it seems better to ask again.
Thanks all noticed this question. This is a explain about why i ask again:
Before I ask this question, I run EXPLAIN for a couple of times. The answer is the order doesn't matter. But the Interviewer told me the order will make a difference performance, I want make it sure if there is something i missing.
You should first understand a fundamental thing: in theory, a relational database does not have indices.
A purely theoretical relational database engine would indeed scan all records, check the criterion on the sex and age columns and only return the relevant rows.
However, indices are a common layer added by SQL database engines to filter rows faster. In this case, you should have indices for both of these columns.
What is more, these same database engines perform analysis on these indices (if any) to determine the best possible course of action to retrieve the relevant rows faster. In particular, one criterion in index metadata is cardinality: for a given value of the indexed column, how many rows match on average? The higher the number of rows, the lower the cardinality. Therefore, the higher the cardinality the better.
Therefore, an SQL engine's query optimizer will certainly select to cut through the result set by looking up the age index first, and only then the index of sex. And it may even choose not to use the index on sex at all if it determines that it can be faster by just looking up the sex column value for each row resulting from the first filter. Which is likely here, since the cardinality of the sex column is ridiculously low.
Have a look here for an introduction to the relational model.

Big tables and analysis in MySql

For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts, and records:
ip
url
note
account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
SELECT COUNT(distinct ip) FROM acts
WHERE acts.url LIKE '%some_marketing_page%';
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.