Which sql will be faster and why? - mysql

Suppose I have a student table contains id, class, school_id having 1000 records.
There are 3 schools and 12 classes.
Which of these 2 queries would be faster(if there is a difference)
Query 1:
SELECT * FROM student WHERE school = 2 and class = 5;
Query 2:
SELECT * FROM student WHERE class = 5 and school = 2;
Note: I just changed the places of the 2 conditions in WHERE.
Then which will be faster and Is the following true?
->probable number of records in query1 is 333
->probable number of records in query2 is 80.

It seriously doesn't matter one little bit. 1000 records is a truly tiny database table and, if there's a difference at all, you need to upgrade from such a brain-dead DBMS.
A decent DBMS would have already collected the stats from tables (or the DBA would have done it as part of periodic tuning) and the order of the where clauses would be irrelevant.
The execution engine would choose the one which reduced the cardinality (ie, reduced the candidate group of rows) the fastest. That means that (assuming classes and schools are roughly equally distributed) the class = 5 filter would happen first, no matter the order in the select statement.
Explaining the cardinality issue in a little more depth, for a roughly evenly distributed spread of those 1000 records, there would be 333 for each school and 83 for each class.
What a DBMS would do would be to filter first on what gives you the smallest result set. So it would tend to prefer using the class filter. That would immediately drop the candidate list of rows to about 83. Then, it's a simple matter of tossing out those which have a school other than 2.
In both cases, you end up with the same eventual row set but the initial filter is often faster since it can use an index to only select desired rows. The second filter, on the other hand, most likely goes through those rows in a less efficient manner so the quicker you can reduce the number of rows, the better.
If you really want to know, you need to measure rather than guess. That's one of the primary responsibilities of a DBA, tuning the database for optimal execution of queries.

These 2 queries are strictly the same :)

hypothetical; to teach a DB concept
"How your DB uses cardinality to optiize your queries"
So, it's basically true that they are identical, but I will mention one thought hinting at the "why" which will actually introduce a good RDBMS concept.
Let's just say hypothetically that your RDBMS used the WHERE clauses strictly in the order you specified them.
In that case, the optimal query would be the one in which the column with maximum cardinality was specified first. What that means is that specifying class=5 first would be faster, as it more quickly eliminates rows from consideration, meaning if the row's "class" column does not contain 5 (which is statistically more likely than it's "school" column not containing 2), then it doesn't even need to evaluate the "school" column.
Coming back to reality, however, you should know that almost all modern relational database management systems do what is called "building a query plan" and "compiling the query". This involves, among other things, evaluating the cardinality of columns specified in the WHERE clause (and what indexes are available, etc). So essentially, it is probably true to say they are identical, and the number of results will be, too.

The number of rows affected will not and may not change simply because you reorder the conditions in the "where clause" of the sql-statement.
The execution time will also not be affected since the sql-server will look for a matching index first.

First query executes faster than 2nd query because in where clause it filters school first so it is easier to get the class details later

Related

Is there any limit to n. of records while processing a SQL query?

Say I have two tables:
table A with 6000 records (i.e. T(A) = 6000)
table B with 400 000 records (i.e. T(B) = 400000)
for some reason, I decided that for my final query I would need to join A with B twice but I decided to do this (presumably) very ineffectively via the cartesian product. So I would do A * B * B, i.e. T(A) * T(B) * T(B) = which is all of a sudden a quadrillion of records being processed internally (only to be striped to dozens via selection and projection for example).
While maybe ineffective, would an average server handle this? If so, is there any limit, even theoretically? What if the tables were magnitudes bigger?
Your question is hypothetical, and may lead to opinion-based answers, but I'll give it a shot.
You say that from your cartesian product, you intend to return only a dozen or so records. If those records can be found through indexes, an "average" server should be absolutely fine - it doesn't matter how many entries in the phone book, as long as you're search is by last name, your search is fast. If you're searching 2 phone books for 2 last names, still fine.
If the 12 rows you need have to be found simple comparison of non-indexed fields, it's probably fine - the largest table is only 400K rows, and that should be pretty fast. If you're searching for street name in the phonebook, the size of the phonebook matters, but modern hardware should be ok. Better to put an index on the column though.
If you have to find the 12 rows by doing some kind of calculation fields, it will likely be a problem. If you have to convert all the last names in the phonebook to an integer and multiplying it by the date of the month to find the 12 rows you seek, the server has to do a quadrillion calculations, and that is likely going to be slow.
You are confusing the logic model of processing with what actually happens inside the database.
Projection and selection and Cartesian products are concepts from relational algebra. That explains what SQL does. It does not explain how databases do that.
In particular, databases have lots of algorithms that support joining and aggregating tables. Databases also have auxiliary data structures -- in particular, indexes and partitions -- that allow further optimization.
If you have no join conditions or filtering or aggregation, then the database does need to generate the Cartesian product -- and that can be quite expensive.
In general, though, databases do not generate the Cartesian product. If they did, databases would not be very useful.
Is there a limit on the size of data or processing. Practical limits are more common than hard limits in the databases themselves. In general, available memory and disk space limit the size of the data that can be processed -- but the limit is typically much, much higher than your example.

Understanding MySQL explain, `rows`-wise

I'm trying to figure out how should I take into account the rows column of MySQL explain's output. Here's what MySQL documentation says about it:
The rows column indicates the number of rows MySQL believes it must
examine to execute the query.
So here are my questions:
Regardless of its exactness, is this the number of records that are going to be examined after the indices are used or before?
Is it correct that I need to value the optimization of tables with high rows?
Is it true that the total number of records MySQL will examine is the product of rows column?
What are the strategies to reduce the rows?
The meaning of indexes is - that DBMS will look there first, and then use gathered information to look up matched rows. So - yes, rows will indicate how many rows will be examined after indexes were used (if they are present & applicable, of course). This question means that you are confused of what indexes are. They're not some magic, they are just real data structure. They entire sense is to reduce count of data rows used to perform query.
Arguable. This question can't be answered "yes" or "no". Because different tables may have different row definition - and the applied operation would be also different. Imagine that you have 100.000 rows from first table and 10.000 rows from second. But for first table you're selecting just plain value while for second table - something like standard deviation. That is: not only count of rows matters, but also what are you doing with them.
You may think of it as about multiplication, yes. But the thing is - it's not exact what will happen. And not exact count, of course. There is also filtered field that indicates for many rows were affected by applied conditions (like in WHERE clause). But - in general, you may estimate end result as power of 10, i.e. if you have 123.456.789 rows in first line and 111.222 in second, you may treat is as "selection of around 1E8 x 1E5" rows.
The techniques are quite standard and they all are about optimization of your query. First step is to take a look about how MySQL optimizes certain parts of query. Not all queries can be optimized - and in general it is too broad question, because some solutions may touch entire database and/or application structure. But understanding how to use indexes properly, what can be (and what can not be) indexed, how to create effective index - will be enough.

MYSQL IN vs <> performance

I have a table where I have a status field which can have values like 1,2,3,4,5. I need to select all the rows from the table with status != 1. I have the following 2 options:
NOTE that the table has INDEX over status field.
SELECT ... FROM my_tbl WHERE status <> 1;
or
SELECT ... FROM my_tbl WHERE status IN(2,3,4,5);
Which of the above is a better choice? (my_tbl is expected to grow very big).
You can run your own tests to find out, because it will vary depending on the underlying tables.
More than that, please don't worry about "fastest" without having first done some sort of measurement that it matters.
Rather than worrying about fastest, think about which way is clearest.
In databases especially, think about which way is going to protect you from data errors.
It doesn't matter how fast your program is if it's buggy or gives incorrect answers.
How many rows have the value "1"? If less than ~20%, you will get a table scan regardless of how you formulate the WHERE (IN, <>, BETWEEN). That's assuming you have INDEX(status).
But indexing ENUMs, flags, and other things with poor cardinality is rarely useful.
An IN clause with 50K items causes memory problems (or at least used to), but not performance problems. They are sorted, and a binary search is used.
Rule of Thumb: The cost of evaluation of expressions (IN, <>, functions, etc) is mostly irrelevant in performance. The main cost is fetching the rows, especially if they need to be fetched from disk.
An INDEX may assist in minimizing the number of rows fetched.
you can use BENCHMARK() to test it yourself.
http://sqlfiddle.com/#!2/d41d8/29606/2
the first one if faster which makes sense since it only has to compare 1 number instead of 4 numbers.

Does the order of conditions make a performance difference in MySQL?

Suppose I have a MySQL query like this, the table PEOPLE has about 2 million rows:
SELECT * FROM `PEOPLE` WHERE `SEX`=1 AND `AGE`=28;
The first condition will return 1 million rows, and the second condition may return 20,000 rows. From the local website, most developers said that it will cause a better affect to change the order of them. And they also said that It will cause a 2 million + 1 million + *10,000* I/O time if change the order, while original query above will cause a 2 million + 20,000 + *10,000* I/O time. It sounds make sense.
As we all know that MySQL has an internal query optimizer for such work. Does the order needs pay particular attention for optimal performance? I was totally confused.
PS: I noticed that there are some similar question asked already, but they are two or tree years ago, it seems better to ask again.
Thanks all noticed this question. This is a explain about why i ask again:
Before I ask this question, I run EXPLAIN for a couple of times. The answer is the order doesn't matter. But the Interviewer told me the order will make a difference performance, I want make it sure if there is something i missing.
You should first understand a fundamental thing: in theory, a relational database does not have indices.
A purely theoretical relational database engine would indeed scan all records, check the criterion on the sex and age columns and only return the relevant rows.
However, indices are a common layer added by SQL database engines to filter rows faster. In this case, you should have indices for both of these columns.
What is more, these same database engines perform analysis on these indices (if any) to determine the best possible course of action to retrieve the relevant rows faster. In particular, one criterion in index metadata is cardinality: for a given value of the indexed column, how many rows match on average? The higher the number of rows, the lower the cardinality. Therefore, the higher the cardinality the better.
Therefore, an SQL engine's query optimizer will certainly select to cut through the result set by looking up the age index first, and only then the index of sex. And it may even choose not to use the index on sex at all if it determines that it can be faster by just looking up the sex column value for each row resulting from the first filter. Which is likely here, since the cardinality of the sex column is ridiculously low.
Have a look here for an introduction to the relational model.

Composite Primary and Cardinality

I have some questions on Composite Primary Keys and the cardinality of the columns. I searched the web, but did not find any definitive answer, so I am trying again. The questions are:
Context: Large (50M - 500M rows) OLAP Prep tables, not NOSQL, not Columnar. MySQL and DB2
1) Does the order of keys in a PK matter?
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Thanks in advance.
You've got "MySQL and DB2". This answer is for DB2, MySQL has none of this.
Yes, of course that is logical, but the optimiser takes much more than just that into account.
Generally, the order of the columns in the WHERE clause (join) do not (and should not) matter.
However, there are two items related to the order of predicates which may be the reason for your question.
What does matter, is the order of the columns in the index, against which the WHERE clause is processed. Yes, there it is best to specify the columns in the order of highest cardinality to lowest. That allows the optimiser to target a smaller range of rows.
And along those lines do not bother implementing indices for single-column, low cardinality columns (there are useless). If the index is correct, then it will be used more often.
.
The order of tables being joined (not columns in the join) matters very much, it is probably the most important consideration. In fact Join Transitive Closure is automatic, and the optimiser evaluates all possible join orders, and chooses what it thinks is the best, based on Statistics (which is why UPDATE STATS is so important).
Regardless of the no of rows in the tables, if you are joining 100 rows from table_A on a bad index with 1,000,000 rows in table_B on a good index, you want the order A:B, not B:A. If you are getting less than the max IOPS, you may want to do something about it.
The correct sequence of steps is, no surprise:
check that the index is correct as per (1). Do not just add another index, correct the ones you have.
check that update stats is being executed regularly
always try the default operation of the optimiser first. Set stats on and measure I/Os. Use representative sets of values (that the user will use in production).
check the shoowplan, to ensure that the code is correct. Of course that will also identify the join order chosen.
if the performance is not good enough, and you believe that the the join order chosen by the optimiser for those sets of values is sub-optimal, SET JTC OFF (syntax depends on your version of DB2), then specify the order that you want in the WHERE clause. Measure I/Os. Use representative sets
form an opinion. Choose whichever is better performance overall. Never tune for single queries.
1) Does the order of keys in a PK matter?
Yes, it changes the order of the record for the index that is used to police the PRIMARY KEY.
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
For select queries, this totally depends on the queries you are going to use. If you are searching for all three columns at once, the order is not important; if you are searching for two or one columns, they should be leading in the index.
For inserts, it is better to make the leading column match the order in which the records are inserted.
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Again, this depends on the WHERE clause.