I'm trying to figure out how should I take into account the rows column of MySQL explain's output. Here's what MySQL documentation says about it:
The rows column indicates the number of rows MySQL believes it must
examine to execute the query.
So here are my questions:
Regardless of its exactness, is this the number of records that are going to be examined after the indices are used or before?
Is it correct that I need to value the optimization of tables with high rows?
Is it true that the total number of records MySQL will examine is the product of rows column?
What are the strategies to reduce the rows?
The meaning of indexes is - that DBMS will look there first, and then use gathered information to look up matched rows. So - yes, rows will indicate how many rows will be examined after indexes were used (if they are present & applicable, of course). This question means that you are confused of what indexes are. They're not some magic, they are just real data structure. They entire sense is to reduce count of data rows used to perform query.
Arguable. This question can't be answered "yes" or "no". Because different tables may have different row definition - and the applied operation would be also different. Imagine that you have 100.000 rows from first table and 10.000 rows from second. But for first table you're selecting just plain value while for second table - something like standard deviation. That is: not only count of rows matters, but also what are you doing with them.
You may think of it as about multiplication, yes. But the thing is - it's not exact what will happen. And not exact count, of course. There is also filtered field that indicates for many rows were affected by applied conditions (like in WHERE clause). But - in general, you may estimate end result as power of 10, i.e. if you have 123.456.789 rows in first line and 111.222 in second, you may treat is as "selection of around 1E8 x 1E5" rows.
The techniques are quite standard and they all are about optimization of your query. First step is to take a look about how MySQL optimizes certain parts of query. Not all queries can be optimized - and in general it is too broad question, because some solutions may touch entire database and/or application structure. But understanding how to use indexes properly, what can be (and what can not be) indexed, how to create effective index - will be enough.
Related
The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.
Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).
Suppose I have a MySQL query like this, the table PEOPLE has about 2 million rows:
SELECT * FROM `PEOPLE` WHERE `SEX`=1 AND `AGE`=28;
The first condition will return 1 million rows, and the second condition may return 20,000 rows. From the local website, most developers said that it will cause a better affect to change the order of them. And they also said that It will cause a 2 million + 1 million + *10,000* I/O time if change the order, while original query above will cause a 2 million + 20,000 + *10,000* I/O time. It sounds make sense.
As we all know that MySQL has an internal query optimizer for such work. Does the order needs pay particular attention for optimal performance? I was totally confused.
PS: I noticed that there are some similar question asked already, but they are two or tree years ago, it seems better to ask again.
Thanks all noticed this question. This is a explain about why i ask again:
Before I ask this question, I run EXPLAIN for a couple of times. The answer is the order doesn't matter. But the Interviewer told me the order will make a difference performance, I want make it sure if there is something i missing.
You should first understand a fundamental thing: in theory, a relational database does not have indices.
A purely theoretical relational database engine would indeed scan all records, check the criterion on the sex and age columns and only return the relevant rows.
However, indices are a common layer added by SQL database engines to filter rows faster. In this case, you should have indices for both of these columns.
What is more, these same database engines perform analysis on these indices (if any) to determine the best possible course of action to retrieve the relevant rows faster. In particular, one criterion in index metadata is cardinality: for a given value of the indexed column, how many rows match on average? The higher the number of rows, the lower the cardinality. Therefore, the higher the cardinality the better.
Therefore, an SQL engine's query optimizer will certainly select to cut through the result set by looking up the age index first, and only then the index of sex. And it may even choose not to use the index on sex at all if it determines that it can be faster by just looking up the sex column value for each row resulting from the first filter. Which is likely here, since the cardinality of the sex column is ridiculously low.
Have a look here for an introduction to the relational model.
Suppose I have a student table contains id, class, school_id having 1000 records.
There are 3 schools and 12 classes.
Which of these 2 queries would be faster(if there is a difference)
Query 1:
SELECT * FROM student WHERE school = 2 and class = 5;
Query 2:
SELECT * FROM student WHERE class = 5 and school = 2;
Note: I just changed the places of the 2 conditions in WHERE.
Then which will be faster and Is the following true?
->probable number of records in query1 is 333
->probable number of records in query2 is 80.
It seriously doesn't matter one little bit. 1000 records is a truly tiny database table and, if there's a difference at all, you need to upgrade from such a brain-dead DBMS.
A decent DBMS would have already collected the stats from tables (or the DBA would have done it as part of periodic tuning) and the order of the where clauses would be irrelevant.
The execution engine would choose the one which reduced the cardinality (ie, reduced the candidate group of rows) the fastest. That means that (assuming classes and schools are roughly equally distributed) the class = 5 filter would happen first, no matter the order in the select statement.
Explaining the cardinality issue in a little more depth, for a roughly evenly distributed spread of those 1000 records, there would be 333 for each school and 83 for each class.
What a DBMS would do would be to filter first on what gives you the smallest result set. So it would tend to prefer using the class filter. That would immediately drop the candidate list of rows to about 83. Then, it's a simple matter of tossing out those which have a school other than 2.
In both cases, you end up with the same eventual row set but the initial filter is often faster since it can use an index to only select desired rows. The second filter, on the other hand, most likely goes through those rows in a less efficient manner so the quicker you can reduce the number of rows, the better.
If you really want to know, you need to measure rather than guess. That's one of the primary responsibilities of a DBA, tuning the database for optimal execution of queries.
These 2 queries are strictly the same :)
hypothetical; to teach a DB concept
"How your DB uses cardinality to optiize your queries"
So, it's basically true that they are identical, but I will mention one thought hinting at the "why" which will actually introduce a good RDBMS concept.
Let's just say hypothetically that your RDBMS used the WHERE clauses strictly in the order you specified them.
In that case, the optimal query would be the one in which the column with maximum cardinality was specified first. What that means is that specifying class=5 first would be faster, as it more quickly eliminates rows from consideration, meaning if the row's "class" column does not contain 5 (which is statistically more likely than it's "school" column not containing 2), then it doesn't even need to evaluate the "school" column.
Coming back to reality, however, you should know that almost all modern relational database management systems do what is called "building a query plan" and "compiling the query". This involves, among other things, evaluating the cardinality of columns specified in the WHERE clause (and what indexes are available, etc). So essentially, it is probably true to say they are identical, and the number of results will be, too.
The number of rows affected will not and may not change simply because you reorder the conditions in the "where clause" of the sql-statement.
The execution time will also not be affected since the sql-server will look for a matching index first.
First query executes faster than 2nd query because in where clause it filters school first so it is easier to get the class details later
Recently I was asked to develop an app, which basically is going to use 1 main single table in the whole database for the operations.
It has to have around 20 columns with various types - decimals, int, varchar, date, float. At some point the table will have thousands of rows (3-5k).
The app must have the ability to SELECT records by combining each of the columns criteria - e.g. BETWEEN dates, greater than something, smaller than something, equal to something etc. Basically combining a lot of where clauses in order to see the desired result.
So my question is, since I know how to combine the wheres and make the app, what is the best approach? I mean is MySQL good enough not to slow down when I have 3k records and make a SELECT query with 15 WHERE clauses? I've never worked with a database larger than 1k records, so I'm not sure if I should use MySQL for this. Also I'm going to use PHP as a server language if that matters at all.
you are talking about conditions in ONE where clause.
3000 rows is very minimal for a relational database. these typically go far larger (like 3 million+ or even much more)
i am concerned that you have 20 columns in one table. this sounds like a normalization problem.
With a well-defined structure for your database, including appropriate indexes, 3k records is nothing, even with 15 conditions. Even without indexes, it is doubtful that with so few records, you will see any performance hit.
I would however plan for the future and perhaps look at your queries and see if there is any table optimisation you can do at this stage, to save pain in the future. Who knows, 3k records today, 30m next year.
3000 Records in a database is nothing. You won't have any performance issues even with your 15 WHERE.
MySQL and PHP will do the job just fine.
I'd be more concerned about your huge amount of columns. Maybe you should take a look at this article to make sure you respect the databases normal forms,
Good luck for your project.
I don't think querying a single table of 3-5K rows is going to be particularly intensive. MySQL should be able to cope with something like this easily enough. You could add lot's of indexes to speed up your selects if this is the "choke point" but this will slow down insert, edit's, etc. also if you querying lots of different rows this isn't prob a good idea.
As seeing the no of rows is very minimal,I guess it should not cause any performance issue.Still you can look at using OR operator carefully and also indexes on the columns in where clause.
Indices, indices, indices!
If you need to check a lot of different columns try flatten your used logic. In any case make sure you have set an appropriate index on the checked columns. A not an index per columns, but one index over all those columns, that a used regularly.