MYSQL IN vs <> performance - mysql

I have a table where I have a status field which can have values like 1,2,3,4,5. I need to select all the rows from the table with status != 1. I have the following 2 options:
NOTE that the table has INDEX over status field.
SELECT ... FROM my_tbl WHERE status <> 1;
or
SELECT ... FROM my_tbl WHERE status IN(2,3,4,5);
Which of the above is a better choice? (my_tbl is expected to grow very big).

You can run your own tests to find out, because it will vary depending on the underlying tables.
More than that, please don't worry about "fastest" without having first done some sort of measurement that it matters.
Rather than worrying about fastest, think about which way is clearest.
In databases especially, think about which way is going to protect you from data errors.
It doesn't matter how fast your program is if it's buggy or gives incorrect answers.

How many rows have the value "1"? If less than ~20%, you will get a table scan regardless of how you formulate the WHERE (IN, <>, BETWEEN). That's assuming you have INDEX(status).
But indexing ENUMs, flags, and other things with poor cardinality is rarely useful.
An IN clause with 50K items causes memory problems (or at least used to), but not performance problems. They are sorted, and a binary search is used.
Rule of Thumb: The cost of evaluation of expressions (IN, <>, functions, etc) is mostly irrelevant in performance. The main cost is fetching the rows, especially if they need to be fetched from disk.
An INDEX may assist in minimizing the number of rows fetched.

you can use BENCHMARK() to test it yourself.
http://sqlfiddle.com/#!2/d41d8/29606/2
the first one if faster which makes sense since it only has to compare 1 number instead of 4 numbers.

Related

Will records order change between two identical query in mysql without order by

The problem is I need to do pagination.I want to use order by and limit.But my colleague told me mysql will return records in the same order,and since this job doesn't care in which order the records are shown,so we don't need order by.
So I want to ask if what he said is correct? Of course assuming that no records are updated or inserted between the two queries.
You don't show your query here, so I'm going to assume that it's something like the following (where ID is the primary key of the table):
select *
from TABLE
where ID >= :x:
limit 100
If this is the case, then with MySQL you will probably get rows in the same order every time. This is because the only predicate in the query involves the primary key, which is a clustered index for MySQL, so is usually the most efficient way to retrieve.
However, probably may not be good enough for you, and if your actual query is any more complex than this one, probably no longer applies. Even though you may think that nothing changes between queries (ie, no rows inserted or deleted), so you'll get the same optimization plan, that is not true.
For one thing, the block cache will have changed between queries, which may cause the optimizer to choose a different query plan. Or maybe not. But I wouldn't take the word of anyone other than one of the MySQL maintainers that it won't.
Bottom line: use an order by on whatever column(s) you're using to paginate. And if you're paginating by the primary key, that might actually improve your performance.
The key point here is that database engines need to handle potentially large datasets and need to care (a lot!) about performance. MySQL is never going to waste any resource (CPU cycles, memory, whatever) doing an operation that doesn't serve any purpose. Sorting result sets that aren't required to be sorted is a pretty good example of this.
When issuing a given query MySQL will try hard to return the requested data as quick as possible. When you insert a bunch of rows and then run a simple SELECT * FROM my_table query you'll often see that rows come back in the same order than they were inserted. That makes sense because the obvious way to store the rows is to append them as inserted and the obvious way to read them back is from start to end. However, this simplistic scenario won't apply everywhere, every time:
Physical storage changes. You won't just be appending new rows at the end forever. You'll eventually update values, delete rows. At some point, freed disk space will be reused.
Most real-life queries aren't as simple as SELECT * FROM my_table. Query optimizer will try to leverage indices, which can have a different order. Or it may decide that the fastest way to gather the required information is to perform internal sorts (that's typical for GROUP BY queries).
You mention paging. Indeed, I can think of some ways to create a paginator that doesn't require sorted results. For instance, you can assign page numbers in advance and keep them in a hash map or dictionary: items within a page may appear in random locations but paging will be consistent. This is of course pretty suboptimal, it's hard to code and requieres constant updating as data mutates. ORDER BY is basically the easiest way. What you can't do is just base your paginator in the assumption that SQL data sets are ordered sets because they aren't; neither in theory nor in practice.
As an anecdote, I once used a major framework that implemented pagination using the ORDER BY and LIMIT clauses. (I won't say the same because it isn't relevant to the question... well, dammit, it was CakePHP/2). It worked fine when sorting by ID. But it also allowed users to sort by arbitrary columns, which were often not unique, and I once found an item that was being shown in two different pages because the framework was naively sorting by a single non-unique column and that row made its way into both ORDER BY type LIMIT 10 and ORDER BY type LIMIT 10, 10 because both sortings complied with the requested condition.

SELECT TableName.Col1 VS SELECT Col1

This might be a weird question but didnt know how to research on it. When doing the following query:
SELECT Foo.col1, Foo.col2, Foo.col3
FROM Foo
INNER JOIN Bar ON Foo.ID = Bar.BID
I tend to use TableName.Column instead of just col1, col2, col3
Is there any performance difference? Is it faster to specify Table name for each column?
My guess would be that yes it is faster since it would take some time to lookup the column name and and differentiate it.
If anyone knows a link where I could read up on this I would be grateful. I did not even know how to title this question better since not sure how to search on it.
First of all: This should not matter. The time to look up the columns is such a miniscule fraction of the total processing time of a typical query, that this might be the wrong spot to look for additional performance.
Second: Tablename.Colname is faster than Colname only, as it eliminates the need to search the referenced tables (and table-like structures like views and subqueries) for a fitting column. Again: The difference is inside the statistical noise.
Third: Using Tablename.Colname is a good idea, but for other reasons: If you use Colname only, and one of the tables in your query gets a new column with the same name, you end up with the oh-so-well-known "ambiguous column name" error. Typical candidates for such a columns often are "comment", "lastchanged", and friends. If you qualify your col references, this maintainability problem simply disappears - your query will work as allways, ignoring the new fields.
If it's faster, the difference is surely negligible, like a few microseconds per query. All the data about the tables mentioned in the query has to be loaded into memory, so it doesn't save any disk access. It's done during query parsing, not during data processing. Even if you run the query thousands of times, it might not make up for the time spent typing those extra characters, and certainly not the time we've spent discussing it. :)
But it makes the queries longer, so there's slightly more time spent in communications. If you're sending the query over a network, that will probably negate any time saved during parsing. You can reduce this by using short table aliases, though:
SELECT t.col1, t.col2
FROM ReallyLongTableName t
As a general rule, when worrying about database performance you only need to concern yourself with aspects whose time is dependent on the size number of rows in the tables. Anything that's the same regardless of the amount of data will fall into the noise, unless you're dealing with extremely tiny tables (in which case, why are you bothering with a database -- use a flat file).

Which sql will be faster and why?

Suppose I have a student table contains id, class, school_id having 1000 records.
There are 3 schools and 12 classes.
Which of these 2 queries would be faster(if there is a difference)
Query 1:
SELECT * FROM student WHERE school = 2 and class = 5;
Query 2:
SELECT * FROM student WHERE class = 5 and school = 2;
Note: I just changed the places of the 2 conditions in WHERE.
Then which will be faster and Is the following true?
->probable number of records in query1 is 333
->probable number of records in query2 is 80.
It seriously doesn't matter one little bit. 1000 records is a truly tiny database table and, if there's a difference at all, you need to upgrade from such a brain-dead DBMS.
A decent DBMS would have already collected the stats from tables (or the DBA would have done it as part of periodic tuning) and the order of the where clauses would be irrelevant.
The execution engine would choose the one which reduced the cardinality (ie, reduced the candidate group of rows) the fastest. That means that (assuming classes and schools are roughly equally distributed) the class = 5 filter would happen first, no matter the order in the select statement.
Explaining the cardinality issue in a little more depth, for a roughly evenly distributed spread of those 1000 records, there would be 333 for each school and 83 for each class.
What a DBMS would do would be to filter first on what gives you the smallest result set. So it would tend to prefer using the class filter. That would immediately drop the candidate list of rows to about 83. Then, it's a simple matter of tossing out those which have a school other than 2.
In both cases, you end up with the same eventual row set but the initial filter is often faster since it can use an index to only select desired rows. The second filter, on the other hand, most likely goes through those rows in a less efficient manner so the quicker you can reduce the number of rows, the better.
If you really want to know, you need to measure rather than guess. That's one of the primary responsibilities of a DBA, tuning the database for optimal execution of queries.
These 2 queries are strictly the same :)
hypothetical; to teach a DB concept
"How your DB uses cardinality to optiize your queries"
So, it's basically true that they are identical, but I will mention one thought hinting at the "why" which will actually introduce a good RDBMS concept.
Let's just say hypothetically that your RDBMS used the WHERE clauses strictly in the order you specified them.
In that case, the optimal query would be the one in which the column with maximum cardinality was specified first. What that means is that specifying class=5 first would be faster, as it more quickly eliminates rows from consideration, meaning if the row's "class" column does not contain 5 (which is statistically more likely than it's "school" column not containing 2), then it doesn't even need to evaluate the "school" column.
Coming back to reality, however, you should know that almost all modern relational database management systems do what is called "building a query plan" and "compiling the query". This involves, among other things, evaluating the cardinality of columns specified in the WHERE clause (and what indexes are available, etc). So essentially, it is probably true to say they are identical, and the number of results will be, too.
The number of rows affected will not and may not change simply because you reorder the conditions in the "where clause" of the sql-statement.
The execution time will also not be affected since the sql-server will look for a matching index first.
First query executes faster than 2nd query because in where clause it filters school first so it is easier to get the class details later

MySQL Improving speed of order by statements

I've got a table in a MySQL db with about 25000 records. Each record has about 200 fields, many of which are TEXT. There's nothing I can do about the structure - this is a migration from an old flat-file db which has 16 years of records, and many fields are "note" type free-text entries.
Users can be viewing any number of fields, and order by any single field, and any number of qualifiers. There's a big slowdown in the sort, which is generally taking several seconds, sometimes as much as 7-10 seconds.
an example statement might look like this:
select a, b, c from table where b=1 and c=2 or a=0 order by a desc limit 25
There's never a star-select, and there's always a limit, so I don't think the statement itself can really be optimized much.
I'm aware that indexes can help speed this up, but since there's no way of knowing what fields are going to be sorted on, i'd have to index all 200 columns - what I've read about this doesn't seem to be consistent. I understand there'd be a slowdown when inserting or updating records, but assuming that's acceptable, is it advisable to add an index to each column?
I've read about sort_buffer_size but it seems like everything I read conflicts with the last thing I read - is it advisable to increase this value, or any of the other similar values (read_buffer_size, etc)?
Also, the primary identifier is a crazy pattern they came up with in the nineties. This is the PK and so should be indexed by virtue of being the PK (right?). The records are (and have been) submitted to the state, and to their clients, and I can't change the format. This column needs to sort based on the logic that's in place, which involves a stored procedure with string concatenation and substring matching. This particular sort is especially slow, and doesn't seem to cache, even though this one field is indexed, so I wonder if there's anything I can do to speed up the sorting on this particular field (which is the default order by).
TYIA.
I'd have to index all 200 columns
That's not really a good idea. Because of the way MySQL uses indexes most of them would probably never be used while still generating quite a large overhead. (see chapter 7.3 in link below for details). What you could do however, is to try to identify which columns appear most often in WHERE clause, and index those.
In the long run however, you will probably need to find a way, to rework your data structure into something more manageable, because as it is now, it has the smell of 'spreadsheet turned into database' which is not a nice smell.
I've read about sort_buffer_size but it seems like everything I read
conflicts with the last thing I read - is it advisable to increase
this value, or any of the other similar values (read_buffer_size,
etc)?
In general he answer is yes. However the actual details depend on your hardware, OS and what storage engine you use. See chapter 7.11 (especially 7.11.4 in link below)
Also, the primary identifier is a crazy pattern they came up with in
the nineties.[...] I wonder if there's anything I can do to speed up
the sorting on this particular field (which is the default order by).
Perhaps you could add a primarySortOrder column to your table, into which you could store numeric values that would map the PK order (precaluclated from the store procedure you're using).
Ant the link you've been waiting for: Chapter 7 from MySQL manual: Optimization
Add an index to all the columns that have a large number of distinct values, say 100 or even 1000 or more. Tune this number as you go.

mysql performance comparison

Which is more efficient,and by how much?
type 1:
insert into table_name(column1,column2..) select column1,column2 ... from another_table where
columnX in (value_list)
type 2:
insert into table_name(column1,column2..) values (column1_0,column2_0..),(column1_1,column2_1..)
The first edition looks short,and the second may become extremely long,when value_list contains,say 500 or even more values.
But I've no idea about whose performance will be better,though feels the first should be more efficient,intuitively.
The first is cleaner, especially if your columns are already in mysql (which I'm assuming you are saying?). You would save some time in network overhead sending data, and parsing time, and have to worry less about hitting whatever query size limit your client has.
However, in general, I would expect the performance to be similar as the number rows grows larger, especially on a well-indexed table. Most of the time for inserts w/ large queries is spent doing things like building indexes (see here), and both those queries, absent turning indexes off, would have to do that.
I agree with Todd, the first query is cleaner and will be faster to send to the MySQL server and faster to compile. And it's probably true that as the number of inserted records increases, the speed differential will drop.
But the first form has substantial other benefits to consider:
It's far easier to maintain: you only have to add or modify a field every now and then.
You avoid the expense of querying another_table and processing the results to concatenate the second query (a hidden cost of that approach).
If you need to run this update more than once, the first query can be cached in the MySQL server along with its compiled form and query plan. This makes subsequent invocations of the query run a bit faster.