Index optimised for specific lookup - sql-server-2008

I have a table (which could potentially grow large, ~ millions of rows) on which I regularly perform the query SELECT * from table WHERE somefield = 20, and I would like this query to run fast. At any time, I expect this query to return at most 10 rows out of possiby millions, for this specific value 20 (no guarantees for any other values). What would be the proper way to index this? Is it sufficient to just place an index on somefield, and make sure statistics are roughly up to date? Or are then any other tricks I could try to optimise this?

The ideal index for this query in isolation would be an index with key column somefield and included columns of all other columns in the table (either by making the index clustered or an NCI with the INCLUDE option).
This would allow the values to be seeked into directly and avoid the need for bookmark lookups.
But the maintenance overhead of an NCI with all those included columns would affect data modification operations and you might prefer a CI defined on different key columns to benefit other queries or to avoid fragmentation anyway
So for that reason you may well prefer to define an NCI on somefield alone and live with the 10 bookmark lookups. It is a balancing act.
Edit. Actually if you are only interested in optimising the query where somefield = 20 then you could just create a filtered index on that value. I likely would then include all columns in that index definition.

Is it sufficient to just place an index on somefield, and make sure statistics are roughly up to date?
Yes, quite simply. Make sure somefield is the right type (ie. int). If somefield needs to contain text there's more you can do, but otherwise a normal index will be fine.
You can get small increases (and I do mean small) if you don't need every field returned by not using SELECT * (you won't need somefield, presumably, as you already know what it is).

Yes, You would want to add an index on somefield.
If you are not doing other queries, then you may want to make it a clustered index, but without context, it's hard to say conclusively.

Related

MySQL Index is NULL but there are available Keys

I have the following problem when running a mysql query:
Query is very slow and when i use explain the query key is null but possible_keys are avaiable and the order is correct, i also tried adding independent indexes per each row but still key was NULL.
You can see table, index and mysql explain here: https://snag.gy/vcChl6.jpg
The optimizer likely has just decided that there is no reason to use the index.
Since you are using SELECT * that means that means that if it used the index, then it would have to use the primary key from the index to then go back and look up all the necessary data from the clustered index. That is referred to as a double lookup, and is generally bad for performance. As there are so few records in this table, the optimizer likely decided that it can easily do a full table scan instead and get your result faster.
In short, this is expected behavior.
If you want to SELECT just some columns, add them to the t1 index and then just SELECT only the columns you need, with that given WHERE clause. It should use the index then. As your table grows in size, it may start using the index as well, once it estimates that the double lookup is cheaper than the full table scan.
A guess: Most rows are of that 'project' and that 'lang'.
The Optimizer does not understand that fact, so it takes the index that is obviously the best:
(id_project, id_lang)
This one would be equally good: (id_lang, id_project).
No fair... The EXPLAIN mentions indexes named id_project and id_lang (not useful), but the list of indexes shows a composite index t1(id_project, id_lang) (useful).
Then, as Willem suggests, it has to bounce between the index and the table. Normally (that is, when it has adequate statistics), the Optimizer will say "Oh, more than ~20% of the table is being referenced; let's ignore any index."
Things you can do:
Get rid of that index.
Change * to a list of just the columns you need. In particular, if you avoid the 3 TEXT columns, two optimizations kick in. Alternatively, any that will never be longer than 255 characters can be changed to VARCHAR(255).
Use some other filtering, ordering, limiting, etc. If this is a web application, do you really want to get ~534 rows?

mysql not using index?

I have a table with columns like word, A_, E_, U_ .. these columns with X_ are tinyints having the value of how many times the specific letter exists in the word (to later help optimize the wildcard search query).
There is totally 252k rows. If i make search like WHERE u_ > 0 i get 60k rows. But if i do the explain of that select, it says there is 225k rows to go through and no index possible. Why? Column was added as index. Why it doesn't say there is 60k rows to go through and that possible key is U_?
listing the indexes on table (also strange that others are groupped under A_ index)
In comparison if i run query: where id > 250000 i get 2983 results, and if i do explain of that select it says there is 2982 rows and key to be used primary.
Btw if i group by U_ i get this: (but probably doesnt matter much because i already said the query returns 60k results)
EDIT:
If i create column U (varchar(1)) and do the update U = 'U' where U_ > 0, then if i do the select WHERE U = 'U' i get also 60k rows (obviously), but if i do explain i get this:
Still not so good (rows 120k not 60k) but at least better than rows 225k in previous case. Although this solution is bit more piggy that than the first one, but maybe bit more efficient.
My experience is that MySQL chooses to do a tablescan, even if there is an index on the column you're searching, if your query would select more than approximately 25% of the rows in the table.
The reason for this is that using a secondary index in InnoDB is a bit more work than using a primary index.
Look up value in secondary index, like your index on u_.
Read index entry, and find corresponding primary key value(s) of rows where that value in u_ is stored.
Look up row(s) by primary key.
It's actually at least double the work to look up by secondary key. This isn't a problem if you ultimately match a small minority of rows of the table, and there are definitely cases where a secondary index is really important for your query. So don't be reluctant to use secondary indexes.
But if your query matches too many rows, and that becomes a big portion of the table, then it would be less work to just scan the table start-to-finish.
By analogy, why doesn't the index at the back of a book contain the word "the"? Because the entry would naturally list every single page in the book, and it would be a waste for you to refer to the index and then use it to guide you to each page in the main part of the book. You would have been better off just reading the book.
MySQL does not have any officially documented threshold for choosing a tablescan over an indexed search. The 25% figure is only my experience (actually sometimes it seems closer to 21%, but I don't know the code well enough to understand exactly how the threshold is calculated).
I've seen cases where the proportion of rows matched was very close to whatever threshold is in the implementation, and the behavior of the optimizer can actually flip-flop from one query to the next, resulting in highly variable performance.
If this case applies to you, you can use an index hint to make MySQL's optimizer pretend that a tablescan is prohibitively expensive, and it should prefer an index to a tablescan. This is done with the FORCE INDEX hint.
SELECT * FROM words FORCE INDEX(U_) WHERE U_ > 0
I still try to use index hints conservatively. They aren't necessary except in rare cases, and using an index hint means your query must include the index name. This makes it hard to change indexes without breaking your application code.
You're asking about the backend query optimizer. In particular you're asking: "how does it choose an access path? Why index here but tablescan there?"
Let's think about that optimizer. What is it optimizing? Elapsed time, in expectation. It has a model for how long sequential reads and random reads take, and for query selectivity, that is, expected number of rows returned by a query. From several alternative access paths it chooses the one that appears to require the least elapsed time.
Your id > 250000 query had a few things going for it:
good selectivity, so less than 1% of rows will appear in the result set
id is the Primary Key, so all columns are immediately available upon navigating to the right place in the btree
This caused the optimizer to compute an expected elapsed time for the indexed access path much smaller than expected time for tablescan.
On the other hand, your u_ > 0 query has very poor selectivity, dragging nearly a quarter of the rows into the result set. Additionally, the index is not a covering index for your * demand of copying all column values into the result set. So the optimizer predicts it will have to read a quarter of the index blocks, and then essentially all of the data row blocks that they point to. So compared to tablescan, we'd have to read more blocks from disk, and they would be random reads instead of sequential reads. Both of those argue against using the index, so tablescan was selected because it was cheapest. Also, remember that often multiple rows will fit within a single disk block, or within a single read request. We would call it a pessimizer if it always chose the indexed access path, even in cases where indexed disk I/O would take longer.
summary advice
Use an index on a single column when your queries have good selectivity, returning much less than 1% of a relation's rows. Use a covering index when your queries have poor selectivity and you're willing to make a space vs. time tradeoff.

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?
The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.
Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.
there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

Indexing on column with few fixed values but values constitue to less than 25% of total rows

I have a field table_name in a table which can have only 20 different values. The total records in the table is about few tens of thousands of rows. If I do a query like this:
SELECT * FROM table WHERE table_name = 'adasd';
at most the returned records are 25% of the total rows. Mostly I get only 10% of the total records. Is there a scope to index the field table_name here? I hear that for indexes to work well it requires the values in that field to be unique or close to it. In my case, its not at all close to unique. But I also heard that if the returned rows are less in number compared to total number of rows, it makes a good case for indexing.
How should I go about this?
No they don't have to be unique to get a benefit from using indexes, however take some time to think about what the DBMS does when processing a query:
Full table scan - a sequential read through the data (i.e. very few seek operations)
Index lookup - a few seeks on the index to find the start of the selected data, then a sequential read (few seeks) to identify rows in the underlying table, then LOTS AND LOTS of seeks to fetch the rows from the table
Seeks are expensive.
(there is a secondary effect of full table scans in that they are more prone to flushing hot data out of the cache - but you should address the primary concern first).
In this case, it's unlikely that the DBMS would use the index if it were present, and even if it did, it would probably be slower than a full table scan. As a (very) rough rule of thumb, you're only going to get a benefit from an index if a predicate identifies less than around 5% of the rows (but it will vary depending on the relative size of the index and the data).
i.e. don't bother adding an index on this field alone.
I think you may benefit from spending some time thinking about why you need to run queries which return so many rows?
Revised Answer
I just learned that creating an index does not mean that MySQL will use it. Keeping that in mind, I will re-phrase my answer:
You should create an index on that column if (general or your own) practices suggest you to do so. MySQL will use heuristics; which include looking at the available indexes and their respective cardinality, to determine the best index to use or not to use an index at all.
Interesting reading about this topic here.

What are the biggest benefits of using INDEXES in mysql?

I know I need to have a primary key set, and to set anything that should be unique as a unique key, but what is an INDEX and how do I use them?
What are the benefits? Pros & Cons? I notice I can either use them or not, when should I?
Short answer:
Indexes speed up SELECT's and slow down INSERT's.
Usually it's better to have indexes, because they speed up select more than they slow down insert.
On an UPDATE the index can speed things way up if an indexed field is used in the WHERE clause and slow things down if you update one of the indexed fields.
How do you know when to use an index
Add EXPLAIN in front of your SELECT statement.
Like so:
EXPLAIN SELECT * FROM table1
WHERE unindexfield1 > unindexedfield2
ORDER BY unindexedfield3
Will show you how much work MySQL will have to do on each of the unindexed fields.
Using that info you can decide if it is worthwhile to add indexes or not.
Explain can also tell you if it is better to drop and index
EXPLAIN SELECT * FROM table1
WHERE indexedfield1 > indexedfield2
ORDER BY indexedfield3
If very little rows are selected, or MySQL decided to ignore the index (it does that from time to time) then you might as well drop the index, because it is slowing down your inserts but not speeding up your select's.
Then again it might also be that your select statement is not clever enough.
(Sorry for the complexity in the answer, I was trying to keep it simple, but failed).
Link:
MySQL indexes - what are the best practices?
Pros:
Faster lookup for results. This is all about reducing the # of Disk IO's. Instead of scanning the entire table for the results, you can reduce the number of disk IO's(page fetches) by using index structures such as B-Trees or Hash Indexes to get to your data faster.
Cons:
Slower writes(potentially). Not only do you have to write your data to your tables, but you also have to write to your indexes. This may cause the system to restructure the index structure(Hash Index, B-Tree etc), which can be very computationally expensive.
Takes up more disk space, naturally. You are storing more data.
The easiest way to think about an index is to think about a dictionary. It has words and it has definitions corresponding to those words. The dictionary has an index on "word" because when you go to a dictionary you want to look up a word quickly, then get its definition. A dictionary usually contains just one index - an index by word.
A database is analogous. When you have a bunch of data in the database, you will have certain ways that you want to get it out. Let's say you have a User table and you often look up a user by the FirstName column. Since this is an operation that you are doing often in your application, you should consider using an index on this column. That will create a structure in the database that is sorted, if you will, by that column, so that looking up something by first name is like looking up a word in a dictionary. If you didn't have this index you might need to look at ALL rows before you determine which ones have a specific FirstName. By adding an index, you have made this fast.
So why not put an index on all columns and make them all fast? Like everything, there is a trade off. Every time you insert a row into the table User, the database will need to perform its magic and sort everything on your indexed column. This can be expensive.
You don't have to have a primary key. Indexes (of any type) are used to speed up queries and, at least with the InnoDB engine, enforce foreign key constraints. Whether you use a unique or plain (non-unique) index depends on whether you want to allow duplicate values in the key.
This is a general database concept, you might use external resources to read about it, like http://beginner-sql-tutorial.com/sql-index.htm or http://en.wikipedia.org/wiki/Index_(database)
An index allows MySQL to find data quicker. You use them on columns that you'll be using in WHERE clauses. For example, if you have a column named score, and want to find everything with where score > 5, by default this means MySQL will need to scan through the WHOLE table to find those scores. However if you use a BTREE index, finding those that meet that condition will happen a LOT faster.
Indices have a price: disk and memory space. If it's a very big table, your index will grow rather large.
Think of it this way: what are the biggest benefits of having an index in a book? It's much the same thing. You have a slightly larger book, yet you're able to quickly look things up. When you create an index on a column, you're saying you want to be able to reference it in a where clause to look it up quickly.