I have a MySQL table where an indexed INT column is going to be 0 for 90% of the rows. If I change those rows to use NULL instead of 0, will they be left out of the index, making the index about 90% smaller?
http://dev.mysql.com/doc/refman/5.0/en/is-null-optimization.html
MySQL can perform the same optimization on col_name IS NULL that it can use for col_name = constant_value. For example, MySQL can use indexes and ranges to search for NULL with IS NULL.
It looks like it does index the NULLs too.
Be careful when you run this because MySQL will LOCK the table for WRITES during the index creation. Building the index can take a while on large tables even if the column is empty (all nulls).
Reference.
Allowing a column to be null will add a byte to the storage requirements of the column. This will lead to an increased index size which is probably not good. That said if a lot of your queries are changed to use "IS NULL" or "NOT NULL" they might be overall faster than doing value comparisons.
My gut would tell me not null, but there's one answer: test!
No, it will continue to include them, but don't make too many assumptions about what the consequences are in either case. A lot depends on the range of other values (google for "cardinality").
MSSQL has a new index type called a "filtered index" for this type of situation (i.e. includes records in the index based on a filter). dBASE-type systems used to have a similar capability, and it was pretty handy.
Each index has a cardinality means how many distinct values are indexed. AFAIK it's not a reasonable idea to say indexes repeat the same value for many rows but the index will only addresses a repeated value to the clustered index of many rows (rows having null value for this field) and keeping the reference ID of the clustered index means : each row with a NULL value indexed field wastes a size as large as the PK (for this reason experts recommend to have a reasonable PK size if you have composite PK).
Related
I have the following problem when running a mysql query:
Query is very slow and when i use explain the query key is null but possible_keys are avaiable and the order is correct, i also tried adding independent indexes per each row but still key was NULL.
You can see table, index and mysql explain here: https://snag.gy/vcChl6.jpg
The optimizer likely has just decided that there is no reason to use the index.
Since you are using SELECT * that means that means that if it used the index, then it would have to use the primary key from the index to then go back and look up all the necessary data from the clustered index. That is referred to as a double lookup, and is generally bad for performance. As there are so few records in this table, the optimizer likely decided that it can easily do a full table scan instead and get your result faster.
In short, this is expected behavior.
If you want to SELECT just some columns, add them to the t1 index and then just SELECT only the columns you need, with that given WHERE clause. It should use the index then. As your table grows in size, it may start using the index as well, once it estimates that the double lookup is cheaper than the full table scan.
A guess: Most rows are of that 'project' and that 'lang'.
The Optimizer does not understand that fact, so it takes the index that is obviously the best:
(id_project, id_lang)
This one would be equally good: (id_lang, id_project).
No fair... The EXPLAIN mentions indexes named id_project and id_lang (not useful), but the list of indexes shows a composite index t1(id_project, id_lang) (useful).
Then, as Willem suggests, it has to bounce between the index and the table. Normally (that is, when it has adequate statistics), the Optimizer will say "Oh, more than ~20% of the table is being referenced; let's ignore any index."
Things you can do:
Get rid of that index.
Change * to a list of just the columns you need. In particular, if you avoid the 3 TEXT columns, two optimizations kick in. Alternatively, any that will never be longer than 255 characters can be changed to VARCHAR(255).
Use some other filtering, ordering, limiting, etc. If this is a web application, do you really want to get ~534 rows?
What are the differences between PRIMARY, UNIQUE, INDEX and FULLTEXT when creating MySQL tables?
How would I use them?
Differences
KEY or INDEX refers to a normal non-unique index. Non-distinct values for the index are allowed, so the index may contain rows with identical values in all columns of the index. These indexes don't enforce any restraints on your data so they are used only for access - for quickly reaching certain ranges of records without scanning all records.
UNIQUE refers to an index where all rows of the index must be unique. That is, the same row may not have identical non-NULL values for all columns in this index as another row. As well as being used to quickly reach certain record ranges, UNIQUE indexes can be used to enforce restraints on data, because the database system does not allow the distinct values rule to be broken when inserting or updating data.
Your database system may allow a UNIQUE index to be applied to columns which allow NULL values, in which case two rows are allowed to be identical if they both contain a NULL value (the rationale here is that NULL is considered not equal to itself). Depending on your application, however, you may find this undesirable: if you wish to prevent this, you should disallow NULL values in the relevant columns.
PRIMARY acts exactly like a UNIQUE index, except that it is always named 'PRIMARY', and there may be only one on a table (and there should always be one; though some database systems don't enforce this). A PRIMARY index is intended as a primary means to uniquely identify any row in the table, so unlike UNIQUE it should not be used on any columns which allow NULL values. Your PRIMARY index should be on the smallest number of columns that are sufficient to uniquely identify a row. Often, this is just one column containing a unique auto-incremented number, but if there is anything else that can uniquely identify a row, such as "countrycode" in a list of countries, you can use that instead.
Some database systems (such as MySQL's InnoDB) will internally store a table's actual records within the PRIMARY KEY's B-tree index.
FULLTEXT indexes are different from all of the above, and their behaviour differs significantly between database systems. FULLTEXT indexes are only useful for full text searches done with the MATCH() / AGAINST() clause, unlike the above three - which are typically implemented internally using b-trees (allowing for selecting, sorting or ranges starting from left most column) or hash tables (allowing for selection starting from left most column).
Where the other index types are general-purpose, a FULLTEXT index is specialised, in that it serves a narrow purpose: it's only used for a "full text search" feature.
Similarities
All of these indexes may have more than one column in them.
With the exception of FULLTEXT, the column order is significant: for the index to be useful in a query, the query must use columns from the index starting from the left - it can't use just the second, third or fourth part of an index, unless it is also using the previous columns in the index to match static values. (For a FULLTEXT index to be useful to a query, the query must use all columns of the index.)
All of these are kinds of indices.
primary: must be unique, is an index, is (likely) the physical index, can be only one per table.
unique: as it says. You can't have more than one row with a tuple of this value. Note that since a unique key can be over more than one column, this doesn't necessarily mean that each individual column in the index is unique, but that each combination of values across these columns is unique.
index: if it's not primary or unique, it doesn't constrain values inserted into the table, but it does allow them to be looked up more efficiently.
fulltext: a more specialized form of indexing that allows full text search. Think of it as (essentially) creating an "index" for each "word" in the specified column.
I feel like this has been well covered, maybe except for the following:
Simple KEY / INDEX (or otherwise called SECONDARY INDEX) do increase performance if selectivity is sufficient. On this matter, the usual recommendation is that if the amount of records in the result set on which an index is applied exceeds 20% of the total amount of records of the parent table, then the index will be ineffective. In practice each architecture will differ but, the idea is still correct.
Secondary Indexes (and that is very specific to mysql) should not be seen as completely separate and different objects from the primary key. In fact, both should be used jointly and, once this information known, provide an additional tool to the mysql DBA: in Mysql, indexes embed the primary key. It leads to significant performance improvements, specifically when cleverly building implicit covering indexes such as described there.
If you feel like your data should be UNIQUE, use a unique index. You may think it's optional (for instance, working it out at application level) and that a normal index will do, but it actually represents a guarantee for Mysql that each row is unique, which incidentally provides a performance benefit.
You can only use FULLTEXT (or otherwise called SEARCH INDEX) with Innodb (In MySQL 5.6.4 and up) and Myisam Engines
You can only use FULLTEXT on CHAR, VARCHAR and TEXT column types
FULLTEXT index involves a LOT more than just creating an index. There's a bunch of system tables created, a completely separate caching system and some specific rules and optimizations applied. See http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html and http://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html
Does the performance of mysql index depend on datatypes
If it does, what are the best datatypes to use in such cases?
Also, is it important to keep an index on a column which has less duplicate values or it is going to give the same performance even if applied on a column which has enormous duplicate values.
Thanks in advance!
Indexes aside, the best datatypes are the smallest ones to meet the domain requirements.
Column datatypes do affect the performance of indexes on those columns.
Smaller is better.
Indexes on integer columns work best (smaller, faster).
In a particular instance, the optimiser may chose not to use an index if it is not selective enough (i.e. column has many repeated values, and the value searched for, determined from statistics, will result in 'too many' rows.
How MySQL Uses Indexes
Data Type Storage Requirements
From the link #nani1216 posted:
The size of the column or columns you’re indexing is relevant because
the less data the database server has to search through or index the
faster it’ll be and the less storage you’ll use on disk [or in memory]
Yes the performance varies with the use of datatypes.
Indexing on integer datatype gives you more performance than indexing on char or varchar datatypes.
Have a look at What makes a good MySQL index?
As other pointed out, smaller data type is better. This is for simple reason - you can have more records loaded in the memory, if the record is smaller.
In InnoDB the primary key is the last column for each secondary index, so small primary key is extremely important for InnoDB tables.
As for the "selectivity" of index, the optimizer may choose not to use the index at all and do a table scan if the value match more than 50% of the records of the index (this is because, if the engine is InnoDB, the index is secondary and it's not covering index, it has to hit the disc once to read the index, and second time to read the actual row data, so it's better to read the table data directly. So you don't get much performance from an index on a column that contains largely the same values (but if small amount of the values are for example 0 and you are querying exactly for those rows, a index may be good!)
If a column is of type int, does it still need to be indexed to make select query run faster?
SELECT *
FROM MyTable
WHERE intCol = 100;
Probably yes, unless
The table has very few rows (<1000)
The int column has poor selectivity
a large proportion of the table is being returned (say > 1%)
In which case, a table scan might make more sense anyway, and the optimiser may choose to do one. In most cases having an index anyway is not very harmful, but you should definitely try it and see (in your lab, on production-grade hardware with a production-like data set)
Yes, an index on any column can make the query perform faster regardless of data type. The data itself is what matters -- no point in using an index if there are only two values currently in the system.
Also be aware that:
the presence of an index doesn't ensure it will be used -- table statistics need to be current, but it really depends on the query.
MySQL also only allows one index per SELECT, and has a limited amount of space for indexes (limit dependent on engine).
Yes it does. It does not matter which data type a column has. If you don't specify an index, mysql has to scan the whole table anytime you are searching a value with that column.
Yes, the whole point on indexes is that you create them by primary key or by adding the index manualy. The type of the column does not say anything about the speed of query.
Yes, it has to be indexed. It doesn't make sense to index tinyint columns if they are used as boolean, because this index won't be selective enough.
I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.