Do categorical fields need indexing? (MySQL or MongoDB) - mysql

For a table(let's say 'food'), there is a column 'type' with potential value [1,2,3,4] that specifies the type of that entry (e.g. fruit). As I expect selection like
SELECT name FROM food WHERE type = 3 ;
would be called most often, I wonder would an index be recommended in this case. Since there is only a few values possible for that field I wonder if the index would be useful.(Similarly for MongoDB?)

An index on such a field is likely not to be useful in MySQL. Actually, such an index could make most queries worse.
There is a case where an index will always be faster. This is a query that only uses columns in the index, such as:
select count(type)
from food
where type = 3;
This is faster because reading the index should be faster than reading the table, because the data is smaller (presumably, you could include all columns in the index).
In other cases, MySQL uses an index for a table when it is available.
The question you are asking is about the "selectivity" of an index. Consider your query:
SELECT name
FROM food
WHERE type = 3 ;
If all rows have type = 3, then you have to read all the matching records anyway (to get the value of name). If there is one record per page, then the index is probably helping you, because it reduces the number of page reads. A more realistic situation would have a page contain 100 records. Then, if 25% of the records have the same type, a typical page would have 25 such records on it. Basically, every page still has to be read. The question is whether the pages are read sequentially (a "full table scan") or through the index.
There is a difference between these two ways of reading a table. In the full table scan, the pages are read sequentially and once a page is read, it is not accessed again. In the index read, the pages are read randomly, one record at a time, and a page could be read multiple times. In the extreme case, the pages don't fit in the page cache and the same page is flushed to disk and read again and again for each record on the page. Highly inefficient.
You can make this query more efficient by having an index on type, name.
So, the answer to your question is to be careful about indexes, especially with large tables. When you do have an index on a categorical column, make it a composite index so your queries can be satisfied only using the index and not having to go back to the data pages.

Having the index is unlikely to help, but you should test it with your queries and your data. If the column has few distinct values queries will return a sizeable portion of the table's rows, and reading the index is equivalent to a full table scan. In fact, a full table scan might even be faster than reading the index.
If the type of the row is used in other queries it may help to have the type as a part of a multi column index.

Related

Indexing mysql table for selects

I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.

MySQL performance with index clairification

Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.

Is there a benefit to index mysql column if I always have different value in every row?

Question is for rows like timestamp, where always different value stored in every row.
I'm already search through stackoverflow and read about indexes, but I don't understand profit if no one value equals to another. So, index cardinality will be equal to number of rows. What the profit?
This kind of column would actually be an excellent candidate for an index, preferably a unique one.
Tables are unsorted sets of data, so without any knowledge about the table, the database will have to go over the entire table sequentially to find the rows you're looking for (O(n) complexity, where n is the number of rows).
An index is, essentially a tree that stores values in a sorted way, which allows the database to intelligently find the rows you're looking for (O(log n)). In addition, making the index unique tell the database there can be only one row per timestamp value, so once a single row is retrieved the database can stop searching for more.
The performance benefit for such an index, assuming you search for rows according to timestamps, should be significant.
An index is a map between key values and retrieval pointers. The DBMS uses an index during a query if a strategy that uses the index appears to be optimal.
If the index never gets used, then it is useless.
Indexes can speed up lookups based on a single keyed value, or based on a range of key values (depending on the index type), or by allowing index only retrieval in cases where only the key is needed for the query. Speed ups can be as low as two for one or as high as a hundred for one, depending on the size of the table and various other factors.
If your timestamp field is never used in the WHERE clause or the ON clause of a query, the chances are you are better off with no index. The art of choosing indexes well goes a lot deeper than this, but this is a start.

Indexing on column with few fixed values but values constitue to less than 25% of total rows

I have a field table_name in a table which can have only 20 different values. The total records in the table is about few tens of thousands of rows. If I do a query like this:
SELECT * FROM table WHERE table_name = 'adasd';
at most the returned records are 25% of the total rows. Mostly I get only 10% of the total records. Is there a scope to index the field table_name here? I hear that for indexes to work well it requires the values in that field to be unique or close to it. In my case, its not at all close to unique. But I also heard that if the returned rows are less in number compared to total number of rows, it makes a good case for indexing.
How should I go about this?
No they don't have to be unique to get a benefit from using indexes, however take some time to think about what the DBMS does when processing a query:
Full table scan - a sequential read through the data (i.e. very few seek operations)
Index lookup - a few seeks on the index to find the start of the selected data, then a sequential read (few seeks) to identify rows in the underlying table, then LOTS AND LOTS of seeks to fetch the rows from the table
Seeks are expensive.
(there is a secondary effect of full table scans in that they are more prone to flushing hot data out of the cache - but you should address the primary concern first).
In this case, it's unlikely that the DBMS would use the index if it were present, and even if it did, it would probably be slower than a full table scan. As a (very) rough rule of thumb, you're only going to get a benefit from an index if a predicate identifies less than around 5% of the rows (but it will vary depending on the relative size of the index and the data).
i.e. don't bother adding an index on this field alone.
I think you may benefit from spending some time thinking about why you need to run queries which return so many rows?
Revised Answer
I just learned that creating an index does not mean that MySQL will use it. Keeping that in mind, I will re-phrase my answer:
You should create an index on that column if (general or your own) practices suggest you to do so. MySQL will use heuristics; which include looking at the available indexes and their respective cardinality, to determine the best index to use or not to use an index at all.
Interesting reading about this topic here.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.