MySQL: hash index vs. table join - mysql

I'm have pretty big MySQL table(more than 10 millions of rows, innoDB engine), the table has a field that indicate a row's category(varchar(40)), the categories are less than 10.
Now I have two choices:
keep the field and make a hash index on it.
make the field into another category table, and link them with a category_id
Which one has a better performance and why with these two operations:
Query for all categories(I know a seperated table could be faster, but does it really faster a lot? even compare to hash index?)
Query for all rows that in a specified category(I assume hash index should be faster, but not sure, cause someone told me MySQL opitimizer will make table join with small table much faster)
EDIT : I almost never add new categories here.

You can define an index on your category column, and it will make some queries for a specific category much faster (assuming the category you search for doesn't occur in a majority of rows). An index on a varchar works well in this way.
The reason you might create a lookup table for the category name is that if you want to change a category name, you can do that by changing one row in the category lookup table, instead of potentially many thousands of rows in the main table.
By the way, your use of the phrase "hash index" is misplaced. InnoDB does not support hash indexes, only B-tree indexes and fulltext indexes.

Considering that for any DB it is faster to check a number (integer) than a string. I believe that the fastest result will be received if you create a X-REF table as you mentioned which converts the strings into a number which is the ID of the big table records, and have this field set as an index.
As stated, you will gain performance by assisting your DB to compare 10M numbers instead of 10M strings.
Also, as Bill Karwin suggests, this will allow you to change/add categories in the most flexible way.
Last, if you don't expect the total number of categories to grow above, say, 2000, you may even make the index field of the big table to be just a two-bytes integer.

Related

Indexing mysql table for selects

I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.

MySQL performance with index clairification

Say I have a mysql table with an index on the column 'name':
I do this query:
select * from name_table where name = 'John';
Say there are 5 results that are returned from a table with 100 rows.
Say I now insert 1 million new rows, non that have a name John, so there are still only 5 Johns in the table. Will the select statement be as fast as previously, so will inserting all these rows have an impact on the read speed of an indexed table?
Indexes have their own "tables", and when the MySQL engine determines that the lookup references an indexed column, the lookup happens on this table. It isn't really a table per-se, but the gist checks out.
That said, it will be nanoseconds slower, but not something you should concern yourself with.
More importantly, concern youself with indexing pertinent data, and column order, as these have MUCH more of an impact on database performance.
To learn more about what is happening behind the scenes, query the EXPLAIN:
EXPLAIN select * from name_table where name = 'John';
Note: In addition to the column orders listed in the link, it is a good (nay, great) idea to have variable length columns (VARCHAR) after their fixed-length counterparts (CHAR) as, durring the lookup, the engine has to either look at the row, read the column lengths, then skip forward for the lookup (mind you, this is only for non-indexed columns), or read the table declairation and know it always has to look at the column with the offset X. It is more complicated behind the scenes, but if you can shift all fixed-length columns to the front, you will thank yourself. Basically:
Indexed columns.
Everything Fixed-Length in order according to the link.
Everything Variable-Length in order according to the link.
Yes, it will be just as fast.
(In addition to the excellent points made Mike's answer...) there's an important point we should make regarding indexes (B-tree indexes in particular):
The entries in the index are stored "in order".
The index is also organized in a way that allows the database to very quickly identify the blocks in the index that contain the entries it's looking for (or the block that would contain entries, if no matching entries are there.)
What this means is that the database doesn't need to look at every entry in the index. Given a predicate like the one in your question:
WHERE name = 'John'
with an index with a leading column of name, the database can eliminate vast swaths of blocks that don't need to be checked.
Blocks near the beginning of the index contain entries 'Adrian' thru 'Anna', a little later in the index, a block contains entries for Caleb thru Carl, further long in the index James thru Jane, etc.
Because of the way the index is organized, the database effectively "knows" that the entries we're looking for cannot be in any of those blocks (because the index is in order, there's no way the value John could appear in those blocks we mentioned). So none of those blocks needs to be checked. (The database figures out in just a very small number of operations, that 98% of the blocks in the index can be eliminated from consideration.
High cardinality = good performance
The take away from this is that indexes are most effective on columns that have high cardinality. That is, there are a large number of distinct values in the column, and those values are unique or nearly unique.
This should clear up the answer to the question you were asking. You can add brazilians of rows to the table. If only five of those rows have a value of
John in the name column, when you do a query
WHERE name = `John`
it will be just as fast. The database will be able to locate the entries your looking for nearly as fast as it can when you had a thousand rows in the table.
(As the index grows larger, it does add "levels" to the index, to traverse down to the leaf nodes... so, it gets ever so slightly slower because of a tiny few more operations. Where performance really starts to bog down is when the InnoDB buffer cache is too small, and we have to wait for the (glacially slow in comparison) disk io operations to fetch blocks into memory.
Low cardinality = poor performance
Indexes on columns with low cardinality are much less effective. For example, a column that has two possible values, with an even distribution of values across the rows in the table (about half of the rows have one value, and the other half have the other value.) In this case, the database can't eliminate 98% of the blocks, or 90% of the blocks. The database has to slog through half the blocks in the index, and then (usually) perform a lookup to the pages in the underlying table to get the other values for the row.
But with gazillions of rows with a column gender, with two values 'M' and 'F', an index with gender as a leading column will not be effective in satisfying a query
WHERE gender = 'M'
... because we're effectively telling the database to retrieve half the rows in the table, and it's likely those rows are going to be evenly distributed in the table. So nearly every page in the table is going to contain at least one row we need, the database is going to opt to do a full table scan (to look at every row in every block in the table) to locate the rows, rather than using an index.
So, in terms of performance for looking up rows in the table using an index... the size of the table isn't really an issue. The real issue is the cardinality of the values in the index, and how many distinct values we're looking for, and how many rows need to be returned.

Short, single-field indexes or enormous covering indexes in MySQL

I am trying to understand exactly what is and is not useful in a multiple-field index. I have read this existing question (and many more) plus other sites/resources (MySQL Performance Blog, Percona slideshares, etc.) but I'm not totally confident that what I've found on the subject is current and accurate. So please bear with me while I repeat some of what I think I know.
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Assuming the above is at least generally accurate, here's my puzzle. I have a slowly changing dimension table defined with the following columns (more or less) and using MyISAM:
dim_owner_ID INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
person_ID INT UNSIGNED NOT NULL,
raw_name VARCHAR(92) NOT NULL,
first VARCHAR(30),
middle VARCHAR(50),
last VARCHAR(30),
suffix CHAR(3),
flag CHAR(1)
Each "owner" is a unique instance of a particular individual with a particular name, so if Sue Smith changes her name to Sue Brown, that results in two rows that are the same except for the last field and the surrogate key. My understanding is that the only way to enforce this constraint internally is to do:
UNIQUE INDEX uq_owner_complete (person_ID, raw_name, first, middle, last, suffix, flag)
And that's basically going to duplicate the entire table (except for the surrogate key).
I also need to index a few other fields for quick joins and searches. While there will be some writes, and disk space is neither free nor infinite, read performance is absolutely the #1 priority here. These smaller indexes should serve very well to cover the conditions of the queries that will be run against the table, but in almost every case, the entire row needs to be selected.
With that in mind:
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
If this were an InnoDB index, the PK would already be there, but since it's MyISAM it's got pointers to the full rows instead. So if I'm understanding things correctly, there's no point (no pun intended) to adding the PK to any other index, unless doing so would allow the retrieval of the desired result set directly from the index. Which is not likely here.
I understand if it seems like I'm trying too hard to optimize, and maybe I am, but the tasks I need to perform using this database take weeks at a time, so every little bit helps.
You have to understand one concept. An index (either InnoDB or MyiSAM, ether Primary or secondary) is a data structure that's called "B+ tree".
Each node in the B+ tree is a couple (k, v), where k is a key, v is a value. If you build index on last_name your keys will be "Smith", "Johnson", "Kuzminsky" etc.
Value in the index is some data. If the index is the secondary index then the data is a primary key values.
So if you build index on last_name each node will be a couple (last_name, id) e.g. ("Smith", 5).
Primary index is an index where k is primary key and data is all other fields.
Bearing in mind the above let me comment some points:
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
Not exactly. If your secondary index is good you can quickly find v based on you query condition. E.g. you can quickly find PK by last name.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Index is B+tree where each node is a couple of indexed field(s) value(s) and PK.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Not exactly. If there were no index you'd have to scan whole table and choose only records where last_name = "Smith". But you have index (last_name, PK), so having key "Smith" you can quickly find all PK where last_name = "Smith". And then you can quickly find your full result (because you need not only the last name, but the first name too). So you're right, queries like SELECT * FROM table WHERE last_name = "Smith" are executed in two steps:
Find all PK
By PK find full record.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
Exactly. If your index is actually (last_name, first_name, id) and your query is SELECT first_name WHERE last_name = "Smith" you don't do the second step. You have the first name in the secondary index, so you don't have to go to the Primary index.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
Right. Two neighbor PK values will most likely be in the same page. Well, except cases when one PK is the last value in a page and next PK value is stored in the next page.
Basically, this is why B+ tree structure was invented. It's not only efficient for search but also efficient in sequential access. And until recently we had rotating hard drives.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Right. If you insert new records to MyISAM table the records will be added to the end of MYD file regardless the PK order.
Primary index of MyISAM table will be B+tree with pointers to records in the MYD file.
Now about your particular problem. I don't see any reason to define UNIQUE INDEX uq_owner_complete.
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
The best is to have the secondary index on all columns that are used in the WHERE clause, except low selective fields (like sex). The most selective fields must go first in the index. For example (last_name, eye_color) is good. (eye_color, last_name) is bad.
If the covering index allows to avoid additional PK lookup, that's excellent. But if not that's acceptable too.
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Yes.
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
PK is already a part of the index.( Remember, it's stored as data.) So, it makes no sense to explicitly add PK fields to the secondary index. I think (but not sure) that MyISAM secondary indexes store the PK values too (and Primary indexes do store the pointers).
To summarize:
Make your PK shorter as possible (surrogate PK works great)
Add as many indexes as you need until writes performance becomes unacceptable for you.

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

The database I use is currently mySQL but maybe later MSSQL.
My questing is about how mySQL and msSQL takes care about indexed and nonindexed columns.
Lets say I have a simple table like this:
*table_ID -Auto increase. just a ID, indexed.
*table_user_ID -every user has a unique ID indexed
*table_somOtherID -some data..
*....
Lets say that I have A LOT!! of rows in this table, But the number of rows that every user add to this table is very small (10-100)
And I want to find one o a few specific rows in this table. a row or rows from a specific User(indexed column).
If I use the following WHERE clause:
..... WHERE table_user_ID= 'someID' AND table_someOtherID='anotherValue'.
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
I guess the database will increase a lot if I have to index every column in all tables..
But what do you think, is it enough to index those columns that will decrease the number of rows to just ten maybe hundred?
Database optimizers generally work on a cost basis on indexes by looking at all the possible indexes to use based on the query. In your specific case it will see 2 columns - table_user_ID with an index and someOtherID without an index. If you really only have 10-100 rows per userID then the cost of this index will be very low and it will be used. This is because the cardinality is high and the DB can only read the few rows it needs and not touch the other rows for every other user its not interested in. However, if the cost to use the index is very high (very few unique userIDs and many entries per user) it might actually be more efficient to not use the index and scan the whole table to prevent random seeking action as it jumps around the table grabbing rows based on the index.
Once it picks the index then the DB just grabs the rows that match that index (10 to 100 in your case) and try to match them against your other criteria searching for rows where someOtherID='anotherValue'
But the number of rows that every user add to this table is very small (10-100)
You only need to index the user_id. It should give you good performance regardless of your query, as long as it includes the user_id in the filter. Until you have identified other use cases, it will pretty much work as you state
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
Yes (in layman terms that is close).
In regards to SQL Server:
The ordering of the indexes are important depending on how you query and how the indexes are structured. If you create an index on the columns
-table_user_id
-table_someotherID
The index is ordered by the table_user_id first. Example:
1-2
1-5
1-6
2-3
2-5
2-6
For the first record on the index, 1 being the table user id, and 2 being some other value.
If you run a query with a where on table_user_id = blah, it will be very fast to use this index, since the table_user_id are indexed in order.
But if you run a query that only uses table_someotherID in the WHERE clause, it might not even use this index, as instead of doing a quick seek in the index for the matching value, it will do a rough scan of the index (which is less efficient than a seek).
Also SQL Server has a INCLUDE feature that associate the columns you want in the SELECT clause to the index you create on the WHERE or JOIN columns.
So to answer your question, it all depends on how you create the indexes and how you query them. You're right not to think about indexing every column, as indexes take up storage and performance hit when you do inserts and updates on the table.

MySQL: low cardinality/selectivity columns = how to index?

I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.