MYSQL indexing needed? - mysql

I am storing the match odds of sports matches/events in a MYSQl table.
The structure is as follows:
odds_id
match_id
outcome_id
bookmaker_id
odds
date
I can uniquely identify each row by match_id, outcome_id and bookmaker_id. So do I actually need the odds_id. For purposes of order and logic it seems as if I should have one, but the odds_id is meaningless to the user. The table will store a high volume of rows and although the odds_id is set to be auto increment if any odds are deleted the sequence will be thrown.
Will the presence of this index impact upon performance or will it be negligible and therefore ok to leave it in place.
Thank you in advance.
Alan.

If you don't use this table in any joins or whatever, and you usually need the combination between match outcome and bookmaker you can cut the odds index and add an index on the combination (well... not the best word but can't think at any other) of the 3 columns.
If you need to join this table with something else by using the odds_id... you should index it separately.
About the autoincrement... you can set it only on the primary key, witch is already an index. So if you keep the AI you add the index regardless of what you want.
The index impact is almost never negligible so is better to do it right.

Related

Is indexing link tables smart?

So let's say for example I have 2 tables: Users > Items
Users can have favorite Items, and a Item can have multiple users that see it as a favorite, so I'll be using a linking table.
Now my linking table would contain something like:
id (int 11 AI)
user_id (int 11)
item_id (int 11)
Now would it be necessary / usefull to put a index on user_id and item_id since this table will contain a lot of records over time.
I'm not a 100% sure when to use indexes. My idea of when to use them(Might be completely incorrect though) is when you have big database and need to search/filter on a column then you index it. If this is incorrect I'm sorry, it's just what I've always been told.
Basically, yes, that's how it goes.
In this case, I'd say that an index on the user_id column would be useful, because you will display to the user a list of their favorites, right?
An index on the item_id might be less useful, because I doubt you're going to display a list of users that have favorited a specific item. Although you might care about the count ("100 users like this item"), so you might add that index after all. Or you might de-normalize and keep the count in the items table. That would give a better performance, although you'll need to write extra code to maintain that number.
Last but not least - in a link table, you can do away with the id column. Just add the primary key index on both columns (user_id and item_id in that order). This will make sure that you cannot enter duplicate rows, and since user_id is the first column in the index, you'll be able to use it in search queries. No need anymore to add a separate index on just the user_id column.
However this also depends on the code you're using. If you're using some kind of framework (ORM?) that REQUIRES an id column for every table, then this trick is useless.
As requested by the author, here's a quick intro on what indexes are.
Suppose you have a DB table which is just a bunch of rows in no particular order. Let's say we have a table people with the columns name, surname, age.
Now, when you want to find the age for John Smith you probably make a query like this:
select age from people where name='John' and surname='Smith'
When you do this, the DB engine can do only one thing - it has to go through ALL the rows and look for the ones that match. If there's 100,000 rows, it will be slow.
Now there's a faster way of doing this. Think about a phonebook (the classical paper edition). On it's thousand yellow pages there are phone numbers for hundreds of people. Yet you can find the number you seek very quickly even if you're a human being. That's because the numbers are sorted alphabetically by name and surname. You open a random page and you can immediately see whether the number you're looking for is before or after the page you opened. Repeat a couple of times and you've found it.
This kind of searching is called a "binary search". Your DB engine could do this too, if the records were sorted by name and surname. So this is what a Primary Key is - it tells the DB to store the records not in some random order, but sorted by some columns. When a new record comes, it can quickly find its rightful place and push it in there, thus keeping the table forever sorted.
There are a few things to note here already.
First, you can make it sort by one or more columns, but, just like in a phonebook, the order is important. If you sort by name first and then by surname, then that's the order the records will be in. So you'll be able to quickly find all the records where name='John' or name='John' and surname='Smith', but it won't help you at all if you need to find just surname='Smith'. Just like in a phonebook.
Second, pushing a record somewhere in the middle is also somewhat slow. Not criminally so, but still. Appending a record at the end is faster. Therefore people tend to use auto_increment columns for their Primary Keys, because then every new row will be placed at the end.
Third, in most DBs Primary Key is not only also used to search quickly, but also uniquely identify the row. Which means that the DB will not be happy if there are two rows that have equal values for the Primary Key columns. In that case, it cannot determine which has to go first, and which last, and it's also not unique. Another reason to use auto_increment. Note that if the PK index has multiple columns in it, then their combination must be unique - every column individually may be non-unique. In our case that means that there can be many Johns and many Smiths, but only one John Smith.
But we still have a problem. What if we want to quickly find rows both by just the name, and just the surname? A PK index can only do one of those things, not both at the same time.
This is where other non-PK indexes come in play. You can add as many of those as you want to the table. In our case, we could create another index to hold just the surname column.
When we do so, the DB creates another hidden table (OK, not true, but you can think of it this way) which is a copy of the original table, but only with the surname column and a special link back to the rows in the original table. This hidden index table is sorted by the surname column. So when you now need to find a row by specifying just the surname, the DB engine can look it up in the hidden index table, and then follow the links back to the original rows and get the data from them. Much faster.
These non-PK indexes also typically come in a few flavors. There's the standard "index" which places no restrictions at all - you can have duplicate values in the columns, nulls, etc. There's a "unique" index, which enforces that all the values in the index need to be unique; and then there are sometimes speciality indexes like FullText, Spatial, etc. Indexes also tend to have some technical options, but you'll have to read the documentation of your DB for those.
One last important thing to note is - indexes make it fast to find things in a table, but they come at a cost. Modifications to the table (insert, update, delete) become slower, because the indexes need to be updated as well. Keep that in mind and only add them where necessary.
Except for Primary Keys. ALWAYS add Primary Keys. That's an order! :)
In short, yes.
Imagine how well joins would work if, each time you needed to match a primary key value to a foreign key in another table, the DBMS had to search the entire table for the matching keys.

What are the merits of using numeric row IDs in MySQL?

I'm new to SQL, and thinking about my datasets relationally instead of hierarchically is a big shift for me. I'm hoping to get some insight on the performance (both in terms of storage space and processing speed) versus design complexity of using numeric row IDs as a primary key instead of string values which are more meaningful.
Specifically, this is my situation. I have one table ("parent") with a few hundred rows, for which one column is a string identifier (10-20 characters) which would seem to be a natural choice for the table's primary key. I have a second table ("child") with hundreds of thousands (or possibly millions or more) of rows, where each row refers to a row in the parent table (so I could create a foreign key constraint on the child table). (Actually, I have several tables of both types with a complex set of references among them, but I think this gets the point across.)
So I need a column in the child table that gives an identifier to rows in the parent table. Naively, it seems like creating the column as something like VARCHAR(20) to refer to the "natural" identifier in the first table would lead to a huge performance hit, both in terms of storage space and query time, and therefore I should include a numeric (probably auto_increment) id column in the parent table and use this as the reference in the child. But, as the data that I'm loading into MySQL don't already have such numeric ids, it means increasing the complexity of my code and more opportunities for bugs. To make matters worse, since I'm doing exploratory data analysis, I may want to muck around with the values in the parent table without doing anything to the child table, so I'd have to be careful not to accidentally break the relationship by deleting rows and losing my numeric id (I'd probably solve this by storing the ids in a third table or something silly like that.)
So my question is, are there optimizations I might not be aware of that mean a column with hundreds of thousands or millions of rows that repeats just a few hundred string values over and over is less wasteful than it first appears? I don't mind a modest compromise of efficiency in favor of simplicity, as this is for data analysis rather than production, but I'm worried I'll code myself into a corner where everything I want to do takes a huge amount of time to run.
Thanks in advance.
I wouldn't be concerned about space considerations primarily. An integer key would typically occupy four bytes. The varchar will occupy between 1 and 21 bytes, depending on the length of the string. So, if most are just a few characters, a varchar(20) key will occupy more space than an integer key. But not an extraordinary amount more.
Both, by the way, can take advantage of indexes. So speed of access is not particularly different (of course, longer/variable length keys will have marginal effects on index performance).
There are better reasons to use an auto-incremented primary key.
You know which values were most recently inserted.
If duplicates appear (which shouldn't happen for a primary key of course), it is easy to determine which to remove.
If you decide to change the "name" of one of the entries, you don't have to update all the tables that refer to it.
You don't have to worry about leading spaces, trailing spaces, and other character oddities.
You do pay for the additional functionality with four more bytes in a record devoted to something that may not seem useful. However, such efficiencies are premature and probably not worth the effort.
Gordon is right (which is no surprise).
Here are the considerations for you not to worry about, in my view.
When you're dealing with dozens of megarows or less, storage space is basically free. Don't worry about the difference between INT and VARCHAR(20), and don't worry about the disk space cost of adding an extra column or two. It just doesn't matter when you can buy decent terabyte drives for about US$100.
INTs and VARCHARS can both be indexed quite efficiently. You won't see much difference in time performance.
Here's what you should worry about.
There is one significant pitfall in index performance, that you might hit with character indexes. You want the columns upon which you create indexes to be declared NOT NULL, and you never want to do a query that says
WHERE colm IS NULL /* slow! */
or
WHERE colm IS NOT NULL /* slow! */
This kind of thing defeats indexing. In a similar vein, your performance will suffer bigtime if you apply functions to columns in search. For example, don't do this, because it too defeats indexing.
WHERE SUBSTR(colm,1,3) = 'abc' /* slow! */
One more question to ask yourself. Will you uniquely identify the rows in your subsidiary tables, and if so, how? Do they have some sort of natural compound primary key? For example, you could have these columns in a "child" table.
parent varchar(20) pk fk to parent table
birthorder int pk
name varchar(20)
Then, you could have rows like...
parent birthorder name
homer 1 bart
homer 2 lisa
homer 3 maggie
But, if you tried to insert a fourth row here like this
homer 1 badbart
you'd get a primary key collision because (homer,1) is occupied. It's probably a good idea to work how you'll manage primary keys for your subsidiary tables.
Character strings containing numbers sort funny. For example, '2' comes after '101'. You need to be on the lookout for this.
The main benefit you get from numeric values that that they are easier to 'index'. Indexing is a process that MySQL uses to make it easier to find a value.
Typically, if you want to find a value in a group, you have to loop through the group looking for your value. That is slow and has a worst case of O(n). If instead your data was in a nice, searchable format -- like a binary search tree, if could be found in O(lon n), much faster.
Indexing is the process MySQL uses to prepare data to be searched, it generates search trees and other clever do-bobs that will make finding data quick. It makes many searches much faster. However, to do it, it has to compare the value you are searching for to various 'key' values to determine if your value is greater than or less than the key.
This comparison can be done on non-numeric values. However, comparing non-numeric values is much slower. If you want to be able to quickly look up data, your best bet is you have a integer 'key' that you use.
Numeric row id's have many advantages over a string based id.
Most of them are mentioned in other answers.
1. One of them is indexing. Primary keys are by default indexed in a relational database. So, having a numeric key is always more efficient.
2. Numeric fields are stored much more efficiently
2. Joins are much faster with numeric keys.
3. A row id could be a foreign key. Numeric id's are compact to store, making them efficient
4. I think using a auto-increment on primary key has its advantages too
-Thanks
_san

Short, single-field indexes or enormous covering indexes in MySQL

I am trying to understand exactly what is and is not useful in a multiple-field index. I have read this existing question (and many more) plus other sites/resources (MySQL Performance Blog, Percona slideshares, etc.) but I'm not totally confident that what I've found on the subject is current and accurate. So please bear with me while I repeat some of what I think I know.
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Assuming the above is at least generally accurate, here's my puzzle. I have a slowly changing dimension table defined with the following columns (more or less) and using MyISAM:
dim_owner_ID INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
person_ID INT UNSIGNED NOT NULL,
raw_name VARCHAR(92) NOT NULL,
first VARCHAR(30),
middle VARCHAR(50),
last VARCHAR(30),
suffix CHAR(3),
flag CHAR(1)
Each "owner" is a unique instance of a particular individual with a particular name, so if Sue Smith changes her name to Sue Brown, that results in two rows that are the same except for the last field and the surrogate key. My understanding is that the only way to enforce this constraint internally is to do:
UNIQUE INDEX uq_owner_complete (person_ID, raw_name, first, middle, last, suffix, flag)
And that's basically going to duplicate the entire table (except for the surrogate key).
I also need to index a few other fields for quick joins and searches. While there will be some writes, and disk space is neither free nor infinite, read performance is absolutely the #1 priority here. These smaller indexes should serve very well to cover the conditions of the queries that will be run against the table, but in almost every case, the entire row needs to be selected.
With that in mind:
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
If this were an InnoDB index, the PK would already be there, but since it's MyISAM it's got pointers to the full rows instead. So if I'm understanding things correctly, there's no point (no pun intended) to adding the PK to any other index, unless doing so would allow the retrieval of the desired result set directly from the index. Which is not likely here.
I understand if it seems like I'm trying too hard to optimize, and maybe I am, but the tasks I need to perform using this database take weeks at a time, so every little bit helps.
You have to understand one concept. An index (either InnoDB or MyiSAM, ether Primary or secondary) is a data structure that's called "B+ tree".
Each node in the B+ tree is a couple (k, v), where k is a key, v is a value. If you build index on last_name your keys will be "Smith", "Johnson", "Kuzminsky" etc.
Value in the index is some data. If the index is the secondary index then the data is a primary key values.
So if you build index on last_name each node will be a couple (last_name, id) e.g. ("Smith", 5).
Primary index is an index where k is primary key and data is all other fields.
Bearing in mind the above let me comment some points:
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
Not exactly. If your secondary index is good you can quickly find v based on you query condition. E.g. you can quickly find PK by last name.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Index is B+tree where each node is a couple of indexed field(s) value(s) and PK.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Not exactly. If there were no index you'd have to scan whole table and choose only records where last_name = "Smith". But you have index (last_name, PK), so having key "Smith" you can quickly find all PK where last_name = "Smith". And then you can quickly find your full result (because you need not only the last name, but the first name too). So you're right, queries like SELECT * FROM table WHERE last_name = "Smith" are executed in two steps:
Find all PK
By PK find full record.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
Exactly. If your index is actually (last_name, first_name, id) and your query is SELECT first_name WHERE last_name = "Smith" you don't do the second step. You have the first name in the secondary index, so you don't have to go to the Primary index.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
Right. Two neighbor PK values will most likely be in the same page. Well, except cases when one PK is the last value in a page and next PK value is stored in the next page.
Basically, this is why B+ tree structure was invented. It's not only efficient for search but also efficient in sequential access. And until recently we had rotating hard drives.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Right. If you insert new records to MyISAM table the records will be added to the end of MYD file regardless the PK order.
Primary index of MyISAM table will be B+tree with pointers to records in the MYD file.
Now about your particular problem. I don't see any reason to define UNIQUE INDEX uq_owner_complete.
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
The best is to have the secondary index on all columns that are used in the WHERE clause, except low selective fields (like sex). The most selective fields must go first in the index. For example (last_name, eye_color) is good. (eye_color, last_name) is bad.
If the covering index allows to avoid additional PK lookup, that's excellent. But if not that's acceptable too.
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Yes.
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
PK is already a part of the index.( Remember, it's stored as data.) So, it makes no sense to explicitly add PK fields to the secondary index. I think (but not sure) that MyISAM secondary indexes store the PK values too (and Primary indexes do store the pointers).
To summarize:
Make your PK shorter as possible (surrogate PK works great)
Add as many indexes as you need until writes performance becomes unacceptable for you.

When should my indexes have the active column?

I have several tables and I'm wondering if my composite index is helpful or not. I am using MySQL 5+ but I guess this would apply to any database (or not?).
Anyway, say I the following table:
username active
-----------------------------------
Moe.Howard 1
Larry.Fine 0
Shemp.Howard 1
So I normally select like:
select * from users where username = 'shemp.howard' and active = 1;
The active=1 is used in many of our tables. Normally, my index would be on the username column but I'm thinking of added the active flag as well (to the same index).
My logic is that as the query engine is scanning through the index, it would be scanning against an index like:
moe.howard,1
shemp.howard,1
larry.fine,0
and find Shemp before it hits the inactive users (Larry).
Now, our active columns are usually TINYINTS and Unsigned. But I'm concerned the index might be backward!
larry.fine,0
moe.howard,1
shemp.howard,1
How should I best handle this and make sure my indexes are correct? Should I not add the active column to the same index as username? Or should I create a separate index for the active and make it descending?
Thanks.
If you combine those two fields in a composite index with the active flag as the second part of the key, then the index order will only depend on that value when (iff) the name field for two or more rows is identical (which seems unlikely in this situation based on the assumption that one would want user names in a system to be unique). The first key in the composite index will define the order of the keys whenever they are different. In other words, if the user name is unique, then adding the active flag as the second segment of a composite index will not change the order of the index.
Also, note that for the example query, the database won't "scan" the index to find the value. Rather it will seek to the first matching entry, which in the example given consists of a single match. The "scan" would happen if multiple entries pass the WHERE clause.
Having said that, unless there are lots of cases where you have duplicate names, my initial reaction would be to not create the composite key. If the names are "generally" unique, then you would not be buying a lot of savings with the composite key. On the other hand, if there are generally quite a few duplicate names with differing active flag values, it could help. At that point, you may need to just test.
Really we can only second guess what the query optimiser will try and do, however it is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is very likely that even if you index active an index won't actually be used asuming you have many more active than non-active users.
MySQL can only use the index in order, so if you create a composite index of username,active that is entirely pointless as you're not going to have multiple users with the same username.
You really need to analyse your query requirements and then you can design an indexing plan to suite them. Profile each query and don't try to over optimize everything as this can have a negative result.
An index should be added only if the values you expect it to help you filter in/out are representative, statistically speaking.
What does that mean?
If say, the filter in your WHERE clause, on the column you're indexing, is helping you out retrieving 20% of the rows, you should add an index in it. This percent number depends on your special case and should be tryed out but that's the idea.
In your case, just by the name, you would have 100% of exclusion. Adding an index on the active column would be then useless because it wouldn't help reducing the final recordset (except if you have possibly n times the same name but only one active?)
The situation would be different if you decided to filter ONLY active users, not caring about the name.

MySQL index question

I've been reading about indexes in MySQL recently, and some of the principles are quite straightforward but one concept is still bugging me: basically, if in a hypothetical table with, let's say, 10 columns, we have two single-column indexes (for column01 and column02 respectively), plus a primary key column (some other column), then are they going to be used in a simple SELECT query like this one or not:
SELECT * FROM table WHERE column01 = 'aaa' AND column02 = 'bbb'
Looking at it, my first instinct is telling me that the first index is going to retrieve a set of rows (or primary keys in InnoDB, if I got the idea right) that satisfy the first condition, and the second index will get another set. And the final result set will be just the intersection of these two. In the books that I've been going through I cannot find anything about this particular scenario. Of course, for this particular query one index on both columns seems like the best option, but I am struggling with understanding the real process behind this whole thing if I try to use two indexes that I described above.
Its only going to use a single index. You need to create a composite index of multiple columns if you want it to be able to index off of each column you are testing. You may want to read the manual to find out how MySQL uses each type of index, and how to order your composite indexes correctly to get the best utilization of it.
It's actually the most common question
about indexing at all: is it better to
have one index with all columns or one
individual index for every column?
http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/index-combine-performance