Correct indexing when using OR operator - mysql

I have a query like this:
SELECT fields FROM table
WHERE field1='something' OR field2='something'
OR field3='something' OR field4='something'
What would be the correct way to index such a table for this query?
A query like this takes a entire second to run! I have 1 index with all 4 of those fields in it, so I'd think mysql would do something like this:
Go through each row in the index thinking this:
Is field1 something? How about field2? field3? field4? Ok, nope, go to the next row.

You misunderstand how indexes work.
Think of a telephone book (the equivalent of a two-column index on last name first, first name last). If I ask you to find all people in the telephone book whose last name is "Smith," you can benefit from the fact that the names are ordered that way; you can assume that the Smiths are organized together. But if I ask you to find all the people whose first name is "John" you get no benefit from the index. Johns can have any last name, and so they are scattered throughout the book and you end up having to search the hard way, from cover to cover.
Now if I ask you to find all people whose last name is "Smith" OR whose first name is "John", you can find the Smiths easily as before, but that doesn't help you at all to find the Johns. They're still scattered throughout the book and you have to search for them the hard way.
It's the same with multi-column indexes in SQL. The index is sorted by the first column, then sorted by the second column in cases of ties in the first column, then sorted by the third column in cases of ties in both the first two columns, etc. It is not sorted by all columns simultaneously. So your multi-column index doesn't help to make your search terms more efficient, except for the left-most column in the index.
Back to your original question.
What would be the correct way to index such a table for this query?
Create a separate, single-column index on each column. One of these indexes will be a better choice than the others, based on MySQL's estimation of how many I/O operations the index will incur if it is used.
Modern versions of MySQL also have some smarts about index merging, so the query may use more than one index in a given table, and then try to merge the results. Otherwise MySQL tends to be limited to use one index per table in a given query.
Another trick that a lot of people use successfully is to do a separate query for each of your indexed columns (which should use the respective index) and then UNION the results.
SELECT fields FROM table WHERE field1='something'
UNION
SELECT fields FROM table WHERE field2='something'
UNION
SELECT fields FROM table WHERE field3='something'
UNION
SELECT fields FROM table WHERE field4='something'
One final observation: if you find yourself searching for the same 'something' across four fields, you should reconsider if all four fields are actually the same thing, and you're guilty of designing a table that violates First Normal form with repeating groups. If so, perhaps field1 through field4 belong in a single column in a child table. Then it becomes a lot easier to index and query:
SELECT fields from table INNER JOIN child_table ON table.pk = child_table.fk
WHERE child_table.field = 'something'

In addition to previous comment:
Some RDMS like Mysql/PostgreSql can use index merge if optimizer thinks that it's good idea.
So you can create different indexes for each field or create some composite indexes like field1,field2 and field3,field4. Finally, you should try several different solutions and choose with best explain plan.

Related

Which indexes to create on a table that will contain millions of rows and multiple filters

I have a table with millions of rows. Users can select any combination of filters on multiple columns. For ex:
Year
Month
Product
HSCode
Chapter
Country
Port
Unit
Importer/exporter name
10.Type
I am planning to make it mandatory to select the Year filter. So that index on Year filter is always used to improve the query performance.
Since any combination of these filters be used (single, multiple, all)- What kind of indexes I should have on the table? The table is going to be really huge and it is important to maintain the read performance on these queries
Discover what queries are typically used.
Make a dozen or so 2-column indexes based on typical queries.
When making composite indexes:
Have the column(s) tested with = first.
When a column is tested with a range (IN, LIKE, BETWEEN, etc), subsequent columns in the index may go unused.
LIKE 'no-wild-cards' and IN ('one option') are the same as =`.
`LIKE 'blah%' is a "range" test.
LIKE '%blah cannot use an index.
Have an extra column for all "text" searches; toss all the "text" columns into it. (Optionally remove punctuation.) Then use FULLTEXT and MATCH.
Year and Month can be problematic; let's see some concrete examples.
See also EAV
I'd recommend using a search engine like Apache Solr for the task you describe.
The problem with using conventional MySQL indexing is that they have a fixed number of columns, the columns are in order, and searches must make use of the columns from the first one.
Compare with looking up a name in a telephone book. You can look up a person by their last name, because the last name is the first column in the index. But if you need to search for someone by first name only, it's not the first column in the index, and the sort order of the book does not help.
So to optimize searching by any columns in your search criteria, you'd need N-factorial indexes for N columns.
Whereas using a search engine like Apache Solr doesn't use the same kind of indexing. You can search the Solr index with any subset of attributes.

Indexing mysql table for selects

I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.

Is indexing link tables smart?

So let's say for example I have 2 tables: Users > Items
Users can have favorite Items, and a Item can have multiple users that see it as a favorite, so I'll be using a linking table.
Now my linking table would contain something like:
id (int 11 AI)
user_id (int 11)
item_id (int 11)
Now would it be necessary / usefull to put a index on user_id and item_id since this table will contain a lot of records over time.
I'm not a 100% sure when to use indexes. My idea of when to use them(Might be completely incorrect though) is when you have big database and need to search/filter on a column then you index it. If this is incorrect I'm sorry, it's just what I've always been told.
Basically, yes, that's how it goes.
In this case, I'd say that an index on the user_id column would be useful, because you will display to the user a list of their favorites, right?
An index on the item_id might be less useful, because I doubt you're going to display a list of users that have favorited a specific item. Although you might care about the count ("100 users like this item"), so you might add that index after all. Or you might de-normalize and keep the count in the items table. That would give a better performance, although you'll need to write extra code to maintain that number.
Last but not least - in a link table, you can do away with the id column. Just add the primary key index on both columns (user_id and item_id in that order). This will make sure that you cannot enter duplicate rows, and since user_id is the first column in the index, you'll be able to use it in search queries. No need anymore to add a separate index on just the user_id column.
However this also depends on the code you're using. If you're using some kind of framework (ORM?) that REQUIRES an id column for every table, then this trick is useless.
As requested by the author, here's a quick intro on what indexes are.
Suppose you have a DB table which is just a bunch of rows in no particular order. Let's say we have a table people with the columns name, surname, age.
Now, when you want to find the age for John Smith you probably make a query like this:
select age from people where name='John' and surname='Smith'
When you do this, the DB engine can do only one thing - it has to go through ALL the rows and look for the ones that match. If there's 100,000 rows, it will be slow.
Now there's a faster way of doing this. Think about a phonebook (the classical paper edition). On it's thousand yellow pages there are phone numbers for hundreds of people. Yet you can find the number you seek very quickly even if you're a human being. That's because the numbers are sorted alphabetically by name and surname. You open a random page and you can immediately see whether the number you're looking for is before or after the page you opened. Repeat a couple of times and you've found it.
This kind of searching is called a "binary search". Your DB engine could do this too, if the records were sorted by name and surname. So this is what a Primary Key is - it tells the DB to store the records not in some random order, but sorted by some columns. When a new record comes, it can quickly find its rightful place and push it in there, thus keeping the table forever sorted.
There are a few things to note here already.
First, you can make it sort by one or more columns, but, just like in a phonebook, the order is important. If you sort by name first and then by surname, then that's the order the records will be in. So you'll be able to quickly find all the records where name='John' or name='John' and surname='Smith', but it won't help you at all if you need to find just surname='Smith'. Just like in a phonebook.
Second, pushing a record somewhere in the middle is also somewhat slow. Not criminally so, but still. Appending a record at the end is faster. Therefore people tend to use auto_increment columns for their Primary Keys, because then every new row will be placed at the end.
Third, in most DBs Primary Key is not only also used to search quickly, but also uniquely identify the row. Which means that the DB will not be happy if there are two rows that have equal values for the Primary Key columns. In that case, it cannot determine which has to go first, and which last, and it's also not unique. Another reason to use auto_increment. Note that if the PK index has multiple columns in it, then their combination must be unique - every column individually may be non-unique. In our case that means that there can be many Johns and many Smiths, but only one John Smith.
But we still have a problem. What if we want to quickly find rows both by just the name, and just the surname? A PK index can only do one of those things, not both at the same time.
This is where other non-PK indexes come in play. You can add as many of those as you want to the table. In our case, we could create another index to hold just the surname column.
When we do so, the DB creates another hidden table (OK, not true, but you can think of it this way) which is a copy of the original table, but only with the surname column and a special link back to the rows in the original table. This hidden index table is sorted by the surname column. So when you now need to find a row by specifying just the surname, the DB engine can look it up in the hidden index table, and then follow the links back to the original rows and get the data from them. Much faster.
These non-PK indexes also typically come in a few flavors. There's the standard "index" which places no restrictions at all - you can have duplicate values in the columns, nulls, etc. There's a "unique" index, which enforces that all the values in the index need to be unique; and then there are sometimes speciality indexes like FullText, Spatial, etc. Indexes also tend to have some technical options, but you'll have to read the documentation of your DB for those.
One last important thing to note is - indexes make it fast to find things in a table, but they come at a cost. Modifications to the table (insert, update, delete) become slower, because the indexes need to be updated as well. Keep that in mind and only add them where necessary.
Except for Primary Keys. ALWAYS add Primary Keys. That's an order! :)
In short, yes.
Imagine how well joins would work if, each time you needed to match a primary key value to a foreign key in another table, the DBMS had to search the entire table for the matching keys.

MySQL Optimization When Not All Columns Are Indexed

Say I have a table with 3 columns and thousands of records like this:
id # primary key
name # indexed
gender # not indexed
And I want to find "All males named Alex", i.e., a specific name and specific gender.
Is the naieve way (select * from people where name='alex' and gender=2) good enough here? Or is there a more optimal way, like a sub-query on name?
Assuming that you don't have thousand of records, matching the name, with only few being actually males, the index on name is enough. Generally you should not index fields with little carinality (only 2 possible values means that you are going to match 50% of the rows, which does not justify using an index).
The only usefull exception I can think of, is if you are selecting name and gender only, and if you put both of them in the index, you can perform an index-covered query, which is faster than selecting rows by index and then retrieving the data from the table.
If creating an index is not an option, or you have a large volume of data in the table (or even if there is an index, but you still want to quicken the pace) it can often have a big impact to reorder the table according to the data you are grouping together.
I have a query at work for getting KPIs together for my division and even though everything was nicely indexed, the data that was being pulled was still searching through a couple of gigs of table. This means a LOT of disc accessing while the query aggregates all the correct rows together. I reordered the table using alter table tableName order by column1, column2; and the query went from taking around 15 seconds to returning data in under 3. So the physical gathering of the data can be a significant influence - even if the tables are indexed and the DB knows exactly where to get it. Arranging the data so it is easier for the database to get to everything it needs will improve performance.
A better way is to have a composite index.
i.e.
CREATE INDEX <some name for the index> ON <table name> (name, gender)
Then the WHERE clause can use it for both the name and the gender.

MySQL index question

I've been reading about indexes in MySQL recently, and some of the principles are quite straightforward but one concept is still bugging me: basically, if in a hypothetical table with, let's say, 10 columns, we have two single-column indexes (for column01 and column02 respectively), plus a primary key column (some other column), then are they going to be used in a simple SELECT query like this one or not:
SELECT * FROM table WHERE column01 = 'aaa' AND column02 = 'bbb'
Looking at it, my first instinct is telling me that the first index is going to retrieve a set of rows (or primary keys in InnoDB, if I got the idea right) that satisfy the first condition, and the second index will get another set. And the final result set will be just the intersection of these two. In the books that I've been going through I cannot find anything about this particular scenario. Of course, for this particular query one index on both columns seems like the best option, but I am struggling with understanding the real process behind this whole thing if I try to use two indexes that I described above.
Its only going to use a single index. You need to create a composite index of multiple columns if you want it to be able to index off of each column you are testing. You may want to read the manual to find out how MySQL uses each type of index, and how to order your composite indexes correctly to get the best utilization of it.
It's actually the most common question
about indexing at all: is it better to
have one index with all columns or one
individual index for every column?
http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/index-combine-performance