MySQL Optimization When Not All Columns Are Indexed - mysql

Say I have a table with 3 columns and thousands of records like this:
id # primary key
name # indexed
gender # not indexed
And I want to find "All males named Alex", i.e., a specific name and specific gender.
Is the naieve way (select * from people where name='alex' and gender=2) good enough here? Or is there a more optimal way, like a sub-query on name?

Assuming that you don't have thousand of records, matching the name, with only few being actually males, the index on name is enough. Generally you should not index fields with little carinality (only 2 possible values means that you are going to match 50% of the rows, which does not justify using an index).
The only usefull exception I can think of, is if you are selecting name and gender only, and if you put both of them in the index, you can perform an index-covered query, which is faster than selecting rows by index and then retrieving the data from the table.

If creating an index is not an option, or you have a large volume of data in the table (or even if there is an index, but you still want to quicken the pace) it can often have a big impact to reorder the table according to the data you are grouping together.
I have a query at work for getting KPIs together for my division and even though everything was nicely indexed, the data that was being pulled was still searching through a couple of gigs of table. This means a LOT of disc accessing while the query aggregates all the correct rows together. I reordered the table using alter table tableName order by column1, column2; and the query went from taking around 15 seconds to returning data in under 3. So the physical gathering of the data can be a significant influence - even if the tables are indexed and the DB knows exactly where to get it. Arranging the data so it is easier for the database to get to everything it needs will improve performance.

A better way is to have a composite index.
i.e.
CREATE INDEX <some name for the index> ON <table name> (name, gender)
Then the WHERE clause can use it for both the name and the gender.

Related

MySQL: hash index vs. table join

I'm have pretty big MySQL table(more than 10 millions of rows, innoDB engine), the table has a field that indicate a row's category(varchar(40)), the categories are less than 10.
Now I have two choices:
keep the field and make a hash index on it.
make the field into another category table, and link them with a category_id
Which one has a better performance and why with these two operations:
Query for all categories(I know a seperated table could be faster, but does it really faster a lot? even compare to hash index?)
Query for all rows that in a specified category(I assume hash index should be faster, but not sure, cause someone told me MySQL opitimizer will make table join with small table much faster)
EDIT : I almost never add new categories here.
You can define an index on your category column, and it will make some queries for a specific category much faster (assuming the category you search for doesn't occur in a majority of rows). An index on a varchar works well in this way.
The reason you might create a lookup table for the category name is that if you want to change a category name, you can do that by changing one row in the category lookup table, instead of potentially many thousands of rows in the main table.
By the way, your use of the phrase "hash index" is misplaced. InnoDB does not support hash indexes, only B-tree indexes and fulltext indexes.
Considering that for any DB it is faster to check a number (integer) than a string. I believe that the fastest result will be received if you create a X-REF table as you mentioned which converts the strings into a number which is the ID of the big table records, and have this field set as an index.
As stated, you will gain performance by assisting your DB to compare 10M numbers instead of 10M strings.
Also, as Bill Karwin suggests, this will allow you to change/add categories in the most flexible way.
Last, if you don't expect the total number of categories to grow above, say, 2000, you may even make the index field of the big table to be just a two-bytes integer.

Indexing mysql table for selects

I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.

How 'and' and 'or' work in SQL

Imagine I have a database for a large website which has a table called 'users' that has a large number of records. When I execute a query such as SELECT * FROM users WHERE username='John' my understanding is that (ignoring caching etc.) the database would navigate the index and find the user(s) named John. Imagine this query returns 1 million results and I am only interested in users called John who are 25 years old, so I perform another query: SELECT * FROM users WHERE username='John' AND age=25
How does this work? does it loop through all the users named John and find only those who's age matches 25, or is there a better way of doing it? I assume this is database and storage engine specific so we can assume I am using MySQL with InnoDB.
The answer is -- you're not supposed to ask this question. In a declarative language like SQL you describe the result desired and the processing engine determines the optimal way to produce the result. It may take different paths to get to the result depending on seemingly minor differences in the request, or the method used may change from version to version of the product, or even based on some factor completely unrelated to the product (available memory or disk space, for instance).
That said, the following is true of most SQL databases in most cases:
The database will use only one index in evaluating a WHERE clause.
If more than one index could be used to evaluate the WHERE clause the database will use statistics about the cardinality (distribution of values) in each index to select the "best" one.
If there is an index built from more than one column, and the head column(s) of that index are present in the filter conditions of the WHERE clause, that index can possibly be used to filter by multiple columns in a single index.
So, in your example, most databases would use indexes on either age or name to do the first-level filtering, then scan the resulting records to do the second level of filtering. The only exception would be if you had a compound index on (name, age) or (age, name) in which case only an index scan would be needed to find the records.
Assuming you have indexes on both columns, it generally examines the statistics of the data itself to choose an option that reduces the cardinality of the result set as quickly as possible.
For example, if 20% of people are aged 25 but only 3% are called John, it will get the Johns first then strip out those who are not aged 25.
If you have a composite key made up of both columns, then that should be even faster, since there's no "stripping" involved at all.
Bottom line, it comes down to the DB engine understanding the makeup of the data and choosing the best execution plan based on that. That's why it's often good to re-calculate statistics periodically, as the data may change.
If you have a query like this:
SELECT *
FROM users
WHERE username = 'John' AND age = 25;
Then the optimal index is users(username, age) or users(age, username). With this index, the matching records can be found just by looking them up in the index.
As for what happens if you only have an index on username. It would typically look up the rows with "John" in the username column. It would then fetch the records from the data pages and continue the filtering based on the data on the pages.

Correct indexing when using OR operator

I have a query like this:
SELECT fields FROM table
WHERE field1='something' OR field2='something'
OR field3='something' OR field4='something'
What would be the correct way to index such a table for this query?
A query like this takes a entire second to run! I have 1 index with all 4 of those fields in it, so I'd think mysql would do something like this:
Go through each row in the index thinking this:
Is field1 something? How about field2? field3? field4? Ok, nope, go to the next row.
You misunderstand how indexes work.
Think of a telephone book (the equivalent of a two-column index on last name first, first name last). If I ask you to find all people in the telephone book whose last name is "Smith," you can benefit from the fact that the names are ordered that way; you can assume that the Smiths are organized together. But if I ask you to find all the people whose first name is "John" you get no benefit from the index. Johns can have any last name, and so they are scattered throughout the book and you end up having to search the hard way, from cover to cover.
Now if I ask you to find all people whose last name is "Smith" OR whose first name is "John", you can find the Smiths easily as before, but that doesn't help you at all to find the Johns. They're still scattered throughout the book and you have to search for them the hard way.
It's the same with multi-column indexes in SQL. The index is sorted by the first column, then sorted by the second column in cases of ties in the first column, then sorted by the third column in cases of ties in both the first two columns, etc. It is not sorted by all columns simultaneously. So your multi-column index doesn't help to make your search terms more efficient, except for the left-most column in the index.
Back to your original question.
What would be the correct way to index such a table for this query?
Create a separate, single-column index on each column. One of these indexes will be a better choice than the others, based on MySQL's estimation of how many I/O operations the index will incur if it is used.
Modern versions of MySQL also have some smarts about index merging, so the query may use more than one index in a given table, and then try to merge the results. Otherwise MySQL tends to be limited to use one index per table in a given query.
Another trick that a lot of people use successfully is to do a separate query for each of your indexed columns (which should use the respective index) and then UNION the results.
SELECT fields FROM table WHERE field1='something'
UNION
SELECT fields FROM table WHERE field2='something'
UNION
SELECT fields FROM table WHERE field3='something'
UNION
SELECT fields FROM table WHERE field4='something'
One final observation: if you find yourself searching for the same 'something' across four fields, you should reconsider if all four fields are actually the same thing, and you're guilty of designing a table that violates First Normal form with repeating groups. If so, perhaps field1 through field4 belong in a single column in a child table. Then it becomes a lot easier to index and query:
SELECT fields from table INNER JOIN child_table ON table.pk = child_table.fk
WHERE child_table.field = 'something'
In addition to previous comment:
Some RDMS like Mysql/PostgreSql can use index merge if optimizer thinks that it's good idea.
So you can create different indexes for each field or create some composite indexes like field1,field2 and field3,field4. Finally, you should try several different solutions and choose with best explain plan.

mySQL (and MSSQL), using both indexed and non-indexed columns in where clause

The database I use is currently mySQL but maybe later MSSQL.
My questing is about how mySQL and msSQL takes care about indexed and nonindexed columns.
Lets say I have a simple table like this:
*table_ID -Auto increase. just a ID, indexed.
*table_user_ID -every user has a unique ID indexed
*table_somOtherID -some data..
*....
Lets say that I have A LOT!! of rows in this table, But the number of rows that every user add to this table is very small (10-100)
And I want to find one o a few specific rows in this table. a row or rows from a specific User(indexed column).
If I use the following WHERE clause:
..... WHERE table_user_ID= 'someID' AND table_someOtherID='anotherValue'.
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
I guess the database will increase a lot if I have to index every column in all tables..
But what do you think, is it enough to index those columns that will decrease the number of rows to just ten maybe hundred?
Database optimizers generally work on a cost basis on indexes by looking at all the possible indexes to use based on the query. In your specific case it will see 2 columns - table_user_ID with an index and someOtherID without an index. If you really only have 10-100 rows per userID then the cost of this index will be very low and it will be used. This is because the cardinality is high and the DB can only read the few rows it needs and not touch the other rows for every other user its not interested in. However, if the cost to use the index is very high (very few unique userIDs and many entries per user) it might actually be more efficient to not use the index and scan the whole table to prevent random seeking action as it jumps around the table grabbing rows based on the index.
Once it picks the index then the DB just grabs the rows that match that index (10 to 100 in your case) and try to match them against your other criteria searching for rows where someOtherID='anotherValue'
But the number of rows that every user add to this table is very small (10-100)
You only need to index the user_id. It should give you good performance regardless of your query, as long as it includes the user_id in the filter. Until you have identified other use cases, it will pretty much work as you state
Will the database first search for the indexed columns, and then search for the "anotherValue" inside of those rows, or how does the database handle this?
Yes (in layman terms that is close).
In regards to SQL Server:
The ordering of the indexes are important depending on how you query and how the indexes are structured. If you create an index on the columns
-table_user_id
-table_someotherID
The index is ordered by the table_user_id first. Example:
1-2
1-5
1-6
2-3
2-5
2-6
For the first record on the index, 1 being the table user id, and 2 being some other value.
If you run a query with a where on table_user_id = blah, it will be very fast to use this index, since the table_user_id are indexed in order.
But if you run a query that only uses table_someotherID in the WHERE clause, it might not even use this index, as instead of doing a quick seek in the index for the matching value, it will do a rough scan of the index (which is less efficient than a seek).
Also SQL Server has a INCLUDE feature that associate the columns you want in the SELECT clause to the index you create on the WHERE or JOIN columns.
So to answer your question, it all depends on how you create the indexes and how you query them. You're right not to think about indexing every column, as indexes take up storage and performance hit when you do inserts and updates on the table.