Imagine I have a database for a large website which has a table called 'users' that has a large number of records. When I execute a query such as SELECT * FROM users WHERE username='John' my understanding is that (ignoring caching etc.) the database would navigate the index and find the user(s) named John. Imagine this query returns 1 million results and I am only interested in users called John who are 25 years old, so I perform another query: SELECT * FROM users WHERE username='John' AND age=25
How does this work? does it loop through all the users named John and find only those who's age matches 25, or is there a better way of doing it? I assume this is database and storage engine specific so we can assume I am using MySQL with InnoDB.
The answer is -- you're not supposed to ask this question. In a declarative language like SQL you describe the result desired and the processing engine determines the optimal way to produce the result. It may take different paths to get to the result depending on seemingly minor differences in the request, or the method used may change from version to version of the product, or even based on some factor completely unrelated to the product (available memory or disk space, for instance).
That said, the following is true of most SQL databases in most cases:
The database will use only one index in evaluating a WHERE clause.
If more than one index could be used to evaluate the WHERE clause the database will use statistics about the cardinality (distribution of values) in each index to select the "best" one.
If there is an index built from more than one column, and the head column(s) of that index are present in the filter conditions of the WHERE clause, that index can possibly be used to filter by multiple columns in a single index.
So, in your example, most databases would use indexes on either age or name to do the first-level filtering, then scan the resulting records to do the second level of filtering. The only exception would be if you had a compound index on (name, age) or (age, name) in which case only an index scan would be needed to find the records.
Assuming you have indexes on both columns, it generally examines the statistics of the data itself to choose an option that reduces the cardinality of the result set as quickly as possible.
For example, if 20% of people are aged 25 but only 3% are called John, it will get the Johns first then strip out those who are not aged 25.
If you have a composite key made up of both columns, then that should be even faster, since there's no "stripping" involved at all.
Bottom line, it comes down to the DB engine understanding the makeup of the data and choosing the best execution plan based on that. That's why it's often good to re-calculate statistics periodically, as the data may change.
If you have a query like this:
SELECT *
FROM users
WHERE username = 'John' AND age = 25;
Then the optimal index is users(username, age) or users(age, username). With this index, the matching records can be found just by looking them up in the index.
As for what happens if you only have an index on username. It would typically look up the rows with "John" in the username column. It would then fetch the records from the data pages and continue the filtering based on the data on the pages.
Related
I'm looking to add some mysql indexes to a database table. Most of the queries are for selects. Would it be best to create a separate index for each column, or add an index for province, province and number, and name. Does it even make sense to index province since there are only about a dozen options?
select * from employees where province = 'ab' and number = 'v45g';
select * from employees where province = 'ab';
If the usage changed to more inserts should I remove all the indexes except for the number?
An index is a data structure that maps the values of a column into a fast searchable tree. This tree contains the index of rows which the DB can use to find rows fast. One thing to know, some DB engines read plus or minus a bunch of rows to take advantage of disk read ahead. So you may actually read 50 or 100 rows per index read, and not just one. Hence, if you access 30% of a table through an index, you may wind up reading all table data multiple times.
Rule of thumb:
- index the more unique values, a tree with 2 branches and half of your table on either side is not too useful for narrowing down a search
- use as few index as possible
- use real world examples numbers as much as possible. Performance can change dynamically based on data or the whim of the DB engine, so it's very important to try and track how fast your queries are running consistently (ie: log this in case a query ever gets slow). But from this data you can add indexes without being blind
Okay, so there are multiple kinds of index, single and multiple column. You want multiple indexes when it makes sense for indexes to access each other, multiple columns typically when you are refining with a where clause. Think of the first as good when you want joins, or you have "or" conditions. The second is better when you have and conditions and successively filter rows.
In your case name does not make sense since like does not use index. city and number do make sense, probably as a multi-column index. Province could help as well as the last index.
So an index with these columns would likely help:
(number,city,province)
Or try as well just:
(number,city)
You should index fields that are searched upon and have high selectivity / cardinality. Indexes make writes slower.
Other thing is that indexes can be added and dropped at any time so maybe you should let this for a later review of the database and optimization of querys.
That being said one index that you can be sure to add is in the column that holds the name of a person. That's almost always used in searching.
According to MySQL documentation found here:
You can create multiple column indexes and the first column mentioned in the index declaration uses index when searched alone but not the others.
Documentation also says that if you create a hash of the columns and save in another column and index the hashed column the search could be faster then multiple indexes.
SELECT * FROM tbl_name
WHERE hash_col=MD5(CONCAT(val1,val2))
AND col1=val1 AND col2=val2;
You could use an unique index on province.
Greeting,
My question; Whether or no sql query (SELECT) continues or stops reading data (records) from table when find the value that I was looking for?
referance: "In order to return data for this query, mysql must start at the beginning of the disk data file, read in enough of the record to know where the category field data starts (because long_text is variable length), read this value, see if it satisfies the where condition (and so decide whether to add to the return record set), then figure out where the next record set is, then repeat."
link for referance: http://www.verynoisy.com/sql-indexing-dummies/#how_the_database_finds_records_normally
In general you don't know and you don't care, but you have to adapt when queries take too long to execute. When you do something like
select a,b,c from mytable where a=3 and b=5
then the database engine has a couple of options to optimize. When all these options fail, then it will do a "full table scan" - which means, it will have to examine the entire table to see which rows are eligible. When you have indices on e.g. column a then the database engine can optimize the search because it can pre-select rows where a has value 3. So, in general, make sure that you have indices for the columns that are most searched. (Perversely, some database engines get confused when you have too many indices and will fall back to a full table scan because they've lost their way...)
As to whether or not the scanning stops: In general, the database engine has to examine all data in the table (hopefully aided by indices) and won't stop after having found just one hit. If you want just the first hit, use a limit 1 clause to make sure that your result set has only one outcome. But then again, if you have a sort by clause, the database engine cannot stop after the first hit, there might be next ones that should get priority given the sorting.
Summarizing, how the db engine does its scan depends on how smart it is, what indices are available etc.. If your select queries take too long then consider re-organizing your indices, writing your select statements differently, or rebuilding the table.
The RDBMS reading data from disk is something you cannot know, you should not care and you must not rely on.
The issue is too broad to get a precise answer. The engine reads data from storage in blocks, a block can contain records that are not needed by the query at hand. If all the columns needed by the query is available in an index, the RDBMS won't even read the data file, it will only use the index. The data it needs could already be cached in memory (because it was read during the execution of a previous query). The underlying OS and the storage media also keep their own caches.
On a busy system, all these factors could lead to very different storage access patterns while running the same query several times on a couple of minutes apart.
Yes it scans the entire file. Unless you put something like
select * from user where id=100 limit 1
This of course will still search entire rows if id 100 is the last record.
If id is a primary key it will automatically be indexed and searching would be optimized
I'm sorry... I thought the table.
I will change question and I will explain it in the following image;
I understand that in CASE 1 all columns must be read with each iteration.
My question is: If it's the same in the CASE 2 or columns that are not selected in the query are excluded from reading in each iteration.
Also, are the both queries are the some in performance perspective?
Clarify:
CASE: 1 In first CASE select print all data
CASE: 2 In second CASE select print columns first_name and last_name
Whether in CASE 2 mysql server (SQL query) reads only columns first_name, last_name or read the entire table to get that data(rows)=(first_name, last_name)?
An interest of me how the server reads table row in CASE 1 and CASE 2?
Say I have a table with 3 columns and thousands of records like this:
id # primary key
name # indexed
gender # not indexed
And I want to find "All males named Alex", i.e., a specific name and specific gender.
Is the naieve way (select * from people where name='alex' and gender=2) good enough here? Or is there a more optimal way, like a sub-query on name?
Assuming that you don't have thousand of records, matching the name, with only few being actually males, the index on name is enough. Generally you should not index fields with little carinality (only 2 possible values means that you are going to match 50% of the rows, which does not justify using an index).
The only usefull exception I can think of, is if you are selecting name and gender only, and if you put both of them in the index, you can perform an index-covered query, which is faster than selecting rows by index and then retrieving the data from the table.
If creating an index is not an option, or you have a large volume of data in the table (or even if there is an index, but you still want to quicken the pace) it can often have a big impact to reorder the table according to the data you are grouping together.
I have a query at work for getting KPIs together for my division and even though everything was nicely indexed, the data that was being pulled was still searching through a couple of gigs of table. This means a LOT of disc accessing while the query aggregates all the correct rows together. I reordered the table using alter table tableName order by column1, column2; and the query went from taking around 15 seconds to returning data in under 3. So the physical gathering of the data can be a significant influence - even if the tables are indexed and the DB knows exactly where to get it. Arranging the data so it is easier for the database to get to everything it needs will improve performance.
A better way is to have a composite index.
i.e.
CREATE INDEX <some name for the index> ON <table name> (name, gender)
Then the WHERE clause can use it for both the name and the gender.
i'm creating a ecommerce web applicaiton using PHP and MYSQL(MYISAM). i want to know how to speed up my queries
I have a products table with over a million records with following columns: id (int, primary) catid(int) usrid (int) title (int) description (int) status (enum) date(datetime)
recently i split this one table into multiple tables based on the product categories(catid). thinking that it might reduce the load on the server.
Now i need to fetch results from these tables combined with following sets of conditions 1. results matching a usrid and status. (to fetch a users products) 2 results matching status and title or description (eg: for product search)
now currently i have to use UNION to fetch results from these all tables combined which is slowing down the permormance also i can't apply the LIMIT to the combined result set also. I thought of creating an index on all these columns to speed up the searching but this might slow down the INSERTS and UPDATES. also i'm begingin to think that splitting the table was not a good idea in the first place.
i would like to know the best approach to optimize the data retrieval in such a situation. I'm open to new database schema proposals as well.
To start: load test and turn on the MySQL slow query log.
Some other suggestions:
If staying with separate tables per category use UNION ALL instead of UNION. Reason being UNION implies distinctness, which makes the database engine do extra work to dedupe the rows unnecessarily.
Indices do add a write penalty, but what you describe probably has a read-write ratio of at least 10 to 1 and probably more like 1000 to 1 or higher. So index. For the two queries you describe, I would probably create three indices (you'll need to study explain plans to determine what column order is better).
usrid and status
status and title
status and description (is this an indexable field?)
Another note on indices, creating a covering index, that is one that has all your columns, can also be a useful solution if one of your frequent access patterns is retrieval by primary key.
Have you considered using memcached? It caches the resultset from database queries on the server and returns them if they are requested by multiple users. If it doesn't find a cache resultset, only then will it query the database. It should alleviate the load on the database significantly.
http://memcached.org/
I need to add indexes to my table (columns) and stumbled across this post:
How many database indexes is too many?
Quote:
“Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.”
Is an Index really pointless if there are only two distinct values? Given a table as follows (MySQL Database, InnoDB)
Id (BIGINT)
fullname (VARCHAR)
address (VARCHAR)
status (VARCHAR)
Further conditions:
The Database contains 300 Million records
Status can only be “enabled” and “disabled”
150 Million records have status= enabled and 150 Million records have
stauts= disabled
My understanding is, without having an index on status, a select with where status=’enabled’ would result in a full tablescan with 300 Million Records to process?
How efficient is the lookup when I use a BTREE index on status?
Should I index this column or not?
What alternatives (maybe any other indexes) does MySQL InnoDB provide to efficiently look records up by the "where status="enabled" clause in the given example with a very low cardinality/selectivity of the values?
The index that you describe is pretty much pointless. An index is best used when you need to select a small number of rows in comparison to the total rows.
The reason for this is related to how a database accesses a table. Tables can be assessed either by a full table scan, where each block is read and processed in turn. Or by a rowid or key lookup, where the database has a key/rowid and reads the exact row it requires.
In the case where you use a where clause based on the primary key or another unique index, eg. where id = 1, the database can use the index to get an exact reference to where the row's data is stored. This is clearly more efficient than doing a full table scan and processing every block.
Now back to your example, you have a where clause of where status = 'enabled', the index will return 150m rows and the database will have to read each row in turn using separate small reads. Whereas accessing the table with a full table scan allows the database to make use of more efficient larger reads.
There is a point at which it is better to just do a full table scan rather than use the index. With mysql you can use FORCE INDEX (idx_name) as part of your query to allow comparisons between each table access method.
Reference:
http://dev.mysql.com/doc/refman/5.5/en/how-to-avoid-table-scan.html
I'm sorry to say that I do not agree with Mike. Adding an index is meant to limit the amount of full records searches for MySQL, thereby limiting IO which usually is the bottleneck.
This indexing is not free; you pay for it on inserts/updates when the index has to be updated and in the search itself, as it now needs to load the index file (full text index for 300M records is probably not in memory). So it might well be that you get extra IO in stead of limitting it.
I do agree with the statement that a binary variable is best stored as one, a bool or tinyint, as that decreases the length of a row and can thereby limit disk IO, also comparisons on numbers are faster.
If you need speed and you seldom use the disabled records, you may wish to have 2 tables, one for enabled and one for disabled records and move the records when the status changes. As it increases complexity and risk this would be my very last choice of course. Definitely do the move in 1 transaction if you happen to go for it.
It just popped into my head that you can check wether an index is actually used by using the explain statement. That should show you how MySQL is optimizing the query. I don't really know hoe MySQL optimizes queries, but from postgresql I do know that you should explain a query on a database approximately the same (in size and data) as the real database. So if you have a copy on the database, create an index on the table and see wether it's actually used. As I said, I doubt it, but I most definitely don't know everything:)
If the data is distributed like 50:50 then query like where status="enabled" will avoid half scanning of the table.
Having index on such tables is completely depends on distribution of data, i,e : if entries having status enabled is 90% and other is 10%. and for query where status="disabled" it scans only 10% of the table.
so having index on such columns depends on distribution of data.
#a'r answer is correct, however it needs to be pointed out that the usefulness of an index is given not only by its cardinality but also by the distribution of data and the queries run on the database.
In OP's case, with 150M records having status='enabled' and 150M having status='disabled', the index is unnecessary and a waste of resource.
In case of 299M records having status='enabled' and 1M having status='disabled', the index is useful (and will be used) in queries of type SELECT ... where status='disabled'.
Queries of type SELECT ... where status='enabled' will still run with a full table scan.
You will hardly need all 150 mln records at once, so I guess "status" will always be used in conjunction with other columns. Perhaps it'd make more sense to use a compound index like (status, fullname)
Jan, you should definitely index that column. I'm not sure of the context of the quote, but everything you said above is correct. Without an index on that column, you are most certainly doing a table scan on 300M rows, which is about the worst you can do for that data.
Jan, as asked, where your query involves simply "where status=enabled" without some other limiting factor, an index on that column apparently won't help (glad to SO community showed me what's up). If however, there is a limiting factor, such as "limit 10" an index may help. Also, remember that indexes are also used in group by and order by optimizations. If you are doing "select count(*),status from table group by status", an index would be helpful.
You should also consider converting status to a tinyint where 0 would represent disabled and 1 would be enabled. You're wasting tons of space storing that string vs. a tinyint which only requires 1 byte per row!
I have a similar column in my MySQL database. Approximately 4 million rows, with the distribution of 90% 1 and 10% 0.
I've just discovered today that my queries (where column = 1) actually run significantly faster WITHOUT the index.
Foolishly I deleted the index. I say foolishly, because I now suspect the queries (where column = 0) may have still benefited from it. So, instead I should explicitly tell MySQL to ignore the index when I'm searching for 1, and to use it when I'm searching for 0. Maybe.