I have a table that contains 250 million records recording people who live in the US and their sate, county and settlement. A simplified version looks like:
I have put a combined index on surname, region, subregion and place. The following queries execute in exactly the same time:
SELECT SQL_NO_CACHE surname, place, count(*) as cnt FROM `ustest` group by place, surname;
SELECT SQL_NO_CACHE surname, region, count(*) as cnt FROM `ustest` group by region, surname;
I was under the impression that the first query would not use the index, as I thought that to use an index you had to query on all the columns from left to right.
Can anyone explain how MySQL uses indexes on multiple columns in such instances?
It's hard to tell the specifics of your queries' execution plans without seeing the EXPLAIN output.
But two things jump out:
Both queries must take all rows in the table into account (you don't have a WHERE clause).
Both queries could be satisfied by scanning your compound index based on surname being the lead column of that index. Because you're counting items, it's necessary to do a tight, not loose, index scan. (You can read about those.)
So it's possible that they both have the same execution plan.
Related
I'm doing some problem sets in my database management course and I can't figure this specific problem out.
We have the following relation:
Emp (id, name, age, sal, ...)
And the following query:
SELECT id
FROM Emp
WHERE age > (select max(sal) from Emp);
We are then supposed to choose an index that we would be a good query optimizer. My answer would be to just use Emp(age) but the solution to the question is
Emp(age)
&
Emp(sal)
How come there are 2 indices? I can't seem to wrap my head around why you would need more than the age attribute..
Of course, you realize that the query is non-sensical, comparing age to sal (which is presumably a salary). That said, two indexes are appropriate for:
SELECT e.id
FROM Emp e
WHERE e.age > (select max(e2.sal) from Emp e2);
I added table aliases to emphasize that the query is referring to the Emp table twice.
To get the maximum sal from the table, you want an index on emp(sal). The maximum is a simple index lookup operation.
Then you want to compare this to age. Well, for a comparison to age, you want an index on emp(age). This an entirely separate reference to emp that has no reference to sal, so you cannot put the two columns in a single index.
The index on age may not be necessary. The query may be returning lots of rows -- and tables that returns lots of rows don't generally benefit from a secondary index. The one case where it can benefit from the index is if age is a clustered index (that is, typically the first column in the primary key). However, I wouldn't recommend such an indexing structure.
you need both indexes to get optimal performance
1) the subquery (select max(sal) from Emp) will benefit from indexing Emp(sal) because on a tree-index, retrieving the max would be much quicker
2) the outer query needs to run a filtering on Emp(age), so that also benefits from a tree-index
I had a table that is holding a domain and id
the query is
select distinct domain
from user
where id = '1'
the index is using the order idx_domain_id is faster than idx_id_domain
if the order of the execution is
(FROM clause,WHERE clause,GROUP BY clause,HAVING clause,SELECT
clause,ORDER BY clause)
then the query should be faster if it use the sorted where columns than the select one.
at 15:00 to 17:00 it show the same query i am working on
https://serversforhackers.com/laravel-perf/mysql-indexing-three
the table has a 4.6 million row.
time using idx_domain_id
time after change the order
This is your query:
select distinct first_name
from user
where id = '1';
You are observing that user(first_name, id) is faster than user(id, firstname).
Why might this be the case? First, this could simply be an artifact of how your are doing the timing. If your table is really small (i.e. the data fits on a single data page), then indexes are generally not very useful for improving performance.
Second, if you are only running the queries once, then the first time you run the query, you might have a "cold cache". The second time, the data is already stored in memory, so it runs faster.
Other issues can come up as well. You don't specify what the timings are. Small differences can be due to noise and might be meaningless.
You don't provide enough information to give a more definitive explanation. That would include:
Repeated timings run on cold caches.
Size information on the table and the number of matching rows.
Layout information, particularly the type of id.
Explain plans for the two queries.
select distinct domain
from user
where id = '1'
Since id is the PRIMARY KEY, there is at most one row involved. Hence, the keyword DISTINCT is useless.
And the most useful index is what you already have, PRIMARY KEY(id). It will drill down the BTree to find id='1' and deliver the value of domain that is sitting right there.
On the other hand, consider
select distinct domain
from user
where something_else = '1'
Now, the obvious index is INDEX(something_else, domain). This is optimal for the WHERE clause, and it is "covering" (meaning that all the columns needed by the query exist in the index). Swapping the columns in the index will be slower. Meanwhile, since there could be multiple rows, DISTINCT means something. However, it is not the logical thing to use.
Concerning your title question (order of columns): The = columns in the WHERE clause should come first. (More details in the link below.)
DISTINCT means to gather all the rows, then de-duplicate them. Why go to that much effort when this gives the same answer:
select domain
from user
where something_else = '1'
LIMIT 1
This hits only one row, not all the 1s.
Read my Indexing Cookbook.
(And, yes, Gordon has a lot of good points.)
I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).
All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.
I have a table with 2 million rows. I have two index (status, gender) and also (birthday).
I find strange that this query is taking 3.6 seconds or more
QUERY N° 1
SELECT COUNT(*) FROM ts_user_core
WHERE birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND status='ok' AND gender='female';
same for this:
QUERY N° 2
SELECT COUNT(*) FROM ts_user_core
WHERE status='ok' AND gender='female'
AND birthday between '1980-01-01' AND '1985-01-01';
While this query is taking 0.140 seconds
QUERY N° 3
select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
Also this query takes 0.2 seconds
QUERY N° 4
select count(*) from ts_user_core where status='ok' and gender='female'
I expect the first query to be way more faster, how can be possible this behavior? I can't handle so much time for this query.
Here the result of:
I know that I can add a new index with 3 columns, but is there a way to have a faster query without adding an index for every where clause?
Thanks for your advice
is there a way to optimize the query without adding an index for every possible where clause?
Yes, somewhat. But it takes an understanding of how INDEXes work.
Let's look at all the SELECTs you have presented so far.
To build the optimal index for a SELECT, start with all the = constant items in the WHERE clause. Put those columns into an index in any order. That gives us INDEX(status, gender, ...) or INDEX(gender, status, ...), but nothing deciding between them (yet).
add on one range or all the ORDER BY. In your first couple of SELECTs, that would be birthday. Now we have INDEX(status, gender, birthday) or INDEX(gender, status, birthday). Either of these is 'best' for the first two SELECTs.
Those indexes work quite well for #4: select count(*) from ts_user_core where status='ok' and gender='female', too. So no extra index needed for it.
Now, let's work on #3: select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
It cannot use the indexes we have so far.
INDEX(birthday) is essentially the only choice.
Now, suppose we also had ... WHERE status='foo'; (without gender). That would force us to pick INDEX(status, gender, birthday) instead of the variant of it.
Result: 2 good indexes to handle all 5 selects:
INDEX(status, gender, birthday)
INDEX(birthday)
Suggestion: If you end up with more than 5 INDEXes or an index with more than 5 columns in it, it is probably wise to shorten some indexes. Here is where things get really fuzzy. If you would like to present me with a dozen 'realistic' indexes, I'll walk you through it.
Notes on other comments:
For timing, run each query twice and take the second time -- to avoid caching effects. (Your 3.6 vs 0.140 smells like caching of the index.)
For timing, turn off the Query cache or use SQL_NO_CACHE.
The optimizer rarely uses two indexes in a single query.
Show us the EXPLAIN plain; we can help you read it.
The extra time taken to pick among multiple INDEXes is usually worth it.
If you have INDEX(a,b,c), you don't need INDEX(a,b).
In first case, you have two indexes, and while MySQL optimizer read your query, it must find out which plan is more optimal.
Because you have two indexes, optimizer spend more time to decide which plan is more optimal, because it create more possible execution plans.
In second cases, MySQL positions at first index page which consist status 'ok' and read all pages while gender is not changed to 'male', which is faster than first case.
Try to create one index with three columns from WHERE clause.
It's more than likely the case that mysql is terminating your index usage after it performs a range scan over your date range.
Run the following queries in the mysql client to see how it's using your indices:
EXPLAIN EXTENDED
SELECT COUNT(*) FROM ts_user_core
WHERE birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND status='ok' AND gender='female';
SHOW INDEX IN ts_user_core;
I'm guessing that your index or primary key has birthday before status and/or gender in the index causing a range scan. Mysql will terminate all further index usage after it performs a range scan.
If that's the case, you can then re-arrange the columns in your index to move status and gender before birthday or create a new index specifically for this query with status and gender before birthday.
Before you re-arrange an existing index, however, make sure that no other queries our system will run depend on the current ordering.
The difference between no1 and no2 is down to the stored data being cached. If you had looked at the execution plans you would find they were exactly the same.
select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
With an index on birthday will not look at the table data (and similarly for status and gender). But MySQL can only use one index per table - so for a query using both predicates, it will select the more specific index (shown in EXPLAIN) to resolve the predicate, then fetch the corresponding table rows (expensive operation) to resolve the second predicate.
If you either add an index with all 3 columns then you will have a covering index for the compound query. Alternatively, add the primary key (you didn't tell us the structure of the table, I'll assume "id") and...
SELECT COUNT(*)
FROM ts_user_core bday
INNER JOIN ts_user_core stamf
ON bday.id=stamf.id
WHERE bday.birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND stamf.status='ok' AND stamf.gender='female';
Side note:
status='ok' AND gender='female'
Columns which have a small set of possible values and/or skewed data (such that some values are MUCH more frequent than others) tend not to work well as indexes, although the stats here suggest that might not be an issue.
Say I have these four tables:
BRANCH (BRANCH_ID, CITY_ID, OWNER_ID, SPECIALTY_ID, INAUGURATION_DATE)
CITY (CITY_ID, NAME)
OWNER (ONWER_ID, NAME)
SPECIALTY (SPECIALTY_ID, NAME)
I have a PrimeFaces datatable where I will show all branches using pagination of 50 (LIMIT X, 50). Today BRANCH has like 10000 rows. I'll join BRANCH with the other 3 tables because I want to show their names.
I want to fetch the results with the following default sort:
ORDER BY INAUGURATION_DATE ASC, C.NAME ASC, O.NAME ASC, S.NAME ASC
Now, the user can choose to click in the header of any of these columns in my datatable, and I will query the database again making the sort he asked as the priority one. For instance, if he chose to order first by specialty name, descending, I'll do:
ORDER BY S.NAME DESC, INAUGURATION_DATE ASC, C.NAME ASC, O.NAME ASC
Now my question: how can I query the database with this dynamic sort always using the 4 columns, efficiently? A lot of users can be viewing this datatable in my site at the same time (like 1000 users), so using the ORDER BY in the SQL is very slow. I'm doing the ordering in Java, but then I cannot do the pagination correctly. How can I make this efficiently in SQL? Is creating indexes for these columns enough?
Thanks
10000 rows is quite small, so mysql should be able to handle that very fast. Assuming you have proper indexes on the City, Owner, and Speciality class (which will be the case if you declare primary keys) this query should return quickly. Also be sure to use LIMIT 50 in your query. However if the number of rows becomes large (like a million or even much more. You should just time the query to find out where it begins to slow down) then you individual indexes on City_ID, Owner_ID, Speciality_id, or inauguration_date will not help. To take advantage of the sort, assuming that your are just doing a join and there are no where clauses then you the index will need to be on all columns in the order you wish to sort. So you will need quite a few indexes to cover all the cases. If performance becomes an issue, you may want to consider whether the application needs all those options. Perhaps you could offer the user to change the sort of just any one column. In that case individual indexes will help. Also when the number of rows gets large, the performance bottleneck may not be sorting but rather how you are performing the pagination. I like the approach in https://stackoverflow.com/a/19609938/4350148.
One last point. Mysql caches queries by default. So if the tables are not changing then the queries should return without even having to do the sorting.