I have two different queries. But they are both on the same table and have both the same WHERE clause. So they are selecting the same row.
Query 1:
SELECT HOUR(timestamp), COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY HOUR(timestamp)
Query 2:
SELECT country, COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY country
How can I make this more efficient?
If this table is indexed correctly, it honestly doesn't matter how big the entire table is because you're only looking at today's rows.
If the table is indexed incorrectly the performance of these queries will be terrible no matter what you do.
Your WHERE timestamp >= CURDATE() clause means you need to have an index on the timestamp column. In one of your queries the GROUP BY country shows that a compound covering index on (timestamp, country) will be a great help.
So, a single compound index (timestamp, country) will satisfy both the queries in your question.
Let's explain how that works. To look for today's records (or indeed any records starting and ending with particular timestamp values) and group them by country, and count them, MySQL can satisfy the query by doing these steps:
random-access the index to the first record that matches the timestamp. O(log n).
grab the first country value from the index.
scan to the next country value in the index and count. O(n).
repeat step three until the end of the timestamp range.
This index scan operation is about as fast as a team of ace developers (the MySQL team) can get it to be with a decade of hard work. (You may not be able to outdo them on a Saturday afternoon.) MySQL satisfies the whole query with a small subset of the index, so it doesn't really matter how big the table behind it is.
If you run one of these queries right after the other, it's possible that MySQL will still have some or all the index data blocks in a RAM cache, so it might not have to re-fetch them from disk. That will help even more.
Do you see how your example queries lead with timestamp? The most important WHERE criterion chooses a timestamp range. That's why the compound index I suggested has timestamp as its first column. If you don't have any queries that lead with country your simple index on that column probably is useless.
You asked whether you really need compound covering indexes. You probably should read about how they work and make that decision for yourself.
There's obviously a tradeoff in choosing indexes. Each index slows the process of INSERT and UPDATE a little, and can speed up queries a lot. Only you can sort out the tradeoffs for your particular application.
Since both queries have different GROUP BY clauses they are inherently different and cannot be combined. Assuming there already is an index present on the timestamp field there is no straightforward way to make this more efficient.
If the dataset is huge (10 million or more rows) you might get a little extra efficiency out of making an extra combined index on country, timestamp, but that's unlikely to be measurable, and the lack of it will usually be mitigated by in-memory buffering of MySQL itself if these 2 queries are executed directly after another.
Related
I am using mysql for my db. I have 200,000 records with 30 columns. I am creating a composite index using 6 columns(txn_date,v_name,transaction_status, sid, pnum, txn_num) . When I do a explain on the following query having those 6 columns in where clause, the explain is using index till certain txn_date and then its using the where condition based on output of explain command
SELECT * FROM transactions
WHERE txn_date between '2021-01-10' and '2021-01-19'
and v_name ='Vo'
AND transaction_status = 'failed'
AND sid = '566'
AND txn_num = 100
AND p_num = 5;
In the above query when the txn_date is date from 10 Jan to 18 Jan, its using index and above that its using where condition. Please help me out to use the index effectively so it uses index always
End with the date; start with columns tested with '='.
The columns of an index will be used from the left, but won't be used past the range test, so your index was no better than a 1-column index with just the date. Given that, the Optimizer probably saw that more than about 20% of the table would need to be used (based on the date range), and punted. That is, it decided that it would probably be faster to simply scan the table.
This discussion applies to any size of table.
FORCE INDEX will force it to use the index, but so what? The Optimizer is pretty good at deciding that a small date range can effectively use the index, but a large range cannot. If you add a FORCE, it may help some of the time but hurt badly in other cases.
By having all the = tests first in the index, obviates much of the discussion about how many days are in the date range.
More on index building: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
I had a table that is holding a domain and id
the query is
select distinct domain
from user
where id = '1'
the index is using the order idx_domain_id is faster than idx_id_domain
if the order of the execution is
(FROM clause,WHERE clause,GROUP BY clause,HAVING clause,SELECT
clause,ORDER BY clause)
then the query should be faster if it use the sorted where columns than the select one.
at 15:00 to 17:00 it show the same query i am working on
https://serversforhackers.com/laravel-perf/mysql-indexing-three
the table has a 4.6 million row.
time using idx_domain_id
time after change the order
This is your query:
select distinct first_name
from user
where id = '1';
You are observing that user(first_name, id) is faster than user(id, firstname).
Why might this be the case? First, this could simply be an artifact of how your are doing the timing. If your table is really small (i.e. the data fits on a single data page), then indexes are generally not very useful for improving performance.
Second, if you are only running the queries once, then the first time you run the query, you might have a "cold cache". The second time, the data is already stored in memory, so it runs faster.
Other issues can come up as well. You don't specify what the timings are. Small differences can be due to noise and might be meaningless.
You don't provide enough information to give a more definitive explanation. That would include:
Repeated timings run on cold caches.
Size information on the table and the number of matching rows.
Layout information, particularly the type of id.
Explain plans for the two queries.
select distinct domain
from user
where id = '1'
Since id is the PRIMARY KEY, there is at most one row involved. Hence, the keyword DISTINCT is useless.
And the most useful index is what you already have, PRIMARY KEY(id). It will drill down the BTree to find id='1' and deliver the value of domain that is sitting right there.
On the other hand, consider
select distinct domain
from user
where something_else = '1'
Now, the obvious index is INDEX(something_else, domain). This is optimal for the WHERE clause, and it is "covering" (meaning that all the columns needed by the query exist in the index). Swapping the columns in the index will be slower. Meanwhile, since there could be multiple rows, DISTINCT means something. However, it is not the logical thing to use.
Concerning your title question (order of columns): The = columns in the WHERE clause should come first. (More details in the link below.)
DISTINCT means to gather all the rows, then de-duplicate them. Why go to that much effort when this gives the same answer:
select domain
from user
where something_else = '1'
LIMIT 1
This hits only one row, not all the 1s.
Read my Indexing Cookbook.
(And, yes, Gordon has a lot of good points.)
I have a table which contains information about time, So the table has columns like year, month, day, hour and so on.
Table has data across years and quite big so i decided to make partition on this table and started learning about Mysql partitioning but caught up by few questions.
I will really appreciate, if someone can help me understand how partition and indexes will work together.
If i create partition over year column and also have an index on the same column, how partition and index will work together? How it will impact the performance over, if i had index on year column only and table has no partition?
Ex. Sql: Select month, day, hour ... from time_table where year = '2017';
If table has partition over year column and query is filtering records over month column and month column is indexed. How index over month and partition over year will impact the select performance.
Ex Sql: Select year, month, day .... from time_table where month = '05';
Partitioning splits a table up into, shall we say, "sub-tables". Each sub-table is essentially a table, with data and index(es).
When SELECTing from the table, the first thing done is to decide which partition(s) may contain the desired data. This is "partition pruning" and uses the "partition key" (which is apparently to be year). Then the select is applied to the subtables that are relevant, using whatever index is appropriate. In that case it is a waste to have INDEX(year, ...), since you are already pruned down to the year.
Your sample select cannot do partition pruning since you did not specify year in the WHERE clause. Hence, it will look in all partitions, and will be slower than if you did not partition the table.
Don't use partitioning unless you expect at least a million rows. (That would be a lot of years.)
Don't use partitioning unless you have a use case where it will help you. (Apparently not your case.)
Don't have columns for the parts of a datetime, when it is so easy to compute the parts: YEAR(date), MONTH(date), etc.
Don't index columns with low cardinality; the Optimizer will end up scanning the entire table anyway -- because it is faster. (eg: month='05')
If you would like to back up a step and explain what you are trying to accomplish, perhaps we can discuss another approach.
I have a big database with about 3 million records with records containing a time stamp.
Now I want to select one record per month and it works using this query:
SELECT timestamp, id, gas_used, kwh_used1, kwh_used2 FROM energy
GROUP BY MONTH(timestamp) ORDER BY timestamp ASC
It works but it is very slow.
I have indexes on id and on timestamp.
What can I do to make this query fast?
GROUP BY MONTH(timestamp) is forcing the engine to look at each record individually, aka a sequential scan, which obviously is very slow when you have 30M records.
A common solution is to add an indexed column with just the criterium you will want to select on. However, I highly suspect that you will actually want to select on Year-Month, if your db is not reset every year.
To avoid data corruption issues, it may be best to create an insert trigger that automatically fills that field. That way this extra column doesn't interfere with your business logic.
It is not a good practice to SELECT columns that don't appear in GROUP BY statement, unless they are handled with aggregating function such as MIN(), MAX(), SUM() etc.
In your query this applies to columns:
id, gas_used, kwh_used1, kwh_used2
You will not get the "earliest" (by timestamp) row for each month in this case.
More:
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
I have a table with 2 million rows. I have two index (status, gender) and also (birthday).
I find strange that this query is taking 3.6 seconds or more
QUERY N° 1
SELECT COUNT(*) FROM ts_user_core
WHERE birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND status='ok' AND gender='female';
same for this:
QUERY N° 2
SELECT COUNT(*) FROM ts_user_core
WHERE status='ok' AND gender='female'
AND birthday between '1980-01-01' AND '1985-01-01';
While this query is taking 0.140 seconds
QUERY N° 3
select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
Also this query takes 0.2 seconds
QUERY N° 4
select count(*) from ts_user_core where status='ok' and gender='female'
I expect the first query to be way more faster, how can be possible this behavior? I can't handle so much time for this query.
Here the result of:
I know that I can add a new index with 3 columns, but is there a way to have a faster query without adding an index for every where clause?
Thanks for your advice
is there a way to optimize the query without adding an index for every possible where clause?
Yes, somewhat. But it takes an understanding of how INDEXes work.
Let's look at all the SELECTs you have presented so far.
To build the optimal index for a SELECT, start with all the = constant items in the WHERE clause. Put those columns into an index in any order. That gives us INDEX(status, gender, ...) or INDEX(gender, status, ...), but nothing deciding between them (yet).
add on one range or all the ORDER BY. In your first couple of SELECTs, that would be birthday. Now we have INDEX(status, gender, birthday) or INDEX(gender, status, birthday). Either of these is 'best' for the first two SELECTs.
Those indexes work quite well for #4: select count(*) from ts_user_core where status='ok' and gender='female', too. So no extra index needed for it.
Now, let's work on #3: select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
It cannot use the indexes we have so far.
INDEX(birthday) is essentially the only choice.
Now, suppose we also had ... WHERE status='foo'; (without gender). That would force us to pick INDEX(status, gender, birthday) instead of the variant of it.
Result: 2 good indexes to handle all 5 selects:
INDEX(status, gender, birthday)
INDEX(birthday)
Suggestion: If you end up with more than 5 INDEXes or an index with more than 5 columns in it, it is probably wise to shorten some indexes. Here is where things get really fuzzy. If you would like to present me with a dozen 'realistic' indexes, I'll walk you through it.
Notes on other comments:
For timing, run each query twice and take the second time -- to avoid caching effects. (Your 3.6 vs 0.140 smells like caching of the index.)
For timing, turn off the Query cache or use SQL_NO_CACHE.
The optimizer rarely uses two indexes in a single query.
Show us the EXPLAIN plain; we can help you read it.
The extra time taken to pick among multiple INDEXes is usually worth it.
If you have INDEX(a,b,c), you don't need INDEX(a,b).
In first case, you have two indexes, and while MySQL optimizer read your query, it must find out which plan is more optimal.
Because you have two indexes, optimizer spend more time to decide which plan is more optimal, because it create more possible execution plans.
In second cases, MySQL positions at first index page which consist status 'ok' and read all pages while gender is not changed to 'male', which is faster than first case.
Try to create one index with three columns from WHERE clause.
It's more than likely the case that mysql is terminating your index usage after it performs a range scan over your date range.
Run the following queries in the mysql client to see how it's using your indices:
EXPLAIN EXTENDED
SELECT COUNT(*) FROM ts_user_core
WHERE birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND status='ok' AND gender='female';
SHOW INDEX IN ts_user_core;
I'm guessing that your index or primary key has birthday before status and/or gender in the index causing a range scan. Mysql will terminate all further index usage after it performs a range scan.
If that's the case, you can then re-arrange the columns in your index to move status and gender before birthday or create a new index specifically for this query with status and gender before birthday.
Before you re-arrange an existing index, however, make sure that no other queries our system will run depend on the current ordering.
The difference between no1 and no2 is down to the stored data being cached. If you had looked at the execution plans you would find they were exactly the same.
select count(*) from ts_user_core where (birthday between '1990-01-01' and '2000-01-01');
With an index on birthday will not look at the table data (and similarly for status and gender). But MySQL can only use one index per table - so for a query using both predicates, it will select the more specific index (shown in EXPLAIN) to resolve the predicate, then fetch the corresponding table rows (expensive operation) to resolve the second predicate.
If you either add an index with all 3 columns then you will have a covering index for the compound query. Alternatively, add the primary key (you didn't tell us the structure of the table, I'll assume "id") and...
SELECT COUNT(*)
FROM ts_user_core bday
INNER JOIN ts_user_core stamf
ON bday.id=stamf.id
WHERE bday.birthday BETWEEN '1980-01-01' AND '1985-01-01'
AND stamf.status='ok' AND stamf.gender='female';
Side note:
status='ok' AND gender='female'
Columns which have a small set of possible values and/or skewed data (such that some values are MUCH more frequent than others) tend not to work well as indexes, although the stats here suggest that might not be an issue.