MySQL: Optimizing query for records within date range - mysql

I have a table (logs) that has the following columns (there are others, but these are the important ones):
id (PK, int)
Timestamp (datetime) (index)
Duration (int)
Basically this is a record for an event that starts at a time and ends at a time. This table currently has a few hundred thousand rows in it. I expect it to grow to millions. For the purpose of speeding up queries, I have added another column and precomputed values:
EndTime (datetime) (index)
To calculate EndTime I have added the number of seconds in Duration to the Timestamp field.
Now what I want to do is run a query where the result counts the number of rows where the start (Timestamp) and end times (EndTime) fall outside of a certain point in time. I then want to run this query for every second for a large timespan (such as a year). I would also like to count the number of rows that start on a particular point in time, and end at a particular point in time.
I have created the following query:
SELECT
`dates`.`date`,
COUNT(*) AS `total`,
SUM(IF(`dates`.`date`=`logs`.`Timestamp`, 1, 0)) AS `new`,
SUM(IF(`dates`.`date`=`logs`.`EndTime`, 1, 0)) AS `dropped`
FROM
`logs`,
(SELECT
DATE_ADD("2010-04-13 09:45:00", INTERVAL `number` SECOND) AS `date`
FROM numbers LIMIT 120) AS dates
WHERE dates.`date` BETWEEN `logs`.`Timestamp` AND `logs`.`EndTime`
GROUP BY `dates`.`date`;
Note that the numbers table is strictly for easily enumerating a date range. It is a table with one column, number, and contains the values 1, 2, 3, 4, 5, etc...
This gives me exactly what I am looking for... a table with 4 columns:
date
total (the total rows that start and end outside the current point in time)
new (rows that start at this point in time)
dropped (rows that end at this point in time)
The trouble is, this query can take a significant amount of time to execute. To go through 120 seconds (as shown in the query), it takes about 10 seconds. I suspect that this is about as fast as I am going to get it, but I thought I would ask here if anyone had any ideas for improving the performance of this query.
Any suggestions would be most helpful. Thank you for your time.
Edit: I have indexes on Timestamp and EndTime.
The output of EXPLAIN on my query:
"id";"select_type";"table";"type";"possible_keys";"key";"key_len";"ref";"rows";"Extra"
"1";"PRIMARY";"<derived2>";"ALL";NULL;NULL;NULL;NULL;"120";"Using temporary; Using filesort"
"1";"PRIMARY";"logs";"ALL";"Timestamp,EndTime";NULL;NULL;NULL;"296159";"Range checked for each record (index map: 0x6)"
"2";"DERIVED";"numbers";"index";NULL;"PRIMARY";"4";NULL;"35546940";"Using index"
When I run analyze on my logs table, it says status OK.

Note in the EXPLAIN output that the join type for the logs table is "ALL" and the key is NULL, which means a full table scan is scheduled. The "Range checked for each record" message means that MySQL uses the range access method on logs after examining column values from somewhere else in the result. I take this to mean that once dates has been created, MySQL can perform a ranged join on logs using the second and third indices (likely those on Timestamp and EndTime) rather than performing a full table scan. If you only have indices on Timestamp and EndTime separately, try adding an index on both, which might result in a more efficient join type (e.g. index_merge rather than range):
CREATE INDEX `start_end` ON `logs` (`Timestamp`, `EndTime`);
I believe (though could easily be wrong) that other items in the query plan either aren't really a concern or can't be eliminated. The filesort, as an example of the latter, is likely due to the GROUP BY. In other words, this is likely the extent of what you can do with this particular query, though radically different queries or approaches that address table storage format are still possibly more efficient.

You could look at merge tables to speedup the processing. With merge tables, since the tables are split up, the indexes are smaller resulting in faster fetching. Also, if you have multiple processors, the searches can happen in parallel increasing the performance.

Related

fetch time is taking time

I am running below simple query,execution time is 1sec but fetch time is 30 sec. It contains totally 100 000 records
SELECT id, referrer, timestamp
FROM masterstats_innodb
WHERE video = 1869 AND timestamp between '2011-10-01' and '2021-01-21';
Index is created on video and timestamp column and even range partition has been created on timestamp table. Can anything be done to fetch result faster?
Please provide SHOW CREATE TABLE.
Plan A: INDEX(video, timestamp)
Plan B - slightly better because of being "covering":
INDEX(video, timestamp, referrer, id)
PARTITIONing will not help the performance of this query any more than indexing.
You say "it" contains 100K rows -- are you referring to the table? Or the just the number of rows returned. If 'table', then the index will help. If the 'resultset', then you are constrained by having to send so many rows. What will the client do with 100K rows?? Can the server condense the data (eg summarize it in some way)?

Optimizing MySQL table for selecting many rows in date range

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.

mysql slow query when results are less than limit

i've a table with 550.000 records
SELECT * FROM logs WHERE user = 'user1' ORDER BY date DESC LIMIT 0, 25
this query takes 0.0171 sec. without LIMIT, there are 3537 results
SELECT * FROM logs WHERE user = 'user2' ORDER BY date DESC LIMIT 0, 25
this query takes 3.0868 sec. without LIMIT, there are 13 results
table keys are:
PRIMARY KEY (`id`),
KEY `date` (`date`)
when using "LIMIT 0,25" if there are less records than 25, the query slows down. How can I solve this problem?
Using limit 25 allows the query to stop when it found 25 rows.
If you have 3537 matching rows out of 550.000, it will, on average, assuming equal distribution, have found 25 rows after examining 550.000/3537*25 rows = 3887 rows in a list that is ordered by date (the index on date) or a list that is not ordered at all.
If you have 13 matching rows out of 550.000, limit 25 will have to examine all 550.000 rows (that are 141 times as many rows), so we expect 0.0171 sec * 141 = 2.4s. There are obviously other factors that determine runtime too, but the order of magnitude fits.
There is an additional effect. Unfortunately the index by date does not contain the value for user, so MySQL has to look up that value in the original table, by jumping back and forth in that table (because the data itself is ordered by the primary key). This is slower than reading the unordered table directly.
So actually, not using an index at all could be faster than using an index, if you have a lot of rows to read. You can force MySQL to not use it by using e.g. FROM logs IGNORE INDEX (date), but this will have the effect that it now has to read the whole table in absolutely every case: the last row could be the newest and thus has to be in the resultset, because you ordered by date. So it might slow down your first query - reading the full 550.000 rows fast can be slower than reading 3887 rows slowly by jumping back and forth. (MySQL doesn't know this either beforehand, so it took a choice - for your second query obviously the wrong one).
So how to get faster results?
Have an index that is ordered by user. Then the query for 'user2' can stop after 13 rows, because it knows there are no more rows. And this will now be faster than the query for 'user1', that has to look through 3537 rows and then order them afterwards by date.
The best index for your query would therefore be user, date, because it then knows when to stop looking for further rows AND the list is already ordered the way you want it (and beat your 0.0171s in all cases).
Indexes require some resources too (e.g. hdd space and time to update the index when you update your table), so adding the perfect index for every single query might be counterproductive sometimes for the system as a whole.

Two different queries on the same table with the same WHERE clause

I have two different queries. But they are both on the same table and have both the same WHERE clause. So they are selecting the same row.
Query 1:
SELECT HOUR(timestamp), COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY HOUR(timestamp)
Query 2:
SELECT country, COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY country
How can I make this more efficient?
If this table is indexed correctly, it honestly doesn't matter how big the entire table is because you're only looking at today's rows.
If the table is indexed incorrectly the performance of these queries will be terrible no matter what you do.
Your WHERE timestamp >= CURDATE() clause means you need to have an index on the timestamp column. In one of your queries the GROUP BY country shows that a compound covering index on (timestamp, country) will be a great help.
So, a single compound index (timestamp, country) will satisfy both the queries in your question.
Let's explain how that works. To look for today's records (or indeed any records starting and ending with particular timestamp values) and group them by country, and count them, MySQL can satisfy the query by doing these steps:
random-access the index to the first record that matches the timestamp. O(log n).
grab the first country value from the index.
scan to the next country value in the index and count. O(n).
repeat step three until the end of the timestamp range.
This index scan operation is about as fast as a team of ace developers (the MySQL team) can get it to be with a decade of hard work. (You may not be able to outdo them on a Saturday afternoon.) MySQL satisfies the whole query with a small subset of the index, so it doesn't really matter how big the table behind it is.
If you run one of these queries right after the other, it's possible that MySQL will still have some or all the index data blocks in a RAM cache, so it might not have to re-fetch them from disk. That will help even more.
Do you see how your example queries lead with timestamp? The most important WHERE criterion chooses a timestamp range. That's why the compound index I suggested has timestamp as its first column. If you don't have any queries that lead with country your simple index on that column probably is useless.
You asked whether you really need compound covering indexes. You probably should read about how they work and make that decision for yourself.
There's obviously a tradeoff in choosing indexes. Each index slows the process of INSERT and UPDATE a little, and can speed up queries a lot. Only you can sort out the tradeoffs for your particular application.
Since both queries have different GROUP BY clauses they are inherently different and cannot be combined. Assuming there already is an index present on the timestamp field there is no straightforward way to make this more efficient.
If the dataset is huge (10 million or more rows) you might get a little extra efficiency out of making an extra combined index on country, timestamp, but that's unlikely to be measurable, and the lack of it will usually be mitigated by in-memory buffering of MySQL itself if these 2 queries are executed directly after another.

Why changing a simple query parameter causes serious changes in MySQL query plan (explain)?

I'm experiencing a strange situation with examining my query in MySQL using the "explain" command. I've got a table, which has three non-unique single column index on columns "Period","X", and "Y". All three of these columns have the same integer datatype. Then I examine the following commands:
EXPLAIN SELECT * FROM MyTable WHERE Period = 201208 AND X >= 0 AND Y <= 454;
EXPLAIN SELECT * FROM MyTable WHERE Period = 201304 AND X >= 0 AND Y <= 454;
The first one shows "Using index condition; Using where", but stangely the second onnly shows "Using where", so it seems like changing one parameter somehow eliminates indexes in query execution.
The table has about 65000 total rows, about 5000 per Period value (so it's balanced), and the first query returns about 2000 number of rows, the second returns about 500. Also, the latter period value (201304) is not the "last" physically in the table, and the former value is not the first as well, there are many rows with period values less and greater than this two specific.
My original table is quite complex with lots of columns, so I cannot paste it into here. But the only indexes are this three, and the query is the same as I used during testing, so I hope it should not matter too much.
Could someone give me any tip what can cause this and if I need to take care of something what I don!t know about? Thank you.