Optimizing MySQL table for selecting many rows in date range - mysql

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index

I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?

Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.

Related

fetch time is taking time

I am running below simple query,execution time is 1sec but fetch time is 30 sec. It contains totally 100 000 records
SELECT id, referrer, timestamp
FROM masterstats_innodb
WHERE video = 1869 AND timestamp between '2011-10-01' and '2021-01-21';
Index is created on video and timestamp column and even range partition has been created on timestamp table. Can anything be done to fetch result faster?
Please provide SHOW CREATE TABLE.
Plan A: INDEX(video, timestamp)
Plan B - slightly better because of being "covering":
INDEX(video, timestamp, referrer, id)
PARTITIONing will not help the performance of this query any more than indexing.
You say "it" contains 100K rows -- are you referring to the table? Or the just the number of rows returned. If 'table', then the index will help. If the 'resultset', then you are constrained by having to send so many rows. What will the client do with 100K rows?? Can the server condense the data (eg summarize it in some way)?

MySQL - Group By date/time functions on a large table

I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:

SQL Performance of grouping by DATE(TIMESTAMP) vs separate columns for DATE and TIME

I'm facing a problem of displaying data from MySQL database.
I have a table with all user requestes in format:
| TIMESTAMP Time / +INDEX | Some other params |
I want to show this data on my website as a table with number of requests in each day.
The query is quite simple:
SELECT DATE(Time) as D, COUNT(*) as S FROM Stats GROUP BY D ORDER BY D DESC
But when looking into EXPLAIN this drives me mad:
Using index; **Using temporary; Using filesort**
From MySQL docs it says that it creates temporary table for this query on hard drive.
How fast it would be with 1.000.000 records? And how fast with 100.000.000?
Is there any way to put INDEX on result of function?
Maybe I should create separate columns for DATE and TIME and than group by DATE column?
What are other good ways of dealing with such problem? Caching? Another DB engine?
If you have an index on your Time column this operation is going to perform tolerably well. I'm guessing you do have that index, because your EXPLAIN output says it's using an index.
Why does this work well? Because MySQL can access this index in order -- it can scan the index -- to satisfy your query.
Don't be confused by Using temporary; Using filesort. This simply means MySQL needs to create and return a virtual table with a row for each day. That's pretty small and almost surely fits in memory. filesort doesn't necessarily mean the file has spilled to a temp file on disk; it just means MySQL has to sort the virtual table. It has to sort it to get the last day first.
By the way, if you can restrict the date range of the query you'll get predictable performance on this query even when your application has been in use for years. Try something this:
SELECT DATE(Time) as D, COUNT(*) as S
FROM Stats
WHERE Time >= CURDATE() - INTERVAL 30 DAY
GROUP BY D ORDER BY D DESC
First: a GROUP BY means sorting and it is an expensive operation. The data in the index is sorted but even in this case the ddbb needs to groups dates. So I feel that indexing by DATE may help as it will improve the speed of the query at the cost of refreshing another index at every insert. Please test it, i am not 100% sure.
Other alternatives are:
Using a partitioned table by month.
Using a materialized views
Updating a counter with every visit.
Precalculating and storing yesterday's data. Just refresh your daily visits with a WHERE DAY(timestamp) = TODAY. This way the serer will have to sort a smaller amount of data.
Dependes on how often do user visit your page and when you do need this data. Do not optimize prematuraly if you do not need it.

How to prevent MySQL selecting one index when a better one is available?

I have a table with 30,000 rows (and growing), which I join with another table. One some pages, I need to run a some 100+ of those queries, and things get slow. If I EXPLAIN the query, I notice that one table uses a primary key and is fast, but another table using one of its indexes, which is not the best one. Here's an overview:
SIMPLE | acc_entries | ref | ledger,date,type,status,status_ledger_date_type | type | 1 | const | 15359 | Using where
This is a sample query:
SELECT SUM(usd) AS total FROM acc_entries
LEFT JOIN acc_ledgers ON acc_entries.ledger = acc_ledgers.id
WHERE acc_entries.status = 1 AND
acc_ledgers.account = 3004 AND
date >= '2011-01-01' AND
date <= '2011-08-30' AND
type = 'credit'
As you can see, I am using in my WHERE the fields status, ledger (which is the field that joins with acc_ledgers.account), date and type. All of these fields have indexes. However, there is also a specific index that is used for all of them, in that same order. It is called status_ledger_data_type, and as you can see it is one of the indexes that MySQL considers using. However, at the end MySQL opts to use type as an index. This has some 15,000 possible rows (half of the table), whereas the other combined index only features a fraction of this. So my questions is: why does MySQL selects this index when a better one is available, and how can I prevent this?
You can try using index hints to force the use of your desired index.
MySql docs on Index Hints
The Battle Between Force Index and the Query Optimizer
7 ways to convince MySQL to use the right index
Actually, you want your index based on your smaller granularity. The Ledger from your Acc_Entries table will join to your ACC_Ledgers table on ITS primary index of ID, so the Acc_Ledgers is not really utilizing the Ledger portion for the WHERE clause. Your index should match as closely to the WHERE clause of your common queries. In this case, I would have an index on
(Account, Status, Type, Date)
The reason for Account being first, smaller result set. You could have 5,000 entries. Of those, 300 entries for the one account accounts, so you've already eliminated a huge amount of data to go through. Then, the Status... of the 300, you could have 100 # status 1, 100 # status 2, 100 # status 3, so you've now reduced the set even more, etc by other criteria of type and date.
Your query otherwise is completely fine... just a personal style in writing, I try to write my queries with the WHERE conditions as closely matching the index in same sequence too, so I would just have the Account clause first, then Status, Type and Date... but again, thats a personal style in writing queries.

MySQL: Optimizing query for records within date range

I have a table (logs) that has the following columns (there are others, but these are the important ones):
id (PK, int)
Timestamp (datetime) (index)
Duration (int)
Basically this is a record for an event that starts at a time and ends at a time. This table currently has a few hundred thousand rows in it. I expect it to grow to millions. For the purpose of speeding up queries, I have added another column and precomputed values:
EndTime (datetime) (index)
To calculate EndTime I have added the number of seconds in Duration to the Timestamp field.
Now what I want to do is run a query where the result counts the number of rows where the start (Timestamp) and end times (EndTime) fall outside of a certain point in time. I then want to run this query for every second for a large timespan (such as a year). I would also like to count the number of rows that start on a particular point in time, and end at a particular point in time.
I have created the following query:
SELECT
`dates`.`date`,
COUNT(*) AS `total`,
SUM(IF(`dates`.`date`=`logs`.`Timestamp`, 1, 0)) AS `new`,
SUM(IF(`dates`.`date`=`logs`.`EndTime`, 1, 0)) AS `dropped`
FROM
`logs`,
(SELECT
DATE_ADD("2010-04-13 09:45:00", INTERVAL `number` SECOND) AS `date`
FROM numbers LIMIT 120) AS dates
WHERE dates.`date` BETWEEN `logs`.`Timestamp` AND `logs`.`EndTime`
GROUP BY `dates`.`date`;
Note that the numbers table is strictly for easily enumerating a date range. It is a table with one column, number, and contains the values 1, 2, 3, 4, 5, etc...
This gives me exactly what I am looking for... a table with 4 columns:
date
total (the total rows that start and end outside the current point in time)
new (rows that start at this point in time)
dropped (rows that end at this point in time)
The trouble is, this query can take a significant amount of time to execute. To go through 120 seconds (as shown in the query), it takes about 10 seconds. I suspect that this is about as fast as I am going to get it, but I thought I would ask here if anyone had any ideas for improving the performance of this query.
Any suggestions would be most helpful. Thank you for your time.
Edit: I have indexes on Timestamp and EndTime.
The output of EXPLAIN on my query:
"id";"select_type";"table";"type";"possible_keys";"key";"key_len";"ref";"rows";"Extra"
"1";"PRIMARY";"<derived2>";"ALL";NULL;NULL;NULL;NULL;"120";"Using temporary; Using filesort"
"1";"PRIMARY";"logs";"ALL";"Timestamp,EndTime";NULL;NULL;NULL;"296159";"Range checked for each record (index map: 0x6)"
"2";"DERIVED";"numbers";"index";NULL;"PRIMARY";"4";NULL;"35546940";"Using index"
When I run analyze on my logs table, it says status OK.
Note in the EXPLAIN output that the join type for the logs table is "ALL" and the key is NULL, which means a full table scan is scheduled. The "Range checked for each record" message means that MySQL uses the range access method on logs after examining column values from somewhere else in the result. I take this to mean that once dates has been created, MySQL can perform a ranged join on logs using the second and third indices (likely those on Timestamp and EndTime) rather than performing a full table scan. If you only have indices on Timestamp and EndTime separately, try adding an index on both, which might result in a more efficient join type (e.g. index_merge rather than range):
CREATE INDEX `start_end` ON `logs` (`Timestamp`, `EndTime`);
I believe (though could easily be wrong) that other items in the query plan either aren't really a concern or can't be eliminated. The filesort, as an example of the latter, is likely due to the GROUP BY. In other words, this is likely the extent of what you can do with this particular query, though radically different queries or approaches that address table storage format are still possibly more efficient.
You could look at merge tables to speedup the processing. With merge tables, since the tables are split up, the indexes are smaller resulting in faster fetching. Also, if you have multiple processors, the searches can happen in parallel increasing the performance.