I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:
Related
I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.
I have a big database with about 3 million records with records containing a time stamp.
Now I want to select one record per month and it works using this query:
SELECT timestamp, id, gas_used, kwh_used1, kwh_used2 FROM energy
GROUP BY MONTH(timestamp) ORDER BY timestamp ASC
It works but it is very slow.
I have indexes on id and on timestamp.
What can I do to make this query fast?
GROUP BY MONTH(timestamp) is forcing the engine to look at each record individually, aka a sequential scan, which obviously is very slow when you have 30M records.
A common solution is to add an indexed column with just the criterium you will want to select on. However, I highly suspect that you will actually want to select on Year-Month, if your db is not reset every year.
To avoid data corruption issues, it may be best to create an insert trigger that automatically fills that field. That way this extra column doesn't interfere with your business logic.
It is not a good practice to SELECT columns that don't appear in GROUP BY statement, unless they are handled with aggregating function such as MIN(), MAX(), SUM() etc.
In your query this applies to columns:
id, gas_used, kwh_used1, kwh_used2
You will not get the "earliest" (by timestamp) row for each month in this case.
More:
https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.html
I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.
MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.
I'm facing a problem of displaying data from MySQL database.
I have a table with all user requestes in format:
| TIMESTAMP Time / +INDEX | Some other params |
I want to show this data on my website as a table with number of requests in each day.
The query is quite simple:
SELECT DATE(Time) as D, COUNT(*) as S FROM Stats GROUP BY D ORDER BY D DESC
But when looking into EXPLAIN this drives me mad:
Using index; **Using temporary; Using filesort**
From MySQL docs it says that it creates temporary table for this query on hard drive.
How fast it would be with 1.000.000 records? And how fast with 100.000.000?
Is there any way to put INDEX on result of function?
Maybe I should create separate columns for DATE and TIME and than group by DATE column?
What are other good ways of dealing with such problem? Caching? Another DB engine?
If you have an index on your Time column this operation is going to perform tolerably well. I'm guessing you do have that index, because your EXPLAIN output says it's using an index.
Why does this work well? Because MySQL can access this index in order -- it can scan the index -- to satisfy your query.
Don't be confused by Using temporary; Using filesort. This simply means MySQL needs to create and return a virtual table with a row for each day. That's pretty small and almost surely fits in memory. filesort doesn't necessarily mean the file has spilled to a temp file on disk; it just means MySQL has to sort the virtual table. It has to sort it to get the last day first.
By the way, if you can restrict the date range of the query you'll get predictable performance on this query even when your application has been in use for years. Try something this:
SELECT DATE(Time) as D, COUNT(*) as S
FROM Stats
WHERE Time >= CURDATE() - INTERVAL 30 DAY
GROUP BY D ORDER BY D DESC
First: a GROUP BY means sorting and it is an expensive operation. The data in the index is sorted but even in this case the ddbb needs to groups dates. So I feel that indexing by DATE may help as it will improve the speed of the query at the cost of refreshing another index at every insert. Please test it, i am not 100% sure.
Other alternatives are:
Using a partitioned table by month.
Using a materialized views
Updating a counter with every visit.
Precalculating and storing yesterday's data. Just refresh your daily visits with a WHERE DAY(timestamp) = TODAY. This way the serer will have to sort a smaller amount of data.
Dependes on how often do user visit your page and when you do need this data. Do not optimize prematuraly if you do not need it.
I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you