mysql partitioning with int and timestamp - mysql

I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you

Related

Optimizing MySQL table for selecting many rows in date range

I have an InnoDB table in MySQL where I have to select and sum a lot of data in date ranges. It seems I can't get to a point where it runs fast enough for the use case.
The table is as follows:
user_id: int
date: date
amount: int
The table has several hundred million rows.
A date range can return up to 10 million rows.
Amount is 1-10
I have a composite index on all three columns in the order: user_id, date, amount.
The query I use for selecting is:
SELECT
SUM(amount)
FROM table
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
I hardcode the dates into the query.
Anything else I can do to speed up this query? I should be able to do the query about 20 times a second.
It's running on DI with 8gb RAM and 4 CPUs (not dedicated).
Update
The output of EXPLAIN is:
select_type: SIMPLE
type: range
possible_keys: composite
key: composite
key_len: 7
ref: null
rows: 14994440
Extra: Using where; Using index
I've used various techniques in the past to do similar stuff.
You should consider partitioning your table. That involves creating a column that contains a partition identifier, which could be a date, or year-month
I've had some performance increase by splitting the date and time portion. The advantage is that you can then quickly grab all data from a date by looking at the date field, without even considering the time portion.
If you know what kind of data you'll be requesting, and you can allow for some delays, you can pre-calculate. It looks like you're working with log-data, so I assume that query results for anything that's older than today will never change. You should exploit that, for example by having a separate table with aggregated data. If you only need to calculate "today" things will be much faster. Or accept that numbers are a bit old, you can just pre-calculate periodically.
The table that I'm talking about could be something like:
CREATE table aggregated_requests AS
SELECT user_id, request_date, SUM(amount) as amount
FROM table
After that, rewrite your query above like this, and i'll be extremely fast:
SELECT SUM(amount)
FROM aggregated_requests
WHERE user_id = ?
AND request_date <= ?
AND request_date >= ?
Plan A: INDEX(user_id, request_date, amount) -- optimal for the WHERE, also "covering". OK, you have that; so, on to plan B:
Plan B (even better): Build and maintain a Summary table of, say, daily subtotals. Then query that table instead. More: http://mysql.rjweb.org/doc.php/summarytables
Partitioning is unlikely to help more than a good index (as in Plan A).
More on B
If you need up-to-the-minute totals, there are multiple approaches to achieve it using summary tables without waiting until the next day.
IODKU against the summary table at the same time (possibly in a Trigger) that you insert the row data. This keeps the summary table up to date, but with non-trivial overhead.
Hybrid. Reach into the summary table for whole days, then total up 'today' from the raw data and add it on.
Summarize by hour instead of by day. This either gives you only hourly resolution, or you can combine with the "hybrid" to make that run faster.
(My blog gives those 3, plus 3 more.)
Other
"Amount is 1-10" -- I hope you are using a 1-byte TINYINT, not a 4-byte INT. That's 300MB of difference. Perhaps user_id could be smaller than INT.

MySQL - Group By date/time functions on a large table

I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:

MySQL database design for bigdata

I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.
MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.

SQL Performance of grouping by DATE(TIMESTAMP) vs separate columns for DATE and TIME

I'm facing a problem of displaying data from MySQL database.
I have a table with all user requestes in format:
| TIMESTAMP Time / +INDEX | Some other params |
I want to show this data on my website as a table with number of requests in each day.
The query is quite simple:
SELECT DATE(Time) as D, COUNT(*) as S FROM Stats GROUP BY D ORDER BY D DESC
But when looking into EXPLAIN this drives me mad:
Using index; **Using temporary; Using filesort**
From MySQL docs it says that it creates temporary table for this query on hard drive.
How fast it would be with 1.000.000 records? And how fast with 100.000.000?
Is there any way to put INDEX on result of function?
Maybe I should create separate columns for DATE and TIME and than group by DATE column?
What are other good ways of dealing with such problem? Caching? Another DB engine?
If you have an index on your Time column this operation is going to perform tolerably well. I'm guessing you do have that index, because your EXPLAIN output says it's using an index.
Why does this work well? Because MySQL can access this index in order -- it can scan the index -- to satisfy your query.
Don't be confused by Using temporary; Using filesort. This simply means MySQL needs to create and return a virtual table with a row for each day. That's pretty small and almost surely fits in memory. filesort doesn't necessarily mean the file has spilled to a temp file on disk; it just means MySQL has to sort the virtual table. It has to sort it to get the last day first.
By the way, if you can restrict the date range of the query you'll get predictable performance on this query even when your application has been in use for years. Try something this:
SELECT DATE(Time) as D, COUNT(*) as S
FROM Stats
WHERE Time >= CURDATE() - INTERVAL 30 DAY
GROUP BY D ORDER BY D DESC
First: a GROUP BY means sorting and it is an expensive operation. The data in the index is sorted but even in this case the ddbb needs to groups dates. So I feel that indexing by DATE may help as it will improve the speed of the query at the cost of refreshing another index at every insert. Please test it, i am not 100% sure.
Other alternatives are:
Using a partitioned table by month.
Using a materialized views
Updating a counter with every visit.
Precalculating and storing yesterday's data. Just refresh your daily visits with a WHERE DAY(timestamp) = TODAY. This way the serer will have to sort a smaller amount of data.
Dependes on how often do user visit your page and when you do need this data. Do not optimize prematuraly if you do not need it.

Faster way of retrieving aggregate data from large table?

I have a table that grows by tens of millions of rows each day. The rows in the table contain hourly information about page view traffic.
The indices on the table are on url and datetime.
I want to aggregate the information by day, rather than hourly. How should I do this? This is a query that exemplifies what I am trying to do:
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= "2012-08-29 00:00:00" AND datetime <= "2012-08-29 23:00:00"
GROUP BY url
ORDER BY pageviews DESC
LIMIT 10;
The above query never finishes, though. There are millions of rows in the table. Is there a more efficient way that I can get this aggregate data?
Tens of millions of rows per day is quite a lot.
Assuming:
only 10 million new records per day;
your table contains only the columns that you mention in your question;
url is of type TEXT with a mean (Punycode) length of ~77 characters;
pageviews is of type INT;
int_views is of type INT;
ext_views is of type INT; and
datetime is of type DATETIME
then each day's data will occupy around 9.9 × 108 bytes, which is almost 1GiB/day. In reality it may be considerably more, because the above assumptions were quite conservative.
MySQL's maximum table size is determined, amongst other things, by the underlying filesystem on which its data files reside. If you're using the MyISAM engine (as suggested by your comment beneath) without partitioning on Windows or Linux, then a limit of a few GiB is not uncommon; which implies the table will reach its capacity well within a working week!
As #Gordon Linoff mentioned, you should partition your table; However, each table has a limit of 1024 partitions. With 1 partition/day (which would be imminently sensible in your case), you will be limited to storing under 3 years of data in a single table before the partitions start getting reused.
I would therefore advise that you keep each year's data in its own table, each partitioned by day. Furthermore, as #Ben explained, a composite index on (datetime, url) would help (I actually propose creating a date column from DATE(datetime) and indexing that, because it will enable MySQL to prune the partitions when performing your query); and, if row-level locking and transactional integrity are not important to you (for a table of this sort, they may not be), using MyISAM may not be daft:
CREATE TABLE news_2012 (
INDEX (date, url(100))
)
Engine = MyISAM
PARTITION BY HASH(TO_DAYS(date)) PARTITIONS 366
SELECT *, DATE(datetime) AS date FROM news WHERE YEAR(datetime) = 2012;
CREATE TRIGGER news_2012_insert BEFORE INSERT ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
CREATE TRIGGER news_2012_update BEFORE UPDATE ON news_2012 FOR EACH ROW
SET NEW.date = DATE(NEW.datetime);
If you choose to use MyISAM, you can not only archive completed years (using myisampack) but can also replace your original table with a MERGE one comprising the UNION of all of your underlying year tables (an alternative that would also work in InnoDB would be to create a VIEW, but it would only be useful for SELECT statements as UNION views are neither updatable nor insertable):
DROP TABLE news;
CREATE TABLE news (
date DATE,
INDEX (date, url(100))
)
Engine = MERGE
INSERT_METHOD = FIRST
UNION = (news_2012, news_2011, ...)
SELECT * FROM news_2012 WHERE FALSE;
You can then run your above query (along with any other) on this merge table:
SELECT url, SUM(pageviews), SUM(int_views), SUM(ext_views)
FROM news
WHERE date = '2012-08-29'
GROUP BY url
ORDER BY SUM(pageviews) DESC
LIMIT 10;
A few points:
Also, as the only predicate that you're filtering on you should
probably have an index with datetime as the first column.
You're ordering by pageviews. I would have assumed that you want to order by sum(pageviews).
You're querying 23 hours of data not 24. You probably want to use an explicit less than, <, from midnight the next day to avoid missing anything.
SELECT url, sum(pageviews), sum(int_views), sum(ext_views)
FROM news
WHERE datetime >= '2012-08-29 00:00:00'
AND datetime < '2012-08-30 00:00:00'
GROUP BY url
ORDER BY sum(pageviews) DESC
LIMIT 10;
You could index this on datetime, url, pageviews, int_views, ext_views but I think that would be overkill; so, if the index isn't too big datetime, url seems like a good way to go. The only way to be certain is to test it and decide if any performance improvements in querying are worth the extra time taken in index maintenance.
As Gordon's just mentioned in the comments you may need to look into partitioning. This enables you to query a smaller "table" that is part of the larger one. If all your queries are based at the day level it sounds like you might need to create a new one each day.