MySQL database design for bigdata - mysql

I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.

MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.

Related

fetch time is taking time

I am running below simple query,execution time is 1sec but fetch time is 30 sec. It contains totally 100 000 records
SELECT id, referrer, timestamp
FROM masterstats_innodb
WHERE video = 1869 AND timestamp between '2011-10-01' and '2021-01-21';
Index is created on video and timestamp column and even range partition has been created on timestamp table. Can anything be done to fetch result faster?
Please provide SHOW CREATE TABLE.
Plan A: INDEX(video, timestamp)
Plan B - slightly better because of being "covering":
INDEX(video, timestamp, referrer, id)
PARTITIONing will not help the performance of this query any more than indexing.
You say "it" contains 100K rows -- are you referring to the table? Or the just the number of rows returned. If 'table', then the index will help. If the 'resultset', then you are constrained by having to send so many rows. What will the client do with 100K rows?? Can the server condense the data (eg summarize it in some way)?

MySQL - Group By date/time functions on a large table

I have a bunch of financial stock data in a MySQL table. The data is stored in a 1min tick per row format (OHLC). From that data I'd like to create 30min/hourly/daily aggregates. The problem that the table is enormous and grouping by date functions on the timestamp column yeilds horrible performance results.
Ex: The following query produces the right result but ends up taking too long.
SELECT market, max(timestamp) AS TS
FROM tbl_data
GROUP BY market, DATE(timestamp), HOUR(timestamp)
ORDER BY market, TS ASC
The table has a primary index on the (market, timestamp) columns. And I have also added an additional index on the timestamp column. However, that is not of much help as the usage of date/hour functions means a table scan regardless.
How can I improve the performance? Perhaps I should consider a different database than MySQL that provides specialized date/time indexes? if so what would be a good option?
One thing to note is that it would suffice if I could get the LAST row of each hour/day/timeframe. The database has tens of millions of rows.
MySQL version: 5.7
Thanks in advance for the help.
Edit: Here is what Explain shows on a smaller DB of the exact same format:

mysql optimization - copy or serialized old rows

Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.

Advice on optimizing Table Structure

My table has 5 columns: Date, Name, X, Y, Z
For each Date there are ~10 Names. For each Name, there are around 1 million - 5 million records.
Queries always look like this:
Select sum(Z) from TABLENAME where Name='somename' AND Date between 'somedate' and 'otherdate' group by X,Y
For this small number of names, I simply partitioned the table on date; ignoring the extra expense of scanning over a large number of records in a date partition that we actually don't need to (since the query is ALWAYS for 1 Name, but over multiple (max 30) days)
Now I need to scale this to ~600 Names, and it's obvious that continuing with the current model isn't feasable.
My first instinct was to SUBPARTITION for each Name within each Date. This should probably solve the problem of scale since MySQL would narrow it down to a single subpartition per date partition for whichever dates are in the BETWEEN clause.
Is this a good solution? Is there something better I could do? Because creating this sort of structure in an automated manner would be a pain; since I've to load data everyday, possibly with new Names. So I'll have to introspect the CSV files I'm loading, create subpartitions in advance for each Name and then start LOAD DATA INFILE.

mysql partitioning with int and timestamp

I have MySQL 5.6.12 Community Server.
I am trying to partition my MySQL innoDB table which contains 5M(and growing always) rows of history data. It is getting slower and slower and I figured partitioning will solve it.
I have columns.
stationID int(4)
valueNumberID(int 5)
logTime(timestamp)
value(double)
(stationID,valueNumberID,logTime) is my PRIMARY key.
I have 50 different stationID's. From each station comes history data and I need to store it for 5 years. There are only 2-5 different valueNumberID's from each stationID but hundreds of value changes per day. Each query in the system uses stationID,valueNumberID and logTime in that order. In most cases the queries are limited to current year.
I would like to create partitioning with stationID with each stationID having own partition so the queries use smaller physical table to scan, and further reduce the size of the table by subpartitioning it by logTime. I do not know how to create own partition for 50 different stationID's and create subpartitions for them using timestamp.
Thank you for your replies. SELECT queries are getting slower. To me it seems like they are getting slower linearly with the speed the table is growing. The issue must be with the GROUP-statement.This is an example query. SELECT DATE_FORMAT(logTime,"%Y%m%d%H%i%s") AS 'logTime', SUM(Value) FROM His WHERE stationID=23 AND valueNumberID=4 AND logTime > '2013-01-01 00:00:00' AND logTime < '2013-11-14 00:00:00' GROUP BY DATE_FORMAT( logPVM,"%Y%m") ORDER BY logTime LIMIT 0,120;
Objective of this query/queries like this is to give either AVG,MAX,MIN,SUM in hour,day,week,month intervals. Result of the query is bound tightly to how the results are presented to the user in various ways(graph,excel file) and it would take long time to change if I would change the queries. So I was looking an easy way out with partitioning.
And estimate of 1.2-1.4M rows per month comes to this table.
Thank you