Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.
Related
I want to store daily fund data for approximately 2000 funds over 20 years or more. At first I figured I would just create one giant table with one column per fund and one row per date. I ran into trouble trying to create this table and also realise that a table like that would have a lot of NULL values (almost half the values would be NULL).
Is there a more efficient way of structuring the table or database for quickly finding and fetching the data for a specific fund over hundreds (or thousands) of days?
The alternative way I've thought of doing this is with three columns (date, fund_id, fund_value). This however does not seem optimal to me since both the date and fund_id would be duplicated many times over. Having a few million data points just for the date (instead of a few thousand) seems wasteful.
Which is the better option? Or is there a better way to accomplish this?
Having the three columns you mention is fine. fund_value is the price of fund_id on fund_date. So fund_id and fund_date would be the PK of this table. I don't understand what you mean "having a few million data points just for the date..." If you have 20k funds, a particular date will appear in at most 20k rows -- one for each fund. This is not needless duplication. This is necessary to uniquely identify the value of a particular fund on a particular date. If you added, say, fund_name to the table, that would be needless duplication. We assume the fund name will not change from day to day. Unchanging (static) data about each fund would be contained in a separate table. The field fund_id of this table would then be a FK reference to the static table.
To query the value of the funds on a particular date:
select fund_date as ValueDate, fund_id, fund_value
from fund_value_history
where fund_date = #aDate
and fund_id = #aFund -- to limit to a particular fund
To show the dates a fund increased in value from one day to the next:
select h1.fund_date, h2.fund_value as PreviousValue,
h1.fund_value PresentValue
from fund_value_history h1
join fund_value_history h2
on h2.fund_id = h1.fund_id
and h2.fund_date =(
select max( fund_date )
from fund_value_history
where fund_id = h2.fund_id
and fund_date < h2.fund_date )
where h2.fund_value < h1.fund_value
and fund_id = #aFund;
This would be a sizable result set but you could modify the WHERE clause to show, for example, all funds whose values on a particular date was greater than the previous day, or the values of all funds (or particular fund) on a particular date and the previous day, or any number of interesting results.
You could then join to the static table to add fund name or any other descriptive data.
The three column approach you considered is the correct one. There would be no wasted space due to missing values, and you can add and remove funds at any time.
Have a search for "database normalisation", which is the discipline that covers this sort of design decision.
Edit: I should add that you're free to include other metrics in that table, of course. Since historical data is effectively static you can also store "change since previous day" as well, which is redundant strictly speaking, but may help to optimise certain queries such as "show me all the funds that decreased in value on this day".
I'm not a database specialist, therefore I'm coming here for a little help.
I have planty of measured data and I want help myself with data manipulation. Here is my situation:
There are cca 10 stations, measuring every day. Everyday, one produces cca 3000 rows (with cca 15 columns) of data. Data have to be downloaded once a day from every station to the centralized server. That means cca 30 000 inserted rows into the database every day. (daily counts are mutable)
Now, I've already had data from a few past years, so for every station, I have a few milions of rows. There are also cca 20 "dead" stations - don't work anymore, but there are data from a few years.
Sum this all up and we'll get cca 50+ millions of rows, produced by 30 stations and cca 30 000 rows inserted every day. Looking ahead, let's assume 100 millions of rows in database.
My question is obvious - how would you suggest to store this data?
Measured values(columns) are only numbers (int, or double + datetime) - no text, or fulltext search, basically the only index I need is DATETIME.
Data will not be updated, nor deleted. I just need a fast select of a range of data (eg. from 1.1.2010 to 3.2.2010)
So as I wrote, I want to use MySQL because that's the database I know best. I've read, that it should easily handle this amount of data, but still, I appreciate any suggestion for this very situation.
Again:
10 stations, 3000 rows per day each => cca 30 000 inserts per day
cca 40-50 millions of rows yet to be inserted from binary files
DB is going to grow (100+ millions of rows)
The only thing I need is to SELECT data as fast as possible.
As far as I know, MySQL should handle this amount of data. I also know, that my only index will be date and time in DATETIME type (should be faster then others, am I right?)
The thing I can't decide is, whether create one huge table with 50+ millions of rows (with station id), or create table for every station separately. Basically, I don't need to perform any JOIN on these stations. If I need to do time coincidence, I can just select the same range of time on stations. Are there any dis/advanteges on these approaches?
Can anyone confirm/decline my thoughts? Do you think, that there is a better solution? I appreciate any help or discussion.
MySQL should be able to handle this pretty well. Instead of indexing just your DATETIME column, I suggest you create two compound indexes, as follows:
(datetime, station)
(station, datetime)
Having both these indexes in place will help accelerate queries that choose date ranges and group by stations or vice versa. The first index will also serve the purpose that just indexing datetime will serve.
You have not told us what your typical query is. Nor have you told us whether you plan to age out old data. Your data is an obvious candidate for range partitioning (http://dev.mysql.com/doc/refman/5.6/en/partitioning-range.html) but we'd need more information to help you design a workable partitioning criterion.
Edit after reading your comments.
A couple of things to keep in mind as you build up this system.
First, Don't bother with partitions for now.
Second, I would get everything working with a single table. Don't split stuff by station or year. Get yourself the fastest disk storage system you can afford and a lot of RAM for your MySQL server and you should be fine.
Third, take some downtime once in a while to do OPTIMIZE TABLE; this will make sure your indexes are good.
Fourth, don't use SELECT * unless you know you need all the columns in the table. Why? Because
SELECT datetime, station, temp, dewpoint
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
can be directly satisfied from sequential access to a compound covering index on
(station, datetime, temp, dewpoint)
whereas
SELECT *
FROM table
WHERE datetime >= DATE(NOW() - INTERVAL 60 DAY)
ORDER BY station, datetime
needs to random-access your table. You should read up on compound covering indexes.
Fifth, avoid the use of functions with column names in your WHERE clauses. Don't say
WHERE YEAR(datetime) >= 2003
or anything like that. MySQL can't use indexes for that kind of query. Instead say
WHERE datetime >= '2003-01-01'
to allow the indexes to be exploited.
So I have an interesting question that I am not sure is considered a 'hack' or not. I looked through some questions but did not find a duplicate so here it is. Basically, I need to know if this is unreliable or considered bad practice.
I have a very simple table with a unique auto incrementing id and a created_at timestamp.
(simplified version of my problem to clarify the concept in question)
+-----------+--------------------+
| id |created_at |
+-----------+--------------------+
| 1 |2012-12-11 20:35:19 |
| 2 |2012-12-12 20:35:19 |
| 3 |2012-12-13 20:35:19 |
| 4 |2012-12-14 20:35:19 |
+-----------+--------------------+
Both of these columns are added dynamically so it can be said that a new 'insert' will ALWAYS have a greater id and ALWAYS have a greater date.
OBJECTIVE -
very simply grab the results ordered by created_at in descending order
SOLUTION ONE - A query that orders by date in descending order
SELECT * FROM tablename
ORDER BY created_at DESC
SOLUTION TWO - A query that orders by ID in descending order
SELECT * FROM tablename
ORDER BY id DESC
Is solution two considered bad practice? Or is solution two the proper way of doing things. Any explanation of your reasonings would be very helpful as I am trying to understand the concept, not just simply get an answer. Thanks in advance.
In typical practice you can almost always assume that an autoincrement id can be sorted to give you the records in creation order (either direction). However, you should note that this is not considered portable in terms of your data. You might move your data to another system where the keys are recreated, but the created_at data is the same.
There is actually a pretty good StackOverflow discussion of this issue.
The basic summary is the first solution, ordering by created_at, is considered best practice. Be sure, however, to properly index the created_at field to give the best performance.
You shouldn't rely on ID for anything other than that it uniquely identifies a row. It's an arbitrary number that only happens to correspond to the order in which the records were created.
Say you have this table
ID creation_date
1 2010-10-25
2 2010-10-26
3 2012-03-05
In this case, sorting on ID instead of creation_date works.
Now in the future you realize, oh, whoops, you have to change the creation date of of record ID #2 to 2010-09-17. Your sorts using ID now report the records in the same order:
1 2010-10-25
2 2010-09-17
3 2012-03-05
even though with the new date they should be:
2 2010-09-17
1 2010-10-25
3 2012-03-05
Short version: Use data columns for the purpose that they were created. Don't rely on side effects of the data.
There are a couple of differences between the two options.
The first is that they can give different results.
The value of created_at might be affected by the time being adjusted on the server but the id column will be unaffected. If the time is adjusted backwards (either manually or automatically by time synchronization software) you could get records that were inserted later but with timestamps that are before records that were inserted earlier. In this case you will get a different order depending on which column you order by. Which order you consider to be "correct" is up to you.
The second is performance. It is likely to be faster to ORDER BY your clustered index.
How the Clustered Index Speeds Up Queries
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads.
By default the clustered key is the primary key, which in your case is presumably the id column. You will probably find that ORDER BY id is slightly faster than ORDER BY created_at.
Primary keys, especially of surrogate type, do not usually represent any kind of meaningful data aside from the fact that their mere function is to allow for uniquely identifiable records. Since dates in this case do represent meaningful data that has meaning outside of its primary function I'd say sorting according to dates is a more logical approach here.
Ordering by id orders by insertion order.
If you have use cases where insertion can be delayed, for example a batch process, then you must order by created_at to order by time.
Both are acceptable if they meet you needs.
I am creating a table that stores data, actually counter, for products for each week.
Example:
id = 1
productId = 195
DateTime = 01/07/2012
Counter = 0
My question to you is about database storage space, query flexibility and performance.
Instead of the DateTime column, I thought about using a SmallInt 'WeekNumber' column.
I will decide on the Date that the weeks start (base date). Let's say 10/10/2012.
For each product and for each week, there will be a row that represents the total of something that I count on a daily basis (ie. Pageviews for a specific product page).
From What I'v eread:
Date column is 4 bytes
SmallInt is 2 bytes
I want to save as much space as possible, but I want to be able to query the database base on range of dates (august 2012 to September 2013), specific week in a specific year, etc.
Is this approach to the schema is good, or I will find myself having problem with poor SQL performance, Query flexibility, indexes, etc.
Consider the sacrifice and complication you're going to make in order to save 2 bytes one byte....
In order to use the smallint you're going to pass every call to the data through a function to get its "week number" starting from your own arbitrary date.... This is neither more performant nor more clear.
The query is, likewise, not as flexible because each one will need to compare based on your magic "starting date" rather than just a date compare/group. Your queries will likely not be SARGable and will probably be slower
EDIT: From your comments you have a hard limit of 50GB.... that's a lot of space for an aggregation DB like you're discussing. You're inviting undue stress and loss of sustainability by complicating this.
According to MySQL, the DATE type is only 3 bytes compared to the 2 bytes for the SMALLINT
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
You are, therefore, going to save one byte per row (you say 2000 per week)... so let's say 2KB per week, 104 KB per year.....
If this table has no child tables (no foreign keys referencing it), to conserve space, you might consider omitting the surrogate primary key (id), and instead use a composite key (productId,date_) as the primary key. (From what you describe, it sounds as you are going to want to have the combination of those columns as UNIQUE, and both of those columns as NOT NULL.
If what you want to store is a "week" identifier rather than a DATE, there's no problem on the database side of things, as long as your queries aren't wrapping that column in an expression to get a DATE values to use in predicates. That is, for performance, your predicates are going to need to be on the bare "week identifier" column, e.g.
WHERE t.product_id = 195 AND t.week_id >= 27 AND t.week_id < 40
Predicates like that on the bare column will be sargable (that is, allow for an index to be used.) You do NOT want to be wrapping that week_id column in an expression to return a DATE, and use WHERE clause on that expression. (Having expressions on the literal side of the comparison is not a problem... you just don't want them on the "table" side.
That's really going to be the determining factor of whether you can use a week_id in place of a DATE column.
Using a "period id" in place of a DATE is fairly straightforward to implement for periods that are whole months. (It's also straightforward for "days", but is really of less benefit there.) Implementing this approach for "week" periods is more complicated, because of the handling you need for a week that is split between two years.
Consider, for example, that the last two days of this year (2012) are on Sunday and Monday, but Tuesday thru Saturday of that same week are in 2013. You'd need to decide whether that's two separate weeks, or whether that's the same week.
But the 1-byte savings (of SMALLINT vs DATE) isn't the real benefit. What the "week_id" column gets you (as I see it) is that you have a single id value that identifies a week. Consider the date values of '2012-07-30', '2012-07-31', '2012-08-01' they all really represent the same week. So you have multiple values for the week, such that a UNIQUE constraint on (product_id,date) doesn't really GUARANTEE (on the database side) that you don't have more than row for the same week. (That's not an insurmountable problem of course, you can specify that you only store a Sunday (or Monday) date value.)
In summary,
To conserve space, I would first drop that surrogate id column, and make the combination of the product_id and the DATE be the primary key.
Then I would ONLY consider changing that DATE into a SMALLINT, if I could GUARANTEE that all queries would be referencing that bare SMALLINT column, and NOT referencing an expression that converts the SMALLINT column back into DATE.
I have an order table that contains dates and amounts for each order, this table is big and contains more that 1000000 records and growing.
We need to create a set of queries to calculate certain milestones, is there a way in mysql to figure out on which date we reached an aggregate milestone of x amount.
For e.g we crossed 1 m sales on '2011-01-01'
Currently we scan the entire table then use the logic in PHP to figure out the date, but it would be great if this could be done in mysql without reading so many records at 1 time.
There maybe elegant approaches, but what you can do is maintain a row in another table which contains, current_sales and date it occurred. Every time you have a sale, increment the value, and store sales date. If the expected milestones(1 Million, 2 Million etc) are known in advance, you can store them away when they occur(in same or different table)
i think using gunner's logic with trigger will be a good option as it reduce your efforts to maintain the row and after that you can send mail notification through trigger to know the milestone status