I am creating a table that stores data, actually counter, for products for each week.
Example:
id = 1
productId = 195
DateTime = 01/07/2012
Counter = 0
My question to you is about database storage space, query flexibility and performance.
Instead of the DateTime column, I thought about using a SmallInt 'WeekNumber' column.
I will decide on the Date that the weeks start (base date). Let's say 10/10/2012.
For each product and for each week, there will be a row that represents the total of something that I count on a daily basis (ie. Pageviews for a specific product page).
From What I'v eread:
Date column is 4 bytes
SmallInt is 2 bytes
I want to save as much space as possible, but I want to be able to query the database base on range of dates (august 2012 to September 2013), specific week in a specific year, etc.
Is this approach to the schema is good, or I will find myself having problem with poor SQL performance, Query flexibility, indexes, etc.
Consider the sacrifice and complication you're going to make in order to save 2 bytes one byte....
In order to use the smallint you're going to pass every call to the data through a function to get its "week number" starting from your own arbitrary date.... This is neither more performant nor more clear.
The query is, likewise, not as flexible because each one will need to compare based on your magic "starting date" rather than just a date compare/group. Your queries will likely not be SARGable and will probably be slower
EDIT: From your comments you have a hard limit of 50GB.... that's a lot of space for an aggregation DB like you're discussing. You're inviting undue stress and loss of sustainability by complicating this.
According to MySQL, the DATE type is only 3 bytes compared to the 2 bytes for the SMALLINT
http://dev.mysql.com/doc/refman/5.0/en/storage-requirements.html
You are, therefore, going to save one byte per row (you say 2000 per week)... so let's say 2KB per week, 104 KB per year.....
If this table has no child tables (no foreign keys referencing it), to conserve space, you might consider omitting the surrogate primary key (id), and instead use a composite key (productId,date_) as the primary key. (From what you describe, it sounds as you are going to want to have the combination of those columns as UNIQUE, and both of those columns as NOT NULL.
If what you want to store is a "week" identifier rather than a DATE, there's no problem on the database side of things, as long as your queries aren't wrapping that column in an expression to get a DATE values to use in predicates. That is, for performance, your predicates are going to need to be on the bare "week identifier" column, e.g.
WHERE t.product_id = 195 AND t.week_id >= 27 AND t.week_id < 40
Predicates like that on the bare column will be sargable (that is, allow for an index to be used.) You do NOT want to be wrapping that week_id column in an expression to return a DATE, and use WHERE clause on that expression. (Having expressions on the literal side of the comparison is not a problem... you just don't want them on the "table" side.
That's really going to be the determining factor of whether you can use a week_id in place of a DATE column.
Using a "period id" in place of a DATE is fairly straightforward to implement for periods that are whole months. (It's also straightforward for "days", but is really of less benefit there.) Implementing this approach for "week" periods is more complicated, because of the handling you need for a week that is split between two years.
Consider, for example, that the last two days of this year (2012) are on Sunday and Monday, but Tuesday thru Saturday of that same week are in 2013. You'd need to decide whether that's two separate weeks, or whether that's the same week.
But the 1-byte savings (of SMALLINT vs DATE) isn't the real benefit. What the "week_id" column gets you (as I see it) is that you have a single id value that identifies a week. Consider the date values of '2012-07-30', '2012-07-31', '2012-08-01' they all really represent the same week. So you have multiple values for the week, such that a UNIQUE constraint on (product_id,date) doesn't really GUARANTEE (on the database side) that you don't have more than row for the same week. (That's not an insurmountable problem of course, you can specify that you only store a Sunday (or Monday) date value.)
In summary,
To conserve space, I would first drop that surrogate id column, and make the combination of the product_id and the DATE be the primary key.
Then I would ONLY consider changing that DATE into a SMALLINT, if I could GUARANTEE that all queries would be referencing that bare SMALLINT column, and NOT referencing an expression that converts the SMALLINT column back into DATE.
Related
Suppose i have a simple table with this columns:
| id | user_id | order_id |
About 1,000,000 rows is inserted to this table per month and as it is clear relation between user_id and order_id is 1 to M.
The records in the last month needed for accounting issues and the others is just for showing order histories to the users.To archive records before last past month,i have two options in my mind:
first,create a similar table and each month copy old records to it.so it will get bigger and bigger each month according to growth of orders.
second,create a table like below:
| id | user_id | order_idssss |
and each month, for each row to be inserted to this table,if there exist user_id,just update order_ids and add new order_id to the end of order_ids.
in this solution number of rows in the table will be get bigger according to user growth ratio.
suppose for each solution we have an index on user_id.
.
Now question is which one is more optimized for SELECT all order_ids per user in case of load on server.
the first one has much more records than the second one,but in the second one some programming language is needed to split order_ids.
The first choice is the better choice from among the two you have shown. With respect, I should say your second choice is a terrible idea.
MySQL (with all SQL dbms systems) is excellent at handling very large numbers of rows of uniformly laid out (that is, normalized) data.
But, your best choice is to do nothing except create appropriate indexes to make it easy to look up order history by date or by user. Leave all your data in this table and optimize lookup instead.
Until this table contains at least fifty million rows (at least four years' worth of data), the time you spend reprogramming your system to allow it to be split into a current and an archive version will be far more costly than just keeping it together.
If you want help figuring out which indexes you need, you should ask another question showing your queries. It's not clear from this question how you look up orders by date.
In a 1:many relationship, don't make an extra table. Instead have the user_id be a column in the Orders table. Furthermore, this is likely to help performance:
PRIMARY KEY(user_id, order_id),
INDEX(order_id)
Is a "month" a calendar month? Or "30 days ago until now"?
If it is a calendar month, consider PARTITION BY RANGE(TO_DAYS(datetime)) and have an ever-increasing list of monthly partitions. However, do not create future months in advance; create them just before they are needed. More details: http://mysql.rjweb.org/doc.php/partitionmaint
Note: This would require adding datetime to the end of the PK.
At 4 years' worth of data (48 partitions), it will be time to rethink things. (I recommend not going much beyond that number of partitions.)
Read about "transportable tablespaces". This may become part of your "archiving" process.
Use InnoDB.
With that partitioning, either of these becomes reasonably efficient:
WHERE user_id = 123
AND datetime > CURDATE() - INTERVAL 30 DAY
WHERE user_id = 123
AND datetime >= '2017-11-01' -- or whichever start-of-month you need
Each of the above will hit at most one non-empty partition more than the number of months desired.
If you want to discuss this more, please provide SHOW CREATE TABLE (in any variation), plus some of the important SELECTs.
I want to store daily fund data for approximately 2000 funds over 20 years or more. At first I figured I would just create one giant table with one column per fund and one row per date. I ran into trouble trying to create this table and also realise that a table like that would have a lot of NULL values (almost half the values would be NULL).
Is there a more efficient way of structuring the table or database for quickly finding and fetching the data for a specific fund over hundreds (or thousands) of days?
The alternative way I've thought of doing this is with three columns (date, fund_id, fund_value). This however does not seem optimal to me since both the date and fund_id would be duplicated many times over. Having a few million data points just for the date (instead of a few thousand) seems wasteful.
Which is the better option? Or is there a better way to accomplish this?
Having the three columns you mention is fine. fund_value is the price of fund_id on fund_date. So fund_id and fund_date would be the PK of this table. I don't understand what you mean "having a few million data points just for the date..." If you have 20k funds, a particular date will appear in at most 20k rows -- one for each fund. This is not needless duplication. This is necessary to uniquely identify the value of a particular fund on a particular date. If you added, say, fund_name to the table, that would be needless duplication. We assume the fund name will not change from day to day. Unchanging (static) data about each fund would be contained in a separate table. The field fund_id of this table would then be a FK reference to the static table.
To query the value of the funds on a particular date:
select fund_date as ValueDate, fund_id, fund_value
from fund_value_history
where fund_date = #aDate
and fund_id = #aFund -- to limit to a particular fund
To show the dates a fund increased in value from one day to the next:
select h1.fund_date, h2.fund_value as PreviousValue,
h1.fund_value PresentValue
from fund_value_history h1
join fund_value_history h2
on h2.fund_id = h1.fund_id
and h2.fund_date =(
select max( fund_date )
from fund_value_history
where fund_id = h2.fund_id
and fund_date < h2.fund_date )
where h2.fund_value < h1.fund_value
and fund_id = #aFund;
This would be a sizable result set but you could modify the WHERE clause to show, for example, all funds whose values on a particular date was greater than the previous day, or the values of all funds (or particular fund) on a particular date and the previous day, or any number of interesting results.
You could then join to the static table to add fund name or any other descriptive data.
The three column approach you considered is the correct one. There would be no wasted space due to missing values, and you can add and remove funds at any time.
Have a search for "database normalisation", which is the discipline that covers this sort of design decision.
Edit: I should add that you're free to include other metrics in that table, of course. Since historical data is effectively static you can also store "change since previous day" as well, which is redundant strictly speaking, but may help to optimise certain queries such as "show me all the funds that decreased in value on this day".
So I've been dreading asking this question - mostly because I'm terrible at logic in excel, and transferring logic statements to SQL is such a struggle for me, but I'm going to try and make this as clear as possible.
I have two tables. One table is historic_events and the other is future_events. Based on future_events, I have another table confidence_interval that calculates a z-score telling me based on how many future_events will occur, how many historic_event data points I will need to calculate a reliable average. Each record in historic_events has a unique key called event_id. Each record in confidence_interval has a field called service_id that is unique. The 'service_id' field also exists in 'historic_events' and they can be joined on that field.
So, with all that being said, based on the count of future events by service_id, my confidence_interval table calculates the z-score. I then need to select records from the historic_events table for each service_id that satisfy the following parameters
Select * EVENT_ID
From historic_events
where END_DATE is within two calendar years from todays date
and count of `EVENT_ID` is >= `confidence_interval.Z_SCORE`
if those parameters are not met, then I want to widen the date value to being within three years.
if those parameters are still not met, I want to widen the date value to being within four years, and then again to five years. If there still aren't enough datapoints after five years, ohwell, we'll settle for what we have. We do not want to look at datapoints that are older than five years.
I want my end result to be a table that has a list of the EVENT_ID and I would re-run the SQL query for each service_id.
I hope this makes sense - I can figure out the SELECT and FROM, but totally getting stuck on the WHERE.
I have a pricing history table with half a billion records. It is formatted like this:
Id, sku, vendor, price, datetime
What I want to do is get average price of all products by vendor for a certain date range. Most products are updated once every 3 days, but it varies.
So, this is the query I want to run:
SELECT
avg(price)
FROM table
WHERE
vendor='acme'
AND datetime > '12-15-2014'
AND datetime < '12-18-2014'
GROUP BY sku
This 3 day range is broad enough that i will for sure get at least one price sample, but some skus may have been sampled more than once, hence group by to try and get only one instance of each sku.
The problem is, this query runs and runs and doesn't seem to finish (more than 15 minutes). There are around 500k unique skus.
Any ideas?
edit: corrected asin to sku
For this query to be optimized by mysql you need to create a composite index
(vendor, datetime, asin)
IN THIS PARTICULAR ORDER (it mattters)
It also worth trying creating another one
(vendor, datetime, asin, price)
since it may perform better (since it's a so called "covering index").
The indexes with other order, like (datetime, vendor) (which is suggested in another answer) are useless since the datetime is used in a range comparison.
Few notes:
The index will be helpful if only the vendor='acme' AND datetime > '12-15-2014' AND datetime < '12-18-2014' filter condition covers a small part of the whole table (say less than 10%)
Mysql does not support dd-mm-yyyy literals (at least it's not documented, see references) so I assume it must be yyyy-mm-dd instead
Your comparison does not cover the first second of the December 15th, 2014. So you probably wanted datetime >= '2014-12-15' instead.
References:
http://dev.mysql.com/doc/refman/5.6/en/range-optimization.html
http://dev.mysql.com/doc/refman/5.6/en/date-and-time-literals.html
You need an index to support your query. Suggest you create an index on vendor and datetime like so:
CREATE INDEX pricing_history_date_vendor ON pricing_history (datetime, vendor);
Also, I assume you wanted to group by sku rather than undefined column asin.
Not to mention your non-standard SQL date format MM-dd-yyyy as pointed out by others in comments (should be yyyy-MM-dd).
I have two different queries. But they are both on the same table and have both the same WHERE clause. So they are selecting the same row.
Query 1:
SELECT HOUR(timestamp), COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY HOUR(timestamp)
Query 2:
SELECT country, COUNT(*) as hits
FROM hits_table
WHERE timestamp >= CURDATE()
GROUP BY country
How can I make this more efficient?
If this table is indexed correctly, it honestly doesn't matter how big the entire table is because you're only looking at today's rows.
If the table is indexed incorrectly the performance of these queries will be terrible no matter what you do.
Your WHERE timestamp >= CURDATE() clause means you need to have an index on the timestamp column. In one of your queries the GROUP BY country shows that a compound covering index on (timestamp, country) will be a great help.
So, a single compound index (timestamp, country) will satisfy both the queries in your question.
Let's explain how that works. To look for today's records (or indeed any records starting and ending with particular timestamp values) and group them by country, and count them, MySQL can satisfy the query by doing these steps:
random-access the index to the first record that matches the timestamp. O(log n).
grab the first country value from the index.
scan to the next country value in the index and count. O(n).
repeat step three until the end of the timestamp range.
This index scan operation is about as fast as a team of ace developers (the MySQL team) can get it to be with a decade of hard work. (You may not be able to outdo them on a Saturday afternoon.) MySQL satisfies the whole query with a small subset of the index, so it doesn't really matter how big the table behind it is.
If you run one of these queries right after the other, it's possible that MySQL will still have some or all the index data blocks in a RAM cache, so it might not have to re-fetch them from disk. That will help even more.
Do you see how your example queries lead with timestamp? The most important WHERE criterion chooses a timestamp range. That's why the compound index I suggested has timestamp as its first column. If you don't have any queries that lead with country your simple index on that column probably is useless.
You asked whether you really need compound covering indexes. You probably should read about how they work and make that decision for yourself.
There's obviously a tradeoff in choosing indexes. Each index slows the process of INSERT and UPDATE a little, and can speed up queries a lot. Only you can sort out the tradeoffs for your particular application.
Since both queries have different GROUP BY clauses they are inherently different and cannot be combined. Assuming there already is an index present on the timestamp field there is no straightforward way to make this more efficient.
If the dataset is huge (10 million or more rows) you might get a little extra efficiency out of making an extra combined index on country, timestamp, but that's unlikely to be measurable, and the lack of it will usually be mitigated by in-memory buffering of MySQL itself if these 2 queries are executed directly after another.