I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.
Related
I have a table of "outcomes" of scores for a few hundred people on specific days. E.g.
Person
Date
Score
1
1/1/2021
10
2
1/1/2021
15
1
2/2/2022
20
3
2/2/2022
17
I will need to repeatedly compare each players' average score for a specific date range. E.g. get each player's average score between 1/1/2021 and 12/31/2021.
I know that I could query their average using the AVG(score) aggregate function, like SELECT Person, AVG(Score) FROM outcomes WHERE date < ?;
However, since I have hundreds of players with possibly hundreds of outcomes, I am worried that repeatedly doing this query will be produce a lot of row reads. I am considering creating an "averages" table or view where there is an entry for each player on each unique date, but the Score is an average score for the outcomes before that date.
Something like:
Person
EndDate
AVG(Score)
1
1/2/2021
10
2
1/2/2021
15
3
1/2/2021
0
1
2/3/2022
15
2
2/3/2022
15
3
2/3/2022
17
I realize that this is essentially at least doubling the amount of storage required, because each outcome will also have the associated "average" entry.
How is this kind of problem often addressed in practice? At what point does creating an "averages" table make sense? When is using the AVG(x) function most appropriate? Should I just add an "average" column to my "outcomes" table?
I was able to implement my query using the aggregate AVG(x) function, but I am considered about the number of row reads that my database quickly started requiring.
What you are describing is a form of denormalization. Storing the result of an aggregation instead of running the query every time you need it.
When to implement this? When running the query cannot be done fast enough to meet your performance goals.
Be cautious about adopting denormalization too soon. It comes with a cost.
The risk is that if your underlying data changes, but your denormalized copy is not updated, then the stored averages will be outdated. You have to decide whether it's acceptable to query outdated aggregate results from the denormalized table, and how often you want to update those stored results. There isn't one answer to this — it's up to your project requirements and your judgment.
I want to store large amount of cryptocurrencies data in db. Then I want to show nice javascript price graphs with historical prices on webpage.
Problem is that I am not sure what database design is best for this problem, I was thinking about Mysql DB, but maybe NOSQL db are better in this case, I don’t know.
What I need:
I need to track at least 100 crypto currencies with historical and
current prices and other stock information like volume etc…
I am going to insert new data every 10 minutes for each crypto ((6
records / hour * 24h * 365 days) * 100 for each crypto = 5 256 000
new records per year )
I need to query various time ranges for each coin to draw graph on webpage.
My idea:
I came with this solution but I need to know if this is ok or I am completely wrong and naive.
In this case I would have 2 tables, first parent table where I would store all necessary info about coins, children table where would be all prices, but this child table would have to contain a huge amount of data, which is worrying me.
My table structure example:
tbl_coin_detail:
id. |Tick_name | Name |Algorithm |Icon
1 | BTC |Bitcoin |SHA256 |path/to/img
2 | ETH |Ethereum |Ethash |path/to/img
.
.
.
tbl_prices:
id | price_USD | price_EUR | datetime | Volume_Day_BTC | FK_coin
1 | 6537.2 | 5 632,28 | 2018-07-01 15:00:00 | 62121.7348556964 | 1
2 | 466.89 | 401.51 | 2018-07-01 15:01:00 | 156373.79481106618 | 2
.
.
.
Another idea is to make separate table for each coin prices, that would mean 100 tables with all historical and current prices and stock info instead of one huge table.
I am really not sure here, what is better, all prices in one table are good for simple querying, but I guess it can be huge performance bottleneck, make queries from separated table will be worse for querying, because I will need to write query for each table but it can help with performance.
Can you point me to right direction how to solve this? SQL DB or NOSQL what is better?
Thank you in advance.
MySQL recommendations...
You have Volume_Day_BTC, yet you say "6 records/hour" -- is the record daily or more fine grained.
The volume of data is not that great, but it will be beneficial to shrink the datatypes before you get started.
id is unnecessary; use PRIMARY KEY(coin, datetime) instead.
Think carefully about the datatype for prices and volumes. At one extreme is space (hence, somewhat, speed); at the other, precision.
DOUBLE -- 8 bytes, about 16 significant digits, large range
DECIMAL(17, 11) -- 8 bytes, limited to $1M and 11 decimal places (not enough?)
DECIMAL(26, 13) -- 12 bytes, maybe big enough?
etc.
Would it be OK to summarize data over, say, one month to save space? Hourly or daily avg/hi/low, etc. This would be very useful for speeding up fetching data for graphing.
In particular, I recommend keeping a Summary table by coin+day with volume, price, etc. Consider using FLOAT (4 bytes, 7 significant digits, sufficient range) as more than good enough for graphing.
So, I am recommending 3 tables:
Coins -- 100 rows with meta info about the currencies.
Prices -- 5M rows/year of details -- unless trimmed (400MB/year)
Summary -- 36500 rows/year for graphing range more than, say, a week. (4MB/yr)
It may be worth it to have an hourly summary table for shorter-range graphs. There is no need to go with weekly or monthly summaries; they can be derived from the daily with sufficient efficiency.
Use InnoDB.
Summary tables
To be honest, that's far from 'huge'. We aren't talking billions of records here, so any properly indexed DB will do just fine.
I want to use a calculated field in access, however, the tricky part for me is when I have to run it through the dates. I have a database with multiple registers for the same day, but with different dates. Let's take this one for example:
Date | Report | Onblock Time
-----|--------|-------------
27/5 | 5:45 | 8:52
-----|--------|-------------
27/5 | 9:35 | 10:57
-----|--------|-------------
27/5 | 11:52 | 12:59
So, what I want to do is add 45 minutes to the first time that shows (in this case 5:45) and add 30 minutes to the last one (in this case 12:59). Once those two things are done, I want to calculate the difference between them.
I've tried [(Onblock Time + 0:30) - (Report - 0:45)] in the expression generator, and it seems to work. The problem I have is when I have to make it for a table that has 1000's of registers, with 4-6 a day. Is there any sort of automated loop, like a for each of anything like that?
Thanks in advance,
Jonathan
If I understood you right, you need a query, which returns for each day number of minutes between minimum of ReportTime + 0:45 and maximum of OnblockTime + 0:30. If so, the SQL for query should be like this:
SELECT ReportDate
,DateDiff("n", DateAdd("n", 45, Min([ReportTime])), DateAdd("n", 30, Max([OnblockTime]))) AS Diff
FROM TimeTable
GROUP BY ReportDate;
I am trying to create a table in my MYSQL database for storing click data to my posts on daily basis, what I come up is something like this:
ID | post_id | click_type | created_date
1 1 page_click 2015-12-11 18:13:13
2 2 page_click 2015-12-13 11:16:34
3 3 page_click 2015-12-13 13:24:01
4 1 page_click 2015-12-15 15:31:10
For this type of storing I can get how many clicks does the post number 1 get in December 2015 and even I can get how many clicks does the post number something gets in 15 December between 01-11pm. However let's say I am getting 2000 clicks per day which means it will create 2000 rows per day which means 60.000 per month and 720.000 per year.
Another approach that comes to my mind is like this which stores a row for one day per post and if there is more than one click in that day it will increase the count
ID | post_id | click_type | created_date | count
1 1 page_click 2015-12-11 13
2 2 page_click 2015-12-11 26
3 3 page_click 2015-12-11 152
4 1 page_click 2015-12-12 14
5 2 page_click 2015-12-12 123
6 3 page_click 2015-12-12 163
In this approach if every page is clicked at least one time (which means creating the row) in every day it will generate 1000 rows each day (let's say I have 1000 posts) and 30.000 per month and 360.000 per year.
I am looking for an advice to how to store these statistics and if I want to get daily click statistics. I have some concerns about the performance (of course it's nothing for big data guys :D but sorry for my lack of experience). Do you think it will be ok if there is over 1 million rows in that table after 2-3 years? And which one is do you thing is going to be more effective for me?
720,000 records per year is not necessarily a lot of data. One option may be not to worry about it. Something to consider may be how long the click data matters. If after a year you don't really care anymore then you can have an historical data cleanup protocol that removes data that is older than you care about.
If you are worried about storing large amounts of data and you don't want to erase history, then you can consider pre-calculating your summarized statistics and storing them instead of your transaction detail.
The issue with this is that you have to know in advance what the smallest resolution of time will be that you will continue to care about. Also, if your motivation is saving space then you have to be careful that your summary data doesn't end up taking more space than the original transactions. This can easily happen if you store summarized data at multiple resolutions, as you might in a data warehouse arrangement.
This seems like a good application for rrdtool (http://oss.oetiker.ch/rrdtool/). Here you can specify several resolutions for different time intervals, e.g:
average 5 min for 1 day
average 30 min for 1 week
average 2 hours for 1 month
average 1 day for 1 Year
etc. This is also often used for graphs. Usually this is used with rrd-files, but it can also be based on mysql with rrdgraph_libdbi
I have to model over a relational database the following scenario.
Imagine you have a number (say 10.000) of persons.
Imagine each of those person may, or may not, offer a given service inside a timespan of a given day. Let's call these services "Answer phone", "Answer email", and "Answer SMS".
I have 48 timespans a day (00:00 - 00:30, 00:30 - 01:00, 01:00 - 01:30, etc.)
I have to schedule 7 week days (1 to 7)
Each service can be overlapped to another.
I'm currently thinking about a structure like this:
id | user_id | day | t00 | t05 | t10 | [... more timespans ...] | service_type
x 001 1 1 1 0 ... 'answer_phone'
y 001 1 1 1 1 ... 'answer_email'
z 002 1 0 0 1 ... 'answer_phone'
And so on. About the t* columns:
every t* column is a boolean value
t00 means "service is ON from 00:00 to 00:29"
t05 means "service is ON from 00:30 to 00:59"
t10 means "service is ON from 01:00 to 01:29"
and so on. So, at row "x" i've modeled that
user 001 will answer phone between 00:00 and 00:59, while answering
emails from 00:00 to 01:29 on Monday.
After thinkin about for a while, this approach seems to be enough straightforward, but i fear it will suffer performance and disk space issues when dealing with thousands of users.
Infact, for 10k users, i would have (10k * how_many_services * 7days) rows, which means 210.000 records. Not that much, but users may grow, or new services may be added.
Can you suggest a better approach?
This is a terrible design. It's not normalized at all.
I would imagine there's a 1:many relationship between a user and their activity schedules. I'd model it that way.
If you don't know what normal forms are and why they're important, you shouldn't be doing relational modeling. Get someone who understands it to help you.
I would have separate tables for TIMES, SERVICES, USERS and ACTIONS.
TIMES would contain just the time splits (including a textual description of the time period)
SERVICES would have the service types such as answer_phone, answer_email etc. This allows for easy future expansion.
USERS would have any info on the users of the system. Such as userID, name, department, whatever.
The ACTIONS table would be for linking all the above tables together using foreign keys.
An entry in the actions table would have its own primary key, user_FK, time_FK, service_FK.