Database table structure for storing statistics data - mysql

I am trying to create a table in my MYSQL database for storing click data to my posts on daily basis, what I come up is something like this:
ID | post_id | click_type | created_date
1 1 page_click 2015-12-11 18:13:13
2 2 page_click 2015-12-13 11:16:34
3 3 page_click 2015-12-13 13:24:01
4 1 page_click 2015-12-15 15:31:10
For this type of storing I can get how many clicks does the post number 1 get in December 2015 and even I can get how many clicks does the post number something gets in 15 December between 01-11pm. However let's say I am getting 2000 clicks per day which means it will create 2000 rows per day which means 60.000 per month and 720.000 per year.
Another approach that comes to my mind is like this which stores a row for one day per post and if there is more than one click in that day it will increase the count
ID | post_id | click_type | created_date | count
1 1 page_click 2015-12-11 13
2 2 page_click 2015-12-11 26
3 3 page_click 2015-12-11 152
4 1 page_click 2015-12-12 14
5 2 page_click 2015-12-12 123
6 3 page_click 2015-12-12 163
In this approach if every page is clicked at least one time (which means creating the row) in every day it will generate 1000 rows each day (let's say I have 1000 posts) and 30.000 per month and 360.000 per year.
I am looking for an advice to how to store these statistics and if I want to get daily click statistics. I have some concerns about the performance (of course it's nothing for big data guys :D but sorry for my lack of experience). Do you think it will be ok if there is over 1 million rows in that table after 2-3 years? And which one is do you thing is going to be more effective for me?

720,000 records per year is not necessarily a lot of data. One option may be not to worry about it. Something to consider may be how long the click data matters. If after a year you don't really care anymore then you can have an historical data cleanup protocol that removes data that is older than you care about.
If you are worried about storing large amounts of data and you don't want to erase history, then you can consider pre-calculating your summarized statistics and storing them instead of your transaction detail.
The issue with this is that you have to know in advance what the smallest resolution of time will be that you will continue to care about. Also, if your motivation is saving space then you have to be careful that your summary data doesn't end up taking more space than the original transactions. This can easily happen if you store summarized data at multiple resolutions, as you might in a data warehouse arrangement.

This seems like a good application for rrdtool (http://oss.oetiker.ch/rrdtool/). Here you can specify several resolutions for different time intervals, e.g:
average 5 min for 1 day
average 30 min for 1 week
average 2 hours for 1 month
average 1 day for 1 Year
etc. This is also often used for graphs. Usually this is used with rrd-files, but it can also be based on mysql with rrdgraph_libdbi

Related

Does MySQL compress query responses?

I am currently trying to optimise some DB queries that get run a lot, the queries are run by using a SELECT query against a view, this view does a lot of joins. I thought I might be able to speed things up by caching the results of the view into a table and selecting from the table instead of the view.
Let's say I have 2 tables
People:
PersonId
Name
1
Anne
2
Brian
3
Charlie
4
Doug
CustomerPeople:
CustomerId
PersonId
1
1
1
2
1
3
1
4
2
1
2
2
and I have a view that joins the two tables to give a list of people, by name, belonging to the customer:
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
When I query the view, I look at the Duration/Fetch and it is 0.10 sec/4.00 sec
I decide to cache the view data into a table and create a new table:
CustomerNamedPeople
CustomerId
PersonName
1
Anne
1
Brian
1
Charlie
1
Doug
2
Anne
2
Brian
Which contains the exact same data, however now when I query the table, I look at the Duration/Fetch and it is 0.05 sec/6.00 sec
My understanding is the Duration is the time it takes MySQL engine to run the query, and Fetch is the time it takes the data to be returned to the client (over the network). Unsurprisingly the Duration was faster, and took only 50% of the time, which makes sense, there is no longer a join occurring, however the Fetch took 150% of the time, and is slower.
My question here is: Does MySQL do some sort of response stream compression, since it knows that Anne and Brian are repeated, it can send them only once and have the client "decompress" the data?
The reason I ask is because I am doing something similar but with 1,000,000 rows returned, the data in the two responses is identical, but the view Fetch takes 20 seconds, and the table Fetch is 60 seconds, most of the PersonNames are repeated more than once, so I am wondering if perhaps there is some sort of compression occurring in the response, should I not expect MySQL to take the same time to Fetch two sets of identical data?

DB design little or too much data

I'm currently working on a little project that uses MySQL. However I'm struggling with the database design. Currently I've come up with 2 designs, one stores more data but is actually the way I want it to be, however this way makes it really hard to work with the data. The other way is I think more basic and simplifies a lot of things but stores less data.
Design 1
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
time_achieved
1
1
1
2021-04-17 17:30:04
2
1
1
2021-04-17 17:30:03
3
1
1
2021-04-17 17:30:17
4
1
1
2021-04-17 17:30:22
5
1
1
2021-04-17 17:30:34
6
1
2
2021-04-17 17:30:42
7
1
2
2021-04-17 17:30:54
Design 2
Example data items table
id
description
time_created
1
Car
2021-04-17 17:30:00
2
Bike
2021-04-17 17:30:00
Example data user_items table
id
user_id
item_id
count
1
1
1
5
2
1
2
2
Basically we have items that can be anything, they include a description to specify what they actually are. A user can collect items (a lot). These are stored in the user_items table which contains a FK user_id and item_id to the users and items table. The users table is left out for simplicity.
As you can see design 1 stores a lot more rows for the user_items table, this allows us to add more information (time_achieved and more) per item that a user achieved. However this results in more rows and probably a harder time queriyng. Design 2 on the other hand simply adds a count column to determine how many items the user has, but this is very limiting because we cannot add more data (achieved time..) per user_item.
I'm not sure if design 1 is the right and only design for what we want to achieve. Basically we really want to store additional metadata per user_item but I just don't know if this is the right design since it quickly fills up the database. Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Thanks in advance.
Does anyone have a suggestion/idea for an alternative design which stores less data than design 1 but still allows to add more info per user_item?
Design 1 should work.
This design will also work but quickly fills up, more efficient.
id, item_id,Item_des,Item_qty,user_id,username,time_created all in one table.
some of the values will be repeated.

Correct DB design to store huge amount of stock cryptocurrencies data in DB

I want to store large amount of cryptocurrencies data in db. Then I want to show nice javascript price graphs with historical prices on webpage.
Problem is that I am not sure what database design is best for this problem, I was thinking about Mysql DB, but maybe NOSQL db are better in this case, I don’t know.
What I need:
I need to track at least 100 crypto currencies with historical and
current prices and other stock information like volume etc…
I am going to insert new data every 10 minutes for each crypto ((6
records / hour * 24h * 365 days) * 100 for each crypto = 5 256 000
new records per year )
I need to query various time ranges for each coin to draw graph on webpage.
My idea:
I came with this solution but I need to know if this is ok or I am completely wrong and naive.
In this case I would have 2 tables, first parent table where I would store all necessary info about coins, children table where would be all prices, but this child table would have to contain a huge amount of data, which is worrying me.
My table structure example:
tbl_coin_detail:
id. |Tick_name | Name |Algorithm |Icon
1 | BTC |Bitcoin |SHA256 |path/to/img
2 | ETH |Ethereum |Ethash |path/to/img
.
.
.
tbl_prices:
id | price_USD | price_EUR | datetime | Volume_Day_BTC | FK_coin
1 | 6537.2 | 5 632,28 | 2018-07-01 15:00:00 | 62121.7348556964 | 1
2 | 466.89 | 401.51 | 2018-07-01 15:01:00 | 156373.79481106618 | 2
.
.
.
Another idea is to make separate table for each coin prices, that would mean 100 tables with all historical and current prices and stock info instead of one huge table.
I am really not sure here, what is better, all prices in one table are good for simple querying, but I guess it can be huge performance bottleneck, make queries from separated table will be worse for querying, because I will need to write query for each table but it can help with performance.
Can you point me to right direction how to solve this? SQL DB or NOSQL what is better?
Thank you in advance.
MySQL recommendations...
You have Volume_Day_BTC, yet you say "6 records/hour" -- is the record daily or more fine grained.
The volume of data is not that great, but it will be beneficial to shrink the datatypes before you get started.
id is unnecessary; use PRIMARY KEY(coin, datetime) instead.
Think carefully about the datatype for prices and volumes. At one extreme is space (hence, somewhat, speed); at the other, precision.
DOUBLE -- 8 bytes, about 16 significant digits, large range
DECIMAL(17, 11) -- 8 bytes, limited to $1M and 11 decimal places (not enough?)
DECIMAL(26, 13) -- 12 bytes, maybe big enough?
etc.
Would it be OK to summarize data over, say, one month to save space? Hourly or daily avg/hi/low, etc. This would be very useful for speeding up fetching data for graphing.
In particular, I recommend keeping a Summary table by coin+day with volume, price, etc. Consider using FLOAT (4 bytes, 7 significant digits, sufficient range) as more than good enough for graphing.
So, I am recommending 3 tables:
Coins -- 100 rows with meta info about the currencies.
Prices -- 5M rows/year of details -- unless trimmed (400MB/year)
Summary -- 36500 rows/year for graphing range more than, say, a week. (4MB/yr)
It may be worth it to have an hourly summary table for shorter-range graphs. There is no need to go with weekly or monthly summaries; they can be derived from the daily with sufficient efficiency.
Use InnoDB.
Summary tables
To be honest, that's far from 'huge'. We aren't talking billions of records here, so any properly indexed DB will do just fine.

Best way to select n-th rows based on data in a field for mySQL table

The final result of this will be used for a graphing application where sometimes we would not want the detailed granularity of data at the level it is stored in the table. This may be hard to phrase in a single question so I will give an example:
Example table:
DateTime AddressID Amount
1/1/2015 10:00:00 1 10
1/1/2015 10:00:00 2 8
1/1/2015 10:01:00 1 7
1/1/2015 10:01:00 2 12
1/1/2015 10:02:00 1 21
1/1/2015 10:02:00 2 15
etc...
Note: The times will always have 00 for the seconds - if that helps.
Note: The entries may NOT always have an entry for every minute, but they generally should. So it is possible some might times might be skipped. But there will always be an entry for both addressIDs (1 & 2) every time without fail.
I need to return the above 3 fields, in a period of time requested (for example past 24 hours), but only for certain increments of time FOR EACH OF THE ADDRESS ID's. For example, records for every 5 minutes, or every 10 minutes.
so in the case of 5 minutes it would return:
DateTime AddressID Amount
1/1/2015 10:**00**:00 1 10
1/1/2015 10:**00**:00 2 8
1/1/2015 10:**05**:00 1 11
1/1/2015 10:**05**:00 2 17
1/1/2015 10:**10**:00 1 28
1/1/2015 10:**10**:00 2 5
etc...
Performance is very important. I hope I explained that well enough for someone to get the idea of what I need and I thank you in advance for your suggestions.
EDIT: For clarification, the 5 minutes in the above example should be the minimum time BETWEEN each row. So, if in the above example, on the rare chance that there was a missing time entry for 10:05:00 it should not simply select the 10:10:00 row, it should select the 10:06:00 record and then the next row selected would be 10:11:00, etc.

Should I worry about 1B+ rows in a table?

I've got a table which keeps track of article views. It has the following columns:
id, article_id, day, month, year, views_count.
Let's say I want to keep track of daily views / each day for every article. If I have 1,000 user written articles. The number of rows would compute to:
365 (1 year) * 1,000 => 365,000
Which is not too bad. But let say. The number of articles grow to 1M. And as time passes by to 3 years. The number of rows would compute to:
365 * 3 * 1,000,000 => 1,095,000,000
Obviously, over time, this table will keep growing. And quite fast. What problems will this cause? Or should I not worry since RDBM's handle situations like this quite commonly?
I plan on using the views data in our reports. Either break it down to months or even years. Should I worry about 1B+ rows in a table?
The question to ask yourself (or your stakeholders) is: do you really need 1-day resolution on older data?
Have a look into how products like MRTG, via RRD, do their logging. The theory is you don't store all the data at maximum resolution indefinitely, but regularly aggregate them into larger and larger summaries.
That allows you to have 1-second resolution for perhaps the last 5-minutes, then 5-minute averages for the last hour, then hourly for a day, daily for a month, and so on.
So, for example, if you have a bunch of records like this for a single article:
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 1 | 5 | day
2011 | 12 | 2 | 7 | day
2011 | 12 | 3 | 10 | day
2011 | 12 | 4 | 50 | day
You would then at regular periods create a new record(s) that summarises these data, in this example just the total count for the month
year | month | day | count | type
-----+-------+-----+-------|------
2011 | 12 | 0 | 72 | month
Or the average per day:
year | month | day | count | type
-----+-------+-----+-------+------
2011 | 12 | 0 | 2.3 | month
Of course you may need some flag to indicate the "summarised" status of the data, in this case I've used a 'type' column for finding the "raw" records and the processed records, allowing you to purge out the day records as required.
INSERT INTO statistics (article_id, year, month, day, count, type)
SELECT article_id, year, month, max(day), sum(count), 'month'
FROM statistics
WHERE type = 'day'
GROUP BY article_id, year, month, type
(I haven't tested that query, it's just an example)
The answer is "it depends". but yes, it will probably be a lot to deal with.
However - this is generally a problem of "cross that bridge when you need to". It's a good idea to think about what you could do if this becomes a problem for you in the future.. but it's probably too early to actually implement any suggestions until they're necessary.
My suggestion, if it ever occurs, is to not keep the individual records for longer than X-months (where you adjust X according to your needs). Instead, you'd store the aggregated data that you currently feed into your reports. What you'd do is run, say, a daily script that looks at your records and grabs any that are over X months old... and create a "daily_stats" object of some sort, then delete the originals (or better yet, archives them somewhere).
This will ensure that only X-months worth of data are ever in the db - but you still have quick access to an aggregated form of the stats for long-timeline reports.
It's not something you need to worry about if you can put some practices in place.
Partition the table; this should make archiving easier to do
Determine how much data you need at present
Determine how much data you can archive
Ensure that the table has the right build, perhaps in terms of data types and indexes
Schedule for a time when you will archive partitions that meet the aging requirements
Schedule for index checking (and other table checks)
If you have a DBA in your team, then you can discuss it with him/her, and I'm sure they'll be glad to assist.
Also, like what is used in many data warehouses, and I just saw #Taryn's post (which I agree with -> )store aggregated data as well. This is quickly suggested based on the data you keep in the involved table. If you have trouble with possible editing/updating of records, then it brings to light (even more) the fact that you will just have to set restrictions like how much data to keep (which means this data is what can be modified) and have procedures+jobs in place to ensure that the aggregated data is checked/updated daily and can be updated/checked manually when any changes are made. This way, data integrity is maintained. Discuss with your DBA what other approaches you can take...
By the way, in case you didn't already know.. Aggregated data are normally needed for weekly or monthly reports, and many other reports based upon an interval. Granulize your aggregation as needed, but not so much that it becomes too tedious or seemingly exaggerated.