Is this a good way to keep track of daily views? - mysql

I have a views table that keeps track of daily views. We use this table to show daily and monthly line charts. The table looks like:
id, post_id, day, month, year, count.
Which means for every post, there is 365 views (in a year). This means if I have 1,000 posts. I would have 365,000 entries in the views table. We have several posts and counting. Sometimes 10 posts a day.
I've put an index on post_id, day, month and year.
Am worried that this may lead to db performance issues as the table grows exponentially? Should I be concerned? Or should I be fine?

I think you are making things more difficult for yourself than you need to. Why don't you just have a table defined like this:
create table daily_views
( post_id int not null
, view_date date not null
, count int not null
, primary key (post_id, view_date)
, foreign key (post_id) references post(post_id)
)
Unless you have a child table that references the daily_views table there is no particular advantage to having an auto-increment ID on daily_views. In fact, you'd just be wasting space for an index that is less useful than the natural key, which is post_id and view_date.
There is no advantage to splitting day, month and year into separate columns. If you store the date as a single field it is more efficient and you can easily aggregate across any date range, not just on days of the month, months and years.
By using this table format you will optimize the space used and the access to records which will mitigate any concern you have about performace and scalability. In terms of the number of rows you are generating, I don't think you need to worry about that. Lots and lots of databases have tables with many millions of rows. You just want to make sure each row is as compact as possible.

Related

How to effectively save last 7 days statistics in SQL database? [duplicate]

I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)

How to reduce table size for date data type?

I was saving data for a table, as follows.
create table A
(
RecordDate date not null, -- 3 bytes
SomeNumber int not null, -- 4 bytes
)
For each single calendar day, about 20 thousands records for table A are created, with the same RecordDate, but different SomeNumber. That makes the RecordDate redundant for many records, I felt that there is some room to reduce the table size. Is there a way I can cut table A's size without losing the date information? Thanks.
As per my comment, I don't think it is worth the effort in your case, even with 20,000 transactions per day. I suspect it's not worthwhile for millions of transactions per day. A 'Date' field uses 3 bytes of data, which is already one of the smallest datatypes in your database.
However, I think it's an interesting question.
Bansi's suggestion is a very space efficient solution (have a date table with min and max order number) but without any table keys you could (would) break your data, and it would be very hard to fix or even know it's broken. You would also massively increase your query time, as it would have to make two calculations on every record to find dates within a range.
My solution would be a dimensional modeling style date table, with a date key and the date. Your transaction table would store a date key for you to join to. For this to be of any benefit the date key would have to be a 'smallint' type (2 bytes). You would save 1 byte per row minus a relatively insignificant amount for the date table itself.
Note that using 'smallint' restricts your number of dates to 65,535 different days, or ~180 years if the days are consecutive.

Optimizing a large MySQL table - Partitioning?

My columns are:
job_name, job_date, job_details1, job_details2 ...
There are NO Primary key columns
In my table, I expect to have 15-20 distinct jobs. Each job with exactly 2 months of data so 60 distinct job_date per job_name. And within each date there would be 100,000 records.
Query will always be a SELECT for ONE particular job_name and a range of job_date (followed by several group bys, but that's irrelevant for now). I don't want the query to go through irrelevant job_dates or job_names when queried for a particular job_name and some range of job_date.
So what sort of optimizations can I do to make my select query faster? I'm using MySQL5.6.17, which has a partitioning limit of 8096 partitions.
Something like partitioning per job_name and subpartitions for job_date within that? This is the first time I'm dealing with such large data so I'm not sure about these optimizations. Any help or tips will be appreciated.
Thanks
"Query will always be a SELECT for ONE particular job_name and a range of job_date (followed by several group bys, but that's irrelevant for now)." -- Based on that, you need
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
PRIMARY KEY(job_name, job_date, id),
INDEX(id)
ENGINE=InnoDB
Notes:
The combination of InnoDB with `PK(job_name, job_date, ...) clusters the data so that you scan exactly the rows you need, and nothing more.
No partitioning; it won't help.
I am adding an AUTO_INCREMENT and adding it to the PK because a PK must be unique. (And the PK is needed for the clustering.)
INDEX(id) (or some key starting with id) is needed for AUTO_INCREMENT.
"... followed by group bys ..." That sounds like you are summarizing data for reports? If my suggestions above are not fast enough, let's talk about Summary Tables. You might get another factor of 10 speedup.

What is the best way to "roll up" aggregate data in MySql?

I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.

Database structure for holding statistics by day, week, month, year

I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)