I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.
Related
In my database, a table has a column with an integer that needs to increment every day, counting the days that have passed from a date.
Is there any way I can do this?
I know Auto Increment exists, but I don't know if it fits for this occasion.
I found a solution using mysql events, but now I'm having trouble with the syntax.
PHPMyadmin gives me a form to complete.
https://imgur.com/Lhru1ZJ
I'm having trouble because I don't know what informations to put into it.
The best way to do this is to compute the elapsed days in a query, not to update the table every day.
For example, suppose you have a table with columns id and start_date.
This query gives you the elapsed days.
SELECT id,
DATEDIFF(CURDATE(), start_date) elapsed
FROM tbl
Doing it this way is better than changing the table every day, for several reasons.
It always works even if the event doesn't fire for some reason.
Updating an entire table can get more and more expensive as the table grows.
The computational cost of computing the elapsed days is far less than the computational cost of updating every day.
If you happen to use incremental backups, updating the whole table defeats that.
It's a good practice to avoid representing the same information multiple times in the database.
You can also add a generated (virtual) column to the table or use a VIEW.
You should have a look at event schedulers in MySQL, you could use them to run a job that increments your values once a day.
MySQL Event Scheduler
The following example creates a recurring event that updates a row in a table.
First, create a new table named counter.
CREATE TABLE counter (
id INT PRIMARY KEY AUTO_INCREMENT,
counts VARCHAR(255) NOT NULL,
created_at DATETIME NOT NULL
);
Second, create an event using the CREATE EVENT statement:
CREATE EVENT IF NOT EXISTS count_event
ON SCHEDULE AT EVERY interval STARTS timestamp [+INTERVAL] ENDS timestamp [+INTERVAL]
DO
UPDATE counter SET counts = counts + YOUR_COUNT WHERE id = YOUR_ID;
Replace interval timestamps, your_count and your_id with real variables
I have table named data_table.
This table already has about 10 million records. Previously I used to check if combination of itemID, FromDate and ToDate exists before inserting data. To make things easier I created unique index with fields itemID+FromDate+ToDate.
This table has now all together three indexes, ID (Pk), itemID and UniqueIndex
Problem
The first time if I try to generate report for say itemID=2630, and for date range 2018-01-01 to 2021-01-01. It takes around 60 seconds.
Second time for same parameters, it takes around 1 second.
I then deleted all data for this item (2630) , and reinserted many random data for this itemID between selected date range.
Now if I run the report third time, it still takes around 1 second.
I thought first time, the query results was cached so second time it was very fast.
In third step I removed all data and re inserted different records and then generated the report, but it was as fast as second time. For particular item, why first time the report generation process is very slow? Can anybody help me how to overcome this issue?
The table engine I use is innodb and my mysql version is 5.7.33
Update
This is my query
SELECT
*
FROM
DataTable AS D
WHERE ItemID = :ItemID
AND IsJoined = '0'
AND (
(
:paramFromDate < ToDate
AND ToDate < :paramToDate
)
OR (
:paramFromDate < FromDate
AND FromDate < :paramToDate
)
OR (
FromDate < :paramFromDate
AND :paramToDate < FromDate
)
)
ORDER BY FromDate DESC
UPDATE
Restarting mysql, causes the query slow again. And subsequent queries are fast until I restart mysql again.
Thanks
Caching
The main cache for InnoDB is the "buffer_pool". It caches blocks (16KB each) each of which contains several consecutive rows of data or rows of an index. All operations (read or write) of rows work with those blocks.
After starting or restarting MySQL, the cache is empty. Hence, everything needs to be fetched from disk, hence 'slow'.
After reading the data once (and getting the relevant blocks pulled into cache, a second query will find them cached and be 'fast'.
Inserted rows require the relevant block(s) to be in the cache. So, for some time after doing an an INSERT a SELECT of those rows will be 'fast'.
Better INDEX
As for optimizing that query, have
INDEX(ItemID, IsJoined, FromDate) -- (in this order)
The first two columns help with part of the WHERE.
The OR in the rest of the WHERE prevents any useful optimization involving the two date columns.
However, the Optimizer may be able to avoid a sort (for the ORDER BY) if it chooses to use the FromDate that I added to the index.
If you are checking for overlapping date ranges, see if this meets your needs:
AND fromDate <= :toParam
AND :fromParam <= toDate
If that works for you, then another part of the WHERE is handled by my index. (But it will not be possible to also have the other part handled.) (Also, I don't know whether you need < or <=.)
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
I have a database with ~200 tables that I want to audit to ensure the tables won't grow too large.
I know I can easily get an idea of a lot of the table attributes I want (size in MB, rows, row length, data length, etc) with:
SHOW TABLE STATUS FROM myDatabaseName;
But it's missing one key piece of information I'm after: how many rows are added to each table in a given time period?
My records each contain a datestamp column in matching formats, if it helps.
Edit: Essentially, I want something like:
SELECT COUNT(*)
FROM *
WHERE datestamp BETWEEN [begindate] AND [enddate]
GROUP BY tablename
The following should work to get number of rows entered in for a given table for a given time period:
select count(*) from [tablename] where datestamp between [begindate] and [enddate]
After a bit of research, it looks like this isn't possible in MySQL, since it would require massive table reads (after all, the number of rows can differ between users).
Instead, I grabbed the transaction logs for all the jobs that write into the tables and I'll parse them. A bit hacky, but it works.
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)