I archive last years table and create a new table at the beginning of each year. I'd like to find a way to have one multi-year table so I don't have to manually change anything each year. Columns are: row (unique),date(primary),col1,col2,col3.. Users will type data (every column) into a form. I wanted to have a 'years' column that would be populated from year(date) with composite primary key(row,years). I also need primary key (row,year(date)). So each year we could start with row 1 and a new year. I even looked at and tried insert update triggers but I don't think that's the answer. What do you think? Is this too vague?
I did something similar in one of my Firebird databases (you didn't say which db you are using). First of all, you should have a primary key which is generated automatically - this will make life more simple. Secondly, you need one generator per year - or a table which has one field per year.
In this model, you have a table with two fields: year (primary key) and 'curvalue'. Every year you insert a new tuple such as (2012,0) or (2013, 0). You query this table to find the current value of the 'curvalue' field for the given year, then increment the curvalue field and use the incremented value in the tuple which you are inserting.
Using generators guarantees that you will always receive a unique number; using a table as I wrote above could cause problems if one row starts the process and then another row arrives.
Related
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
As an example let's say I want to save the lines of code I've written on specific days of different months in a MySQL database using Hibernate. On the 5th of Janurary, I write 100 lines of code. I have not previously written any lines of code on this day, so it simply inserts the data.
A year passes and it's again the 5th of January. I write 50 lines of code this day, and I try to insert it into the database. The ID of day 5 and month 1 already exists, so it can't insert the data. In this case I would like it to simply add the lines of code the the already existing row, so it now says:
DAY. MONTH. LOC.
5. 1. 150.
What I'm currently doing is that every time I want to insert new data, I'm first making a SELECT query to check if the row already exists. If it does I create an UPDATE statement, otherwise I just create an INSERT statement. I'm working with a lot of data and have noticed that the operations involving my database are starting to take a long time. Therefore, I was wondering if there is a better, more efficient way to do this? Help is much appreciated. Thank you.
You seem to be looking for the update... on duplicate key syntax. This would look like:
insert into mytable(day, month, loc)
values(:day, :month, :loc)
on duplicate key update loc = loc + values(loc)
:day, :month and :loc represent the parameters that you give for insert.
For this to work, you need a unique constraint on (day, month) (or make that tuple of columns the primary key of your table).
When an insert occurs that would violate the unique constraint, MySQL goes to the on duplicate key clause, where we add the new loc value to the one already there in the existing row.
I was wondering if there is any way to solve this.
So my row has an column of type date which increments with 1 day daily ( untill the end of the respective month ). At the beggining of a new month a new row has to be generated and the update will start again untill the and of that month, and so on..
Here's a way to think about the problem in MySQL's dialect of SQL.
First, you need a function that changes a datestamp into a value that's unique for each month. That is LAST_DAY(datestamp). It generates DATETIME values like 2017-09-30 00:00:00 from arbitrary inputs.
Next, you can exploit MySQL's INSERT ... ON DUPLICATE KEY UPDATE capability. You will create a table months with, say, these columns
month_ending DATETIME
category VARCHAR(20)
sum_of_input INT
Then you make month_ending, category into a unique compound index.
Then you do something like this
INSERT INTO months /* warning! not debugged! */
(month_ending, category, sum_of_input)
VALUES (LAST_DAY(?date), ?category, ?value)
ON DUPLICATE KEY
UPDATE months
SET sum_of_input = sum_of_input + ?value
WHERE month_ending=LAST_DAY(?date)
AND category=?category
However, this has the hallmarks of a big, hard to debug, pain in the neck. It make make more sense to use features inside your ETL system to do this summarizing work.
I need to setup a table that will have two auto increment fields. 1 field will be a standard primary key for each record added. The other field will be used to link multiple records together.
Here is an example.
field 1 | field 2
1 1
2 1
3 1
4 2
5 2
6 3
Notice that each value in field 1 has the auto increment. Field 2 has an auto increment that increases slightly differently. records 1,2 and 3 were made at the same time. records 4 and 5 were made at the same time. record 6 was made individually.
Would it be best to read the last entry for field 2 and then increment it by one in my php program? Just looking for the best solution.
You should have two separate tables.
ItemsToBeInserted
id, batch_id, field, field, field
BatchesOfInserts
id, created_time, field, field field
You would then create a batch record, and add the insert id for that batch to all of the items that are going to be part of the batch.
You get bonus points if you add a batch_hash field to the batches table and then check that each batch is unique so that you don't accidentally submit the same batch twice.
If you are looking for a more awful way to do it that only uses one table, you could do something like:
$batch = //Code to run and get 'SELECT MAX(BATCH_ID) + 1 AS NEW_BATCH_ID FROM myTable'
and add that id to all of the inserted records. I wouldn't recommend that though. You will run into trouble down the line.
MySQL only offers one auto-increment column per table. You can't define two, nor does it make sense to do that.
Your question doesn't say what logic you want to use to control the incrementing of the second field you've called auto-increment. Presumably your PHP program will drive that logic.
Don't use PHP to query the largest ID number, then increment it and use it. If you do your system is vulnerable to race conditions. That is, if more than one instance of your PHP program tries that simultaneously, they will occasionally get the same number by mistake.
The Oracle DBMS has an object called a sequence which gives back guaranteed-unique numbers. But you're using MySQL. You can obtain unique numbers with a programming pattern like the following.
First create a table for the sequence. It has an auto-increment field and nothing else.
CREATE TABLE sequence (
sequence_id INT NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`sequence_id`)
)
Then when you need a unique number in your program, issue these three queries one after the other:
INSERT INTO sequence () VALUES ();
DELETE FROM sequence WHERE sequence_id < LAST_INSERT_ID();
SELECT LAST_INSERT_ID() AS sequence;
The third query is guaranteed to return a unique sequence number. This guarantee holds even if you have dozens of different client programs connected to your database. That's the beauty of AUTO_INCREMENT.
The second query (DELETE) keeps the table from getting big and wasting space. We don't care about any rows in the table except for the most recent one.
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)