Related
In our web app (Java/JDBC), we use MySQL and we need it to store individual payment bills. Each bill must have a unique bill number. Currently, when storing a new bill, the bill number of the new bill is computed via the following SELECT statement:
SELECT COUNT(*) FROM bills;
My question: how do we ensure that no two bills get the same bill number?
AUTO_INCREMENT is the "right" way to achieve your goal.
COUNT(*) won't work if a row is ever deleted!
Here is a way to make your attempt work:
START TRANSACTION;
SELECT MAX(bill_num)+1 FROM Bills FOR UPDATE;
...
INSERT ... INTO Bills;
COMMIT;
The combination of using a transaction and FOR UPDATE prevents another connection from interferring.
You can do a 3 phase insert of a new bill record:
Create a new record using auto_increment for it's id (primary key).
read the record back to get it's id.
With this id you are able to calculate your bill number.
Update the record with the bill number.
But keep in mind to always set a unique constraint on the bill number column!
If this is not an option you have to develop your own sequence generator to have unique numbers. From mySql 5 on you can implement this as stored procedure / function in the database.
Many suggestions here are spot on.
AUTO_INCREMENT works. However, if you plan to use it for public-facing data, I recommend against it because it's simple to guess the next bill number; just increment bill number by 1. Potentially a security issue.
COUNT(*) doesn't work. When you have a lot of records, performance will suffer, and if there are deletions, then COUNT(*) may actually go backwards and you'll get the same bill number.
Without knowing the requirements of the bill number, I would go with UUID route that P.Salmon recommended. This way, not only that noone can guess the next bill number, but it's quite easy to do as well. Inserting into bills table would be something like:
INSERT INTO bills VALUES (UUID(), ...)
or if you have AUTO_INCREMENT field, you may want to do something like
START TRANSACTION;
INSERT INTO bills VALUES (NULL, UUID(), ...);
SELECT LAST_INSERT_ID(); -- To get the last inserted id
COMMIT;
If UUID string doesn't work for you, you may want to do alternative-scheme, like appending your customer's number and/or generate a numbering scheme similar to how credit card / gift card generates their number.
If bill number is not sensitive information, then the simplest way is to just use AUTO_INCREMENT as your bill number.
If you want to create unique number for Bill without using Auto-Increment etc, one way is you convert your timestamp to bill id by using Java function System.currentTimeMillis(), it will always be unique and secure unless yours bills are getting generated at the rate of more than one per millisecond ;)
SELECT COUNT(*) can computed the same bill number of two bills, if you want not to computed, you can set unique index on bill number
COUNT(arg) itself is a MySQL function. COUNT ++ is counted if the arg (column or entire row) in parentheses is not NULL, otherwise the line is not counted. For details, go to the "Evaluate_join_record and column is empty" section.
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
I have a table 'user_plays_track' that keeps track of how many times a user has 'played' a track.
I use the following query to either insert a new track a user has played, or update the number of times an existing track has been played:
INSERT INTO user_plays_track
(user_id, track_id) VALUES (x,y)
ON duplicate key UPDATE play_count = play_count+1
Here is the structure of my table:
user_id track_id play_count
1 5 2
4 2 1
3 5 7
From this information, I can infer things such as the total number of times a track has been played, or the total number of plays an artist has had, by finding the sum of all the track counts.
With a thousand or so records, this would soon become messy and the semantics unclear. What I wish to do, is use triggers to produce what could be described as cache.
For example, when a record is updated or inserted into 'user_plays_track', the 'tracks' table will increment its play_count column, indicating the total number of plays from all users for that track.
track_id artist_id track_name play_count
2 1 Hey 1
5 1 Test 9
Furthering this, another trigger should be applied, to infer new knowledge such as the total number of artist plays. This would again be triggered when a new track is added, it will find the artist_id the track belongs to and update the 'artist' table accordingly.
artist_id artist_name play_count
1 Bob 10
How would I go about implementing the relevant triggers, to provide a incrementing totals when a user 'plays' a track?
The more you want to calculate at query time, the more you want views, calculated columns and stored or user routines. The more you want to calculate at normalized base update time, the more you want cascades and triggers. The more you want to calculate at some other (scheduled or ad hoc) time, the more you use snapshots aka materialized views and updated denormalized bases. You can combine these. Any time the database is accessed it can be enabled by and restricted by stored routines or other api.
Until you can show that they are in adequate, views and calculated columns are the simplest.
The whole idea of a DBMS is to store a representation of your application state as the database (which normalization reduces the redundancy of) and then you query and let the DBMS implement and optimize calculation of the answer. You haven't presented a reason for not doing that in the most straightforward way possible.
I have a large table containing hourly statistical data broken down across a number of dimensions. It's now large enough that I need to start aggregating the data to make queries faster. The table looks something like:
customer INT
campaign INT
start_time TIMESTAMP
end_time TIMESTAMP
time_period ENUM(hour, day, week)
clicks INT
I was thinking that I could, for example, insert a row into the table where campaign is null, and the clicks value would be the sum of all clicks for that customer and time period. Similarly, I could set the time period to "day" and this would be the sum of all of the hours in that day.
I'm sure this is a fairly common thing to do, so I'm wondering what the best way to achieve this in MySql? I'm assuming an INSERT INTO combined with a SELECT statement (like with a materialized view) - however since new data is constantly being added to this table, how do I avoid re-calculating aggregate data that I've previously calculated?
I done something similar and here is the problems I have deal with:
You can use round(start_time/86400)*86400 in "group by" part to get summary of all entries from same day. (For week is almost the same)
The SQL will look like:
insert into the_table
( select
customer,
NULL,
round(start_time/86400)*86400,
round(start_time/86400)*86400 + 86400,
'day',
sum(clicks)
from the_table
where time_period = 'hour' and start_time between <A> and <B>
group by customer, round(start_time/86400)*86400 ) as tbl;
delete from the_table
where time_period = 'hour' and start_time between <A> and <B>;
If you going to insert summary from same table to itself - you will use temp (Which mean you copy part of data from the table aside, than it dropped - for each transaction). So you must be very careful with the indexes and size of data returned by inner select.
When you constantly inserting and deleting rows - you will get fragmentation issues sooner or later. It will slow you down dramatically. The solutions is to use partitioning & to drop old partitions from time to time. Or you can run "optimize table" statement, but it will stop you work for relatively long time (may be minutes).
To avoid mess with duplicate data - you may want to clone the table for each time aggregation period (hour_table, day_table, ...)
If you're trying to make the table smaller, you'll be deleting the detailed rows after you make the summary row, right? Transactions are your friend. Start one, compute the rollup, insert the rollup, delete the detailed rows, end the transaction.
If you happen to add more rows for an older time period (who does that??), you can run the rollup again - it will combine your previous rollup entry with your extra data into a new, more powerful, rollup entry.
I have to collect statisctics by days, weeks, months and years of user activity for a site. I am the DB design stage and I wanted to do this stage properly since it will make my coding life easier.
What I have to do is just simply increment the values in the fields by 1 in the DB each time an activity happens. So then I can pull up the date by each day, each week, each month and year. How should my DB be structured? Apologies if this is a simple question for most. It would also be great if this structure could be extendable so that it can be broken down by other categories.
The bit am having trouble with is each month is made up of more days and these days change each calender year.
Thanks all for any help or direction.
Other info: Linux Machine, making use of PHP and MySQL
Instead of updating counts per day, week etc. just INSERT a row into a table each time an activity happens like this:
insert into activities (activity_date, activity_info)
values (CURRENT_TIMESTAMP, 'whatever');
Now your reports are very simple like:
select count(*) from activities
where activity_date between '2008-01-01' and '2008-01-07';
or
select YEARWEEK(`activity_date`) as theweek, count(*)
group by theweek
You may just add records into the table and SELECT them using aggregate functions.
If for some reason you need to keep aggregated statistics, you may use:
CREATE TABLE aggregates (type VARCHAR(20), part VARCHAR(10) NOT NULL PRIMARY KEY, activity INT)
INSERT INTO aggregates (type, part, activity)
VALUES ('year', SUBSTRING(SYSDATE(), 1, 4), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('month', SUBSTRING(SYSDATE(), 1, 7), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
INSERT INTO aggregates (type, part, activity)
VALUES ('day', SUBSTRING(SYSDATE(), 1, 10), 1)
ON DUPLICATE KEY UPDATE activity = activity + 1
This will automatically update existing rows and insert non-existing when needed.
table of events : id, activity id, datetime, userid.
table of users : id, username etc
table of activities : id, activity name, etc
Just enter a new row into events when an event happens. Then you can analyse the events but manipulating time, date, user, activity etc.
To start with, you would probably imagine a single table, as this would be the most normalized form. The table would simply have an entry for each hit you receive, with each row containing the date/time of that hit.
Now, this way, in order to get statistics for each hour, day, week etc, the queries are simple but your database will have to do some pretty heavy query work. In particular, queries that do sums, counts or averages will need to fetch all the relevant rows.
You could get around this by precalculating the required counts in a second table, and making sure you sync that table to the first regularly. Problem is, you will be responsible for keeping that cache in sync yourself.
This would probably involve making a row for each hour. It will still be a lot quicker to do a query for a day, or a month, if you are only fetching a maximum of 24 rows per day.
Your other suggestion was to aggregate it from the start, never storing every single hit as a row. You would probably do that, as before, with a row for each hour. Every hit would increment the relevant hours' row by one. You would only have the data in one location, and it would already be pretty well summarised.
The reason I suggest by hour instead of by day, is that this still gives you the option to support multiple time zones. If your granularity is only to the day, you don't have that option.
Tony Andrews' answer is the simplest, however a snowflake structure is sometimes used in data warehouse applications: a table that counts all the activities, another for activities per day, another for activities per month, and a third for activities per year.
With this kind of structure, the activity between any two dates can be computed very efficiently.
https://en.wikipedia.org/wiki/Snowflake_schema
Use a star schema design. (or perhaps a snowflake design).
Star-Schema Design
You will end up doing an insert into a fact table for each new activity. See Tony's suggestion.
You will need at least two dimension tables, one for users and one for time frames. There will probably be dimensions for activity type, and maybe even for location. It depends on what you want to do with the data.
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)
Include columns in the Almanac for each reporting period you can think of. Week, Month, Quarter, Year, etc. You can even include reporting periods that relate to your company's own calendar.
Here's an article comparing ER and DM. I'm unusual in that I like both methods, choosing the appropriate method for the appropriate task.
http://www.dbmsmag.com/9510d05.html
Your question relates to the time frames dimension table. Let's call it "Almanac". Choose a granularity. Let's say the day. The almanac will have one row per day. The primary key can be the date. Your fact table should include this primary key as a foreign key, in order to make joins easier. (It doesn't matter whether or not you declare it as a foreign key. That only affects referential integrity during your update process.)