I am designing a system where I am supposed to store different types of Lottery(results + tickets).
Currently focusing on US Mega Millions and Singapore Pool Toto. They both have a similar format.
Mega Millions: Five different numbers from 1 to 56 and one number from 1 to 46.
Toto: 6 numbers from 1 to 45
I need to come up with an elegant database design to store the user tickets and corresponding results.
I thought of two ways to go about it.
Just store 6 six numbers in 6 columns.
OR
Create another table(many to many) which has ball-number and ticket_id
I need to store the ball-numbers for the results as well.
For TOTO if you your numbers match 4 or more winning numbers, you win a prize.
For Mega millions there is a similar process.
I'm looking for the pros and cons or possibly a better solution?
I have done a lot of research and paper work, but I am still confused which way to go about it.
Two tables
tickets
ball_number
ticket_id
player
player_id
ticket_id
// optional
results
ball_number
lottery_id
With two tables you could use a query like:
select ticket_id, count(ball_number) hits
from tickets
where ball_number in (wn1, wn2, ...) // wn - winning number
group by ticket_id
having hits = x
Of course you could take winning numbers from lottery results table (or store them in the balls_table under special ticket numbers).
Also preparing statistics would be easier. With
select count(ticket_id)
from tickets
group by ball_number
you could easily see which numbers are mostly picked.
You might also use some field like lottery number to be able to narrow down the queries as most of them would concern just one lottery.
One table
Using one table with a column for each number might make the queries much more complex. Especially that, as I believe, the numbers are sorted, and there are be prizes for hitting all but one (or two) numbers. Than you might have to compare 1, 2, 3, ... with 2, 3, 4, ... which is not as short as straightforward as the queries above.
One column
Storing all entries in a string in just one column violates all normalization practices, forces you to split the column for most of the queries and takes away all optimization carried out by the database. Also storing numbers requires less disk space than storing text.
Since this is a once a day thing, I think I'd store the data in an easy to edit, maintain, visualize way. Your many-many approach would work. Mainly, I'd want it easy to find users that chose a particular ball_number.
users
id
name
drawings
id
type # Mega Millions or Singapore (maybe subclass Drawing)
drawing_on
wining_picks
drawing_id
ball_number
ticket
drawing_id
user_id
correct_count
picks
id
ticket_id
ball_number
Once you get the numbers in, find all user_ids that pick a particular number in a drawing
Get the drawing by date
drawing = Drawing.find_by_drawing_on(drawing_date)
Get the users by ball_number and drawing.
picked_1 = User.picked(1,drawing)
picked_2 = User.picked(2,drawing)
picked_3 = User.picked(3,drawing)
This is a scope on User
class User < ActiveRecord::Base
def self.picked(ball_number, drawing)
joins(:tickets => :picks).where(:picks => {:ball_number => ball_number}, :tickets => {:drawing_id => drawing.id})
end
end
Then do quick array intersections to get the user_ids that got 3,4,5,6 picks correct. You'd loop through the winning numbers to get the permutations.
For example if the winning numbers were 3,8,21,24,27,44
some_3_correct_winner_ids = picked_3 & picked_8 & picked_21 # Array intersection
For each winner - update the ticket with correct count.
I may potentially store winners separately, but with an index on correct_count, and not too much data in tickets, this would probably be ok for now.
I would just concatenate them using a convention and store them in one column.
Something like '10~20~30~40~50~!60'
~ separates numbers
! indicates special number ( powerball, etc)
Have a sql table valued function split the result if you really need to have it in columns.
Firstly, let me say that I'm an Oracle person, not a MySQL person.
Secondly, I'd usually say to go for a normalised design, but I'm tempted here to think of a very unconventional alternative which I'll float out here for comment.
How about you denormalised it to the extent of using one column for all the number choices?
ticket_id integer
nums bit(56)
special_number integer
It would be a pretty compact representation, and you could perhaps use bit-wise operations to find the winners or potential winners.
No idea if this is workable ... open for comments.
Related
I'm working on the Product Catalog module of an Invoicing application.
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalog.
How can I store this "usage recency/frequency" in the database?
I'm thinking about adding a new field recency which would be increased by 1 every time the product was used, and decreased by 1/(count of all products), when an other product is used. Then use this recency field for ordering, but it doesn't seem to me the best solution.
Can you help me what is the best practice for this kind of problem?
Solution for the recency calculation:
Create a new column in the products table, named last_used_on for example. Its data type should be TIMESTAMP (the MySQL representation for the Unix-time).
Advantages:
Timestamps contains both date and time parts.
It makes possible VERY precise calculations and comparisons in regard
to dates and times.
It lets you format the saved values in the date-time format of your
choice.
You can convert from any date-time format into a timestamp.
In regard to your autocomplete fields, it allows you to filter
the products list as you wish. For example, to display all products
used since [date-time]. Or to fetch all products used between
[date-time-1] and [date-time-2]. Or get the products used only on Mondays, at 1:37:12 PM, in the last two years, two months and three
days (so flexible timestamps are).
Resources:
Unix-Time
The DATE, DATETIME, and TIMESTAMP Types
How should unix timestamps be stored in int columns?
How to convert human date to unix timestamp in Mysql?
Solution for the usage rate calculation:
Well, actually, you are not speaking about a frequency calculation, but about a rate - even though one can argue that frequency is a rate, too.
Frequency implies using the time as the reference unit and it's measured in Hertz (Hz = [1/second]). For example, let's say you want to query how many times a product was used in the last year.
A rate, on the other hand, is a comparison, a relation between two related units. Like for example the exchange rate USD/EUR - they are both currencies. If the comparison takes place between two terms of the same type, then the result is a number without measurement units: a percentage. Like: 50 apples / 273 apples = 0.1832 = 18.32%
That said, I suppose you tried to calculate the usage rate: the number of usages of a product in relation with the number of usages of all products. Like, for a product: usage rate of the product = 17 usages of the product / 112 total usages = 0.1517... = 15.17%. And in the autocomplete you'd want to display the products with a usage rate bigger than a given percentage (like 9% for example).
This is easy to implement. In the products table add a column usages of type int or bigint and simply increment its value each time a product is used. And then, when you want to fetch the most used products, just apply a filter like in this sql statement:
SELECT
id,
name,
(usages*100) / (SELECT sum(usages) as total_usages FROM products) as usage_rate
FROM products
GROUP BY id
HAVING usage_rate > 9
ORDER BY usage_rate DESC;
Here's a little study case:
In the end, recency, frequency and rate are three different things.
Good luck.
To allow for future flexibility, I'd suggest the following additional (*) table to store the entire history of product usage by all users:
Name: product_usage
Columns:
id - internal surrogate auto-incrementing primary key
product_id (int) - foreign key to product identifier
user_id (int) - foreign key to user identifier
timestamp (datetime) - date/time the product was used
This would allow the query to be fine tuned as necessary. E.g. you may decide to only order by past usage for the logged in user. Or perhaps total usage within a particular timeframe would be more relevant. Such a table may also have a dual purpose of auditing - e.g. to report on the most popular or unpopular products amongst all users.
(*) assuming something similar doesn't already exist in your database schema
Your problem is related to many other web-scale search applications, such as e.g. showing spell corrections, related searches, or "trending" topics. You recognized correctly that both recency and frequency are important criteria in determining "popular" suggestions. In practice, it is desirable to compromise between the two: Recency alone will suffer from random fluctuations; but you also don't want to use only frequency, since some products might have been purchased a lot in the past, but their popularity is declining (or they might have gone out of stock or replaced by successor models).
A very simple but effective implementation that is typically used in these scenarios is exponential smoothing. First of all, most of the time it suffices to update popularities at fixed intervals (say, once each day). Set a decay parameter α (say, .95) that tells you how much yesterday's orders count compared to today's. Similarly, orders from two days ago will be worth α*α~.9 times as today's, and so on. To estimate this parameter, note that the value decays to one half after log(.5)/log(α) days (about 14 days for α=.95).
The implementation only requires a single additional field per product,
orders_decayed. Then, all you have to do is to update this value each night with the total daily orders:
orders_decayed = α * orders_decayed + (1-α) * orders_today.
You can sort your applicable suggestions according to this value.
To have an individual user experience, you should not rely on a field in the product table, but rather on the history of the user.
The occurrences of the product in past invoices created by the user would be a good starting point. The advantage is that you don't need to add fields or tables for this functionality. You simply rely on data that is already present anyway.
Since it is an auto-complete field, maybe past usage is not really relevant. Display n search results as the user types. If you feel that results are better if you include recency in the calculation of the order, go with it.
Now, implementation may defer depending on how and when product should be displayed. Whether it has to be user specific usage frequency or application specific (overall). But, in both case, I would suggest to have a history table, which later you can use for other analysis.
You could design you history table with atleast below columns:
Id | ProductId | LastUsed (timestamp) | UserId
And, now you can create a view, which will query this table for specific time range (something like product frequency of last week, last month or last year) and will give you highest sold product for specific time range.
Same can be used for User's specific frequency by adding additional condition to filter by Userid.
I'm thinking about adding a new field recency which would be increased
by 1 every time the product was used, and decreased by 1/(count of all
products), when an other product is used. Then use this recency field
for ordering, but it doesn't seem to me the best solution.
Yes, it is not a good practice to add a column for this and update every time. Imagine, this product is most awaiting product and people love to buy it. Now, at a time, 1000 people or may be more requested for this product and for every request you are going to update same row, since to maintain the concurrency database has to lock that specific row and update for each request, which is definitely going to hit your database and application performance instead you can simply insert a new row.
The other possible solution is, you could use your existing invoice table as it will definitely have all product and user specific information and create a view to get frequently used product as I mentioned above.
Please note that, this is an another option to achieve what you are expecting. But, I would personally recommend to have history table instead.
The scenario
When the user creates a new invoice the product name field should be an autocomplete field which shows the most recently used products from the product catalogue.
your suggested solution
How can I store this "usage recency/frequency" in the database?
If it is a web application, don't store it in a Database in your server. Each user has different choices.
Store it in the user's browser as Cookie or Localstorage because it will improve the User Experience.
If you still want to store it in MySQL table,
Do the following
Create a column recency as said in question.
When each time the item used, increase the count by 1 as said in question.
Don't decrease it when other items get used.
To get the recent most used item,
query
SELECT * FROM table WHERE recence = (SELECT MAX(recence) FROM table);
Side note
Go for the database use only if you want to show the recent most used products without depending the user.
As you aren't certain on wich measure to choose, and it's rather user experience related problem, I advice you have a number of measures and provide a user an option to choose one he/she prefers. For example the set of available measures could include most popular product last week, last month, last 3 months, last year, overall total. For the sake of performance I'd prefer to store those statistics in a separate table which is refreshed by a scheduled job running every 3 hours for example.
We're working a web application (Ruby/Rails + Backbone,jQuery,Javascript) where a user can manage a booklist and drag and drop books to rearrange their order within the list, which has to be persisted.
We have books and a custom collection of books called booklist, for which we have two tables: book and booklist. Since a book could belong to multiple booklists, and a booklist consists of multiple books, they have an m x n relationship, and we have another additional table to store the mapping. Lets say we use this for all purposes. Now when the user wants to re-order the books in her bookshelf, we'd need to store that order.
I can totally see the sense about why storing ids in a column is evil , no doubts about it. What if we have the tables normalized, and for all other cases we'd go through the standard operations.
There are quite a few approaches on storing an additional order column. But still it seems like bad design to store the ids of the books in a booklist in a comma separated list in the booklist table, even assuming that integrity is maintained.
We'd never run into this...
SELECT * FROM users WHERE... OH F#$%CK -
Yes it's bad, you can't order, count, sum (etc) or even do a simple report without depending
on a top level language.
because we'd simply be selecting books based on the booklist id using the join table like the standard approach. (In any case, we're only getting the books as an array as part of the backbone booklist model)
So what if we retrieve the booklist and books for the booklist, and do the sorting programatically on the client side (in this case Javascript?) based on the CSV column.
It appears to be a simple solution because:
Every time the user reorders a book, we simply store all the ids in this one column freshly again. (A user will have at the most 20 to 30 books in a booklist).
We could of course simply ignore invalid ids, i.e. books that have been deleted after the booklist had been created.
What are the disadvantages of this approach, which seems to be simpler than maintaining the sort order and updating other columns every time an order is changed, or using a float or weightage, etc.
As per my knowldege its really violating the rule of RDBMS.Which causes facing many difficulties when applying JOIN.
Hope it will help you.
I've been developing an application and I've run into a situation where I would like to take a snapshot of the current data.
For example, in this application, users will have varying stats and be able to enter matches. How they place in the matches depends on their stats. When the matches are determined the application will pull all of the user's current stats and determine their points to see who wins.
Now after a match is over I want users to be able to view past matches and the problem arises when I want to display what the participants points were at the time of the match. I would think it would be acceptable to store an array structured like so:
array(
array(username, points),
array(username, points),
etc.
)
Now normalizing the data may be the best practice normally but in this situation:
There can be anywhere between 2 and 25 participants in a match.
The data will never be updated, only read.
I would think having it in an array structure in the database will save me time from having to construct an array in my back-end code.
EDIT: The data is not permanent. Match records will be deleted 7 days after the match has ended.
Can anyone tell me if this solution will provide any problems?
EDIT
I would be saving the data after serializing the array so in my database I would have a table called 'matches' and it would have a column called 'results'.
The rows for this column would contain serialized arrays. So if the array looked as such:
$array["a"] = "Foo";
$array["b"] = "Bar";
$array["c"] = "Baz";
$array["d"] = "Wom";
Then the row in the database would look like this:
a:4:{s:1:"a";s:3:"Foo";s:1:"b";s:3:"Bar";s:1:"c";s:3:"Baz";s:1:"d";s:3:"Wom";}
This solution wouldn't pose any problems in the short term - but say you wanted to eventually add in functionality to show all of the games a user has played in, or their highest scoring games... having this data in an inaccessible-from-sql array would not allow you to have those features.
I'm thinking a table like this would be perfect:
CREATE TABLE game_scores(
id int AUTO_INCREMENT NOT NULL PRIMARY KEY,
game_id int,
user_id int,
final_score int,
KEY(game_id),KEY(user_id)
)
At the end of every game, you'd simply insert a row for every user that was playing that round with their corresponding score and the game id. Later, you'd be able to select all of the scores for a certain game:
SELECT * FROM game_scores WHERE game_id=?
... or show all scores by a certain user:
SELECT * FROM game_scores WHERE user_id=?
etc. Have fun with it!
If you're really committed to the use cases you've outlined in the question along with the qualification in your comment to Sean Johnson, then I don't see any problems with your approach.
I still might qualify that by suggesting that you normalize the data if you think there's a chance you'll want to be able to mine historical information, but dumping an array into the database as a long lived (relatively speaking) sort of cache might make sense. In other words, store it in both formats, but the main line of the use case you've outlined would just hit the array format, but you'd still have the data in a queryable form if you ever wanted it.
Using MySQL I have table of users, a table of matches (Updated with the actual result) and a table called users_picks (at first it's always going to be 10 football matches pr. gameweek pr. league because there's only one league as of now, but more leagues will come along eventually, and some of them only have 8 matches pr. gameweek).
In the users_picks table should i store each 'pick' (by pick I mean both 'hometeam score' and 'awayteam score') in a different row, or have all 10 picks in one single row? Both with a FK for user and gameweek. All picks in one row would mean I had columns with appended numbers like this:
Option 1: [pick_id, user_id, league_id, gameweek_id, match1_hometeam_score, match1_awayteam_score, match2_hometeam_score, match2_awayteam_score ... etc]
and that option doesn't quite fill me with joy, and looks a bit stupid. Especially since there's going to be lots of potential NULLs in the db. The second option would mean eventually millions of rows. But would look like this:
Option 2: [pick_id, user_id, league_id, gameweek_id, match_id, hometeam_score, awayteam_score]
What's the best practice? And would it be a PITA to do all sorts of statistics using the second option? eg. Calculating how many matches a user has hit correctly in a specific round, how many alltime correct hits etc.
If I'm not making much sense, I'll try to elaborate anything. I just wan't my table design to be good from the start, so I won't have a huge headache in a couple of months.
Thanks in advance.
The second choice is much better than the first. This is called database normalisation and makes querying easier, not harder. I would suggest reading the linked article, and the related descriptions of the various "normal forms", and aiming for a 3rd Normal Form data structure as a minimum.
To see the flaw in your first option, imagine if there were to be included later a new league with 11 matches. Or 400.
You should read up about database normalization.
When you have a 1:n relation, like in your case one team having many matches, you would create two tables. One table "teams" and a second table "matches" where each row includes the ID of the team which played the match.
In the same manner you should also have separate tables for users, picks and leagues.
Option two is better, provided you INDEX your table properly, since (as you indicate) it will grow quite large. The pick_id is the primary key, but also create an INDEX on the user_id field, as likely the most common query will be
SELECT * FROM `users_pics` WHERE `user_id`=?;
to get all the picks for a given user.
I am trying to come up with a database design to hold the "Top 10" results for some calculations that are being done. Basically, when all is said in done, there will be 3 "Top 10" categories, which I am fine with all being separate tables, however I need to be able to go back and later pull historical data about what was in the Top 10 at certain times, hence the need for a database, although a flat-file would work, this has the potential to hold years worth of data.
Now, it's been awhile since I have done anything serious with a database, other than something that had a couple of simple tables, so I am having some issues thinking through this design. If someone could help me with the design of it, I know enough MySQL to get the rest done.
So, in essence, I need to store: A group of 10 names, a % of the total points each name had, the rank they held in the Top 10 and a time associated with that Top 10 (So I can later query for that time)
I would think I need a table for for the Top 10 with 11 columns, one for the ID and 10 for the Foreign Key of the 'Names' table, that holds every name ever used with a PK, Name, %, and Rank. This seems clunky to me, anyone else have a suggestion?
edit:The 'Top 10' is associated with a specific set of data for 5-minute intervals, and each interval is completely independent from the previous or future intervals.
I don't recommend your solution, because then if you want to ask the database "How often has Joe been in the top 10," you have to write 10 queries of the form
SELECT Date FROM Top10 WHERE FirstPlace = 'joe'
SELECT Date FROM Top10 WHERE SecondPlace = 'joe'
...
Instead, how about a Rankings table, with fields:
id
Date
Person
Rank
Then if you want the Top 10 list for a certain date, the query is
SELECT * FROM Rankings WHERE Date = ...
and if you want to know someone's historical ranking, the query is
SELECT * FROM Rankings WHERE Person = ...
and if you want to know all the historical leaders, the query is
SELECT * FROM Rankings WHERE Rank = 1
The downside to this is that you might accidentally make two different people 8th place, and your database would allow the anomaly. But I have good news for you -- people might actually tie for 8th place, so you might actually want that to be possible!
I assume that your "Top 10" is a snapshot data in certain time. And your business logic is that "every 5 minutes" so that the time is the parent entity for table design
top_10_history
th_id - the primary key
th_time - the time point when taking the snapshot data of "Top 10"
top_10_detail
td_th_id - the FK to top_10_history
td_name_id - the FK to name
td_percentage - the "%"
td_rank - the rank
If the sequence of "Top 10" could be calculated from columns in "top_10_detail", you don't need a column to keep the sequence of it. Otherwise, you need a column to persist the sequence for it.
If you need more complicated query such as "The top 10 at 12:00 AM in last 30 days", using individual columns for "day", "hour", and "minute" would be a better idea for performance(with suitable indexes).