I'm creating a database to store a lot of events. There will be a lot of them and they will each have an associated time that is precise to the second. As an example, something like this:
Event
-----
Timestamp
ActionType (FK)
Source (FK)
Target (FK)
Actions, Sources, and Targets are all in 6NF. I'd like to keep the Event table normalized, but all of the approaches I could think of have problems. To be clear about my expectations for the data, the vast majority (99.9%) of events will be unique with just the above four fields (so I can use the whole row as a PK), but the few exceptions can't be ignored.
Use a Surrogate Key: If I use a four-byte integer this is possible, but it seems like just inflating the table for no reason. Additionally I'm concerned about using the database over a long period of time and exhausting the key space.
Add a Count Column to Event: Since I expect small counts I could use a smaller datatype and this would have a smaller effect on database size, but it would require upserts or pooling the data outside the database before insertion. Either of those would add complexity and influence my choice of database software (I was thinking of going with Postgres, which does upserts, but not gladly.)
Break Events into small groups: For example, all events in the same second could be part of a Bundle which could have a surrogate key for the group and another for each event inside it. This adds another layer of abstraction and size to the database. It would be a good idea if otherwise-duplicate events become common, but otherwise seems like overkill.
While all of these are doable, they feel like a poor fit for my data. I was thinking of just doing a typical Snowflake and not enforcing a uniqueness constraint on the main Event table, but after reading PerformanceDBA answers like this one I thought maybe there was a better way.
So, what is the right way to keep time-series data with a small number of repeated events normalized?
Edit: Clarification - the sources for the data are logs, mostly flat files but some in various databases. One goal of this database is to unify them. None of the sources have time resolution more precise than to the second. The data will be used for questions like "How many different Sources executed Action on Target over Interval?" where Interval will not be less than an hour.
The simplest answers seem to be
store the timestamp with greater precision, or
store the timestamp to the second and retry (with a slightly later timestamp) if INSERT fails because of a duplicate key.
None of the three ideas you mention have anything to do with normalization. These are decisions about what to store; at the conceptual level, you normalize after you decide what to store. What the row means (so, what each column means) is significant; these meanings make up the table's predicate. The predicate lets you derive new true facts from older true facts.
Using an integer as a surrogate key, you're unlikely to exhaust the key space. But you still have to declare the natural key, so a surrogate in this case doesn't do anything useful for you.
Adding a "count" colummn makes sense if it makes sense to count things; otherwise it doesn't. Look at these two examples.
Timestamp ActionType Source Target
--
2013-02-02 08:00:01 Wibble SysA SysB
2013-02-02 08:00:02 Wibble SysA SysB
Timestamp ActionType Source Target Count
--
2013-02-02 08:00:01 Wibble SysA SysB 2
What's the difference in meaning here? The meaning of "Timestamp" is particularly important. Normalization is based on semantics; what you need to do depends on what the data means, not on what the columns are named.
Breaking events into small groups might make sense (like adding a "count" column might make sense) if groups of events have meaning in your system.
Related
I'm storing the information for Magic the Gathering cards in a data base, my doubt is how to store information that is not always present.
i.e. creatures have a power/toughness, while other cards don't. should I store it for every card and leave it empty for the ones that don't have it? or create another table to store this information, like:
cardID power resist
how would this affect query times?
From a normalization point of view, power and resistance are functionally dependent from the cardID, and should thus be on the same table as all the other things which depend from the cardID.
If I correctly remember how magic works, you would have one big table with several columns for all the features of a card: power, resistance, cost (which I would model as a VARCHAR), picture.
Then, since a 1-N (or M-N, don't remember) relationship exists between effects and cards, you will have an extra table with the effects, which use the cardID as a foreign key (or maybe a dispatch table, if it's M-N).
Anyway, I found only 4 fields + ID for the cards, it isn't so much you have to worry, it's quite a litte, rather. Even if you wanted to model the cost as multiple INTs, the number of elements is limited (5-6?). A table with 10, or even 20 columns is not a problem for a DBMS.
Maybe yo're going to have a worse time concerning the effects (if you're planning to implement them).
I am tackling a problem in class to design a mySQL representation of a web that stores a list of events associated with a person. So, for this table/tables, it would have 2 columns, one of which is the person's name and the other is the event. However, a person will generally have anywhere from 30-1000 events, so this table, which we plan to have for our entire undergraduate class of 6000 students, will have millions of entries. Is there a better way to store this in mySQL that will take less space, but will still be able to retrieve individual events and the list of people that attended it just as easily as if it was a table of two columns?
Yes, there is a technique called many-to-many, and essentially breaks your one table into three, which is critical when you consider that there are indeed exactly three entities being modeled (as a good sanity check)
Person
Event
A Person's association with an Event
You model this as three tables, with the first two having essentially two columns each: one with a unique index (called "primary key"), and the second being a semantic name (person name, event name). Note that you can also add any number of columns to these with only one factor of increased storage (most likely your first move will be to add a date column to the event table).
The third table is the interesting one, it contains only 2 columns, each numeric, both of which are references to the other tables (each row is simply: (person_id, event_id)). We term these "foreign keys".
This structure means a few things:
No matter how many events someone goest to, that someone is only represented once.
same with events, not matter how many attendees
The attendance is a "first-class" entity, and can grow to include it's own attributes (i.e. "role")
This structure is called many-to-many because each person may attend many events, and each event may have many attendees.
The quintessential feature of the design is that no single piece of domain knowledge is repeated, only "keys" are repeated as necessary to model the real-world domain. (i.e. in your first example, accounting for a name change would require an unknown quantity of updates, and might lead to data anomalies, avoidance of which is a primary concern of database normalization.
Don't worry about "space". This isn't the 1970s and we're not going to run out of columns on punch cards to store data. You should be concerned with expressing your requirements in the proper, most normalized data structure. With proper indexing there shouldn't be a problem, not with this volume of data.
Remember indexes need to be defined on anything you will include as part of a WHERE clause, and sometimes you may need to add additional indexes for large lists fetched with ORDER BY and LIMIT.
Whenever possible or practical use an integer identifier instead of a string. These are stored as a small number of bytes, typically 4, compared with a variable length string which is typically at least the length of the string in bytes plus 1.
A properly normalized database will use numerical identifiers for things anyway, so this kind if thing isn't a huge concern. The only time you go against this, or deliberately de-normalize your data, is when you have a legitimate performance problem that cannot be easily solved using some other method.
As always, test your schema by generating large amounts of dummy data and see how it performs. Since you have a good idea of the requirements in advance, do some testing at those levels, and then, to be on the safe side, try 2x, 5x and 10x the data to see how much flexibility your design has. It's okay to have performance limitations so long as you know at what kind of scale you'll experience them.
mySQL relational databases were designed specifically to handle this sort of problem. Handling millions of entries is not a problem. Complex queries may take a couple seconds but will perform remarkably well.
It is best design to store 1 event per row. The way you are going about it sounds like the best way. Good Luck.
I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1
student1 course 2
student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
This is a bad idea for two obvious reasons:
DBMS can't enforce proper referentialX (and possibly domain) integrity and relying on application-level integrity is almost always a bad idea.
While the database will be able to answer the query: "based on given student, give me courses", you won't be able to (efficiently) go in the opposite direction, should you ever need to.
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
When a student enrolls into course, insert a row in STUDENT_COURSE, set COMPLETED to 0 and leave performance fields NULL.
When the student completed the course, set COMPLETED to 1 and fill the performance fields.
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.
I have 2 scenarios for a MySQL DB and I'm not sure which to choose, and I've run into the same dilemma for a few tables.
I'm making a web application only accessed by members. Each member has their own deals, expenses, and say "listings". The criteria for the records is the same across users, but each user can have completely different amounts of records.
My 2 scenarios are whether I should have one table for deals, one table for listings, one table for expenses...and have a field in each that links to the primary key for a particular user. Or...if it is better to have a separate deal table, expense table, and listing table for each user..(using a combined string like "user"+deals, or "user"+exp). Deals can be used across 1 or 2 users, but expenses and listings are completely independent. I am going to have a master deal table to hold all the info for each deal, but there is a user deal table(s) that links their primary key to a deal primary key.
So, separate tables or one table? If there are thousands of users with hundreds of deals/expenses/listings..I just don't want the queries to be extremely slow after a lot of deals or expenses have built up...No user will ever need to view anything from other users...strictly just their data.
Also, I'm familiar with how a database works and stores data, but I'm not 100% clear. I just want it to work quickly, so my other question is (although it may be stupid) when a user submits a new deal or expense...is it inserted in the beginning or end the table? Or is it irrelevant...because a query will search everything in the table either way before returning information?
Always use one table to store one kind of entity.
Or more specifically, what you're talking about is a nasty, complicated optimisation that works in an incredibly small subset of cases which almost certainly isn't yours.
You want to use just one table for one kind of entry. Index it appropriately, and try to get rid of old records when you don't need them any more.
Also, a lot of peoples' idea of "big data" isn't actually particularly big. Databases normally need little optimisation while their data still fit in RAM, which on a modern system means, say, 32Gb.
Regarding your second question:
In MySql the order of the records on the disk is defined by your PRIMARY KEY. Meaning a record does not get inserted at the end or the beginning, but rather wherever it belongs based on the primary key.
In other db's you have th option to use CLUSTERED KEYS in order to use another key than the PRIMARY to order the records on disk, but this is not supported in MySql to my knowledge.
Regarding your first question:
I found myself in this position a couple of times and recently I keep getting back to one blog post (last of a series, the conclusion is in the bottom):
http://weblogs.asp.net/manavi/archive/2011/01/03/inheritance-mapping-strategies-with-entity-framework-code-first-ctp5-part-3-table-per-concrete-type-tpc-and-choosing-strategy-guidelines.aspx
I quote:
Before we get into this discussion, I
want to emphasize that there is no one
single "best strategy fits all
scenarios" exists. As you saw, each of
the approaches have their own
advantages and drawbacks. Here are
some rules of thumb to identify the
best strategy in a particular
scenario:
If you don’t require polymorphic associations or queries, lean toward
TPC—in other words, if you never or
rarely query for BillingDetails and
you have no class that has an
association to BillingDetail base
class. I recommend TPC (Table per Concrete Type) (only) for the
top level of your class hierarchy,
where polymorphism isn’t usually
required, and when modification of the
base class in the future is unlikely.
If you do require polymorphic associations or queries, and
subclasses declare relatively few
properties (particularly if the main
difference between subclasses is in
their behavior), lean toward TPH (Table per Hierarchy). Your
goal is to minimize the number of
nullable columns and to convince
yourself (and your DBA) that a
denormalized schema won’t create
problems in the long run.
If you do require polymorphic associations or queries, and
subclasses declare many properties
(subclasses differ mainly by the data
they hold), lean toward TPT (Table per Type). Or,
depending on the width and depth of
your inheritance hierarchy and the
possible cost of joins versus unions,
use TPC.
By default, choose TPH only for simple
problems. For more complex cases (or
when you’re overruled by a data
modeler insisting on the importance of
nullability constraints and
normalization), you should consider
the TPT strategy. But at that point,
ask yourself whether it may not be
better to remodel inheritance as
delegation in the object model
(delegation is a way of making
composition as powerful for reuse as
inheritance). Complex inheritance is
often best avoided for all sorts of
reasons unrelated to persistence or
ORM. EF acts as a buffer between the
domain and relational models, but that
doesn’t mean you can ignore
persistence concerns when designing
your classes.
Lately I've been rethinking a database design I made a couple of months ago. The main reason is that last night I read the databse schema of vBulletin and saw that they use many, MANY, tables.
The current "idea" I'm using for my schema, for instance my log table, is to keep everything in one table by differencing the type of Log with an integer:
id, type, type_id, action, message
1 , 1, 305, 2, 'Explanation for user Ban'
2, 2, 1045, 1, 'Reason for deletion of Article'
Where type 1 = user, type 2 = article, type_id = the ID of the user, article or w/e and action 2 = ban, action 1 = deletion.
Should I change the design to two tables logBans, logSomething and so on? or is it better to keep the method I'm currently using?
The issue here is subtyping. There are three basic approaches to dealing with subtypes.
Put each record type into a completely separate table;
Put a record in a parent table and then a record in a subtype table; and
Put all the records in one table, having nullable columns for the "optional" data (ie things that don't apply to that type).
Each strategy has its merits.
For example, (3) is particularly applicable if there is little to no difference between different subtypes. In your case, do different log records have extra columns if they're of a particular type? If they don't or there are few cases when they do putting them all in one table makes perfect sense.
(2) is common used for a Party table. This is a common model in CRMs that involves a parent Party object which has subtypes for Person and Organization (Organization may also have subtypes like Company, Association, etc). Person and Organization have different properties (eg salutation, given names, date of birth, etc for Person) so it makes sense to split this up rather than using nullable columns.
(2) is potentially more space efficient (although the overhead of NULL columns in modern DBMSs is very low). The bigger issue is that (2) might be more confusing to developers. You will get a situation where someone needs to store an extra field somewhere and will whack it in a column that's empty for that type simply because it's easier doing that than getting approval for the DBAs to add a column (no, I'm not kidding).
(1) is probably the least frequently used scheme of the 3 in my experience.
Lastly, scalability has to be considered and is probably the best case for (1). At a certain points JOINs don't scale effectively and you'll need to use some kind of partitioning scheme to cut down your table sizes. (1) is one method of doing that (but a crude method).
I wouldn't worry too much about that though. You'll typically need to get to hundreds of millions or billions of records before that becomes an issue (unless your records are really really large, in which case it'll happen sooner).
It depends. If you're going to have 1500000000 entries of type 1 and 1000 entries of type 2 and you'll be doing a LOT of queries on type 2, separate the tables. If not, it's more convenient to keep only one table.
Keep in mind scalability:
How many entries of each type will I have in 1 year?
How many requests on this table will I be doing ?
Can you, at some point, clear this log? Can you move it to another table (like archive entries older than X months) ?
The one drawback I see right now is that you cannot enforce foreign key integrity on your type_id since it points to many different tables.
I want to add a small tip. A little off topic, and quite basic, but it's a lot clearer to use enum instead of tinyint for status flags, i.e.
enum('user','type')
If there are only two statuses, tinyint is a little more memory efficient, but less clear. Another disadvantage in enum is that you put a part of the business logic in the data tier - when you need to add or remove statuses, you have to alter the DB. Otherwise it's much more clear and I prefer enum.
I would keep things as specific as possible - in this case I would create two tables.
Each table has a specific purpose so I cannot see why you would combine them.
I wouldn't do what vBulletin does. The problem with older apps like vBulletin is that while they might have started as lean-machines, over the time they collect a lot of entropy and end up being bloated. Since there are plugins, and third-party tools, and developers who've worked on the old code, breaking it is a tough choice.
That's why there is not much refactoring going on here. Don't make them your programming model. Look around, find out what works best and use that. A lot of table sounds like a bad thing to me, not good.