How to store records that don't contain any fact? For example, let's say that a shop wants to count how many people have entered inside a store (and that they take info on every person that goes inside the shop). In warehouse, I guess there would be dimension table "Person" with different attributes, but how would fact table look like? Would it contain only foreign keys?
As you described it, that would be just a fact table. Actually, there is name for this -- factless fact table; fact table without any measures.
It is quite common for recoding events. Essentially anything that records: who, what, where, when and why? would be fact-measure-less table. If you add how much? then that goes into a measure.
You can think of it as the fact table containing an implicit count column, with the number of people entering, which is always "1" if you store data on individual person level, so that makes a fact table containing only FKs to the dimensions.
This of course only enabled analysis on the number of people entering, filtered for the various dimensions, but it seams like a realistic use case to me. I think you're on the right way.
Related
We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...
I want to implement a vote system for several different entities/tables (e.g. articles, blog posts, users).
What is the best/more efficient approach?:
Create a table votes to store all the votes of all entities?
votes
vote_id
user_id
type (articles, blogposts or users)
Create a table votes for each entity? votes_articles, votes_blogposts, votes_users
What I see is:
First option will result with a bigger table and there's an additional field which I need to include in my queries. More generic table that can be easily extended for more entities if needed and everything is kind of centralised. (Can use a generic function to retrieve/insert/update the table.)
Second option will result with smaller tables; faster to query? But not necessarily better to maintain.
The second method has many advantages. Presumably, the votes are actually on entities, so you also have an id in each table pointing to the article, blogpost, or whatever that is being voted on. In a standard SQL database, you would like to have foreign key references to other tables, and the one-table-per-entity approach provides that capability.
You could modify the first approach to do this. However, that would require a separate column for each possible entity. And, then, you lose the easy flexibility of adding new entities.
When is the first approach advantageous? First, when maintaining valid foreign key references is not important. And, when you often want to bring together votes as votes. So, how many times did a user vote today regardless of what s/he voted on? How many votes do user A and user B have in common regardless of what they voted on? Get the idea. If votes starts to behave like its own entity, then it deserves its own table.
I happen to think that your very question highlights a major weakness in SQL and relational databases. This is an example of wanting different entities to "inherit" features from a class (to borrow terminology from the OO world). Wouldn't it be nice if you could just specify that a new entity inherits properties from another entity (such as "Votable")? Oh, never mind, that's not the real world of popular databases. At least not today.
EDIT:
If you care about performance, don't go with the modified first approach -- that is, a separate column for each possible entity. Normally, primary keys are 4-byte integers. These (in most databases at least) will occupy four bytes, regardless of whether the column has a NULL value. So, one table with three entity columns is (to a very rough approximation) three times the size of three tables specialized for each entity. Such wasted space only slows down the query processing.
If you are only going to have two or three entities, maybe this isn't that big a deal. But once you get to more than you can count on one hand, it really is a waste of space, memory, and processing power.
I have multiple content types, but they all share some similarities. I'm wondering when it is a problem to use the same table for a different content type? Is it ever a problem? If so, why?
Here's an example: I have five kinds of content, and they all have a title. So, can't I just use a 'title' table for all five content types?
Extending that example: a title is technically a name. People and places have names. Would it be bad to put all of my content titles, people names, and place names in a "name" table? Why separate into place_name, person_name, content_title?
I have different kinds of content. In the database, they seem very similar, but the application uses the content in different ways, producing different outputs. Do I need a new table for each content type because it has a different result with different kinds of dependencies, or should I just allow null values?
I wouldn't do that.
If there are multiple columns that are the same among multiple tables, you should indeed normalize these to 1 table.
And example of that would be several types of users, which all require different columns, but all share some characteristics (e.g. name, address, phone number, email address)
These could be normalized to 1 table, which is then referenced to by all other tables through a foreign key. (see http://en.wikipedia.org/wiki/Database_normalization )
Your example only shows 1 common column, which is not worth normalizing. It would even reduce performance trying to fetch your data, because you'll need to join 2 tables to get all data; 1 of which (the one with the titles) contains a lot of data you won't need all the data from, thus straining the server more.
While normalization is a very good practice to avoid redundency and ensure consistency, it can be bad for performance sometimes. For example for a person table where you have columns like name, adress, dob its not very good performance wise to have a picture in the same table. A picture can be about 1MB easily while the remaining columns may not take any more than 1K. Imagine how many blokcs of data needed to be read even if you only want to list the name and address of people living in a certain city - if you are keeping everything in the same table.
If there is a variation in size of the contents and you might have to retrieve only certain types of contents in the same query, the performance gain from storing them in separate tables will outweight the normalization easily.
To typify data in this way, it's best to use a table (i.e., name), and a sub-table (i.e., name_type), and then use a FK constraint. Use an FK constraint because the InnoDB does not support column constraints, and the MyISAM engine is not suited for this (it is much less robust and feature rich, and it should really only be used for performance).
This kind of normailization is fine, but it should be done with a free-format column type, like VARCHAR(40), rather than with ENUM. Use triggers to restrict the input so that it matches the types you want to support.
Say I have a table like this:
create table users (
user_id int not null auto_increment,
username varchar,
joined_at datetime,
bio text,
favorite_color varchar,
favorite_band varchar
....
);
Say that over time, more and more columns -- like favorite_animal, favorite_city, etc. -- get added to this table.
Eventually, there are like 20 or more columns.
At this point, I'm feeling like I want to move columns to a separate
user_profiles table is so I can do select * from users without
returning a large number of usually irrelevant columns (like
favorite_color). And when I do need to query by favorite_color, I can just do
something like this:
select * from users inner join user_profiles using user_id where
user_profiles.favorite_color = 'red';
Is moving columns off the main table into an "auxiliary" table a good
idea?
Or is it better to keep all the columns in the users table, and always
be explicit about the columns I want to return? E.g.
select user_id, username, last_logged_in_at, etc. etc. from users;
What performance considerations are involved here?
Don't use an auxiliary table if it's going to contain a collection of miscellaneous fields with no conceptual cohesion.
Do use a separate table if you can come up with a good conceptual grouping of a number of fields e.g. an Address table.
Of course, your application has its own performance and normalisation needs, and you should only apply this advice with proper respect to your own situation.
I would say that the best option is to have properly normalized tables, and also to only ask for the columns you need.
A user profile table might not be a bad idea, if it is structured well to provide data integrity and simple enhancement/modification later. Only you can truly know your requirements.
One thing that no one else has mentioned is that it is often a good idea to have an auxiliary table if the row size of the main table would get too large. Read about the row size limits of your specific databases in the documentation. There are often performance benefits to having tables that are less wide and moving the fields you don't use as often off to a separate table. If you choose to create an auxiliarary table with a one-to-one relationship make sure to set up the PK/FK relationship to maintain data integrity and set a unique index or constraint on the FK field to mainatin the one-to-one relationship.
And to go along with everyone else, I cannot stress too strongly how bad it is to ever use select * in production queries. You save a few seconds of development time and create a performance problem as well as make the application less maintainable (yes less - as you should not willy nilly return things you may not want to show on the application but you need in the database. You will break insert statements that use selects and show users things you don't want them to see when you use select *.).
Try not to get in the habit of using SELECT * FROM ... If your application becomes large, and you query the users table for different things in different parts of your application, then when you do add favorite_animal you are more likely to break some spot that uses SELECT *. Or at the least, that place is now getting unused fields that slows it down.
Select the data you need specifically. It self-documents to the next person exactly what you're trying to do with that code.
Don't de-normalize unless you have good reason to.
Adding a favorite column ever other day every time a user has a new favorite is a maintenance headache at best. I would highly consider creating a table to hold a favorites value in your case. I'm pretty sure I wouldn't just keep adding a new column all the time.
The general guideline that applies to this (called normalization) is that tables are grouped by distinct entities/objects/concepts and that each column(field) in that table should describe some aspect of that entity
In your example, it seems that favorite_color describes (or belongs to) the user. Some times it is a good idea to moved data to a second table: when it becomes clear that that data actually describes a second entity. For example: You start your database collecting user_id, name, email, and zip_code. Then at some point in time, the ceo decides he would also like to collect the street_address. At this point a new entity has been formed, and you could conceptually view your data as two tables:
user: userid, name, email
address: steetaddress, city, state, zip, userid(as a foreign key)
So, to sum it up: the real challenge is to decide what data describes the main entity of the table, and what, if any, other entity exists.
Here is a great example of normalization that helped me understand it better
When there is no other reason (e.g. there are normal forms for databases) you should not do it. You dont save any space, as the data must still stored, instead you waste more as you need another index to access them.
It is always better (though may require more maintenance if schemas change) to fetch only the columns you need.
This will result in lower memory usage by both MySQL and your client application, and reduced query times as the amount of data transferred is reduced. You'll see a benefit whether this is over a network or not.
Here's a rule of thumb: if adding a column to an existing table would require making it nullable (after data has been migrated etc) then instead create a new table with all NOT NULL columns (with a foreign key reference to the original table, of course).
You should not rely on using SELECT * for a variety of reasons (google it).
How many field is possible in one table,
Shall i maintain 150 field in one table is good way ,
OR
Maintain relation ship with other tables,
Thanks
Bharanikumar
In the vast majority of cases having 150 columns in a single table is symptomatic of a badly denormalized database.
You might want to read this and re-evaluate your db design.
To put it in your terms, go with "maintain relationship with other tables"
http://dev.mysql.com/doc/refman/5.0/en/column-count-limit.html
If you have a business need to have 150 columns, then it's a "good way". I've never seen such a business need, but that doesn't mean one doesn't exist. I have seen very wide tables used in olap type cases, so if that's what you're doing, there's a good chance that you're on the right track. If you're using this table for more otap functionality, then you're probably going down the wrong road. Perhaps if you provided a bit more info about what you're trying to accomplish, we could provide some advice (instead of "do that" or "do it a different way").
150 bit type fields might be OK, but you also have to consider the maximum length of the record your database will allow you to store. With varchar fields, most databases will let you create a table that would in theory violate the max if all the fields were filled to their max length. However, it won't let you actually add records which are too long,. This is the kind of trap that can go along fine for years until someone puts just one character too many into a potential insert and then blow up and it takes a long time generally to find an fix such a problem. It is best to avoid ever designing a table where the total legnth of the columns is bigger than the length of the maximum record bytes.
Less wide tables can also tend to be faster to query.
Additonally 150 columns is usually a sign that you really need to look at the design and see if a related table would be better. For instance if you have phone1, phone2, phone3, then you need a related phone table.
If you genuiniely need all 150 columns, consider which are likely to be queried together most often. Put those inteh parent table. Then add the less often queried (or columns related only toa particular function) to the related table. There is no reason not to have a 1-1 relationship between tables, just use the id from the parante table sa the PK inthe child table as well as the FK to the parent table.