Let's say I have a table which represents Users:id, name. The table is huge, about 100 million rows.
Also users have some property, lets say City of birth. This is optional field so only a small part of users (let's say only 5%) have provided it. So I also have a table with cities: id, name. Relation is 1 to many - user can have only one city, and the a city can be a bithplace for many users.
The question is: how to connect them?
a) Adding a column city_id to the users table. (doomed for 95 millions nulls for users who don't have the property)
b) Creating a third, conjunction table user_city: user_id,city_id (With purpose to omit that huge number of NULLs if a).
Also, please, keep in mind that the application needs to
select user.name ... where city_id=xxx
So the city_id column needs to be indexed in any case
Because any non-alien user has only one birth city (unless he was born in a taxi), it seems silly and wasteful to have a table of birth city indexed by User ID. I would put birth city right in the user table where (as I claim) it belongs, notwithstanding that most city fields will be NULL.
But, forgetting my mere opinion, this is the classic time vs. space problem, the space consideration being the millions of extraneous, useless NULLs; and the extra time being the millions of extraneous, useless SELECTs into the city table.
What does your solution to that problem tell you?
Related
I have a users table which stores the details of two types of users namely students and teachers. There are 10 fields like username, password etc common to both students and teachers. There are no 1 to n relations in case of any data here.
In case of students, I have to store twenty different 1 to 1 data like weight, DOB, Admission No., Parent,Phone number etc.
In case of teachers, I have to store a separate set of twenty 1 to 1 data like email id, affiliation number etc which is not related to students in any way.
What is the best database structure I can use in this scenario from below? If there are better options please provide that too.
One table with 50 columns where 20 columns will have NULL in case of students and 20 columns will have NULL in case of teachers
One table with 30 columns where first 10 columns stores common data and next 20 columns store students details in case of student and teacher's data in case of teacher.
Two tables one with 10 column to store user details. And another table with 20 columns to store students details in case of student and teacher's data in case of teacher.
Three tables one with 10 column to store user details. Another table with 20 columns to store students details and yet another table with 20 columns to store teacher's data
Single Table Inheritance and Class Table Inheritance are both fine. In fact Fowler has recommended STI for agile. And if you use a good ORM like Hibernate, the difference is trivial. If you use PostgreSQL your nulls won't take up any extra space either.
That being said, you should further normalize your tables (parents phone #s should be in a diff table for example). See https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example/23831#23831 for some help
You have to remember the principles of relational design. All the columns should be dependent on the key fields and only on the key fields.
Its better to have choice 4 tables:
1) For a base person details (columns teachers and students both have).
2) A teacher table for details that pertain to only teachers. This will relate to base person table with a foreign key (just like table 3).
3) A student table for details that pertain to only students.
No extra empty columns and very flexible in the kind of queries (some of which that you are not anticipating) you will be able to do.
the First thing I thought of was a pigs ear relationship, a link entity so that you could have ID, teacherID, studentID to show which teachers teach which students, but then I realised this isn't what you asked for so...
Why not just have a single boolean, true if teacher, false if not?
Look up these two tags: single-table-inheritance class-table-inheritance
These correspond to well known techniques that are like option 1 and option 4. There are situations where one or the other of these is best. The tag wikis (info) and the questions grouped under the tags will give you some additional help.
I'm currently working on a survey creation/administration web application with PHP/MySQL. I have gone through several revisions of the database tables, and I once again find that I may need to rethink the storage of a certain type of answer.
Right now, I have a table that looks like this:
survey_answers
id PK
eid
sesid
intvalue Nullable
charvalue Nullable
id = unique value assigned to each row
eid = Survey question that this answer is in reply to
sesid = The survey 'session' (information about the time and date of a survey take) id
intvalue = The value of the answer if it is a numerical value
charvalue = the value of the answer if it is a textual representation
This allowed me to continue using MySQL's mathematical functions to speed up processing.
I have however found a new challenge: storing questions that have multiple responses.
An example would be:
Which of the following do you enjoy eating? (choose all the apply)
Girl Scout Cookies
Bacon
Corn
Whale Fat
Now, when I want to store the result, I'm not sure of the best way to handle it.
Currently, I have a table just for multiple choice options that looks like this:
survey_element_options
id PK
eid
value
id = unique value associated with each row
eid = question/element that this option is associated with
value = textual value of that option
With this setup, I then store my returned multiple selection answers in 'survey_answers' as strings of comma separated id's of the element_options rows that were selected in the survey. (ie something like "4,6,7,9") I'm wondering if that is indeed the best solution, or if it would be more practical to create a new table that would hold each answer chosen, and then reference back to a given answer row which in turn references back to the element and ultimately the survey.
EDIT
for anyone interested, here is the approach I ended up taking (In PhpMyAdmin Relations View):
And a rudimentary query to gather the counts for a multiple select question would look like this:
SELECT e.question AS question, eo.value AS value, COUNT(eo.value) AS count
FROM survey_elements e, survey_element_options eo, survey_answer_options ao
WHERE e.id = 19
AND eo.eid = e.id
AND ao.oid = eo.id
GROUP BY eo.value
This really depends on a lot of things.
Generally, storing lists of comma separated values in a database is bad, especially if you plan to do anything remotely intelligent with that data. Especially if you want to do any kind of advanced reporting on the answers.
The best relational way to store this is to also define the answers in a second table and then link them to the users response to a question in a third table (with multiple entries per user-question, or possibly user-survey-question if the user could take multiple surveys with the same question on it.
This can get slightly complex as a a possible scenario as a simple example:
Example tables:
Users (Username, UserID)
Questions (qID, QuestionsText)
Answers (AnswerText [in this case example could be reusable, but this does cause an extra layer of complexity as well], aID)
Question_Answers ([Available answers for this question, multiple entries per question] qaID, qID, aID),
UserQuestionAnswers (qaID, uID)
Note: Meant as an example, not a recommendation
Convert primary key to not unique index and add answers for the same question under the same id.
For example.
id | eid | sesid | intval | charval
3 45 30 2
3 45 30 4
You can still add another column for regular unique PK if needed.
Keep things simple. No need for relation here.
It's a horses for courses thing really.
You can store as a comma separated string (But then what happens when you have a literal comma in one of your answers).
You can store as a one-to-many table, such as:
survey_element_answers
id PK
survey_answers_id FK
intvalue Nullable
charvalue Nullable
And then loop over that table. If you picked one answer, it would create one row in this table. If you pick two answers, it will create two rows in this table, etc. Then you would remove the intvalue and charvalue from the survey_answers table.
Another choice, since you're already storing the element options in their own table, is to create a many-to-many table, such as:
survey_element_answers
id PK
survey_answers_id FK
survey_element_options_id FK
Again, one row per option selected.
Another option yet again is to store a bitmask value. This will remove the need for a many-to-many table.
survey_element_options
id PK
eid FK
value Text
optionnumber unique for each eid
optionbitmask 2 ^ optionnumber
optionnumber should be unique for each eid, and increment starting with one. There will impose a limit of 63 options if you are using bigint, or 31 options if you are using int.
And then in your survey_answers
id PK
eid
sesid
answerbitmask bigint
Answerbitmask is calculated by adding all of the optionbitmask's together, for each option the user selected. For example, if 7 were stored in Answerbitmask, then that means that the user selected the first three options.
Joins can be done by:
WHERE survey_answers.answerbitmask & survey_element_options.optionbitmask > 0
So yeah, there's a few options to consider.
If you don't use the id as a foreign key in another query, or if you can query results using the sesid, try a many to one relationship.
Otherwise I'd store multiple choice answers as a serialized array, such as JSON or through php's serialize() function.
I'm working on a application where each user has to fill out an extensive profile for themselves.
The first part of the user profile consists of about 25 or so fields of general information
The next section of the user profile is a section where they evaluate themselves on a set list of criteria. ie, "Rate how good you are at cooking" and then they tick a radio box from one to five, there is also a check box that the can check if they are 'extra interested' in the activity/subject they rated themselves on.
There are about 40 of these that they rate themselves on.
So my question is, how should I store this information, should there be columns in my users table for every field and item? This would be nearly 70 fields
or should I setup a table for user_profile, and user_self_evaluation, and have the columns for each in there and have a one-one relationship with the users?
Use separate tables. In this way when you update only self evaluation, you does not need to update the user_profile table. The idea here is to separate the often updated fields in another table, leaving the rarely updated on another location. If the table became large, and the username/password is in separate table, the performance of lookup by userid / username won't be affected by the large amount of update queries, nor you'll bring the whole site down if you alter the self_evaluation table.
But if you are planning to add new evaluations, I'd suggest a different design:
user_profile table with the 25 profile field
self_evaluations table, with id and name, and any meta information about the question; with 1 record per evaluation
user_profile_evaluation with userid, evaluationid, score, extra - with one record for each evaluation of the user.
This way your schema will be much more flexible and you won't need to alter the table in order to add another evaluation.
or should I setup a table for user_profile, and user_self_evaluation,
and have the columns for each in there and have a one-one relationship
with the users?
Yes, this is how you should do it, if you know you won't expand the table in the future. The other option is too bad.
If you think you will expand the evaluations in the future, then you can do it like this:
user_self_evaluation table
user_id | evaluation_type | evaluation_value
1 | cooking | 5
1 | singing | 3
2 | cooking | 2
2 | singing | 5
Make the combined columns (user_id, evaluation_type, evaluation_value) a unique or the primary.
I think the latter one is better, a table with 70 columns is really bad-looking and can get really worse if you try to manage it.
When every question is multiple choice you can also add numbers in one field for each answer.
Let's say you've got four questions with 4 choices:
You could save them as 1433 in one column called answerers, (the first question is answer 1, second answer 4, third answer 3, and last but not least question 4 is answer 3)
Just giving you some choices here.
But if I had to choose between one-one relationship and 1 table, I would choose one one relationship because it's easier to manage later on.
I am sure this is a basic question but I am new to SQL so anyways, for my user profile I want to display this: location = "Hollywood, CA - USA" if a user lives in Hollywood. So I assume in the user table there will be 1 column like current_city which will have ID say 1232 which is a FK to the city table where city_name for this PK = Hollywood. Then connect with the state table and the country table to find the names CA and USA as the city lookup table will only store the IDs (like CA = 21 and USA = 345)
Is this the best way to design the table OR I was thinking should I add 2 columns like city_id and city_name to the user_table. And also add country_id, country_name, state_id, state_name to the city table. This way i save on trips to other parent tables just to fetch the name for the IDs.
This is only a sample use case but I have lots of lookup ID tables so I will apply the same principle to all tables once i know how to do it best. My requirement is scalability and performance so whatever works best for these is what i would like.
The first way you described is almost always better.
Having both the city_id and city_name (or any pair of that kind) in the users table is not best practice since it may cause data discrepancies - a wrong update may result in a city_id that does not match the city_name and then the system behavior becomes unexpected.
As said, your first suggestion would be the common and usually the best way to do this. If table keys are designed properly so all select statements can use them efficiently this would also give the best performance.
For example, having just the city_name in the users table would make it a little quicker to find and show the city for one user, but when trying to run other queries - like finding all users in city X, that would make much less sense.
You can find a nice series of articles for beginners about DB normalization here:
http://databases.about.com/od/specificproducts/a/2nf.htm. This article has an example which is very much like what you are trying to achieve, and the related articles will probably help you design many other tables in your DB.
Good luck!
Ok, I have a database with with a table for storing classified posts, each post belongs to a different city. For the purpose of this example we will call this table posts. This table has columns:
id (INT, +AI),
cityid (TEXT),
postcat (TEXT),
user (TEXT),
datatime (DATETIME),
title (TEXT),
desc (TEXT),
location (TEXT)
an example of that data would be:
'12039',
'fayetteville-nc',
'user#gmail.com',
'December 28th, 2010 - 11:55 PM',
'post title',
'post description',
'spring lake'
id is auto incremented, cityid is in text format (this is where I think i will be losing performance once the database is large)...
Originally I planned on having a different table for each city and now since a user has to have the option of searching and posting through multiple cities, I think I need them all in one table. Everything was perfect when I had one city per table, where I could:
SELECT *
FROM `posts`
WHERE MATCH (`title`, `desc`, `location`)
AGAINST ('searchtext' IN BOOLEAN MODE)
AND `postcat` LIKE 'searchcatagory'
But then I ran into problems when trying to search multiple cities at one time, or listing all of a users posts for them to delete or edit.
So looks like I have to have one table with all the posts, and also match another FULLTEXT field: cityid. I am guessing I need full-text because if a user chooses an entire state, and my cityid is "fayetteville-nc" I would need to match cityid against "-nc" this is only an assumption and I would love another way. This database could easily reach over a million rows within 6 months, and a fulltext search against 4 columns is probably going to be slow.
My question is, is there a better way to do this more efficiently? The database has nothing in it now, except for some test posts made by me. So I can completely redesign the table structure if necessary. I am open to any and all suggestions, even if it is just a more efficient way to perform my query.
Yes, one table for all posts sounds sensible. It would also be normal design for the posts table to have a city_id, referring to the id in a city table. Each city would also have a state_id, referring to the id in a state table, and similarly each state would have a country_id referring to the id in a country table. So you could write:
SELECT $columns
FROM posts JOIN city ON city.id = posts.city_id
WHERE city.tag = 'fayetteville-nc'
Once you've brought the cities into a separate table, it might make more sense for you to do the city-to-city_id resolving up front. This fairly naturally happens if you have a city chose from a dropdown, for instance. But if you're entering free text into a search field, you may want to do it differently.
You can also search for all posts in a given state (or set of states) as:
SELECT $columns
FROM posts
JOIN city ON city.id = posts.city_id
JOIN state ON state.id = city.state_id
WHERE state.tag = 'NC'
If you're going to go more fancy or international, you may need a more flexible way of arranging locations into a hierarchy (e.g. you may want city districts, counties, multinational regions, intranational regions (Midwest, East Coast etc)) but stay easy for now :)