I am sure this is a basic question but I am new to SQL so anyways, for my user profile I want to display this: location = "Hollywood, CA - USA" if a user lives in Hollywood. So I assume in the user table there will be 1 column like current_city which will have ID say 1232 which is a FK to the city table where city_name for this PK = Hollywood. Then connect with the state table and the country table to find the names CA and USA as the city lookup table will only store the IDs (like CA = 21 and USA = 345)
Is this the best way to design the table OR I was thinking should I add 2 columns like city_id and city_name to the user_table. And also add country_id, country_name, state_id, state_name to the city table. This way i save on trips to other parent tables just to fetch the name for the IDs.
This is only a sample use case but I have lots of lookup ID tables so I will apply the same principle to all tables once i know how to do it best. My requirement is scalability and performance so whatever works best for these is what i would like.
The first way you described is almost always better.
Having both the city_id and city_name (or any pair of that kind) in the users table is not best practice since it may cause data discrepancies - a wrong update may result in a city_id that does not match the city_name and then the system behavior becomes unexpected.
As said, your first suggestion would be the common and usually the best way to do this. If table keys are designed properly so all select statements can use them efficiently this would also give the best performance.
For example, having just the city_name in the users table would make it a little quicker to find and show the city for one user, but when trying to run other queries - like finding all users in city X, that would make much less sense.
You can find a nice series of articles for beginners about DB normalization here:
http://databases.about.com/od/specificproducts/a/2nf.htm. This article has an example which is very much like what you are trying to achieve, and the related articles will probably help you design many other tables in your DB.
Good luck!
Related
i am trying to implement a database which has multi valued attributes and create a filter based search. For example i want my people_table to contain id, name, address, hobbies, interests (hobbies and interests are multi-valued). The user will be able to check many attributes and sql will return only those who have all of them.
I made my study and i found some ways to implement this but i can't decide which one is the best.
The first one is to have one table with the basic info of people (id, name, address), two more for the multi-valued attributes and one more which contains only the keys of the other tables (i understand how to create this tables, i don't know yet how to implement the search).
The second one is to have one table with the basic info and then one for each attribute. So i will have 20 or more tables (football, paint, golf, music, hiking etc.) which they only contain the ids of the people. Then when the user checks the hobbies and the activities i am going to get the desired results with the use of the JOIN feature (i am not sure about the complexity, so i don't know how fast is going to be if the user do many checks).
The last one is an implementation that i didn't find on internet (and i know there is a reason :) ) but in my mind is the easiest to implement and the fastest in terms of complexity. Use only one table which will have the basic infos as normal and also all the attributes as boolean variables. So if i have 1000 people in my table there are going to be only 1000 loops and which i imagine with the use of AND condition are going to be fast enough.
So my question is: can i use the the third implementation or there is a big disadvantage that i don't get? And also which one of the first two ways do you suggest me to use?
That is a typical n to m relation. It works like this
persons table
------------
id
name
address
interests table
---------------
id
name
person_interests table
----------------------
person_id
interest_id
person_interests contains a record for each interest of a person. To get the interests of a person do:
select i.name
from interests i
join person_interests pi on pi.interest_id = i.id
join persons p on pi.person_id = p.id
where p.name = 'peter'
You could create also tables for hobbies. To get the hobbies do the same in a separate query. To get both in one query you can do something like this
select p.id, p.name,
i.name as interest,
h.name as hobby
from persons p
left join person_interests pi on pi.person_id = p.id
left join interests i on pi.interest_id = i.id
left join person_hobbies ph on ph.person_id = p.id
left join hobbies h on ph.hobby_id = h.id
where p.name = 'peter'
The basic way to deal with this is with a many-to-many join table. Each user can have many hobbies. Each hobby can have many users. That's basic stuff you can find information about anywhere, and #juergend already covered that.
The harder part is tracking different information about various hobbies and interests. Like if their hobby is "baseball" you might want to track what position they play, but if their hobby is "travel" you might want to track their favorite countries. Doing this with typical SQL relationships will lead to a rapid proliferation of tables and columns.
A hybrid approach is to use the new JSON data type to store some unstructured data. To expand on #juergend's example, you might add a field to Person_Interests which can store some of those details about that person's interest.
create table Person_Interests (
InterestID integer references Interests(ID),
PersonID integer references Persons(ID),
Details JSON
);
And now you could add that Person 45 has Interest 12 (travel), their favorite country is Djibouti, and they've been to 45 countries.
insert into person_interests
(InterestID, PersonID, Details)
(12, 45, '{"favorite_country": "Djibouti", "countries_visited": 45}');
And you can use JSON search functions to find, for example, everyone whose favorite country is Djibouti.
select p.id, p.name
from person_interests pi
join persons p on p.id = pi.personid
where pi.details->"$.favorite_country" = "Djibouti"
The advantage here is flexibility: interests and their attributes aren't limited by your database schema.
The disadvantages is performance. The JSON data type isn't the most efficient, and indexing a JSON column in MySQL is complicated. Good indexing is critical to good SQL performance. So as you figure out common patterns you might want to turn commonly used attributes into real columns in real tables.
The other option would be to use table inheritance. This is a feature of Postgres, not MySQL, and I'd recommend considering switching. Postgres also has better and more mature JSON support and JSON columns are easier to index.
With table inheritance, rather than having to write a completely new table for every different interest, you can make specific tables which inherit from a more generic one.
create table person_interests_travel (
FavoriteCountry text,
CountriesVisited text[]
) inherits(person_interests);
This still has InterestID, PersonID, and Details, but it's added some specific columns for tracking their favorite country and countries they've visited.
Note that text[]. Postgresql also supports arrays so you can store real lists without having to create another join table. You can also do this in MySQL with a JSON field, but arrays offer type constraints that JSON does not.
I have a snowflake diagram with:
Fact:
id_movie
id_user
rating
Dim Users:
id_user
...
Dim Movies:
id_movie
...
In my ERD, I also have a table Category, that has a many to many relationship with the movies like this:
Dim_Category:
id_category
...
Map_Category_Movie:
id_movie
id_category
relevance
I am trying to find an efficient way to model this in a snowflace/star schema. My issues:
I could just add these two tables into the snowflake diagram, but this would feel wrong as I usually only use tables that are aggregates of the subtables on the outer fringes of this diagram.
I could create another fact table for the relevance, but as I want to ultimately report on the correlation of relevance of users to their behaviour in rating in movie, I'd need to use both fact tables, which to me is an incorrect approach.
Any guidance here?
There is huge chance that you have already answered to yourself and welcome to hell.
First, quotation from http://www.information-management.com/ would be interested to you:
The snowflake structure will reduce batch updates to dimensions. Though always said to be slower than a star, some tests have revealed no difference in performance between flattened and snowflaked dimensions. In fact in some cases, the snowflake provides superior performance, such as when a wide dimension (i.e., customer) is segmented into a snowflake.
So, using a bridge table is not going to cause significant loss of performance. I prefer snowflake in good percent of cases because sometimes is really easier to manage your data mart and hardware/size of data gives you an opportunity to do it.
My friendly advise is to create bridge table (movie_ID, category_ID, relevance) and go on.
If you have fixed and small list of categories, create table with predefined categories:
dim_movies
----------
movies_id
category1_relavance
category2_relavance
category3_relavance
Up to ten is perhaps ok, especially if you work for company you're creating dwh, not just consulting it (you can administer).
Once, we have tried to create a masterpiece of data warehouse, where was a similar example like yours. Payment deal was based on performance (data was over 2TB per fact table) so we decided to give shot to create star-schema.
We created dimension like I described above and every time when no. of distinct categories grows etl added new field in table.
ETL process also had to dynamically recreate the cube.
It took a lot of pain but performance was as I remember 13% better than snow-flake.
Also, during the most exhaustively project, where I believe that 10y.o kid would designed DB better, we had to connect exact 5 categories per item. Each category points to one of 20+ possible tables. It could be joined ONLY through theirs software based on some rules. It was some kind of 1...5: Many relationship (it doesn't exists!?!)
pk code_conto cat1 cat2 cat3 cat4 cat5
----------------------------------------------------------
1 123 17 NULL 5467 12 NULL
2 124 67 1098 NULL 1423 AK12
3 123 NULL NULL NULL 13 23
Code was like this:
If (code_conto == 123)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_customers'; //NOTE THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_products';
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_city';
...
...
}
If (code_conto == 124)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_products'; //AND THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_origin'; //ON SAME FIELD
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_blabla'; //DIFFERENT JOIN TABLE
...
...
}
All hard-coded. So we hard coded our queries with over 100 times repeating WHEN in CASE Statement. Guess what? ERP provider 'improved' his software and created mapping table where was 'C' if statements based on code_conto key.
We took more than 3 weeks to provide a good and secure ETL job (with SQLs, external tools).
I didn't wrote all this for nothing. I wanted to convince you and others that using bridge table in many to many relationships is probably the best practice in 97% percents.
However, there are five design solutions to M:M relationship possible:
Array or series (I don't want to even try it)
Bridge table
Groupings
Fixed levels
Dynamically created fixed levels
Hope I didn't confused you.
Let's say I have a table which represents Users:id, name. The table is huge, about 100 million rows.
Also users have some property, lets say City of birth. This is optional field so only a small part of users (let's say only 5%) have provided it. So I also have a table with cities: id, name. Relation is 1 to many - user can have only one city, and the a city can be a bithplace for many users.
The question is: how to connect them?
a) Adding a column city_id to the users table. (doomed for 95 millions nulls for users who don't have the property)
b) Creating a third, conjunction table user_city: user_id,city_id (With purpose to omit that huge number of NULLs if a).
Also, please, keep in mind that the application needs to
select user.name ... where city_id=xxx
So the city_id column needs to be indexed in any case
Because any non-alien user has only one birth city (unless he was born in a taxi), it seems silly and wasteful to have a table of birth city indexed by User ID. I would put birth city right in the user table where (as I claim) it belongs, notwithstanding that most city fields will be NULL.
But, forgetting my mere opinion, this is the classic time vs. space problem, the space consideration being the millions of extraneous, useless NULLs; and the extra time being the millions of extraneous, useless SELECTs into the city table.
What does your solution to that problem tell you?
I'm currently working on a social networking site. (Yeah, I know, there's a whole bunch of them. I'm not trying to make Facebook all over)
I was wondering if anyone could tell me if my way of thinking is way off, or if it is the way it is actually done.
I want a user to be able to have friends. And, for that, I'm thinking that I should have one usertable like so:
USER
uId
userName
email
etc..
This should probably be a 1:N relationship, so I'm thinking that a table "contacts" should hold a list of users and their friends like so:
CONTACTS
uId (From USER)
FriendId (From USER table)
Friendship type ENUM[Active, Inactive, Pending]
Would it be an effective solution to sort this table on uId, so that a query result would look something similar to this:
uID | friendId
1 | 2
1 | 6
1 | 97
75 | 1
75 | 34
etc
Or are there any different solutions to this?
If you are simply looking to select a specific users set of friends, the query will be straightforward and you won't have to worry about performance.
For example: If you are looking to return the id's of UID 8's friends, you can just do something like:
Select FriendId FROM TABLE where UID=8;
In your case, since the UID column is not unique, make sure to have an Index on this column to allow quick lookup (optimize performance).
You might also want to think about what other data you will need about the users friends. For example its probably not useful to just grab the FriendIds, you probable want names etc. So your query will likely look more like:
Select FriendId, Users.name FROM Friends JOIN Users ON Users.uid=Friends.FriendId WHERE Friends.UID=8;
Again, having the proper columns indexed is key for optimized lookups, especially once your table size gets big.
Also, since the act of adding friends is likely very uncommon in comparison to the number of lookup queries you do, be sure to choose a database engine that provides the fastest lookup speed. In this case MyISam is probably your best bet. MyISam uses table level locking for inserts (i.e. slower inserts) but the lookups are quick.
Good luck!
I think the best way is without doubt creating a table like you proposed. This will allow you to better manage the friends, do query's for friends on this table, ... this would be the best solution.
Ok, I have a database with with a table for storing classified posts, each post belongs to a different city. For the purpose of this example we will call this table posts. This table has columns:
id (INT, +AI),
cityid (TEXT),
postcat (TEXT),
user (TEXT),
datatime (DATETIME),
title (TEXT),
desc (TEXT),
location (TEXT)
an example of that data would be:
'12039',
'fayetteville-nc',
'user#gmail.com',
'December 28th, 2010 - 11:55 PM',
'post title',
'post description',
'spring lake'
id is auto incremented, cityid is in text format (this is where I think i will be losing performance once the database is large)...
Originally I planned on having a different table for each city and now since a user has to have the option of searching and posting through multiple cities, I think I need them all in one table. Everything was perfect when I had one city per table, where I could:
SELECT *
FROM `posts`
WHERE MATCH (`title`, `desc`, `location`)
AGAINST ('searchtext' IN BOOLEAN MODE)
AND `postcat` LIKE 'searchcatagory'
But then I ran into problems when trying to search multiple cities at one time, or listing all of a users posts for them to delete or edit.
So looks like I have to have one table with all the posts, and also match another FULLTEXT field: cityid. I am guessing I need full-text because if a user chooses an entire state, and my cityid is "fayetteville-nc" I would need to match cityid against "-nc" this is only an assumption and I would love another way. This database could easily reach over a million rows within 6 months, and a fulltext search against 4 columns is probably going to be slow.
My question is, is there a better way to do this more efficiently? The database has nothing in it now, except for some test posts made by me. So I can completely redesign the table structure if necessary. I am open to any and all suggestions, even if it is just a more efficient way to perform my query.
Yes, one table for all posts sounds sensible. It would also be normal design for the posts table to have a city_id, referring to the id in a city table. Each city would also have a state_id, referring to the id in a state table, and similarly each state would have a country_id referring to the id in a country table. So you could write:
SELECT $columns
FROM posts JOIN city ON city.id = posts.city_id
WHERE city.tag = 'fayetteville-nc'
Once you've brought the cities into a separate table, it might make more sense for you to do the city-to-city_id resolving up front. This fairly naturally happens if you have a city chose from a dropdown, for instance. But if you're entering free text into a search field, you may want to do it differently.
You can also search for all posts in a given state (or set of states) as:
SELECT $columns
FROM posts
JOIN city ON city.id = posts.city_id
JOIN state ON state.id = city.state_id
WHERE state.tag = 'NC'
If you're going to go more fancy or international, you may need a more flexible way of arranging locations into a hierarchy (e.g. you may want city districts, counties, multinational regions, intranational regions (Midwest, East Coast etc)) but stay easy for now :)