I have a pretty basic question on which is the preferred way of storing data in my database.
I have a table called "users" with each user getting a username and user_id. Now, I want to make a table called "comments" for users to comment on news.
Is it better to have a column in comments called "username" and storing the logged in user's name, or have a column called "user_id". If I use user_id I would have to make my sql statement have another select statement. "(SELECT username FROM users WHERE users.id = comments.user_id) as username". It seems like performance would be better just storing the username.
I thought I read to avoid duplicate data in a database though.
Which is better?
Thanks
Typically, you use ID fields to link tables together. The reason being (in your situation) that you might allow the person to change their username, but you don't want to try and update all the places that is at...
Therefore, put the user_id in your comments table and pull the username out on a join, as you've shown.
If the user_id is the primary key then you should use user_id instead of username, if you want to use username instead of user_id then why do you have a user_id in the first place?
If there's the potential of creating a large enough database, store the user_id in the comments table. Less overhead. Also consider that usernames my be modified easier this way.
Data should be stored in (at least) third normalized form, so you should use the user_id as the primary key in the users table, and as a foreign key in the comments table and use this to get the details:
SELECT comments.*, users.username
FROM comments, users
WHERE users.user_id = comments.user_id;
If you are getting the comments based on an article, you could do this like this:
SELECT comments.*, users.username
FROM comments, users
WHERE users.user_id = comments.user_id
AND comments.article_id = '$current_article_id';
Storing the userid (integer) will mean faster JOINs later. Unless you plan on having people dig through the database by hand, there's really no reason to use the username
I'm pretty sure storing the user id in the comments table is sufficient. If you're returning rows from the comments table, just use the JOIN statement.
Cheers
Which is going to be a unique identifier? The user_id, I'd bet, or you can't have two "John Smith"s in your system.
And if volume is much of a concern, text matching the username field is going to be more expensive than linking to the users table in your query in the long term.
Numeric values are cheaper to join and index than an alphanumeric id. Use a number to uniquely identify a row. Another benefit is that the PK doesn't need to change if they need to change the user id. The last benefit is that this is the design of most modern web frameworks such as django and rails.
Related
Let's say I have a table posts that contains the User_id of the user who posted and the post post_id of the post. And I have another table comments that contains only the post that it belongs to child_of_post.
The problem is here: I need to select only from comments but at the same time get the user_id of the post that the comment belongs to.
So should I use a join like:
SELECT user_id FROM comments INNER JOIN posts ON child_of_post = post_id
Reading this confused me even more, I don't really know how to explain it, but, in general if I need to use the same value like and id, should I save that value in every table that I need it ? Or should I save it only in one table and use joins to retrieve it ?
Is using a join better that adding one more column to a table ?
Is using a join better than adding one more column to a table ?
In general : Yes.
Your database design looks good. As a general principle, avoid duplicating data across tables. This is inefficient in terms of storage, and also can quickly turn into a maintenance nightmare when it comes to modifying data, which ultimately threatens the integrity of your data.
Instead of duplicating data, the usual approach is to store a reference to the table row where the original data is stored ; this is called a foreign key, and it offers various functionalities that help maintain data integrity (prevent inserts of orphan records in the child table, delete child records when the parent is deleted, ...).
In your use case, you indeed would need to JOIN to recover the user that created the original post, like :
SELECT p.user_id, c.*
FROM comments c
INNNER JOIN posts p ON c.child_of_post = p.post_id
Assuming that post_id is the primary key of table posts, such JOIN with an equality condition referencing the primary key of another table, is very efficient, especially if you create an index on referencing column comments.child_of_post.
PS : it is a good practice to give aliases to table names and use them to index the fields in the query ; it avoids subtle bugs caused by column name clashes (when both tables have fields with the same name), and makes the query easier.
I have a three tables namely profile, academic,payment and these tables having two same columns that are username and status.
my problem is how to select username from the tables where status=1 in all the tables
Typically it works like this:
SELECT * FROM profile
LEFT JOIN academic ON profile.username=academic.username
LEFT JOIN payment ON profile.username=payment.username
WHERE profile.status=1 AND academic.status=1 AND payment.status=1
As a note having username as a key is usually a bad thing, often super bad since if someone's able to change their name you need to update N other tables. You may have a circumstance where you forget to update one or more tables, then subsequently someone registers with the former name and "inherits" this data.
It's also typically very inefficient to use a string INDEX key when a user_id integer value would suffice.
I have 3 tables, each consisting of a column called username. On the registration part, I need to check that the requested username is new and unique.
I need that single SQL that will tell me if that user exists in any of these tables, before I proceed. I tried:
SELECT tbl1.username, tbl2.username, tbl3.username
FROM tbl1,tbl2,tbl3
WHERE tbl1.username = {$username}
OR tbl2.username = {$username}
OR tbl3.username ={$username}
Is that the way to go?
select 1
from (
select username as username from tbl1
union all
select username from tbl2
union all
select username from tbl3
) a
where username = 'someuser'
In the event you honestly just want to know if a user exists:
The quickest approach is an existence query:
select
NOT EXISTS (select username from a where username = {$username}) AND
NOT EXISTS (select username from b where username = {$username}) AND
NOT EXISTS (select username from c where username = {$username});
If your username column is marked as Unique in each table, this should be the most efficient query you will be able to make to perform this operation, and this will outperform a normalized username table in terms of memory usage and, well, virtually any other query that cares about username and another column, as there are no excessive joins. If you've ever been called on to speed up an organization's database, I can assure you that over-normalization is a nightmare. In regards to the advice you've received on normalization in this thread, be wary. It's great for limiting space, or limiting the number of places you have to update data, but you have to weigh that against the maintenance and speed overhead. Take the advice given to you on this page with a grain of salt.
Get used to running a query analyzer on your queries, if for no other reason than to get in the habit of learning the ramifications of choices when writing queries -- at least until you get your sea legs.
In the event you want to insert a user later:
If you are doing this for the purpose of eventually adding the user to the database, here is a better approach, and it's worth it to learn it. Attempt to insert the value immediately. Check afterwards to see if it was successful. This way there is no room for some other database call to insert a record in between the time you've checked and the time you inserted into the database. For instance, in MySQL you might do this:
INSERT INTO {$table} (`username`, ... )
SELECT {$username} as `username`, ... FROM DUAL
WHERE
NOT EXISTS (select username from a where username = {$username}) AND
NOT EXISTS (select username from b where username = {$username}) AND
NOT EXISTS (select username from c where username = {$username});
All database API's I've seen, as well as all SQL implementations will provide you a way to discover how many rows were inserted. If it's 1, then the username didn't exist and the insertion was successful. In this case, I don't know your dialect, and so I've chosen MySQL, which provides a DUAL table specifically for returning results that aren't bound to a table, but honestly, there are many ways to skin this cat, whether you put it in a transaction or a stored procedure, or strictly limit the process and procedure that can access these tables.
Update -- How to handle users who don't complete the sign up process
As #RedFilter points out, if registration is done in multiple steps -- reserving a username, filling out details, perhaps answering an email confirmation, then you will want to at least add a column to flag this user (with a timestamp, not a boolean) so that you can periodically remove users after some time period, though I recommend creating a ToBePurged table and add new users to that, along with a timestamp. When the confirmation comes through, you remove the user from this table. Periodically you will check this table for all entries prior to some delta off your current time and simply delete them from whichever table they were originally added. My philosophy behind this is to define more clearly the responsibility of the table and to keep the number of records you are working with very lean. We certainly don't want to over-engineer our solutions, but if you get into the habit of good architectural practices, these designs will flow out as naturally as their less efficient counterparts.
No. Two processes could run your test at the same time and both would report no user and then both could insert the same user.
It sounds like you need a single table to hold ALL the users with a unique index to prevent duplicates. This master table could link to 'sub-tables' using a user ID, not user name.
Given the collation stuff, you could do this instead, if you don't want to deal with the collation mismatch:
select sum(usercount) as usercount
from (
select count(*) as usercount from tbl1 where username = 'someuser'
union all
select count(*) as usercount from tbl2 where username = 'someuser'
union all
select count(*) as usercount from tbl3 where username = 'someuser'
) as usercounts
If you get 0, there isn't a user with that username, if you get something higher, there is.
Note: Depending on how you do the insert, you could in theory get more than one user with the same username due to race conditions (see other comments about normalisation and unique keys).
1- You need to normalize your tables
See: http://databases.about.com/od/specificproducts/a/normalization.htm
2- Don't use implicit SQL '89 joins.
Kick the habit and use explicit joins
SELECT a.field1, b.field2, c.field3
FROM a
INNER JOIN b ON (a.id = b.a_id) -- JOIN criteria go here
INNER JOIN c ON (b.id = c.b_id) -- and here, nice and explicit.
WHERE ... -- filter criteria go here.
With your current set up RedFilter's answer should work fine. I thought it would be worth noting that you shouldn't have redundant or dispersed data in your database to begin with though.
You should have one and only one place to store any specific data - so in your case, instead of having a username in 3 different tables, you should have one table with username and a primary key identifier for those usernames. Your other 3 tables should then foreign-key reference the username table. You'll be able to construct much simpler and more efficient queries with this layout. You're opening a can of worms by replicating data in various locations.
For storing friends relationships in social networks, is it better to have another table with columns relationship_id, user1_id, user2_id, time_created, pending or should the confirmed friend's user_id be seralized/imploded into a single long string and stored along side with the other user details like user_id, name, dateofbirth, address and limit to like only 5000 friends similar to facebook?
Are there any better methods? The first method will create a huge table! The second one has one column with really long string...
On the profile page of each user, all his friends need to be retrieved from database to show like 30 friends similar to facebook, so i think the first method of using a seperate table will cause a huge amount of database queries?
The most proper way to do this would be to have the table of Members (obviously), and a second table of Friend relationships.
You should never ever store foreign keys in a string like that. What's the point? You can't join on them, sort on them, group on them, or any other things that justify having a relational database in the first place.
If we assume that the Member table looks like this:
MemberID int Primary Key
Name varchar(100) Not null
--etc
Then your Friendship table should look like this:
Member1ID int Foreign Key -> Member.MemberID
Member2ID int Foreign Key -> Member.MemberID
Created datetime Not Null
--etc
Then, you can join the tables together to pull a list of friends
SELECT m.*
FROM Member m
RIGHT JOIN Friendship f ON f.Member2ID = m.MemberID
WHERE f.MemberID = #MemberID
(This is specifically SQL Server syntax, but I think it's pretty close to MySQL. The #MemberID is a parameter)
This is always going to be faster than splitting a string and making 30 extra SQL queries to pull the relevant data.
Separate table as in method 1.
method 2 is bad because you would have to unserialize it each time and wont be able to do JOINS on it; plus UPDATE's will be a nightmare if a user changes his name, email or other properties.
sure the table will be huge, but you can index it on Member11_id, set the foreign key back to your user table and could have static row sizes and maybe even limit the amount of friends a single user can have. I think it wont be an issue with mysql if you do it right; even if you hit a few million rows in your relationship table.
There are two tables. One is users info "users", one is comments info "comments".
I need to create new field "comments" in users table, that contains number of comments of that user. Table "comments" has "user" field with user's id of that comment.
What is optimal way to count number of comments by every user so far?
With php you should write script that selects every user and than count number of his comments and then update "comments" field. It is not hard for me, but boring.
Is it possible to do it without php, only in MySQL?
UPDATE TABLE users SET CommentCount = (SELECT COUNT(*) FROM comments WHERE AuthorUserId = users.id)
Why do you want to store it there anyway?
Why not just show it combined query?
select users.name, count(comments.id) as comment_count
from users
join comments on users.id=comments.user_id
group by users.id
If you want to do it your way then include
update users set comment=comment+1 where id=$user_id
into the script where you store the comment.
And
update users set comment=comment-1 where id=$user_id
into the place where user can delete his comment. Otherwise your data might be out of sync when user adds new commnts and you haven't run the script yet.
Yes, it is possible.
This is called table joining.
You don't add another field to the users table, but to the resulting table.
SELECT users.*, count(comments.id) as num_comments
FROM users,comments
WHERE comments.cid=users.id
GROUP BY users.id
Such a query is what relational databases were invented for. Do not revert it to the plain text file state. There is many reasons to make it that way.
http://en.wikipedia.org/wiki/Database_normalization <-- good text to read