Database Structure - two tables or one table?

Database Structure - two tables or one table? - mysql

I have one database table dealing with users login totals and another table dealing with individual login sessions. Should I keep these tables separate or should I go ahead and merge them?
users_logins
users_id
successful_logins(total)
last_online
users_logins_sessions
users_id
session_id
ip_address
user_agent
last_activity(time-stamp)

You could lose user_logins, as I assume last_online and last_activity contain same value.
You would however have to query the user_logins_sessions table to get the total for successful logins for a given user.
SELECT COUNT(user_id) FROM user_login_sessions WHERE user_id = ?

This really depends on you, however I understand it as (making assumption here) that sessions are cleared? Typically in my applications sessions expire, and a new one is created, I am not sure how you manage that in your users_logins_sessions as you don't give much more info on this, it could work either way.
You should merge if your 'session' table never deletes entries, OR leave it alone the way it is, if the sessions expire / are deleted at intervals.
I am also assuming the users_id is used somewhere else if you keep them separate.

If you only have the users_logins_sessions table, you can easily query for successful_logins and last_online.
SELECT COUNT(1) AS successful_logins
FROM users_logins_sessions
WHERE users_id = <user_id>;
SELECT MAX(last_activity) AS last_online
FROM users_logins_sessions
WHERE users_id = <user_id>;

Related

Liked Posts Design Specifics

So I've found through researching myself that the best way I can design a structure for liking posts is by having a database like the following. Let's say like Reddit, a post can be upvoted, downvoted, or not voted on at all.
The database would then having three columns, [username,post,liked].
Liked could be some kind of boolean, 1 indicating liked, and 0 indicating disliked.
Then to find a post like amount, I would do SELECT COUNT(*) FROM likes WHERE post=12341 AND liked=1 for example, then do the same for liked=0(disliked), and do the addition server side along with controversy percentage.
So I have a few concerns, first off, what would be the appropriate way to find out if a user liked a post? Would I try to select the liked boolean value, and either retrieve or catch error. Or would I first check if the record exist, and then do another select to find out the value? What if I want to check if a user liked multiple posts at once?
Secondly, would this table not need a primary key? Because no row will have the same post and username, should I use a compound primary key?

For performance you will want to alter your database plans:
User Likes Post table
Fields:
Liked should be a boolean, you are right. You can transform this to -1/+1 in your code. You will cache the numeric totals elsewhere.
Username should be UserID. You want only numeric values in this table for speed.
Post should be PostID for the same reason.
You also want a numeric primary key because they're easier to search against, and to perform sub-selects with.
And create a unique index on (Username, Post), because this table is mainly an index built for speed.
So did a user vote on a post?
select id
from user_likes_post
where userID = 123 and postID = 456;
Did the user like the post?
select id
from user_likes_post
where userID = 123 and postID = 456 and liked = true;
You don't need to worry about errors, you'll either get results or you won't, so you might as well go straight to the value you're after:
select liked from user_liked_post where userID=123 and postID=456
Get all the posts they liked:
select postID
from user_likes_post
where userID = 123 and liked = true;
Post Score table
PostID
TotalLikes
TotalDislikes
Score
This second table will be dumped and refreshed every n minutes by calculating on the first table. This second table is your cached aggregate score that you'll actually load for all users visiting that post. Adjust the frequency of this repeat dump-and-repopulate schedule however you see fit. For a small hobby or student project, just do it every 30 seconds or 2 minutes; bigger sites, every 10 or 15 minutes. For an even bigger site like reddit, you'd want to make the schema more complex to allow busier parts of the site to have faster refresh.
// this is not exact code, just an outline
totalLikes =
select count(*)
from user_likes_post
where postID=123 and liked=true
totalDislikes =
select count(*)
from user_likes_post
where postID=123 and liked=false
totalVotes = totalLikes + totalDislikes
score = totalLikes / totalVotes;
(You can simulate an update by involving the user's localStorage -- client-side Javascript showing a bump-up or down on the posts that user has voted on.)

Given your suggested 3-column table and the selects you suggest, be sure to have
PRIMARY KEY(username, post) -- helps with "did user like a post"
INDEX(post_id, liked) -- for that COUNT
When checking whether a user liked a post, either do a LEFT JOIN so that you get one of three things: 1=liked, 0=unliked, or NULL=not voted. Or you could use EXISTS( SELECT .. )
Tables need PKs.

I agree with Rick James that likes table should be uniquely indexed by (username, post) pair.
Also I advise you to let a bit redundancy and keep the like_counter in the posts table. It will allow you to significantly reduce the load on regular queries.
Increase or decrease the counter right after successful adding the like/dislike record.
All in all,
to get posts with likes: plain select of posts
no need to add joins and aggregate sub-queries.
to like/dislike: (1) insert into likes, on success (2) update posts.like_counter.
unique index prevents duplication.
get know if user has already liked the post: select from likes by username+post pair.
index helps to do it fast

My initial thought was that the problem is because boolean type is not rich enough to express the possible reactions to a post. So instead of boolean, you needed an enum with possible states of Liked, Disliked, and the third and the default state of Un-reacted.
Now however it seems, you can do away with boolean too because you do not need to record the Un-reacted state. A lack of reaction means that you do not add the entry in the table.
What would be the appropriate way to find out if a user liked a post?
SELECT Liked
FROM Likes
WHERE Likes.PostId == 1234
AND Likes.UserName == "UniqueUserName";
If the post was not interacted with by the user, there would be no results. Otherwise, 1 if liked and 0 if disliked.
What if I want to check if a user liked multiple posts at once?
I think for that you need to store a timestamp too. You can then use that timestamp to see if it there are multiple liked post within a short duration.
You could employ k-means clustering to figure if there are any "cluster" of likes. The complete explanation is too big to add here.
Would this table not need a primary key?
Of course it would. But Like is a weak entity depending upon the Post. So it would require the PK of Post, which is the field post (I assume). Combined with username we would have the PK because (post, username) would be unique for user's reaction.

SQL LEFT JOIN on two possible columns

We are adding a table to our database schema. It has a relationship to an already existing table, which we added a foreign key for. Mind you, I didn't create this schema nor do I have permission to change much. The application has been running for a while and they are hesitant to change much.
USER_ACTIVITY_T (preexisint table - only relevant columns referred)
activity_id (pk)
username
machineid (fk - recently added)
MACHINE_T (new table)
machineid (pk - auto increment)
machinename (unique)
From the point where I added the machine table, it collects machine data; allowing users to see what machines were involved during the activity. This is useful but it only shows data from the point that it was implemented. A lead asked me to attempt to fill preexisting records by referring to the username associated with the machine. We understand that this is not 100% accurate but... yeah. Our idea was to add username to MACHINE_T and use as a way to populate the machinename in reports retroactively (which assumes that the user has only used one machine and never changed their username).
So, the new MACHINE_T table would look like:
MACHINE_T (new table)
machineid (pk - auto increment)
machinename (unique)
username
Right now, our current SQL is:
SELECT * FROM `USER_ACTIVITY_T` LEFT JOIN `MACHINE_T`
ON MACHINE_T.machineid=USER_ACTIVITY_T.machineid
Anyone have any suggestions on how to join on the username if USER_ACTIVITY_T.machineid is null but has a matching username? I'm sorry. This is an odd request that I may spend far too much time over-analyzing. Thank you for any help. I'm almost tempted to just say it can be reasonably done.

You want to select the joins from a when the joined column is not null and from b when it is null.
You dont want repeat information however so UNION may cause problems on its own.
Try only selecting the not null entries on the first join and then exclude the null entries from the second join before you union them.
So:
SELECT *
FROM `USER_ACTIVITY_T`
LEFT JOIN `MACHINE_T`
ON MACHINE_T.machineid = USER_ACTIVITY_T.machineid
UNION ALL
SELECT *
FROM `USER_ACTIVITY_T`
JOIN `MACHINE_T`
ON MACHINE_T.username = USER_ACTIVITY_T.username
WHERE USER_ACTIVITY_T.machineid IS NULL
This way you are basically using one query for the null entries and one for the not null entries and UNIONing them.

And, I just discovered the UNIION operator which will help me solve this. However, I am open to other solutions.

What is the proper way to store friendship associations in a mysql DB

I want to create a table where my users can associate a friendship between one another. Which at the same time this table will work in conjunction to what I would to be a one-to-many relation between various other tables I am attempting to work up.
Right now I am thinking of something like this
member_id, friend_id, active, date
member_id would be the column of the user making the call, friend_id would be the column of the friend they are attempting to tie to, active would be a toggle of sorts 0 = pending, 1 = active, date would just be a logged date of the last activity on that particular row.
Now my confusion is if I were to query I would typically query for member_id then base the rest of the query off of associated friend_id's to display data accordingly to the right people. So with this logic of sorts in mind, that makes me think I would have to have 2 rows per request. One where its the member_id who's requesting and the friend_id of the request inserted into the table, then one thats the opposite so I could query accordingly every time. So in essences its like double dipping for every one action requested to this particular table I need to make 2 like actions to make it work.
Which in all does not make sense to me as far as optimization goes. So in all my question is what is the proper way to handle data for relations like this? Or am I actually thinking sanely about this being an approach to handling it?

If a friendship is always mutual, then you can choose between data redundancy (i.e. both directions having a row) for the sake of simpler queries, or learn to live with slightly more complex queries. I'd personally avoid data redundancy unless there is a compelling reason otherwise - you're not just wasting space and performance, but you'll need to be careful when enforcing it - a simple CHECK is incapable of referencing other rows and depending on your DBMS a trigger may be limited in what it can do with a mutating table.
An easy way ensure to only one row per friendship is to always insert the lower value in member_id and higher value in friend_id (make a constraint CHECK (member_id < friend_id) to enforce it). Then, when you query, you'll have search in both directions - for example, finding all friends of the given person (identified by person_id) would look something like this:
SELECT *
FROM
person
WHERE
id <> :person_id
AND (
id IN (
SELECT friend_id
FROM friendship
WHERE member_id = :person_id
)
OR
id IN (
SELECT member_id
FROM friendship
WHERE friend_id = :person_id
)
)
BTW, in this scheme, you'd probably want to rename member_id and friend_id to, say, friend1_id and friend2_id...

Two ways to look at it:
WHERE ((friend_id = x AND member_id = y) OR (friend_id = y AND member_id = x))
would allow you to query by simply stating one side of the relationship. If both sides are added, this method would still work without causing duplicate rows to be returned.
Conversely, adding both sides of the relationship, so that your queries consist of
WHERE friend_id = x AND member_id = y
not only makes queries easier to write, but also easier to plan (meaning better DB performance).
My vote is for the latter option.

Beautiful - there's no problem with your table as-is.
ALSO:
I'm not sure if this cardinality is "one to many", or "many to many":
http://en.wikipedia.org/wiki/Cardinality_%28data_modeling%29
Q: I were to query I would typically query for member_id then base the
rest of the query off of associated friend_id's to display data
accordingly to the right people
A: Frankly, I don't see any problem querying "member to friend", or "friend to member" (or any other combinations - e.g. friends who share friends). Again, it looks good.

Introduce a helper table like:
users
user_id, name, ...
friendship
user_id, friend_id, ....
select u.name as user, u2.name as friend from users u
inner join friendship f on f.user_id = u.user_id
inner join users u2 on u2.user_id = f.friend_id
I think this is pretty similar to what you have, just putting a query as an example.

How to check if a given data exists in multiple tables (all of which has the same column)?

I have 3 tables, each consisting of a column called username. On the registration part, I need to check that the requested username is new and unique.
I need that single SQL that will tell me if that user exists in any of these tables, before I proceed. I tried:
SELECT tbl1.username, tbl2.username, tbl3.username
FROM tbl1,tbl2,tbl3
WHERE tbl1.username = {$username}
OR tbl2.username = {$username}
OR tbl3.username ={$username}
Is that the way to go?

select 1
from (
select username as username from tbl1
union all
select username from tbl2
union all
select username from tbl3
) a
where username = 'someuser'

In the event you honestly just want to know if a user exists:
The quickest approach is an existence query:
select
NOT EXISTS (select username from a where username = {$username}) AND
NOT EXISTS (select username from b where username = {$username}) AND
NOT EXISTS (select username from c where username = {$username});
If your username column is marked as Unique in each table, this should be the most efficient query you will be able to make to perform this operation, and this will outperform a normalized username table in terms of memory usage and, well, virtually any other query that cares about username and another column, as there are no excessive joins. If you've ever been called on to speed up an organization's database, I can assure you that over-normalization is a nightmare. In regards to the advice you've received on normalization in this thread, be wary. It's great for limiting space, or limiting the number of places you have to update data, but you have to weigh that against the maintenance and speed overhead. Take the advice given to you on this page with a grain of salt.
Get used to running a query analyzer on your queries, if for no other reason than to get in the habit of learning the ramifications of choices when writing queries -- at least until you get your sea legs.
In the event you want to insert a user later:
If you are doing this for the purpose of eventually adding the user to the database, here is a better approach, and it's worth it to learn it. Attempt to insert the value immediately. Check afterwards to see if it was successful. This way there is no room for some other database call to insert a record in between the time you've checked and the time you inserted into the database. For instance, in MySQL you might do this:
INSERT INTO {$table} (`username`, ... )
SELECT {$username} as `username`, ... FROM DUAL
WHERE
NOT EXISTS (select username from a where username = {$username}) AND
NOT EXISTS (select username from b where username = {$username}) AND
NOT EXISTS (select username from c where username = {$username});
All database API's I've seen, as well as all SQL implementations will provide you a way to discover how many rows were inserted. If it's 1, then the username didn't exist and the insertion was successful. In this case, I don't know your dialect, and so I've chosen MySQL, which provides a DUAL table specifically for returning results that aren't bound to a table, but honestly, there are many ways to skin this cat, whether you put it in a transaction or a stored procedure, or strictly limit the process and procedure that can access these tables.
Update -- How to handle users who don't complete the sign up process
As #RedFilter points out, if registration is done in multiple steps -- reserving a username, filling out details, perhaps answering an email confirmation, then you will want to at least add a column to flag this user (with a timestamp, not a boolean) so that you can periodically remove users after some time period, though I recommend creating a ToBePurged table and add new users to that, along with a timestamp. When the confirmation comes through, you remove the user from this table. Periodically you will check this table for all entries prior to some delta off your current time and simply delete them from whichever table they were originally added. My philosophy behind this is to define more clearly the responsibility of the table and to keep the number of records you are working with very lean. We certainly don't want to over-engineer our solutions, but if you get into the habit of good architectural practices, these designs will flow out as naturally as their less efficient counterparts.

No. Two processes could run your test at the same time and both would report no user and then both could insert the same user.
It sounds like you need a single table to hold ALL the users with a unique index to prevent duplicates. This master table could link to 'sub-tables' using a user ID, not user name.

Given the collation stuff, you could do this instead, if you don't want to deal with the collation mismatch:
select sum(usercount) as usercount
from (
select count(*) as usercount from tbl1 where username = 'someuser'
union all
select count(*) as usercount from tbl2 where username = 'someuser'
union all
select count(*) as usercount from tbl3 where username = 'someuser'
) as usercounts
If you get 0, there isn't a user with that username, if you get something higher, there is.
Note: Depending on how you do the insert, you could in theory get more than one user with the same username due to race conditions (see other comments about normalisation and unique keys).

1- You need to normalize your tables
See: http://databases.about.com/od/specificproducts/a/normalization.htm
2- Don't use implicit SQL '89 joins.
Kick the habit and use explicit joins
SELECT a.field1, b.field2, c.field3
FROM a
INNER JOIN b ON (a.id = b.a_id) -- JOIN criteria go here
INNER JOIN c ON (b.id = c.b_id) -- and here, nice and explicit.
WHERE ... -- filter criteria go here.

With your current set up RedFilter's answer should work fine. I thought it would be worth noting that you shouldn't have redundant or dispersed data in your database to begin with though.
You should have one and only one place to store any specific data - so in your case, instead of having a username in 3 different tables, you should have one table with username and a primary key identifier for those usernames. Your other 3 tables should then foreign-key reference the username table. You'll be able to construct much simpler and more efficient queries with this layout. You're opening a can of worms by replicating data in various locations.

Best method for storing data in mysql?

I have a pretty basic question on which is the preferred way of storing data in my database.
I have a table called "users" with each user getting a username and user_id. Now, I want to make a table called "comments" for users to comment on news.
Is it better to have a column in comments called "username" and storing the logged in user's name, or have a column called "user_id". If I use user_id I would have to make my sql statement have another select statement. "(SELECT username FROM users WHERE users.id = comments.user_id) as username". It seems like performance would be better just storing the username.
I thought I read to avoid duplicate data in a database though.
Which is better?
Thanks

Typically, you use ID fields to link tables together. The reason being (in your situation) that you might allow the person to change their username, but you don't want to try and update all the places that is at...
Therefore, put the user_id in your comments table and pull the username out on a join, as you've shown.

If the user_id is the primary key then you should use user_id instead of username, if you want to use username instead of user_id then why do you have a user_id in the first place?

If there's the potential of creating a large enough database, store the user_id in the comments table. Less overhead. Also consider that usernames my be modified easier this way.

Data should be stored in (at least) third normalized form, so you should use the user_id as the primary key in the users table, and as a foreign key in the comments table and use this to get the details:
SELECT comments.*, users.username
FROM comments, users
WHERE users.user_id = comments.user_id;
If you are getting the comments based on an article, you could do this like this:
SELECT comments.*, users.username
FROM comments, users
WHERE users.user_id = comments.user_id
AND comments.article_id = '$current_article_id';

Storing the userid (integer) will mean faster JOINs later. Unless you plan on having people dig through the database by hand, there's really no reason to use the username

I'm pretty sure storing the user id in the comments table is sufficient. If you're returning rows from the comments table, just use the JOIN statement.
Cheers

Which is going to be a unique identifier? The user_id, I'd bet, or you can't have two "John Smith"s in your system.
And if volume is much of a concern, text matching the username field is going to be more expensive than linking to the users table in your query in the long term.

Numeric values are cheaper to join and index than an alphanumeric id. Use a number to uniquely identify a row. Another benefit is that the PK doesn't need to change if they need to change the user id. The last benefit is that this is the design of most modern web frameworks such as django and rails.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008