Database design for a chat system - mysql

I know there is a lot of posts out there discussing Db design for a chat system, but they didn't explain anything about the scalability of that design, so here my question.
I want to design a Db of a real-time chat between 2 or more users, let's take 2 users first, here what I came up with.
Table 1:
name: User
fields: id, name
Table 2
name: Chat Room
fields: id, user1, user2
Table 3:
name: Message
fields: Chat_room_id, user_id, message
Now considering Facebook in mind, it has around 2 billion active users per month and let say 1 billion of them indulge in chatting and each user sends 100 messages.
which make 100 Billion entries in table: Message, so the question is,
"Will Mysql or Postgres be able to handle this much of entries and show particular chat room messages in real-time ?" if not then what should be the best practice to follow that, I know that it also depends on the server on which RDBMS is installed but still want to know the optimum architecture.
PS: I am using Django as backend and AngularJs for asynchronous behavior

100 Billions rows in one table will never work online. Not only all possible partitioning ways are applied to reduce the sizes, but also separation of active/passive data strategies. But nevertheless all the high maters, the answer:
Postgres is indeed effective working with big data itself.
and yet:
Postgres has not effective enough strategy to fight poor design
Look at your example: table chat_room lists two users in separate columns - what for? You have user_id in messages referencing users.id. And you have chat_room.id in it, so you have data which users were in that chat_room. Now if your idea was to pre-aggregate which users participated in chat_room over time or at all, make it one array column, like (chat_room.id int, users_id bigint[]) or if you want join time and leave time, add corresponding attributes. active/passive data can be implemented using archived chat_rooms in different relation then active ones. Btw aggregation on who participated in that chatroom can be performed on such archiving...
Above is not instructions for action, just expression. There is no best practice for database schema. First make a clear plan what your chat will do, then make db schema, try it, improve, try, improve, try, improve and so on, until everything works. If you have concerns on how it will work with 100 billions of rows - fill it up and check...

Related

Best modeling tips in DynamoDB in User/Friend Tables

In mysql a User and Friend Table will somewhat look like these
User Table
id
name
phone
status [enabled/disabled]
===============================================
Friend Table
user_id
another_user_id
status [if friend or not]
===============================================
but in DynamoDb
I have been troubled about these two ways either
Approach 1.
User Table
id
name
phone
friends -> attributes
OR
Approach 2.
User Table
id
name
phone
===============================================
Friend Table
user_id
another_userid
===============================================
Im currently using the Approach 2. question is whats the best way to model tables in cost effective manner , latency and performance?.
PS: I emailed their support about these problems for me but still has no reply from them
so someone should already had gone through these problems.
I hope I iterated the question carefully to be understandable.
EDITED:
#chen
Q: Do you often query a user's friend list?
A: yes I will query every users-friends-list that will use my software
when a user logs in.
Q: Do you wish to know fast how many friends does a user have?
A: No, no need as long as i can get who the users friends are then its all good.
Q: How many friends do you think a user will have?
A: unlimited.
Q: How many users will the system have?
A: unlimited too.
thanks for giving the time.
thanks
David, you are running into a typical NoSQL problem.
When designing a relational database, you model the data as it fits the world, and also try to break the data into tables.
In DynamoDB (and other NoSQL) the real model is derived from the questions needing answers.
Do you often query a user's friend list?
Do you wish to know fast how many friends does a user have?
How many friends do you think a user will have?
How many users will the system have?
These questions will help you decide between approach #1 and #2.
If you comment with answers to these questions, I will be able to give you my thoughts on the model.
Regardless, if you really want to drop SQL, you might want to look at graph databases.
If you must use DynamoDB, then just keep references in the same table (approach 1). Have you already taken a decision on which DB to use? Couple of Reasons:
Some other NoSQL DBs have a vibrant community and great documentation.
GraphDB best seems to suit your problem above, but you are better aware of your systems' big picture.

Proper way to model user groups

So I have this application that I'm drawing up and I start to think about my users. Well, My initial thought was to create a table for each group type. I've been thinking this over though and I'm not sure that this is the best way.
Example:
// Users
Users [id, name, email, age, etc]
// User Groups
Player [id, years playing, etc]
Ref [id, certified, etc]
Manufacturer Rep [id, years employed, etc]
So everyone would be making an account, but each user would have a different group. They can also be in multiple different groups. Each group has it's own list of different columns. So what is the best way to do this? Lets say I have 5 groups. Do I need 8 tables + a relational table connecting each one to the user table?
I just want to be sure that this is the best way to organize it before I build it.
Edit:
A player would have columns regarding the gear that they use to play, the teams they've played with, events they've gone to.
A ref would have info regarding the certifications they have and the events they've reffed.
Manufacturer reps would have info regarding their position within the company they rep.
A parent would have information regarding how long they've been involved with the sport, perhaps relations with the users they are parent of.
Just as an example.
Edit 2:
**Player Table
id
user id
started date
stopped date
rank
**Ref Table
id
user id
started date
stopped date
is certified
certified by
verified
**Photographer / Videographer / News Reporter Table
id
user id
started date
stopped date
worked under name
website / channel link
about
verified
**Tournament / Big Game Rep Table
id
user id
started date
stopped date
position
tourney id
verified
**Store / Field / Manufacturer Rep Table
id
user id
started date
stopped date
position
store / field / man. id
verified
This is what I planned out so far. I'm still new to this so I could be doing it completely wrong. And it's only five groups. It was more until I condensed it some.
Although I find it weird having so many entities which are different from each other, but I will ignore this and get to the question.
It depends on the group criteria you need, in the case you described where each group has its own columns and information I guess your design is a good one, especially if you need the information in a readable form in the database. If you need all groups in a single table you will have to save the group relevant information in a kind of object, either a blob, XML string or any other form, but then you will lose the ability to filter on these criteria using the database.
In a relational Database I would do it using the design you described.
The design of your tables greatly depends on the requirements of your software.
E.g. your description of users led me in a wrong direction, I was at first thinking about a "normal" user of a software. Basically name, login-information and stuff like that. This I would never split over different tables as it really makes tasks like login, session handling, ... really complicated.
Another point which surprised me, was that you want to store the equipment in columns of those user's tables. Usually the relationship between a person and his equipment is not 1 to 1 and in most cases the amount of different equipment varies. Thus you usually have a relationship between users and their equipment (1:n). Thus you would design an equipment table and there refer to the owner's user id.
But after you have an idea of which data you have in your application and which relationships exist between your data, the design of the tables and so on is rather straitforward.
The good news is, that your data model and database design will develop over time. Try to start with a basic model, covering the majority of your use cases. Then slowly add more use cases / aspects.
As long as you are in the stage of planning and early implementation phasis, it is rather easy to change your database design.

Using Redis as a Key/Value store for activity stream

I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.
1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)

How to use MYSQL to track user likes

For websites like Digg. How can you use MYSQL to track when someone likes an article?
It seems simple enough to just keep track of the total number of likes. The part I don't understand, is how to
1. keep users from only voting on something once and
2. allow users to click on their profile to see the stories they have liked.
Would you have a column in the table containing the story info that you just add comma separated user names? You could keep track of who has liked a story, but the data would get huge, especially for websites like digg that has 100,000 users or more. And how would you allow the user to see all the stories they have liked?
Thank you.
You would need a row for each like. Don't use comma-separated lists.
how to 1. keep users from only voting on something once
Create a unique index on articleid, userid.
And how would you allow the user to see all the stories they have liked?
SELECT articleid FROM likes WHERE userid = 42
but the data would get huge
Yes, it could get huge. Most websites will easily be able to cope with just a single database. Very large websites will need to use a cluster to store data on several machines. The data needs to be partitioned so that the application knows on which server to find the data.
In Social Network these days are like the Graph dataStructure.
Where every entity like people,photo,video,status-updates, comments etc are nodes of the graph and likes,unlikes are connections between two nodes.
ideally you would have a Table for Likes where you would just add a like.
where you would store who liked, what is liked in columns and other info.
Complex social networks do more than just this.
You can store the likes in a seperate table called story_likes with two columns : story_id and user_id.
1) Put a constraint in the database that the combination of these should be unique. That way your user can like a story only once.
2) You can pull the stories that the user likes from this table and pull other story details using the story id you have. 100,000 rows is not that big for a MYSQL database.
You can also allow your users to dislike a story by having a column for state=ENUM('LIKED', 'DISLIKED').

Database structure - most common queries span 3-4 tables. Should I reduce tables?

I am creating a new DB in MySQL for an application and wondered if anyone could provide some advice on the following set up. I'll try and simplify things as best as I can.
This DB is designed to store alerts which are related to specific items created by a user. In turn there is the need to store notes related to the items and/or alerts. At first I considered the following structure...
USERS table - to store basic app user info (e.g. user_id. name, email) - this is the only bit I'm fairly certain does not need to be changed
ITEMS table: contains info on particular item (4 fields or so). Contains user_id to indicate which user created/owns this item
ALERTS table: contains info on the alert, item_id to indicate which item the alert is related to, contains user_id to indicate which user created alert
NOTES table: contains note info, user_id of note owner, item_id if associated with an item, alert_id if associated with alert
Relationships:
An item does not always have an an alert associated with it
An item or alert does not always have a note associated with it
An alert is always associated with an item. More than one alert can be associated with the same item.
A note is always associated with an item or alert. More than one note can be associated with the same item or alert.
Once first created item info is unlikely to be updated by a user.
For arguments sake let's say that each user will create an average of 10 items, each item will have an average of 2 alerts associated with it. There will be an average of 2 notes per item/alert.
Very common queries that will be run:
1) Return all items created by a particular user with any associated alerts and notes. Given a user_id this query would span 3 tables
2) Checking each day for alerts that need to be sent to a user's email address. WHERE alert date==today, return user's email address, item name and any associated notes. This would require a query spanning 4 tables which is why I'm wondering if I need to take a different approach...
Option 1) one table to cover items, alerts and notes. user_id owner for each row. Every time you add a note to an item or alert you are repeating the alert and/or item info. Seems a bit wasteful but item and alert info won't be large.
Option 2) I don't foresee the need to query notes (famous last words?) so how about serializing note data so multiple notes are stored in one row in either the item or alert table (or just a combined alert/item table)
Option 3) Anything else you can think of? I'm asking this question as each option I've considered doesn't feel quite right.
I appreciate this is currently a small project and so performance shouldn't be of great concern and I should just go with the 4 tables. It's more that my common queries will end up being relatively complex that makes me think I need to re-evaluate the structure.
I would say that the common wisdom is to normalize to start and denormalize only when performance data suggest that it's necessary.
Make sure that your tables are indexed properly, with foreign key relationships for JOINs.
If you think you'll end up with a lot of data, this might be a good time to think about a partitioning strategy. Partitioning your fast-growing tables by time would be a good first step.
Four tables is not complex. I commonly write report queries that hit 15 or more tables in a database structure that has hundreds of tables (most with millions of records) and I wouldn't even say our dbs are anything more than medium sized (a typical db in our system might have around 200 gigs of data, so not large at all as databases go). Because they are properly indexed, they still run fast unless I am doing very complex calculations. Normalize, don't even consider denormalizing until you are an experienced database designer who knows better than to worry about the number of tables.