How did Facebook or Twitter implement their subscribe system - mysql

I'm working on a SNS like mobile app project, where users upload their contents and can see updates of their subscribed topic or friends on their homepage.
I store user contents in mysql, and query the user specific homepage data by simply querying out first who and what the user subscribed and then query the content table filtering out using the 'where userid IN (....) or topic IN (....)' clause.
I suspect this would become quite slow when the content table piles up or when a user subscribe tons of users or topics. Our newly released app is already starting to have thousands of new users each week, and getting more over time. Scalability must be a concern for us right now.
So I wonder how Facebook or Twitter handle this subscribing problem with their amazing number of users. Do they handle a list for each user? I tried to search, but all I got is how to interact with Facebook or Twitter rather than how they actually implement this feature.
I noticed that you see only updates rather than history in your feed when using Facebook. Which means that subscribing a new user won't dump lots out dated content into your feed as how it would be by using my current method.
How do Facebook design their database and how did they dispatch new contents to subscribed users?
My backend is currently PHP+MySQL, and I don't mind introducing other backend technologies such as Redis or JMS and stuff if that's the way it should be done.

Sounds like you guys are still in a pretty early stage. There are N-number of ways to solve this, all depending on which stage of DAUs you think you'll hit in the near term, how much money you have to spend on hardware, time in your hands to build it, etc.
You can try an interim table that queues up the newly introduced items, its meta-data on what it entails (which topic, friend user_id list, etc.). Then use a queue-consumer system like RabbitMQ/GearMan to manage the consumption of this growing list, and figure out who should process this. Build the queue-consumer program in Scala or a J2EE system like Maven/Tomcat, something that can persist. If you really wanna stick with PHP, build a PHP REST API that can live in php5-fpm's memory, and managed by the FastCGI process manager, and called via a proxy like nginx, initiated by curl calls at an appropriate interval from a cron executed script.
[EDIT] - It's probably better to not use a DB for a queueing system, use a cache server like Redis, it outperforms a DB in many ways and it can persist to disk (lookup RDB and AOF). It's not very fault tolerant in case the job fails all of a sudden, you might lose a job record. Most likely you won't care on these crash edge cases. Also lookup php-resque!
To prep for the SNS to go out efficiently, I'm assuming you're already de-normalizing the tables. I'd imagine a "user_topic" table with the topic mapped to users who subscribed to them. Create another table "notification_metadata" describing where users prefer receiving notifications (SMS/push/email/in-app notification), and the meta-data needed to push to those channels (mobile client approval keys for APNS/GCM, email addresses, user auth-tokens). Use JSON blobs for the two fields in notification_metadata, so each user will have a single row. This saves I/O hits on the DB.
Use user_id as your primary key for "notification_meta" and user_id + topic_id as PK for "user_topic". DO NOT add an auto-increment "id" field for either, it's pretty useless in this use case (takes up space, CPU, index memory, etc). If both fields are in the PK, queries on user_topic will be all from memory, and the only disk hit is on "notification_meta" during the JOIN.
So if a user subscribes to 2 topics, there'll be two entries in "user_topic", and each user will always have a single row in "notification_meta"
There are more ways to scale, like dynamically creating a new table for each new topic, sharding to different MySQL instances based on user_id, partitioning, etc. There's N-ways to scale, especially in MySQL. Good luck!

Related

Is it a good practice to store auth session in the database?

I created a login system that in addition to being used on a website, will also be used in mobile applications.
As on cell phones I want to keep the user logged in until he chooses to log out, I did not use the authentication for sessions in PHP.
So I thought it would be better to store the login sessions in the database, for each user request, to verify if the authentication token is still valid.
But I don't know if this is a good practice. Since every time the user updates the screen in the browser, or sends any application request to the system, he will make a query to verify that the login is still active and then make another query to search for what the user requested.
My concern is whether this will become too slow, for a system that could have between 900 million and 1,5 billion users, since the database will have many more requests and verification queries in addition to the normal query requested by the user.
Below is the current structure of my database. I would also like tips if my structure is very wrong.
Yes, it's a good practice to store session information in an application's main transactional database. A great many web applications work this way at large scale.
If you have the skills to do so, you might consider setting things up so session information is stored in a separate database that's not dependent on data in your transactional database. This separate database needs just one table:
login_token PK
key PK
value
The session_id is the value of the login_token session cookie, a large hard-to-guess random value your web app sends to each logged-in user's browser. For example, if my user id were 100054 the session table might contain these rows for me.
2EwZzPJdigVlrwtkFC5qoe97YE0EBddJ user_id 10054
2EwZzPJdigVlrwtkFC5qoe97YE0EBddJ user_name ojones
Why use this key/value design? It is easily ported to a high-performance key/value storage system like Redis. It's simple. And, to log me off and kill my session all you need is
DELETE FROM session WHERE login_token = '2EwZzPJdigVlrwtkFC5qoe97YE0EBddJ'
(You asked for feedback on your table design. Here is mine: Use INT or BIGINT values for primary keys in tables you expect to become large. VARCHAR values are a poor choice for primary keys because index lookup and row insertion are substantially slower. CHAR(n) values are a slightly better choice, but still slower than integers. The session table only covers presently logged in users.)
And, I'll repeat my comment. Don't waste too much time today on designing your new system so it can run at the scale of Twitter or Facebook (~ 10**9 users). At this stage of your project, you cannot know where your performance bottlenecks will lie when you run at that scale. And it will take you a decade, at the very least, to get that many users. By then you'll have hundreds of developers working on your system. If you hire them wisely, most of them will be smarter than you.
How do I know these things? Experience, wasted time, and systems that did not scale up even when I designed them to do that.

If my users are stored in another database, should I duplicate them in my service that uses SQL database?

If my users are stored in some other database, but I am building posts in my SQL database, should I create another table users?
If I did, I would be duplicating all of my users and would have to make sure this stays in sync with the other database, but on the other hand, my posts tables could save space by referring to fk instead of full id string each time.
What is the recommendation? Create another table users or just pass in the user ids to query?
If you have a service that stores and provides information about users then other services that need that information should communicate with the User service to get it. That is, presumably, the reason the User service exists in the first place.
Depending on the volatility of the users list and requirements for changes there to be respected in the Posts service you might consider some short-term caching in the Posts service, but I certainly wouldn't persist another copy of the user list there.
There are 3 obvious solutions.
The simplest, cleanest and fastest is to use foreign keys and joins between your "posts" database and your "users" database. In this case, when you show a list of posts, you can get both the post and user data in a single query, and there's no need to keep things up to date.
The next option is to store a copy of the user data alongside your posts. This leads to entertaining failure modes - data in the user database may get out of sync. However, this is a fairly common strategy when using 3rd party authentication systems (e.g. logging on with your Google/Facebook/Github/Stack Exchange credentials). The way to make this work is to minimize the amount of data you duplicate, and have it be safe if it's out of date. For instance, a user's display name is probably okay; current bank account balance is probably not.
The final option is to store the primary key for users in your posts database, and to retrieve the user data at run time. This is less likely to lead to bugs with data getting out of sync, but it can cause performance problems - retrieving user details for 1000 posts one by one is obviously much slower than retrieving everything through a joined query.
The choice then is "do I have a service which combines post and user data and my UI retrieves everything from that service, or do I let the UI retrieve posts, and then users for each post". That's mostly down to the application usage, and whether you can use asynchronous calls to retrieve user information. If at all possible (assuming you're building a web application), the simplest option might be to return the posts and user IDs and use Ajax requests to retrieve the user data as needed.
The CQRS approach (common to microservice architectures) provides some structure for this.

Move information-resource stored in the database tables with two step using 'reservation'

I need to architect a database and service, I have resource that I need to deliver to the users. And the delivery takes some time or requires user to do some more job.
These are the tables I store information into.
Table - Description
_______________________
R - to store resources
RESERVE - to reserve requested resources
HACK - to track some requests that couldn`t be made with my client application (statistics)
FAIL - to track requests that can`t be resolved, but the user isn't guilty (statistics)
SUCCESS - to track successfully delivery (statistics)
The first step when a user requests resouce
IF (condition1 is true - user have the right to request resource) THEN
IF (i've successfully RESERVE-d resource and commited the transaction) THEN
nothing to do more
ELSE
save request into FAIL
ELSE
save request into HACK
Then the second step
IF (condition2 is true - user done his job and requests the reserved resource) THEN
IF (the resource delivered successfully) THEN
save request into SUCCESS
ELSE
save request into FAIL
depending on application logic move resource from RESERVE to R or not
ELSE
save request into HACK, contact to the user,
if this is really a hacker move resource from RESERVE to R
This is how I think to implement the system. I've stored transactions into the procedures. But the main application logic, where I decide which procedure to call are done in the application/service layer.
Am I on a right way, is such code division between the db and the service layers normal? Your experienced opinions are very important.
Clarifying and answering to RecentCoin's questions.
The difference between the HACK and FAIL tables are that I store more information in the HACK table, like user IP and XFF. I`m not going to penalize each user that appeared in that table. There can be 2 reasons that a user(request) is tracked as a hack. The first is that I have a bug (mainly in the client app) and this will help me to fix them. The second is that someone does manually requests, and tries to bypass the rules. If he tries 'harder' I'll be able to take some precautions.
The separation of the reserve and the success tables has these reasons.
2.1. I use reserve table in some transactions and queries without using the success table, so I can lock them separately.
2.2. The data stored in success will not slow down my queries, wile I'm querying the reserve table.
2.3. The success table is kind of a log for statistics, that I can delete or move to other database for future analyse.
2.4. I delete the rows from the reserve after I move them to the success table. So I can evaluate approximately the max rows count in that table, because I have max limit for reservations for each user.
The points 2.3 and 2.4 could be achieved too by keeping in one table.
So are the reasons 2.1 and 2.2 enough good to keep the data separately?
The resource "delivered successfully" mean that the admin and the service are done everything they could do successfully, if they couldn't then the reservation fails
4 and 6. The restrictions and right are simple, they are like city and country restrictions, The users are 'flat', don't have any roles or hierarchy.
I have some tables to store users and their information. I don't have LDAP or AD.
You're going in the right direction, but there are some other things that need to be more clearly thought out.
You're going to have to define what constitutes a "hack" vs a "fail". Especially with new systems, users get confused and it's pretty easy for them to make honest mistakes. This seems like something you want to penalize them for in some fashion so I'd be extremely careful with this.
You will want to consider having "reserve" and "success" be equivalent. Why store the same record twice? You should have a really compelling reason do that.
You will need to define "delivered successfully" since that could be anything from an entry in a calendar to getting more pens and post notes.
You will want to define your resources as well as which user(s) have rights to them. For example, you may have a conference room that only managers are allowed to book, but you might want to include the managers' administrative assistants in that list since they would be booking the room for the manager(s).
Do you have a database of users? LDAP or Active Directory or will you need to create all of that yourself? If you do have LDAP or AD, can use something like SAML?
6.You are going to want to consider how you want to assign those rights. Will they be group based where group membership confers the rights to reserve, request, or use a given thing? For example, you may only want architects printing to the large format printer.

What database/technology to use for a notification system on a node.js site?

I'm looking to implement notifications within my node.js application. I currently use mysql for relational data (users, submissions, comments, etc). I use mongodb for page views only.
To build a notification system, does it make more sense (from a performance standpoint) to use mongodb vs MySQL?
Also, what's the convention for showing new notifications to users? At first, I was thinking that I'd have a notification icon, and they click on it and it does an ajax call to look for all new notifications from the user, but I want to show the user that the icon is actually worth clicking (either with some different color or a bubble with the number of new notifications like Google Plus does).
I could do it when the user logs it, but that would mean the user would only see new notifications when they logged out and back in (because it'd be saved in their session). Should I poll for updates? I'm not sure if that's the recommended method as it seems like overkill to show a single digit (or more depending on the num of notifications).
If you're using node then you can 'push' notifications to a connected user via websockets. The linked document is an example of one well known websocket engine that has good performance and good documentation. That way your application can send notifications to any user, or sets of users, or everyone based on simple queries that you setup.
Data storage is a different question. Generally mysql does have poor perfomance in cases of high scalability, and mongo does generally have a quicker read query response, but it depends on what data structure you wish to use. If your data is in a simple key-value structure with no real need for relational data, then perhaps using a memory store such as Redis would be the most suitable.
This answer has more information on your question too if you want to follow up and investigate more.

Multi-room chat logging in Rails/MySQL app

I'm coding a browser game application in Rails. It's chat-based, and I'm using currently MySQL for the databases. However, I'm running into a problem when it comes to chat logging for the games.
The application goals dictate that there will be multiple rooms at any given time in which people are playing the chat-based game. Conversation will be pretty much constant, and there are a number of other actions, such as private messages and game actions, which must be logged as well. Players who join the game after other players must be able to see the chat log from before they joined, and games must be available to review.
My first thought was to, on game start, create a database that matches the game identifier, and store everything there. Then when someone joins the game, we could just parse it back to them. Once the game had been over for a certain time, the script would take the database content, parse it into an XML object, and store this in a dataase for game review, deleting the table to keep things running lean.
I created a moddel called Message, with a matching table with identical columns for those I want to store in the game tables - id, timestamp, sender, target (for PMs and actions), type of message and content. Then I set the initializer for the Message object to set the table name to 'game_#{game_id}'. Rails however is throwing tantrums - I get an undefined method has_key? error when I try to initialize the object in Rails. It occurs to me based on this that the method I'm using may be a bit un-Rails-ian, and that possibly it defeats the purpose of working in Rails to pass up using the excellent object/db management features it has.
I've considered other alternatives, such as temporarily keeping all the messages in the main Messages table and just querying them by game ID, but I'm unsure if a MySQL table is up to the task of speedily serving up this data while accepting constant writes, especially in the event that we get a dozen or more games going at once averaging a message or two per second. It was suggested to me that a noSQL solution like a MongoDB capped collection for the temporary storage would be an excellent option from a performance standpoint, but that would still waste all that ActiveRecord goodness that Rails offers.
Is there a reliable and relatively fast way to meet the constraints of making the logged messages able to be quickly stored and fetched for quick access while the game is ongoing and then stored in some type of low-overhead method for review? Would any of the above ideas be workable or is there a whole separate option I've overlooked?