How can I reliably de-duplicate requests in message queues? - duplicates

I have a scraper-crawler structure in mind which'd crawl "related" blogs based on their posts' metadata (tags, word occurrence, etc.), making post-fetch requests is expensive, making blog-index requests is "cheap".
To continuously crawl and update blogs, I start from a "seed" blog (either determined manually or by ranking) and then "pulse" it offline by going through internal database items to rank top-level related blogs, then rank their related blogs, perform staleness checks, perform completeness checks, etc, etc.
This is what I'm going to call a "source", this source creates a long list of blogs to be checked (after cutting off on (lets say) a thousand, and then rank them to grab the top 100), this can be de-duplicated within the sourcing process.
These submit this long list of blogs to be indexed to a queue, consumer-workers take from this queue and perform the indexing on those blogs, which places the results (of first pagination) on a queue, consumer-workers on that queue check the results (post IDs) with the database, and fires off another pagination request if all post IDs weren't found in the database, and fires post-fetch requests on those posts that weren't found in the database.
These requests end up on another queue, with which they'll be fetched by a big cluster of workers who perform the rate-limited/expensive post-fetch requests on this queue.
This is all just with 1 source, let's say I want to have 2 sources that check and "crawl" across this web of blogs, how can I de-duplicate across every part of this process?
Source A and source B both stumble across the same blog to be fetched, they both place this onto the blog-index queue, this is a data race, as it means the workers who fetch these 2 blogs will both see missing post IDs (if the blog has em), and both fire the same duplicate post-fetch request onto the queue.
What I want to do is a distributed task management system, which steps through tasks (and creates subtasks), and get rid of duplicate tasks.
In a message broker, I can (for example) technically de-duplicate by using message IDs in rabbitmq, but that doesn't solve the bigger race condition (source B fires a blog-index request that A fired earlier, and the results of that blog-index has resulted in a bunch of post-fetch requests that're still queued).
TL;DR: Sequential Task Management between different message broker queues.

Related

Why does read after right method not produce consistent results?

I'm watching this video about Akka.net and the speaker says read after right does not produce consistent results because the order of events is not predictable at the network level. The arhcitecture the presenter is speaking about in this video is as follows:
One Load Balancer
Multiple web servers. Load balancer determines which server to hit.
One database server (SQL Server).
I'm confused to why consistent results are not achieved with a single databse? If a lock is put before data is written wouldn't that bring you back consistent results?
So I'm going to guess you're talking about the scenario Aaron describes about 10 minutes into this video. Here's the scenario:
User is clicking things on a site and we're firing off asynchronous requests to record the clicks.
The not obvious part from the scenario he's describing is that we're not waiting for the previous requests to finish before sending more requests to capture a user's clicks (imagine a single page app where clicks don't cause a full refresh of the page from the server). We want to capture all the clicks.
We have some logic on the server that says, "If the user clicks these 3 things in a row, do some cool reaction..."
We check our condition on the web server ("Has the user clicked these 3 things in a row?") by writing the click event we just got to our DB, then reading to see if they've generated the stream of 3 things clicked to do our cool reaction.
Here's the problem: each request to record a click could be going to a different web server and we're not waiting on the previous one to finish before we send more requests to record clicks. So we have no guarantee that the request to write the first event has completed before we write the second, or the third, etc.
For example, the first request could be delayed (or even fail!) because of a faulty network, so the second request could reach our SQL Server first! And such, when it goes to read the stream of events that's happened, it could not be aware that a request was sent (but hasn't completed) to record that the first event happened.
I think the point he's trying to make is that in the face of multiple clients (in this example, web servers) writing to a database concurrently, you can't count on, "I sent that first so it will be recorded first". This holds true whether you're using DataStax Enterprise, Cassandra, SQL Server, Oracle, or whatever. Hope that helps!

How do database transactions happen in MMORPG's?

I've built an MMORPG that uses a MySQL database to store player related data when the user logs off.
We built in a auto save timer so that all the data of every logged in user is saved to the database every 3 hours.
In doing so we noticed a fatal flaw....
Due to the fact that all our database transactions are sent to a single DB Thread the thread can become backlogged with requests. This produces a login/saving issue. When this happens players unable to login as the login process requires the use of the DB Thread to confirm login credentials. Similarly all save requests are queued to the back of the DB thread schedule. This produces a backlog of requests...
The only solution that I can think of for this is to introduce multiple threads and have 3-4 threads interacting with the database.
However, this opens up a new issue. Since multiple threads are sent DB requests this means that one thread can receive a save request from a player while another DB thread receives a save request from the same player.
For example....
PlayerA Logs In to the game
3 Hours pass & the auto save happens, playerA's data will now be saved.
PlayerA kills a monster and gains experience.
PlayerA logs off, which adds a save request to a DB thread.
Now we have two different save requests queue'd in the database. Assuming they are both assigned to two different DB threads, this could cause the users data to be saved in the wrong order... For example maybe the the thread handling PlayerA's log out save runs first and then the auto save for PlayerA runs after that on a separate thread.... This would cause loss of data (in this case experience).
How do other MMORPG's handle something like this?
You need a database connection pool if you're not using one already and make sure you're not locking more data than you need. If you are saving how much gold a player has, you don't need to lock the table holding the credentials.
Keeping the order of events in a multi-threaded scenario is not a trivial problem, I suggest using a message queue, a single producer per player and a single consumer per player. This link shows 2 strategies to keep the order.
A queue is actually important for other reasons. If a save request fails, it would remain in the queue to retry later. When dealing with players money and items, you probably want this.
Your autosave is deterministic, meaning that you know exactly when the last one occured and when the next one would occur. I would use that somehow, along with the previously suggested idea to add a timestamp. Actually, it might be better to make the updates represent only the increments/decrements along with a user timestamp and calculate the experience upon request ( maybe cache it then)
To avoid this problem in all cases you must not allow users to continue doing stuff before their last database transaction has been successfully committed. Of course that means that the DB has to be very fast -- if it can't keep the request queue below a couple of seconds worth of transactions at most, you simply have to make it faster. More RAM cache, SSDs, the usual MySQL optimization dance. Adding extra logic in the form of triggers etc. isn't going to help in the long run, especially because they can become really complicated in the case of inventories and the like.
If on average the system is fast enough but struggling in peaks like when everybody logs in during lunch break, adding something like Redis as a fast cache might help. You'd load the data into Redis when a user logs on (or when they first need a certain piece of data) , remove it when they log off or when it expires, and write changes back to the relational DB as fast as it can keep up.

Move information-resource stored in the database tables with two step using 'reservation'

I need to architect a database and service, I have resource that I need to deliver to the users. And the delivery takes some time or requires user to do some more job.
These are the tables I store information into.
Table - Description
_______________________
R - to store resources
RESERVE - to reserve requested resources
HACK - to track some requests that couldn`t be made with my client application (statistics)
FAIL - to track requests that can`t be resolved, but the user isn't guilty (statistics)
SUCCESS - to track successfully delivery (statistics)
The first step when a user requests resouce
IF (condition1 is true - user have the right to request resource) THEN
IF (i've successfully RESERVE-d resource and commited the transaction) THEN
nothing to do more
ELSE
save request into FAIL
ELSE
save request into HACK
Then the second step
IF (condition2 is true - user done his job and requests the reserved resource) THEN
IF (the resource delivered successfully) THEN
save request into SUCCESS
ELSE
save request into FAIL
depending on application logic move resource from RESERVE to R or not
ELSE
save request into HACK, contact to the user,
if this is really a hacker move resource from RESERVE to R
This is how I think to implement the system. I've stored transactions into the procedures. But the main application logic, where I decide which procedure to call are done in the application/service layer.
Am I on a right way, is such code division between the db and the service layers normal? Your experienced opinions are very important.
Clarifying and answering to RecentCoin's questions.
The difference between the HACK and FAIL tables are that I store more information in the HACK table, like user IP and XFF. I`m not going to penalize each user that appeared in that table. There can be 2 reasons that a user(request) is tracked as a hack. The first is that I have a bug (mainly in the client app) and this will help me to fix them. The second is that someone does manually requests, and tries to bypass the rules. If he tries 'harder' I'll be able to take some precautions.
The separation of the reserve and the success tables has these reasons.
2.1. I use reserve table in some transactions and queries without using the success table, so I can lock them separately.
2.2. The data stored in success will not slow down my queries, wile I'm querying the reserve table.
2.3. The success table is kind of a log for statistics, that I can delete or move to other database for future analyse.
2.4. I delete the rows from the reserve after I move them to the success table. So I can evaluate approximately the max rows count in that table, because I have max limit for reservations for each user.
The points 2.3 and 2.4 could be achieved too by keeping in one table.
So are the reasons 2.1 and 2.2 enough good to keep the data separately?
The resource "delivered successfully" mean that the admin and the service are done everything they could do successfully, if they couldn't then the reservation fails
4 and 6. The restrictions and right are simple, they are like city and country restrictions, The users are 'flat', don't have any roles or hierarchy.
I have some tables to store users and their information. I don't have LDAP or AD.
You're going in the right direction, but there are some other things that need to be more clearly thought out.
You're going to have to define what constitutes a "hack" vs a "fail". Especially with new systems, users get confused and it's pretty easy for them to make honest mistakes. This seems like something you want to penalize them for in some fashion so I'd be extremely careful with this.
You will want to consider having "reserve" and "success" be equivalent. Why store the same record twice? You should have a really compelling reason do that.
You will need to define "delivered successfully" since that could be anything from an entry in a calendar to getting more pens and post notes.
You will want to define your resources as well as which user(s) have rights to them. For example, you may have a conference room that only managers are allowed to book, but you might want to include the managers' administrative assistants in that list since they would be booking the room for the manager(s).
Do you have a database of users? LDAP or Active Directory or will you need to create all of that yourself? If you do have LDAP or AD, can use something like SAML?
6.You are going to want to consider how you want to assign those rights. Will they be group based where group membership confers the rights to reserve, request, or use a given thing? For example, you may only want architects printing to the large format printer.

How did Facebook or Twitter implement their subscribe system

I'm working on a SNS like mobile app project, where users upload their contents and can see updates of their subscribed topic or friends on their homepage.
I store user contents in mysql, and query the user specific homepage data by simply querying out first who and what the user subscribed and then query the content table filtering out using the 'where userid IN (....) or topic IN (....)' clause.
I suspect this would become quite slow when the content table piles up or when a user subscribe tons of users or topics. Our newly released app is already starting to have thousands of new users each week, and getting more over time. Scalability must be a concern for us right now.
So I wonder how Facebook or Twitter handle this subscribing problem with their amazing number of users. Do they handle a list for each user? I tried to search, but all I got is how to interact with Facebook or Twitter rather than how they actually implement this feature.
I noticed that you see only updates rather than history in your feed when using Facebook. Which means that subscribing a new user won't dump lots out dated content into your feed as how it would be by using my current method.
How do Facebook design their database and how did they dispatch new contents to subscribed users?
My backend is currently PHP+MySQL, and I don't mind introducing other backend technologies such as Redis or JMS and stuff if that's the way it should be done.
Sounds like you guys are still in a pretty early stage. There are N-number of ways to solve this, all depending on which stage of DAUs you think you'll hit in the near term, how much money you have to spend on hardware, time in your hands to build it, etc.
You can try an interim table that queues up the newly introduced items, its meta-data on what it entails (which topic, friend user_id list, etc.). Then use a queue-consumer system like RabbitMQ/GearMan to manage the consumption of this growing list, and figure out who should process this. Build the queue-consumer program in Scala or a J2EE system like Maven/Tomcat, something that can persist. If you really wanna stick with PHP, build a PHP REST API that can live in php5-fpm's memory, and managed by the FastCGI process manager, and called via a proxy like nginx, initiated by curl calls at an appropriate interval from a cron executed script.
[EDIT] - It's probably better to not use a DB for a queueing system, use a cache server like Redis, it outperforms a DB in many ways and it can persist to disk (lookup RDB and AOF). It's not very fault tolerant in case the job fails all of a sudden, you might lose a job record. Most likely you won't care on these crash edge cases. Also lookup php-resque!
To prep for the SNS to go out efficiently, I'm assuming you're already de-normalizing the tables. I'd imagine a "user_topic" table with the topic mapped to users who subscribed to them. Create another table "notification_metadata" describing where users prefer receiving notifications (SMS/push/email/in-app notification), and the meta-data needed to push to those channels (mobile client approval keys for APNS/GCM, email addresses, user auth-tokens). Use JSON blobs for the two fields in notification_metadata, so each user will have a single row. This saves I/O hits on the DB.
Use user_id as your primary key for "notification_meta" and user_id + topic_id as PK for "user_topic". DO NOT add an auto-increment "id" field for either, it's pretty useless in this use case (takes up space, CPU, index memory, etc). If both fields are in the PK, queries on user_topic will be all from memory, and the only disk hit is on "notification_meta" during the JOIN.
So if a user subscribes to 2 topics, there'll be two entries in "user_topic", and each user will always have a single row in "notification_meta"
There are more ways to scale, like dynamically creating a new table for each new topic, sharding to different MySQL instances based on user_id, partitioning, etc. There's N-ways to scale, especially in MySQL. Good luck!

Implementing message priority in AMQP

I'm intending to use AMQP to allow a distributed collection of machines to report to a central location asynchronously. The idea is to drop messages into the queue and allow the central logging entity to process the queue in a decoupled fashion; the 'process' is simply to create or update a row in a database table.
A problem that I'm anticipating is the effect of network jitter in the message queuing process - what happens if an update accidentally gets in front of an insert because the time between the two messages being issued is less than the network jitter?
Reading the AMQP spec, it seems that I could just apply a higher priority to inserts so they skip the queue and get processed first. But presumably this only applies if a queue actually exists at the broker to be skipped. Is there a way to impose a buffer or delay at the broker to absorb this jitter and allow priority to be enacted before the messages are passed on to the consumer(s)?
Or do I have to go down the route of a resequencer as ActiveMQ suggests?
The lack of ordering between multiple publishers has nothing to do with network jitter, it's a completely natural thing in distributed applications. Messages from the same publisher will always be ordered. If you really need causal ordering of actions performed by different nodes then either a resequencer or a global sequence numbering scheme are your only options. Note that you cannot use sender timestamps for this, which is what everyone seems to try first..