Scheduling periodic tasks to minimize clustering - language-agnostic

A large number of clients want to check in with a server about once an hour, and don't care exactly when. The server wants to talk to all of the clients, but doesn't want to get overloaded if too many clients check in at the same time.
How should the clients schedule their checkins to keep an even load on the server?
If there's been discussion or writing on this topic before (there probably has, but I don't know what to look for), a link to it may be as good as a direct answer.
Edit: I'm interested in theory, tips, and tricks. For instance, would introducing random jitter or drift into each of the clients' check-in schedule help or hurt?

If the clients chose their time within the hour using a reasonable random number generation, that ought to keep the load evenly distributed on average. Random numbers can have clustering, however. If the clients have to register/deregister with the server, the server could simply assign each a time slot to check in and then ensure an even distribution, but without some sort of coordination I don't think there's any real way for the clients to guarantee an even load.

Related

how to implement saved-searches scenario

what is saved-search?
Save is the mechanism users don't find their desired results in advanced search and just push "Save My Search Criteria bottom" and we save the search criteria and when corresponding data post to website we will inform the user "hey user, the item(s) you were looking for exists now come and visit it".
Saved Searches is useful for sites with complex search options, or sites where users may want to revisit or share dynamic sets of search results.
we have advanced search and don't need to implement new search, what we require is a good performance scenario to achieve saved-search mechanism.
we have a website that users post about 120,000 posts per day into the website and we are going to implement SAVED SEARCH scenario(something like this what https://www.gumtree.com/ do), it means users using advanced search but they don't find their desired content and just want to save the search criteria and if there will be any results in the website we inform them with notification.
We are using Elastic search and Mysql in our Website.We still, haven't implement anything and just thinking about it to find good solution which can handle high rate of date, in other hand **the problem is the scale of work, because we have a lot of posts per day and also we guess users use this feature a lot, So we are looking for good scenario which could handle this scale of work easy with high performance.
suggested solutions but not the best
one quick solution is we save the saved-searches in saved-search-index in Elastic then run a cronjob that for all saved-searches items get results from posts-index- Elastic and if there is any result push a record into the RabbitMq to notify the equivalent user.
on user post an item into the website we check it with exists saved-searches in saved-search-index in Elastic and if matched we put a record into the RabbitMq,( the main problem of this method is it could be matched with a huge number of saved-searches in every post inserted into the website).
My big concern is about scale and performance, I'll appreciate sharing your experiences and ideas about this problem with me.
My estimation about the scale
Expire date of saved-search is three month
at least 200,000 Saved-search Per day
So we have 9,000,000 active Records
I'll appreciate if you share your mind with me
*just FYI**
- we also have RabbitMQ for our queue jobs
- our ES servers are good enough with 64GB RAM
Cron job - No. Continual job - yes.
Why? As things scale, or as activity spikes, cron jobs become problematical. If the cron job for 09:00 runs too long, it will compete for resources with the 10:00 instance; this can cascade into a disaster.
At the other side, if a cron job finishes 'early', then the activity oscillates between "busy" (the cron job is doing stuff) and "not busy" (cron has finished, and not time for next invocation).
So, instead, I suggest a job that continually runs through all the "stored queries", doing them one at a time. When it finishes the list, is simply starts over. This completely eliminates my complaints about cron, and provides an automatic "elasticity" to handle busy/not-busy times -- the scan will slow down or speed up accordingly.
When the job finishes, the list, it starts over on the list. That is, it runs 'forever'. (You could use a simple cron job as a 'keep-alive' monitor that restarts it if it crashes.)
OK, "one job" re-searching "one at a time" is probably not best. But I disagree with using a queuing mechanism. Instead, I would have a small number of processes, each acting on some chunk of the stored queries. There are many ways: grab-and-lock; gimme a hundred to work on; modulo N; etc. Each has pros and cons.
Because you are already using Elasticsearch and you have confirmed that you are creating something like Google Alerts, the most straightforward solution would be Elasticsearch Percolator.
From the official documentation, Percolator is useful when:
You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
I can't say much when it comes to performance, because you did not provide any example of your queries but mostly because my findings are inconsistent.
According to this post (https://www.elastic.co/blog/elasticsearch-queries-or-term-queries-are-really-fast), Elasticsearch queries should be capable of reaching 30,000 queries/second.
However, this unanswered question (Elasticsearch percolate performance) reported a painfully slow 200 queries/second on a 16 CPU server.
With no additional information I can only guess that the cause is configuration problems, so I think you'll have to try a bunch of different configurations to get the best possible performance.
Good luck!
This answer was written without a true understanding of the implications of a "saved search". I leave it here as discussion of a related problem, but not as a "saved search" solution. -- Rick James
If you are saving only the "query", I don't see a problem. I will assume you are saving both the query and the "resultset"...
One "saved search" per second? 2.4M rows? Simply rerun the search when needed. The system should be able to handle that small a load.
Since the data is changing, the resultset will become outdated soon? How soon? That is, saving the resultset needs to be purged rather quickly. Surely the data is not so static that you can wait a month. Maybe an hour?
Actually saving the resultset and being able to replay it involves (1) complexity in your code, (2) overhead in caching, I/O, etc, etc.
What is the average number of times that the user will look at the same search? Because of the overhead I just mentioned, I suspect the average number of times needs to be more than 2 to justify the overhead.
Bottomline... This smells like "premature optimization". I recommend
Build the site without saving resultsets.
Stress test it to see when it will break.
Work on optimizing the slow parts.
As for RabbitMQ -- "Don't queue it, just do it". The cost of queuing and dequeuing is (1) increased latency for the user and (2) increased overhead on system. The benefit (at your medium scale) is minimal.
If you do hit scaling problems, consider
Move clients off to another server -- away from the database. This will give you some scaling, but not 2x. To go farther...
Use replication: One Master + many readonly Slaves -- and do the queries on the Slaves. This gives you virtually unlimited scaling in the database.
Have multiple web servers -- virtually unlimited scaling in this part.
I don't understand why you want to use saved-search... First: you should optimize service, so as to use as little as possible the saved-search.
Have you done anything with the ES server? (What can you afford), so:
Have you optimized elasticearch server? By default, it uses 1GB of RAM. The best solution is to give him half the machine RAM, but no more than 16GB (if I'm remember. Check doc)
How powerful is the ES machine? He likes core instead of MHZ.
How many ES nodes do you have? You can always add another machine to get the results faster.
In my case (ES 2.4), the server slows down after a few days, so I restart it once a day.
And next:
Why do you want to fire up tasks every half hour? If you already use cron, fire then every minute, and you indicate that the query is running. With the other the post you have a better solution and an explanation.
Why do you separate the result from the query?
Remember to standardize the query to change the order of the parameters, not to force a new query.
Why do you want to use MySQL to store results? The better document-type database, like Elasticsearch xD.
I propose you:
Optimize ES structure - choose right tokenisers for fields.
Use asynchronous loading of results - eg WebSocket + Node.js

What's the most efficient architecture for this system? (push or pull)

All s/w is Windows based, coded in Delphi.
Some guys submit some data, which I send by TCP to a database server running MySql.
Some other guys add a pass/fail to their data and update the database.
And a third group are just looking at reports.
Now, the first group can see a history of what they submitted. When the second group adds pass/fail, I would like to update their history. My options seem to be
blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient.
ask the database server regularly if anything changed in the last X minutes.
never poll the database server, instead letting it inform the user's app when something changes.
1 seems inefficient. 2 seems better. 3 reduces TCP traffic, but that isn't much. Anyway, just a few bytes for each 2. However, it has the disadvantage that both sides are now both TCP client and server.
Similarly, if a member of the third group is viewing a report and a member of either of the first two groups updates data, I wish to reflect this in the report. What it the best way to do this?
I guess there are two things to consider. Most importantly, reduce network traffic and, less important, make my code simpler.
I am sure this is a very common pattern, but I am new to this kind of thing, so would welcome advice. Thanks in advance.
[Update] Close voters, I have googled & can't find an answer. I am hoping for the beneft of your experience. Can you help me reword this to be acceptable? or maybe give a UTL which will help me? Thanks
Short answer: use notifications (option 3).
Long answer: this is a use case for some middle layer which propagates changes using a message-oriented middleware. This decouples the messaging logic from database metadata (triggers / stored procedures), can use peer-to-peer and publish/subscribe communication patterns, and more.
I have blogged a two-part article about this at
Firebird Database Events and Message-oriented Middleware (part 1)
Firebird Database Events and Message-oriented Middleware (part 2)
The article is about Firebird but the suggested solutions can be applied to any application / database.
In your scenarios, clients can also use the middleware message broker send messages to the system even if the database or the Delphi part is down. The messages will be queued in the broker until the other parts of the system are back online. This is an advantage if there are many clients and update installations or maintenance windows are required.
Similarly, if a member of the third group is viewing a report and a
member of either of the first two groups updates data, I wish to
reflect this in the report. What it the best way to do this?
If this is a real requirement (reports are usually a immutable 'snapshot' of data, but maybe you mean a view which needs to be updated while beeing watched, similar to a stock ticker) but it is easy to implement - a client just needs to 'subscribe' to an information channel which announces relevant data changes. This can be solved very flexible and resource-saving with existing message broker features like message selectors and destination wildcards. (Note that I am the author of some Delphi and Free Pascal client libraries for open source message brokers.)
Related questions:
Client-Server database application: how to notify clients that data was changed?
How to communicate within this system?
Each of your proposed solutions are all viable in certain situations.
I've been writing software for a long time and comments below relate to personal experience which dates way back to 1981. I have no doubt others will have alternative opinions which will also answer your questions.
Please allow me to justify the positives and negatives of each approach, and the parameters around each comment.
"blindly refresh the history regularly (in Delphi, I display on a DB grid so I would close then open the query), but this seems inefficient."
Yes, this is inefficient
Is often the quickest and simplest thing to do.
Seems like the best short-term temporary solution which gives maximum value for minimal effort.
Good for "exploratory coding" helping derive a better software design.
Should be a good basis to refine / explore alternatives.
It's very important for programmers to strive to document and/or share with team members who could be affected by your changes their team when a tech debt-inducing fix has been checked-in.
If not intended as production quality code, this is acceptable.
If usability is poor, then consider more efficient solutions, like what you've described below.
"ask the database server regularly if anything changed in the last X minutes."
You are talking about a "pull" or "polling" model. Consider the following API options for this model:
What's changed since the last time I called you? (client to provide time to avoid service having to store and retrieve seesion state)
If nothing has changed, server can provide a time when the client should poll again. A system under excessive load is then able to back-off clients, i.e if a server application has an awareness of such conditions, then it is therefore better able to control the polling rate of compliant clients, by instructing them to wait for a longer period before retrying.
After considering that, ask "Is the API as simple as it can possibly be?"
"never poll the database server, instead letting it inform the user's app when something changes."
This is the "push" model you're talking about- publishing changes, ready for subscribers to act upon.
Consider what impact this has on clients waiting for a push - timeout scenarios, number of clients, etc, System resource consumption, etc.
Consider that the "pusher" has to become aware of all consuming applications. If using industry standard messaging queueing systems (RabbitMQ, MS MQ, MQ Series, etc, all naturally supporting Publish/Subscribe JMS topics or equivalent then this problem is abstracted away, but also added some complexity to your application)
consider the scenarios where clients suddenly become unavailable, hypothesize failure modes and test the robustness of you system so you have confidence that it is able to recover properly from failure and consistently remain stable.
So, what do you think the right approach is now?

Database structure for storing Bank-like accounts and transactions

We're in the process of adding a bank-like sub-system to our own shop.
We already have customers, so each will be given a sort of account and transactions of some kind will be possible (adding to the account or subtracting from it).
So we at least need the account entity, the transaction one and operations will then have to recalculate overall balances.
How would you structure your database to handle this?
Is there any standard bank system have to use that I could mock?
By the way, we're on mysql but will also look at some nosql solution for performance boost.
I don't imagine you would need NoSQL for any speed boost, since it's unlikely to need much/any parallelism and not sure how non-schema-oriented you might need to be. Except when you start getting into complex business requirements for analysis across many million customers and hundreds of millions of transactions, like profitability, and even then that's kind of a data warehousing-style problem anyway which you probably wouldn't run on your transactional schema in the first place if it had gotten that large.
In relational designs, I would tend to avoid any design which requires balance-recalculation because then you end up with balance-repair programs etc. With proper indexing and a simple enough design, you can do a simple SUM on the transactions (positive and negative) to get a balance. With good consistent sign conventions on the transactions (no ambiguity on whether to add or subtract - always add the values) and appropriate constraints (with limited number of types of transactions, you can specify with constraints that all deposits are positive and all withdrawals are negative) you can let the database ensure there are no anomalies like negative deposits.
Even if you want to cache the balance in some way, you could still rely on such a simple mechanism augmented with a trigger on the transaction table to update an account summary table.
I'm not a big fan of putting any of this in a middle layer outside of the database. Your basic accounting should be fairly simple that it can be handled within the database engine at speed so that anyone or any part of the application executing a query is going to get the same answer without any client-code logic getting involved. And so the database ensures integrity at a level slightly above referential integrity (accounts with non-zero balance might not be allowed to be closed, balances might not be allowed to go negative etc) using a combination of constraints, triggers and stored procedures, in increasing order of complexity as required. I'm not talking about all your business logic, just prohibiting low-level situations you feel the database should never get into due to bad client programming or a failure to do things in the right order or call things with the right parameters.
In real banking (i.e. COBOL apps) typically the database schema (usually non-relational and non-normalized - a lot of these things predate SQL) you see a lot of things like 12 monthly buckets of past balances which are updated and shifted when the account rolls over. Some of the databases these systems use are kind of hierarchical. And this is where the code is really important, because everything gets done in code. Again, it's kind of old-fashioned and subject to all kinds of problems (i.e. probably a lot like what NatWest is going through) and NoSQL is a trend back towards this code-is-king way of looking at things. I just tend to think after a long time working with these things - I don't like having systems with balances cached and I don't like systems where you really don't have point-in-time accountability - i.e. you ignore transactions after a certain date and you can see EXACTLY what things looked like on a certain date/time.
I'm sure someone has "standard" patterns of bank-like database design, but I'm not aware of them despite having built several accounting-like systems over the years - accounts and transactions are just not that complex and once you get beyond that concept, everything gets highly customized.
For instance, in some cases, you might recognize earnings on contracts on some kind of schedule according to GAAP and contracts which are paid over time. In banking you have a lot of interest-related things with different interest rates for cost of funds etc. Everything just gets unique once you start mixing the business needs in with just the basics of the accounting of ins and outs of money.
You don't say whether or not you have a middle tier in your app, between the UI and the database. If you do, you have a choice as to where you'll mark transactions and recalculate balances. If this database is wholly owned by the one application, you can move the calculations to the middle tier and just use the database for persistence.
Spring is a framework that has a nice annotation-based way to declare transactions. It's based on POJOs; an alternative to EJBs. It's a three legged stool of dependency injection, aspect-oriented programming, and great libraries. Perhaps it can help you with both structuring and implementing your app.
If you do have a middle tier, and it's written in an object-oriented language, I'd recommend having a look at Martin Fowler's "Analysis Patterns". It's been around for a long time, but the chapter on financial systems is as good today as it was when it was first written.

DB design and optimization considerations for a social application

The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.

How To Interpret Siege and or Apache Bench Results

We have a MySQL driven site that will occasionally get 100K users in the space of 48 hours, all logging into the site and making purchases.
We are attempting to simulate this kind of load using tools like Apache Bench and Siege.
While the key metric seems to me number of concurrent users, and we've got our report results, we still feel like we're in the dark.
What I want to ask is: What kinds of things should we be testing to anticipate this kind of traffic?
50 concurrent users 1000 Times? 500 concurrent users 10 times?
We're looking at DB errors, apache timeouts, and response times. What else should we be looking at?
This is a vague question and I know there is no "right" answer, we're just looking for some general thoughts on how to determine what our infrastructure can realistically handle.
Thanks in advance!
Simultaneous users is certainly one of the key factors - especially as that applies to DB connection pools, etc. But you will also want to verify that the page rate (pages/sec) of your tests is also in the range you expect. If the the think-time in your testcases is off by much, you can accidentally simulate a much higher (or lower) page rate than your real-world traffic. Think time is the amount of time the user spends between page requests - reading the page, filling out a form, etc.
Depending on what other information you have on hand, this might help you calculate the number of simultaneous users to simulate:
Virtual User Calculators
The complete page load time seen by the end-user is usually the most important metric to evaluate system performance. You'll also want to look for failure rates on all transactions. You should also be on the lookout for transactions that never complete. Some testing tools do not report these very well, allowing simulated users to hang indefinitely when the server doesn't respond...and not reporting this condition. Look for tools that report the number of users waiting on a given page or transaction and the average amount of time those users are waiting.
As for the server-side metrics to look for, what other technologies is your app built on? You'll want to look at different things for a .NET app vs. a PHP app.
Lastly, we have found it very valuable to look at how the system responds to increasing load, rather than looking at just a single level of load. This article goes into more detail.
Ideally you are going to want to model your usage to the user, but creating simulated concurrent sessions for 100k users is usually not easily accomplished.
The best source would be to check out your logs for the busiest hour and try and figure out a way to model that load level.
The database is usually a critical piece of infrastructure, so I would look at recording the number and length of lock waits as well as the number and duration of db statements.
Another key item to look at is disk queue lengths.
Mostly the process is to look for slow responses either in across the whole site or for specific pages and then hone in on the cause.
The biggest problem for load testing is that is quite hard to test your network and if you have (as most public sites do) a limited bandwidth through your ISP, that may create a performance issue that is not reflected in the load tests.