Stress test cases for web application - stress-testing

What other stress test cases are there other than finding out the maximum number of users allowed to login into the web application before it slows down the performance and eventually crashing it?

This question is hard to answer thoroughly since it's too broad.
Anyway many stress tests depend on the type and execution flow of your workload. There's an entire subject dedicated (as a graduate course) to queue theory and resources optimization. Most of the things can be summarized as follows:
if you have a resource (be it a gpu, cpu, memory bank, mechanical or
solid state disk, etc..), it can serve a number of users/requests per
second and takes an X amount of time to complete one unit of work.
Make sure you don't exceed its limits.
Some systems can also be studied with a probabilistic approach (Little's Law is one of the most fundamental rules in these cases)

There are a lot of reasons for load/performance testing, many of which may not be important to your project goals. For example:
- What is the performance of a system at a given load? (load test)
- How many users the system can handle and still meet a specific set of performance goals? (load test)
- How does the performance of a system changes over time under a certain load? (soak test)
- When will the system will crash under increasing load? (stress test)
- How does the system respond to hardware or environment failures? (stress test)
I've got a post on some common motivations for performance testing that may be helpful.

You should also check out your web analytics data and see what people are actually doing.
It's not enough to simply simulate X number of users logging in. Find the scenarios that represent the most common user activities (anywhere between 2 to 20 scenarios).
Also, make sure you're not just hitting your cache on reads. Add some randomness / diversity in the requests.
I've seen stress tests where all the users were requesting the same data which won't give you real world results.

Related

Reinforcement learning target data

I got a question regarding reinforcement learning. let's say we have a robot that is able to adapt to changing environments. Similar to this paper 1. When there is a change in the environment[light dimming], the robot's performance drops and it needs to explore its new environment by collecting data and running the Q-algorithm again to update its policy to be able to "adapt". The collection of new data and updating of the policy takes about 4/5hrs. I was wondering if I have an army of these robots in the same room, undergoing the same environmental changes, can the data collection be sped up so that the policy can be updated more quickly? so that the policy can be updated in under 1 hour or so, allowing the performance of the robots to increase?
I believe you are talking about scaling learning horizontally as in training multiple agents in parallel.
A3C is one algorithm that does this by training multiple agents in parallel and independently of each other. Each agent has its own environment which allows it to gain a different experience than the rest of the agents, ultimately increasing the breadth of your agents collective experience. Eventually each agent updates a shared network asynchronously and you use this network to drive your main agent.
You mentioned that you wanted to use the same environment for all parallel agents. I can think of this in two ways:
If you are talking about a shared environment among agents, then this could possibly speed things up however you are likely not going to gain much in terms of performance. You are also very likely to face issues in terms of episode completion - if multiple agents are taking steps simultaneously then your transitions will be a mess to say the least. The complexity cost is high and the benefit is negligible.
If you are talking about cloning the same environment for each agent then you end up both gaining speed and a broader experience which translates to performance. This is probably the sane thing to do.

How to understand a role of a queue in a distributed system?

I am trying to understand what is the use case of a queue in distributed system.
And also how it scales and how it makes sure it's not a single point of failure in the system?
Any direct answer or a reference to a document is appreciated.
Use case:
I understand that queue is a messaging system. And it decouples the systems that communicate between each other. But, is that the only point of using a queue?
Scalability:
How does the queue scale for high volumes of data? Both read and write.
Reliability:
How does the queue not becoming a single point of failure in the system? Does the queue do a replication, similar to data-storage?
My question is not specified to any particular queue server like Kafka or JMS. Just in general.
Queue is a mental concept, the implementation decides about 1 + 2 + 3
A1: No, it is not the only role -- a messaging seems to be main one, but a distributed-system signalling is another one, by no means any less important. Hoare's seminal CSP-paper is a flagship in this field. Recent decades gave many more options and "smart-behaviours" to work with in designing a distributed-system signalling / messaging services' infrastructures.
A2: Scaling envelopes depend a lot on implementation. It seems obvious that a broker-less queues can work way faster, that a centralised, broker-based, infrastructure. Transport-classes and transport-links account for additional latency + performance degradation as the data-flow volumes grow. BLOB-handling is another level of a performance cliff, as the inefficiencies are accumulating down the distributed processing chain. Zero-copy ( almost ) zero-latency smart-Queue implementations are still victims of the operating systems and similar resources limitations.
A3: Oh sure it is, if left on its own, the SPOF. However, Theoretical Cybernetics makes us safe, as we can create reliable systems, while still using error-prone components. ( M + N )-failure-resilient schemes are thus achievable, however the budget + creativity + design-discipline is the ceiling any such Project has to survive with.
my take:
I would be careful with "decouple" term - if service A calls api on service B, there is coupling since there is a contract between services; this is true even if the communication is happening over a queue, file or fax. The key with queues is that the communication between services is asynchronous. Which means their runtimes are decoupled - from practical point of view, either of systems may go down without affecting the other.
Queues can scale for large volumes of data by partitioning. From clients point of view, there is one queue, but in reality there are many queues/shards and number of shards helps to support more data. Of course sharding a queue is not "free" - you will lose global ordering of events, which may need to be addressed in you application.
A good queue based solution is reliable based on replication/consensus/etc - depends on set of desired properties. Queues are not very different from databases in this regard.
To give you more direction to dig into:
there an interesting feature of queues: deliver-exactly-once, deliver-at-most-once, etc
may I recommend Enterprise Architecture Patterns - https://www.enterpriseintegrationpatterns.com/patterns/messaging/Messaging.html this is a good "system design" level of information
queues may participate in distributed transactions, e.g. you could build something like delete a record from database and write it into queue, and that will be either done/committed or rolledback - another interesting topic to explore

ELI5: How etcd really works and what is consensus algorithm

I am having hard time to grab what etcd (in CoreOS) really does, because all those "distributed key-value storage" thingy seems intangible to me. Further reading into etcd, it delves into into Raft consensus algorithm, and then it becomes really confusing to understand.
Let's say that what happen if a cluster system doesn't have etcd?
Thanks for your time and effort!
As someone with no CoreOS experience building a distributed system using etcd, I think I can shed some light on this.
The idea with etcd is to give some very basic primitives that are applicable for building a wide variety of distributed systems. The reason for this is that distributed systems are fundamentally hard. Most programmers don't really grok the difficulties simply because there are orders of magnitude more opportunity to learn about single-system programs; this has really only started to shift in the last 5 years since cloud computing made distributed systems cheap to build and experiment with. Even so, there's a lot to learn.
One of the biggest problems in distributed systems is consensus. In other words, guaranteeing that all nodes in a system agree on a particular value. Now, if hardware and networks were 100% reliable then it would be easy, but of course that is impossible. Designing an algorithm to provide some meaningful guarantees around consensus is a very difficult problem, and one that a lot of smart people have put a lot of time into. Paxos was the previous state of the art algorithm, but was very difficult to understand. Raft is an attempt to provide similar guarantees but be much more approachable to the average programmer. However, even so, as you have discovered, it is non-trivial to understand it's operational details and applications.
In terms of what etcd is specifically used for in CoreOS I can't tell you. But what I can say with certainty is that any data which needs to be shared and agreed upon by all machines in a cluster should be stored in etcd. Conversely, anything that a node (or subset of nodes) can handle on its own should emphatically not be stored in etcd (because it incurs the overhead of communicating and storing it on all nodes).
With etcd it's possible to have a large number of identical machines automatically coordinate, elect a leader, and guarantee an identical history of data in its key-value store such that:
No etcd node will ever return data which is not agreed upon by the majority of nodes.
For cluster size x any number of machines > x/2 can continue operating and accepting writes even if the others die or lose connectivity.
For any machines losing connectivity (eg. due to a netsplit), they are guaranteed to continue to return correct historical data even though they will fail to write.
The key-value store itself is quite simple and nothing particularly interesting, but these properties allow one to construct distributed systems that resist individual component failure and can provide reasonable guarantees of correctness.
etcd is a reliable system for cluster-wide coordination and state management. It is built on top of Raft.
Raft gives etcd a total ordering of events across a system of distributed etcd nodes. This has many advantages and disadvantages:
Advantages include:
any node may be treated like a master
minimal downtime (a client can try another node if one isn't responding)
avoids split-braining
a reliable way to build distributed locks for cluster-wide coordination
users of etcd can build distributed systems without ad-hoc, buggy, homegrown solutions
For example: You would use etcd to coordinate an automated election of a new Postgres master so that there remains only one master in the cluster.
Disadvantages include:
for safety reasons, it requires a majority of the cluster to commit writes - usually to disk - before replying to a client
requires more network chatter than a single master system

Comparing speeds of different tasks in different languages

If I wanted to test out the speeds it takes for certain tasks to be done, would it matter what language I did the test in? We can consider this to be any job a programmer might want to perform. Simple jobs such as sorting, or more complicated jobs which involve the signing and verification of files.
Considering that we all know that certain languages will run faster than others, this means that the tasks will rely upon the languages and the way their compilers / runtimes are optimised. But these will obviously all be different.
So is it best to rely upon a language which relies less on abstraction such as C, or is it OK to test out jobs and tasks in more high level languages, and rely on the fact that they are implemented well enough not to worry about any possible inefficiencies? I hope my question is clear.
it doesn't matter which real-world language you use for your tests... if you design the tests correctly. there is always an overhead of the language but there is always an overhead of the operating system, thread scheduler, IO, ram or speed of electricity that depends on current temperature etc.
but to compare anything you don't want to measure how many nanoseconds it took to do the one assignment statement. instead you want to measure how many hours it took to do billions of assignment statements. then all mentioned overheads are negligible

"Work stealing" vs. "Work shrugging"?

Why is it that I can find lots of information on "work stealing" and nothing on "work shrugging" as a dynamic load-balancing strategy?
By "work-shrugging" I mean pushing surplus work away from busy processors onto less loaded neighbours, rather than have idle processors pulling work from busy neighbours ("work-stealing").
I think the general scalability should be the same for both strategies. However I believe that it is much more efficient, in terms of latency & power consumption, to wake an idle processor when there is definitely work for it to do, rather than having all idle processors periodically polling all neighbours for possible work.
Anyway a quick google didn't show up anything under the heading of "Work Shrugging" or similar so any pointers to prior-art and the jargon for this strategy would be welcome.
Clarification
I actually envisage the work submitting processor (which may or may not be the target processor) being responsible for looking around the immediate locality of the preferred target processor (based on data/code locality) to decide if a near neighbour should be given the new work instead because they don't have as much work to do.
I dont think the decision logic would require much more than an atomic read of the immediate (typically 2 to 4) neighbours' estimated q length here. I do not think this is any more coupling than implied by the thieves polling & stealing from their neighbours. (I am assuming "lock-free, wait-free" queues in both strategies).
Resolution
It seems that what I meant (but only partially described!) as "Work Shrugging" strategy is in the domain of "normal" upfront scheduling strategies that happen to be smart about processor, cache & memory loyality, and scaleable.
I find plenty of references searching on these terms and several of them look pretty solid. I will post a reference when I identify one that best matches (or demolishes!) the logic I had in mind with my definition of "Work Shrugging".
Load balancing is not free; it has a cost of a context switch (to the kernel), finding the idle processors, and choosing work to reassign. Especially in a machine where tasks switch all the time, dozens of times per second, this cost adds up.
So what's the difference? Work-shrugging means you further burden over-provisioned resources (busy processors) with the overhead of load-balancing. Why interrupt a busy processor with administrivia when there's a processor next door with nothing to do? Work stealing, on the other hand, lets the idle processors run the load balancer while busy processors get on with their work. Work-stealing saves time.
Example
Consider: Processor A has two tasks assigned to it. They take time a1 and a2, respectively. Processor B, nearby (the distance of a cache bounce, perhaps), is idle. The processors are identical in all respects. We assume the code for each task and the kernel is in the i-cache of both processors (no added page fault on load balancing).
A context switch of any kind (including load-balancing) takes time c.
No Load Balancing
The time to complete the tasks will be a1 + a2 + c. Processor A will do all the work, and incur one context switch between the two tasks.
Work-Stealing
Assume B steals a2, incurring the context switch time itself. The work will be done in max(a1, a2 + c) time. Suppose processor A begins working on a1; while it does that, processor B will steal a2 and avoid any interruption in the processing of a1. All the overhead on B is free cycles.
If a2 was the shorter task, here, you have effectively hidden the cost of a context switch in this scenario; the total time is a1.
Work-Shrugging
Assume B completes a2, as above, but A incurs the cost of moving it ("shrugging" the work). The work in this case will be done in max(a1, a2) + c time; the context switch is now always in addition to the total time, instead of being hidden. Processor B's idle cycles have been wasted, here; instead, a busy processor A has burned time shrugging work to B.
I think the problem with this idea is that it makes the threads with actual work to do waste their time constantly looking for idle processors. Of course there are ways to make that faster, like have a queue of idle processors, but then that queue becomes a concurrency bottleneck. So it's just better to have the threads with nothing better to do sit around and look for jobs.
The basic advantage of 'work stealing' algorithms is that the overhead of moving work around drops to 0 when everyone is busy. So there's only overhead when some processor would otherwise have been idle, and that overhead cost is mostly paid by the idle processor with only a very small bus-synchronization related cost to the busy processor.
Work stealing, as I understand it, is designed for highly-parallel systems, to avoid having a single location (single thread, or single memory region) responsible for sharing out the work. In order to avoid this bottleneck, I think it does introduce inefficiencies in simple cases.
If your application is not so parallel that a single point of work distribution causes scalability problems, then I would expect you could get better performance by managing it explicitly as you suggest.
No idea what you might google for though, I'm afraid.
Some issues... if a busy thread is busy, wouldn't you want it spending its time processing real work instead of speculatively looking for idle threads to offload onto?
How does your thread decide when it has so much work that it should stop doing that work to look for a friend that will help?
How do you know that the other threads don't have just as much work and you won't be able to find a suitable thread to offload onto?
Work stealing seems more elegant, because solves the same problem (contention) in a way that guarantees that the threads doing the load balancing are only doing the load balancing while they otherwise would have been idle.
It's my gut feeling that what you've described will not only be much less efficient in the long run, but will require lots of of tweaking per-system to get acceptable results.
Though in your edit you suggest that you want submitting processor to handle this, not the worker threads as you suggested earlier and in some of the comments here. If the submitting processor is searching for the lowest queue length, you're potentially adding latency to the submit, which isn't really a desirable thing.
But more importantly it's a supplementary technique to work-stealing, not a mutually exclusive technique. You've potentially alleviated some of the contention that work-stealing was invented to control, but you still have a number of things to tweak before you'll get good results, these tweaks won't be the same for every system, and you still risk running into situations where work-stealing would help you.
I think your edited suggestion, with the submission thread doing "smart" work distribution is potentially a premature optimization against work-stealing. Are your idle threads slamming the bus so hard that your non-idle threads can't get any work done? Then comes the time to optimize work-stealing.
So, by contrast to "Work Stealing", what is really meant here by "Work Shrugging", is a normal upfront work scheduling strategy that is smart about processor, cache & memory loyalty, and scalable.
Searching on combinations of the terms / jargon above yields many substantial references to follow up. Some address the added complication of machine virtualisation, which wasn't infact a concern of the questioner, but the general strategies are still relevent.