Simulating load with wireshark packet capture (not HTTP) - stress-testing

We have a production environment that is pretty heavily stressed, to the point that some connect requests are dropped. We've pushed the connection backlog up to 100 but are still seeing some issues with dropped connections (100 is well in excess of what we think could be happening due to the configuration).
What I would like to do is get a large wireshark capture and then setup one or more machines that could play that capture back so that we could replicate the situation in a non-production environment. Right now we cannot seem to reproduce this problem at all and I was thinking if we could use multiple machines in our lab to test with real production data, we might be able to "replay" the packet capture.
Problem is that right now I don't have time to write the playback tool, so I'm hoping someone knows of a tool that is already out there that will deal with playback and the bits and pieces of the packets that would have to be changed to deal with that (port #'s, etc....)
Wireshark isn't a requirement, just the first tool that comes to mind because they have it in the switch the machines are connected to.
Anyone know of anything that would allow network traffic simulation from a packet capture?

You didn't mention what kind of connections/traffic, so I'll assume HTTP for now.
The advantage of approaching this problem with a packet capture tool is that you don't need to understand the traffic pattern because it will EXACTLY duplicate the incoming network traffic that was recorded. The downside is that it will EXACTLY duplicate the incoming network traffic :( You've already grokked the fact that some of this stuff probably needs to be different - but figuring out what is what at the packet layer, and changing it, is going to be very difficult (depending on the type of traffic you need to model). The more complex the workload, the more difficult it will be to duplicate it. If it is a semi-sophisticated web app, you're facing a difficult challenge. What you need is a load testing tool.
If the load is primarily web traffic (HTTP), then you have lots of options. I'll offer our Load Tester LITE product, which is free and can generate massive amounts of load (despite the name) for relatively simple workloads.

Related

Volume or frequency limitations of SQL Server Database Mail

I've created a nightly sync between two database applications for a small construction company and setup simple notifications using database mail to let a few people know if the load was successful or not. Now that they see this notification is working I've been asked to provide status updates to their clients as employees make changes to the work order throughout the day.
I've done some research and understand DB Mail is not designed for this type of feature but I'm thinking the frequency will be small enough to not be a problem. I'm estimating 50-200 emails per day.
I couldn't find anything on the actual limitations of DB Mail and wondering if anyone has tried something similar in the past or if I could be pushed in the right direction to send these emails using best practice.
If we're talking hundreds here you can definitely go ahead. Take a peak at the Database Mail MSDN page. The current design (i.e. anything post-SQL2000) was specifically designed for large, high-performance enterprise implementations. Built on top of Service Broker (SQL Server's message queuing bus) it offers both asynchronous processing and scalability with process isolation, clustering, and failover. One caveat is increased transaction log pressure as messages, unlike in some other implementations, are ACID-protected by SQL Server which in turn gives you full recoverability of the queues in case of failure.
If you're wondering what Service Broker can handle before migrating to a dedicated solution, there's a great MySpace case study. The most interesting fragment:
We didn’t want to start down the road of using Service Broker unless
we could demonstrate that it could handle the levels of messages that
we needed to support our millions of users across 440 database
servers,” says Stelzmuller. “When we went to the lab we brought our
own workloads to ensure the quality of the testing. We needed to see
if Service Broker could handle loads of 4,000 messages per second. Our
testing found it could handle more than 18,000 messages a second. We
were delighted that we could build our solution using Service Broker,
rather than creating a custom solution on our own.

Measuring delay caused by using a remote server from a different country

From logistic reasons I've stated using a database server that is physically located far from the main Apache server (not in the same farm, no internal network, not even in the same country).
I understand this isn't recommended, but as I said, logistic reasons.
The question is how can I measure how problematic is it really for my specific architecture? Is there a way to break down a query request to network time and actual process? Also, if you had to estimate, how much of a delay in loading pages would you think such a far database server should cause?

Implementing dynamically updating upvote/downvote

How to implement dynamically updating vote count similar to quora:- Whenever a user upvotes an answer its reflected automatically for every one who is viewing that page.
I am looking for an answer that address following:
Do we have to keep polling for upvote counts for every answer, If yes
then how to manage the server load arising because of so many users
polling for upvotes.
Or to use websockits/push notifications, how scalable are these?
How to store the upvote/downvote count in databases/inmemory to support this. How do they control the number of read/writes. My backend database is mysql
The answer I am looking for may not be exactly how quora is doing it, but may be how this can be done using available opensource technologies.
It's not the back-end system details that you need to worry about but the front end. Having connection being open all the time is impractical at any real scale. Instead you want the opposite - to be able to serve and close connection from back-end as fast as you can.
Websockets is a sexy technology, but again, in real world there are issues with proxies, if you are developing something that should work on a variety of screens (desktop, tablet, mobile) it might became a concern to you. Even good-old long polls might not work through firewalls and proxies.
Here is a good news: I think
"keep polling for upvote counts for every answer"
is a totally good solution in this case. Consider the following:
your use-case does not need any real real-time updates. There is little harm to see the counter updated a bit later
for very popular topics you would like to squash multiple up-votes/down-votes into one anyway
most of the topics will see no up-vote/down-vote traffic at all for days/weeks, so keeping a connection open, waiting for an event that never comes is a waste
most of the user will never up-vote/down-vote that just came to read a topic, so your read/write ration of topics stats will be greatly skewed toward reads
network latencies varies hugely across clients, you will see horrible transfer rates for a 100B http responses, while this sluggish client is fetching his response byte-by-byte your precious server connection and what is more importantly - thread on a back end server is busy
Here is what I'd start with:
have browsers periodically poll for a new topic stat, after the main page loads
keep your MySQL, keep counters there. Every time there is an up/down vote update the DB
put Memcached in front of the DB as a write-through cache i.e. every time there is an up/down vote update cache, then update DB. Set explicit expire time for a counter there to be 10-15 minutes . Every time counter is updated expire time is prolongated automatically.
design these polling http calls to be cacheable by http proxies, set expire and ttl http headers to be 60 sec
put a reverse proxy(Varnish, nginx) in front of your front end servers, have this proxy do the caching of the said polling calls. These takes care of the second level cache and help free up backend servers threads quicker, see network latencies concern above
set-up your reverse proxy component to talk to memcached servers directly without making a call to the backend server, yes if your can do it with both Varnish and nginx.
there is no fancy schema for storing such data, it's a simple inc()/dec() operation in memcached, note that it's safe from the race condition point of view. It's also a safe atomic operation in MySQL UPDATE table SET field = field + 1 WHERE [...]
Aggressive multi level caching covers your read path: in Memcached and in all http caches along the way, note that these http poll requests will be cached on the edges as well.
To take care of the long tail of unpopular topic - make http ttl for such responses reverse proportional to popularity.
A read request will only infrequently gets to the front end server, when http cache expired and memcached does not have it either. If that is still a problem, add memecached servers and increase expire time in memcached across the board.
After you done with that you have all the reads taken care of. The only problem you might still have, depending on the scale, is high rate of writes i.e. flow of up/down votes. This is where your single MySQL instance might start showing some lags. Fear not - proceed along the old beaten path of sharding your instances, or adding a NoSQL storage just for counters.
Do not use any messaging system unless absolutely necessary or you want an excuse to play with it.
Websockets, Server Sent Events (I think that's what you meant by 'push notifications') and AJAX long polling have the same drawback - they keep underlying TCP connection open for a long time.
So the question is how many open TCP connections can a server handle.
Basically, it depends on its OS, number of file descriptors (a config parameter) and available memory (each open connection reserves a read/write buffers).
Here's more on that.
We once tested a possibility to keep 1 million websocket connections open on a single server (Windows 7 x64 with 16Gb of RAM, JVM 1.7 with 8Gb of heap, using Undertow beta to serve Web requests).
Surprisingly, the hardest part was to generate the load on the server )
It managed to hold 1M. But again the server didn't do something useful, just received requests, went through protocol upgrade and kept those connections open.
There was also some number of lost connections, for whatever reason. We didn't investigate. But in production you would also have to ping the server and handle reconnection.
Apart from that, Websockets seem like an overkill here, SSE still aren't widely adopted.
So I would go with good old AJAX polling, but optimize it as much as possible.
Works everywhere, simple to implement and tweak, no reliance on an external system (I had bad experience with that several times), possibilities for optimization.
For instance, you could group updates for all open articles in a single browser, or adjust update interval according to how popular the article is.
After all it doesn't seem like you need real-time notifications here.
sounds like you might be able to use a messaging system like Kafka, or RabbitMQ, or ActiveMQ. Your front end would sent votes to a message channel and receive them with a listener, and you could have a server side piece persist the votes to the db periodically.
You could also accomplish your task by polling your database, and by incre/decre menting a number related to a post via a stored proc... there are a bunch of options here and it depends on how much concurrency you may be facing.

Reducing Ruby on Rails traffic to the database

The networking team has flagged our Ruby on Rails application as one of the top producers of network traffic on our network, specifically from packet traffic between the app server and the database server (mysql).
What are the recommended best practices to reduce traffic between a Rails app and the database? Persistent database connections?
Is it an actual problem, or do they ding the top 3 db consumers no matter what? Check your logs or have them supply you with a log of queries that they think are problematic.
Beyond that, check to see if you're doing bad things like making model calls from your views in loops. Your logs should tell you what's going on here, if you see each partial paired with a query every time it's rendered, that's a big sign that your logic should be pulled back into the models and controllers.
Fire up Wireshark or another network scanner and look for the biggest packets or small packets that are too frequent - to identify the specific, troublesome queries.
Then, before even considering caching, check if that query can really be cached or if it just pulls too much data you are not using.
At this point, there are too many different possible causes - each with it's own recommended practices.

Scaling up from 1 Web Server + 1 DB Server

We are Web 2.0 company that built a hosted Content Management solution from the ground up using LAMP. In short, people log into our backend to manage their website content and then use our API to extract that content. This API gets plugged into templates that can be hosted anywhere on the interwebs.
Scaling for us has progressed as follows:
Shared hosting (1and1)
Dedicated single server hosting (Rackspace)
1 Web Server, 1 DB Server (Rackspace)
1 Backend Web Server, 1 API Web Server, 1 DB Server
Memcache, caching, caching, caching.
The question is, what's next for us? Every time one of our sites are dugg or mentioned in a popular website, our API server gets crushed with too many connections. Or every time our DB server gets overrun with queries, our Web server requests back up.
This is obviously the 'next problem' for any company like ours and I was wondering if you could point me in some directions.
I am currently attracted to the virtualization solutions (like EC2) but need some pointers on what to consider.
What/where/how to scale is dependent on what your issues are. Since you've been hit a few times, and you know it's the API server, you need to identify what's actually causing the issue.
Is it DB lookup times?
A volume of requests that the web server just can't handle even though they're shortlived?
API requests take too long to process? (independent of DB lookups, e.g., does the code take a bit to run)?
Once you identify WHAT the problem is, you should have a pretty clear picture of what you need to do. If it's just volume of requests, and it's the API server, you just need more web servers (and code changes to allow horizontal scaling) or a beefier web server. If it's API requests taking too long, you're looking at code optimizations. There's never a 1-shot fix when it comes to scalability.
The most common scaling issues have to do with slow (2-3 seconds) execution of the actual code for each request, which in turn leads to more web servers, which leads to more database interactions (for cross-server sessions, etc.) which leads to database performance issues. High performance, server independent code with memcache (I actually prefer a wrapper around memcache so the application doesn't know/care where it gets the data from, just that it gets it and the translation layer handles DB/memcache lookups as well as populating memcache).
Depends really if your bottleneck is reads or writes. Scaling writes is much harder than reads.
It also depends on how much data you have in the database.
If your database is small, but cannot cope with the read load, you can deploy enough ram that it fits in ram. If it still cannot cope, you can add read-replicas, possibly on the same box as your web servers, this will give you good read-scalability - the number of slaves from one MySQL master is quite high and will depend chiefly on the write workload.
If you need to scale writes, that's a totally different game. To do that you'll need to split your data out, either horizontally (partitioning / sharding) or vertically (functional partitioning etc) so that you can spread the workload over several write servers which do not need to do each others' work.
I'm not sure what EC2 can do for you, it essentially offers slow, high latency machines with nonpersistent discs and low IO performance on the end of a more-or-less nonexistent SLA. I guess it might be useful in your case as you can provision them relatively quickly - provided you're just using them as read-replicas and you don't have too much data (remember they have nonpersistent discs and sucky IO)
What is the level of scaling you are looking for? Is it a stop-gap solution e.g. scale vertically? If it is a more strategic scaling project, does your current architecture support scaling horizontally?