How to Load Test a Web Service From a Text File of URLs - stress-testing

I am working on load testing a web service where request are of the form:
GET http://host/my/app/some-data
Where some-data is a string that serves as input to the logic behind the service. Now I have 1 million urls with random values for some-data, and now I want to try to simulate load with those 1M urls.
GET http://host/my/app/some-data_1
GET http://host/my/app/some-data_2
...
GET http://host/my/app/some-data_1e6
I dont know how to do that and have not made any substantial progress towards the goal. How do I do this?

That's a pretty trivial task for good load testing software, so I'm curious why you haven't had any success. You didn't mention what level of load you are trying to test - if you are looking for a very high level of concurrency, then that may rule out many of the tools.
So: I'd start by researching some load testing tools that can handle the level of concurrency you need and are within your budget.

Related

Is using a WebServer for executing methods remotely a good architectural decision?

A very small explanation of what i am talking about is presented below.
In my organization a strange kind of effort for load distributing is
implemented, i will attempt to explain it below.
A JBoss server runs in a machine with some IP xxx.xxx.xxx.xxx
There is an app which has many heavy time & resource consuming work
to do (heavy in terms of I/O operation ex- Large file uploads &
downloads - in Gigabytes)
Methods written in a Rest WebApp accessible by the URLs & parameters
to methods passed as URL params, set to the type POST, return value
of the methods as JSON output.
The App in question has a framework written just to make post calls
to the aforementioned WebApp wrapped nicely so that the calls can be
made without the programmer knowing what happens in the background.
This framework has externalized parameters which takes in the IP of
the machines running the WebApps and configure a list of such
machines available & route the method calls to the one least busy
whenever a method call to the framework is made.
Everything looks good but i suspect wrapping things in a http websever & doing processing there & sending the outputs in json may slow down things & collecting the logs in case of failures maybe tough.
Questions
I want to know the views of other programmers on this & whether this is a nice approach or not.
Also do any existing commercial applications follow something similar while trying to distribute the load?
In my opinion, the efficiency of load balancing lies in the tricky process of choosing the right node that will process your request.
A usual approach can be to monitor the CPU usage of the nodes that have to take the load and send the load to the one least busy. This process should be both accurate and efficient.
In your case, wrapping up of request and transfer of Json data should be a thing of least concern as they seem to be required activities and light ones too. Focus should be on the load balancing activity.
Answer to comment
If I understood correctly, http methodserver are there to process the requests from the central app. Load balancing will not be done by them. There has to be a central load balancing mechanism/tool that has to decide which methodserver will take which request.
All the methodservers (slaves) should be identical. That reduces maintainability effort as a fix can be done on one node and can be propagated to other nodes. This is how it is done in my organization.
It may be a requirement if a similar operation is done repeatedly, application server may reduce some duplicity here.

real time data web page

I am new to programming and am working on pushing real time data from a PLC to a web page either by deploying HTML 5 on the WAGO or a Modbus driver wrapper. I honestly have tried to research but don't know where to start. it will be a closed private network with little to no influence from the outside web. I am simply looking to display a single piece of live information for proof of concept. basically I'm trying to custom design a Groov program.
You might want to look into using OPC. Kepware & SoftwareToolbox are just 2 of many vendors that offer tools to help you get your data the way you want it.
There is an existing tool to do what you want, but I am under the impression you have to write one from scratch. The existing tool is http://www.softwaretoolbox.com/cogentdatahub/ if you are interested in looking at it for ideas.
I've been able to interface with PLC using python and modbusTCP and an Raspberry pi as the webserver. Python is a quick and easy to learn language. Websockets are the HTML5 component best used for realtime data.
simple connect code (after you install everything):
from pymodbus.client.sync import ModbusTcpClient as ModbusClient
from time import sleep
client = ModbusClient('ip_address_of_modbus_IO')
if(client.connect()):
print(client.read_discrete_inputs(200,1).bits[0])
client.write_coil(0,True)
sleep(100)
client.write_coil(2,True)
found here:
http://simplyautomationized.blogspot.com/2013/09/home-automation-project-2-rpi-light.html
Can create a websocket broadcast server using an example here:
http://simplyautomationized.blogspot.com/2015/09/raspberry-pi-create-websocket-api.html
Fortunately you can not push data to a browser.
The Internet would become an even greater mess if you could.
To solve this, have your webpage contain a timer, written in JavaScript.
Every say 1 sec. it does an AJAX request (e.g. use jQuery implementation) to the server, which then delivers (almost) realtime data.
The webpage then displays that in some DOM element, e.g. an empty DIV.
So it's the browser polling your server.
#BlueDog
The data is "almost" realtime because sampling once a second gives a delay of at least one second. In the ideal case, as soon as data changes, it would be pushed to the browser. Unfortunately the browser has no way of knowing that anything changed, so the best it can do is frequently "ask" for updates (polling).
How much the delay is depends on your poll frequency. If it's once per second one has to add the delays for transmission of the page request and the reply of the server. The transmission time depends on your network (which may be the Internet with all uncertainty involved). If the backbones involved have enough capacity I expect overall delay to be between 1 and 1.5 seconds. With a dedicated network and even more frequent polling, I expect that 0.5 seconds should be possible. These are however estimated averages. If I request a page over the Internet and my provider (again) has a problem, it may be hours before I receive what I want. Also things like virus scanners and OS updates may spoil your game.
So, practically: with a good broadband connection, a stable browser and the right process priorities it should be possible to get below 1 second overall delay (incl. poll time interval) for 95% of the time. Be prepared to reboot the client every few days. Most browsers leak memory and most OS'es do so too.

Mvvmcross - Best way to retrieve updates

I'm not after any code in particular but I want to know what is the most efficient way to build a function that will constantly check for updates for things such as messages e.g. Have a chat conversation window and I want live updates such as Facebook.
Currently I have implemented it by putting a while loop in my core code that checks if the view is currently visible run a Task every 5 seconds to get new messages. This works but I don't believe its the most efficient way to do it and I need to consider battery life. *Note I do change visibility when the view goes away e.g. on iOS i do
public override ViewDidDissapper {
Model.SetVisible(false)
}
Has anyone implemented some sort of polling on a cross platform app?
There are many different possible solutions here - which one you prefer depends a lot on your requirements in terms of latency, reliability, efficiency, etc - and it depends on how much you can change server side.
If your server is fixed as a normal http server, then frequent polling may be your best route forwards, although you could choose to modify the 5 seconds occasionally when you think updates aren't likely.
One step up from this is that you could try long polling http requests within your server.
Another step beyond that are using Socket (TCP, UDP or websocket) communications to provide "real time" messaging.
And in parallel to these things, you could also consider using PUSH notifications both within your app and in the background.
Overall, this is a big topic - I'd recommend reading up about PushSharp from #Redth and about SignalR from Microsoft - #gshackles has some blog posts about using this in Xamarin. Also, services like AzureMobileServices, UrbanAirship, Buddy, Parse, etc may help

Is it possible to do asynchronous / parallel database query in a Django application?

I have web pages that take 10 - 20 database queries in order to get all the required data.
Normally after a query is sent out, the Django thread/process is blocked waiting for the results to come back, then it'd resume execution until it reaches the next query.
Is there's any way to issue all queries asynchronously so that they can be processed by the database server(s) in parallel?
I'm using MySQL but would like to hear about solutions for other databases too. For example I heard that Postgresql has an async client library - how would I use that in this case?
This very recent blog entry seems to imply that it's not built in to either the django or rails frameworks. I think it covers the issue well and is quite worth a read along with the comments.
http://www.eflorenzano.com/blog/post/how-do-we-kick-our-synchronous-addiction/ (broken link)
I think I remember Cal Henderson mentioning this deficiency somewhere in his excellent speech http://www.youtube.com/watch?v=i6Fr65PFqfk
My naive guess is you might be able to hack something with separate python libraries but you would lose a lot of the ORM/template lazy evaluation stuff django gives to the point you might as well be using another stack. Then again if you are only optimizing a few views in a large django project it might be fine.
I had a similar problem and I solved it with javascript/ajax
Just load the template with basic markup and then do severl ajax requsts to execute the queries and load the data. You can even show loading animation. User will have a web 2.0 feel instead of just gloomy page loading. Ofcourse, this means several more HTTP requests per page, but it's up to you to decide.
Here is how my example looks: http://artiox.lv/en/search?query=test&where_to_search=all (broken link)
Try Celery, there's a bit of a overhead of having to run a ampq server, but it might do what you want. Not sure about concurrency of the DB tho. Also, if you want speed for your DB, I'd recommend MongoDB (but you'll need django-nonrel for that).

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
I would like to see the differences in code, not abstractly. I love code examples.
How do web spiders differ from Wget's spider in code?
A real spider is a lot of work
Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:
Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
You need to be polite to the spidered web sites:
Respect the robots.txt files.
Don't fetch a lot of information too quickly: this overloads the servers.
Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
Besides following links, you may want to parse sitemaps to get URLs of pages.
You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.
This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.
Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.
Let me add a few more technical remarks on three points
Parallel connections / asynchronous socket communication
You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.
A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.
Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.
From the original answer:
The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.
Edit five years later:
Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...
Partitioning the work over several servers
One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.
Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.
If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.
Read the standards
I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.
It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.
I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).
"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.
Again this is all just a guess based on what I know of spiders in general, and what you posted here.
Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:
Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
Prioritization; some pages are more important to spider than others
Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.
There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.
I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.
As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
You need to parse Flash / Javascript at least up to some level
You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.
I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.
I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.
+ Some of them already explained by #yaakov-belch so I'm not going to type them again