Testing k6 groups with separate throughput - stress-testing

I am targeting 10x load for an API, this API contains 6 endpoints which should be under the test, but each endpoint has its own throughput which should be multiplied by 10.
Now, I put all endpoints in one script file, but it doesn't make any sense to have the same throughput for all endpoints, I wanna run the k6 and it has to stop automatically when the needed throughput is already reached for a specific group.
Example:
api/GetUser > current 1k RPM > target 10k RPM
api/GetManyUsers > current 500 RPM > target 5k RPM
The main problem is when I put each endpoint in a separate group in one single script, this let k6 iterate over both groups/endpoints with the same iterations count with the same virtual users, which leads to reach 10x for both endpoints which is not required at the moment.
One more thing, I already tried to separate all endpoints in separate scripts, but this is difficult to manage and this makes the monitoring not easy because all 6 endpoints should be run in parallel.

What you need can currently be approximated roughly with the __ITER and/or __VU execution context variables. Have a single default function that has something like this:
if (__ITER % 3 == 0) {
CallGetManyUsers(); // 33% of iterations
} else {
CallGetUser(); // 66% of iterations
}
In the very near future we plan to also add a more elegant way of supporting multi-scenario tests in a single script: https://github.com/loadimpact/k6/pull/1007

Related

host.json; meaning of batchsize

Does it make sense to set batchSize = 1? In case I would like to process files one-at-a-time?
Tried batchSize = 1000 and batchSize = 1 - seems to have the same effect
{
"version": "2.0",
"functionTimeout": "00:15:00",
"aggregator": {
"batchSize": 1,
"flushTimeout": "00:00:30"
}
}
Edited:
Added into app setings:
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
Still the function is triggered simultaneously - using blob trigger. Two more files were uploaded.
From https://github.com/Azure/azure-functions-host/wiki/Configuration-Settings
WEBSITE_MAX_DYNAMIC_APPLICATION_SCALE_OUT = 1
Set a maximum number of instances that a function app can scale to. This limit is not yet fully supported - it does work to limit your scale out, but there are some cases where it might not be completely foolproof. We're working on improving this.
I think I can close this issue. There is no easy way how to set one-message-one-time feature in multiple function apps instances.
I think your misunderstand the batchSize meaning with aggregator. This batchSize means Maximum number of requests to aggregate. You could check here and about the aggregator it's configured to the runtime agregates data about function executions over a period of time.
From your description, it's similar to the Azure Queue batchSize. It sets the number of queue messages that the Functions runtime retrieves simultaneously and processes in parallel. And If you want to avoid parallel execution for messages received on one queue, you can set batchSize to 1(This means one-message-one-time).

Monte Carlo with rates, system simulation with CUDA C++

So I am trying to simulate a 1-D physical model named Tasep.
I wrote a code to simulate this system in c++, but I definitely need a performance boost.
The model is very simple ( c++ code below ) - an array of 1's and 0's. 1 represent a particle and 0 is no-particle, meaning empty. A particle moves one element to the right, at a rate 1, if that element is empty. A particle at the last location will disappear at a rate beta ( say 0.3 ). Finally, if the first location is empty a particle will appear there, at a rate alpha.
One threaded is easy, I just pick an element at random, and act with probability 1 / alpha / beta, as written above. But this can take a lot of time.
So I tried to do a similar thing with many threads, using the GPU, and that raised a lot of questions:
Is using the GPU and CUDA at all good idea for such a thing?
How many threads should I have? I can have a thread for each site ( 10E+6 ), should I?
How do I synchronize the access to memory between different threads? I used atomic operations so far.
What is the right way to generate random data? If I use a million threads is it ok to have a random generator for each?
How do I take care of the rates?
I am very new to CUDA. I managed to run code from CUDA samples and some tutorials. Although I have some code of the above ( still gives strange result though ), I do not put it here, because I think the questions are more general.
So here is the c++ one threaded version of it:
int Tasep()
{
const int L = 750000;
// rates
int alpha = 330;
int beta = 300;
int ProbabilityNormalizer = 1000;
bool system[L];
int pos = 0;
InitArray(system); // init to 0's and 1's
/* Loop */
for (int j = 0; j < 10*L*L; j++)
{
unsigned long randomNumber = xorshf96();
pos = (randomNumber % (L)); // Pick Random location in the the array
if (pos == 0 && system[0] == 0) // First site and empty
system[0] = (alpha > (xorshf96() % ProbabilityNormalizer)); // Insert a particle with chance alpha
else if (pos == L - 1) // last site
system[L - 1] = system[L - 1] && (beta < (xorshf96() % ProbabilityNormalizer)); // Remove a particle if exists with chance beta
else if (system[pos] && !system[pos + 1]) // If current location have a particle and the next one is empty - Jump right
{
system[pos] = false;
system[pos + 1] = true;
}
if ((j % 1000) == 0) // Just do some Loggingg
Log(system, j);
}
getchar();
return 0;
}
I would be truly grateful for whoever is willing to help and give his/her advice.
I think that your goal is to perform something called Monte Carlo Simulations.
But I have failed to fully understand your main objective (i.e. get a frequency, or average power lost, etc.)
Question 01
Since you asked about random data, I do believe you can have multiple random seeds (maybe one for each thread), I would advise you to generate the seed in the GPU using any pseudo random generator (you can use even the same as CPU), store the seeds in GPU global memory and launch as many threads you can using dynamic parallelism.
So, yes CUDA is a suitable approach, but keep in your mind the balance between time that you will require to learn and how much time you will need to get the result from your current code.
If you will take use this knowledge in the future, learn CUDA maybe worth, or if you can escalate your code in many GPUs and it is taking too much time in CPU and you will need to solve this equation very often it worth too. Looks like that you are close, but if it is a simple one time result, I would advise you to let the CPU solve it, because probably, from my experience, you will take more time learning CUDA than the CPU will take to solve it (IMHO).
Question 02
The number of threads is very usual question for rookies. The answer is very dependent of your project, but taking in your code as an insight, I would take as many I can, using every thread with a different seed.
My suggestion is to use registers are what you call "sites" (be aware that are strong limitations) and then run multiples loops to evaluate your particle, in the very same idea of a car tire a bad road (data in SMEM), so your L is limited to 255 per loop (avoid spill at your cost to your project, and less registers means more warps per block). To create perturbation, I would load vectors in the shared memory, one for alpha (short), one for beta (short) (I do assume different distributions), one "exist or not particle" in the next site (char), and another two to combine as pseudo generator source with threadID, blockID, and some current time info (to help you to pick the initial alpha, beta and exist or not) so u can reuse this rates for every thread in the block, and since the data do not change (only the reading position change) you have to sync only once after reading, also you can "random pick the perturbation position and reuse the data. The initial values can be loaded from global memory and "refreshed" after an specific number of loops to hide the loading latency. In short, you will reuse the same data in shared multiple times, but the values selected for every thread change at every interaction due to the pseudo random value. Taking in account that you are talking about large numbers and you can load different data in every block, the pseudo random algorithm should be good enough. Also, you can even use the result stored in the gpu from previous runs as random source, flip one variable and do some bit operations, so u can use every bit as a particle.
Question 03
For your specific project I would strongly recommend to avoid thread cooperation and make these completely independent. But, you can use shuffle inside the same warp, with no high cost.
Question 04
It is hard to generate truly random data, but you should worry about by how often last your period (since any generator has a period of random and them repeats). I would suggest you to use a single generator which can work in parallel to your kernel and use it feed your kernels (you can use dynamic paralelism). In your case since you want some random you should not worry a lot with consistency. I gave an example of pseudo random data use in the previous question, that may assist. Keep in your mind that there is no real random generator, but there are alternatives as internet bits for example.
Question 05
Already explained in the Question 03, but keep in your mind that you do not need a full sequence of values, only a enough part to use in multiple forms, to give enough time to your kernel just process and then you can refresh you sequence, if you guarantee to not feed block with the same sequence it will be very hard to fall into patterns.
Hope I have help, I’m working with CUDA for a bit more than a year, started just like you, and still every week I do improve my code, now it is almost good enough. Now I see how it perfectly fit my statistical challenge: cluster random things.
Good luck!

JMeter: Capturing Throughput in Command Line Interface Mode

In Jmeter v2.13, is there a way to capture Throughput via non-GUI/Command Line mode?
I have the jmeter.properties file configured to output via the Summariser and I'm also outputting another [more detailed] .csv results file.
call ..\..\binaries\apache-jmeter-2.13\bin\jmeter -n -t "API Performance.jmx" -l "performanceDetailedResults.csv"
The performanceDetailedResults.csv file provides:
timeStamp
elapsed time
responseCode
responseMessage
threadName
success
failureMessage
bytes sent
grpThreads
allThreads
Latency
However, no amount of tweaking the .properties file or the test itself seems to provide Throuput results like I get via the GUI's Summary Report's Save Table Data button.
All articles, postings, and blogs seem to indicate it wasn't possible without manual manipulation in a spreadsheet. But I'm hoping someone out there has figured out a way to do this with no, or minimal, manual manipulation as the client doesn't want to have to manually calculate the Throughput value each time.
It is calculated by JMeter Listeners so it isn't something you can enable via properties files. Same applies to other metrics which are calculated like:
Average response time
50, 90, 95, and 99 percentiles
Standard Deviation
Basically throughput is calculated as simple as dividing total number of requests by elapsed time.
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time)
Hopefully it won't be too hard for you.
References:
Glossary #1
Glossary #2
Did you take a look at JMeter-Plugins?
This tool can generate aggregate report through commandline.

Kafka SparkStreaming configuration specify offsets/message list size

I am fairly new to both Kafka and Spark and trying to write a job (either Streaming or batch). I would like to read from Kafka a predefined number of messages (say x), process the collection through workers and then only start working on the next set of x messages. Basically each message in Kafka is 10 KB and I want to put 2 GB worth of messages in a single S3 file.
So is there any way of specifying the number of messages that the receiver fetches?
I have read that I can specify 'from offset' while creating DStream, but this use case is somewhat different. I need to be able to specify both 'from offset' and 'to offset'.
There's no way to set ending offset as the initial parameter (as you can for starting offset), but
you can use createDirectStream (the fourth overloaded version in the listing) which gives you the ability to get the offsets of the current micro batch using HasOffsetRanges (which gives you back OffsetRange).
That means that you'll have to compare values that you get from OffsetRange with your ending offset in every micro batch in order to see where you are and when to stop consuming from Kafka.
I guess you also need to think about the fact that each partition has its sequential offset. I assume it would be easiest if you could go a bit over 2GB, as much as it takes to finish the current micro-batch (could be couple of kB, depending on density of your messages), in order to avoid splitting the last batch on consumed and unconsumed part, which may require you to fiddle with offsets that Spark keeps in order to track what's consumed and what isn't.
Hope this helps.

C# Data structure Algorithm

I recently gave a interview to one of the TOP software company. I was completely stuck with only one question asked by interviewer to me, which was
Q. I have a machine with 512 mb / 1 GB RAM and I have to sort a file (XML, or any) of 4 GB size. How will I proceed? What will be the data structure, and which sorting algorithm will I use and how?
Do you think it is achievable? If yes then can you please explain?
Thanks in advance!
The answer the interviewer might want maybe how you manage to efficiently sort the data set which exceeds system memory.The following section is taken from Wikipedia:
Memory usage patterns and index
sorting
When the size of the array to be
sorted approaches or exceeds the
available primary memory, so that
(much slower) disk or swap space must
be employed, the memory usage pattern
of a sorting algorithm becomes
important, and an algorithm that might
have been fairly efficient when the
array fit easily in RAM may become
impractical. In this scenario, the
total number of comparisons becomes
(relatively) less important, and the
number of times sections of memory
must be copied or swapped to and from
the disk can dominate the performance
characteristics of an algorithm. Thus,
the number of passes and the
localization of comparisons can be
more important than the raw number of
comparisons, since comparisons of
nearby elements to one another happen
at system bus speed (or, with caching,
even at CPU speed), which, compared to
disk speed, is virtually
instantaneous.
For example, the popular recursive
quicksort algorithm provides quite
reasonable performance with adequate
RAM, but due to the recursive way that
it copies portions of the array it
becomes much less practical when the
array does not fit in RAM, because it
may cause a number of slow copy or
move operations to and from disk. In
that scenario, another algorithm may
be preferable even if it requires more
total comparisons.
One way to work around this problem,
which works well when complex records
(such as in a relational database) are
being sorted by a relatively small key
field, is to create an index into the
array and then sort the index, rather
than the entire array. (A sorted
version of the entire array can then
be produced with one pass, reading
from the index, but often even that is
unnecessary, as having the sorted
index is adequate.) Because the index
is much smaller than the entire array,
it may fit easily in memory where the
entire array would not, effectively
eliminating the disk-swapping problem.
This procedure is sometimes called
"tag sort".[5]
Another technique for overcoming the
memory-size problem is to combine two
algorithms in a way that takes
advantages of the strength of each to
improve overall performance. For
instance, the array might be
subdivided into chunks of a size that
will fit easily in RAM (say, a few
thousand elements), the chunks sorted
using an efficient algorithm (such as
quicksort or heapsort), and the
results merged as per mergesort. This
is less efficient than just doing
mergesort in the first place, but it
requires less physical RAM (to be
practical) than a full quicksort on
the whole array.
Techniques can also be combined. For
sorting very large sets of data that
vastly exceed system memory, even the
index may need to be sorted using an
algorithm or combination of algorithms
designed to perform reasonably with
virtual memory, i.e., to reduce the
amount of swapping required.
Use Divide and Conquer.
Here's the pseudocode:
function sortFile(file)
if fileTooBigForMemory(file)
pair<firstHalfOfFile, secondHalfOfFile> = breakIntoTwoHalves()
sortFile(firstHalfOfFile)
sortFile(secondHalfOfFile)
else
sortCharactersInFile(file)
endif
MergeTwoHalvesInOrder(firstHalfOfFile, secondHalfOfFile)
end
Two well-known algorithms that fall in to the divide and conquer category are merge sort and quick sort algorithm. So you could use them for implementation.
As for the data structure, a char array containing characters in the file could do. If you want to be more object oriented, wrap it in a class called File:
class File {
private char[] characters;
//methods to access and mutate 'characters'
}
There is a nice post on the Guido van Rossum blog which has something to suggest. Beware that the code is in Python.
Split your file to chunks which fit into memory.
Sort each chunk using quick sort and save it to a separate file.
Then merge result files and you get your result.
I would use a multiway merge. There is an excellent book called Managing Gigabytes that shows several different ways of doing it. They also go into sort based inversion for files that are larger than physical memory. Look around page 240 for a pretty detailed algorithm on sorting through chunks on disk.
The post above is correct in that you split the file and sort each portion.
Say you have the 4GB file and only want to load a max of 512MB. That means you need to split the file into 8 chunks minimum. If you are not sure how much extra overhead your sort is going to use, you might even double that number to be safe to 16 chunks.
The 16 files are then sorted one at a time to be in a guaranteed order. So now you have chunk 0-15 as sorted files.
Now you open 16 file handles to those files and read one entry at a time, writing the lowest one to the final output. Since you know each of the files is already sorted, taking the lowest from each means you are then writing them in the correct order to the final output.
I have used such a system in C# for sorting large collections of spam words from emails. The original system required all of them to load into RAM in order to sort them and build a dictionary for spam counts. Once the file grew over 2 GB the in memory structures were requiring 6+GB of RAM and taking over 24 hours to sort due to paging and VM. The new system using the chunking above sorted the entire file in under 40 minutes. That was an impressive speedup for such a simple change.
I played with various load options (1/4 system memory per chunk, etc). It turned out that for our situation the best option was about 1/10 system memory. Then Windows had enough memory left over for decent File I/O buffering to offset the increased file traffic. And the machine was left very responsive to other processes running on it.
And yes, I do frequently like to ask these types of questions in interviews as well. Just to see if people can think outside the box. What do you do when you can't just use .Sort() on a list?
Just simulate a virtual memory, overload the array index operator, []
Find a quicksort implementation that sorts an array in C++ or C#. overload the indexer operator [] which will read from and save to file. That way, you can just plug existing sort algorithms, you just change what happens behind the scenes on those []
here's one example of simulating virtual memory on C#
source: http://msdn.microsoft.com/en-us/library/aa288465(VS.71).aspx
// indexer.cs
// arguments: indexer.txt
using System;
using System.IO;
// Class to provide access to a large file
// as if it were a byte array.
public class FileByteArray
{
Stream stream; // Holds the underlying stream
// used to access the file.
// Create a new FileByteArray encapsulating a particular file.
public FileByteArray(string fileName)
{
stream = new FileStream(fileName, FileMode.Open);
}
// Close the stream. This should be the last thing done
// when you are finished.
public void Close()
{
stream.Close();
stream = null;
}
// Indexer to provide read/write access to the file.
public byte this[long index] // long is a 64-bit integer
{
// Read one byte at offset index and return it.
get
{
byte[] buffer = new byte[1];
stream.Seek(index, SeekOrigin.Begin);
stream.Read(buffer, 0, 1);
return buffer[0];
}
// Write one byte at offset index and return it.
set
{
byte[] buffer = new byte[1] {value};
stream.Seek(index, SeekOrigin.Begin);
stream.Write(buffer, 0, 1);
}
}
// Get the total length of the file.
public long Length
{
get
{
return stream.Seek(0, SeekOrigin.End);
}
}
}
// Demonstrate the FileByteArray class.
// Reverses the bytes in a file.
public class Reverse
{
public static void Main(String[] args)
{
// Check for arguments.
if (args.Length == 0)
{
Console.WriteLine("indexer <filename>");
return;
}
FileByteArray file = new FileByteArray(args[0]);
long len = file.Length;
// Swap bytes in the file to reverse it.
for (long i = 0; i < len / 2; ++i)
{
byte t;
// Note that indexing the "file" variable invokes the
// indexer on the FileByteStream class, which reads
// and writes the bytes in the file.
t = file[i];
file[i] = file[len - i - 1];
file[len - i - 1] = t;
}
file.Close();
}
}
Use the above code to roll your own array class. Then just use any array sorting algorithms.