Let's say I have a random function
${__Random(0,5)}
which I have mentioned in multi JSON request, in a thread group with forever loop condition as:
{
Master=
{
..
${__Random(0,10)}
}
{
..
${__Random(0,10)}
}
{
..
${__Random(0,10)}
}
{
..
${__Random(0,10)}
}
{
..
${__Random(0,10)}
}
}
Though the thread group is mentioned to run forever loop condition, do the thread keep running by duplicating the random variable or get stop aftr 10 iterations, since the max is 10.
my expectations is need to be keep running with duplications & the value need to be within specified range(0,10)..
Pls suggest/ calrify. TIA.
As per Reducing resource requirements chapter of the JMeter Best Practices
If your test needs large amounts of data - particularly if it needs to be randomised - create the test data in a file that can be read with CSV Dataset. This avoids wasting resources at run-time.
So a better idea would be pre-generating the request bodies somewhere in the setUp Thread Group and putting them into the .csv file. Then in the main Thread Group you can read the generated values via CSV Data Set Config
Related
When I start my Node app, I store "static" MySQL data like game quests and game monsters which won't be modified in global objects. I'm not sure if is more efficient to do it this way or retrieving the data each time I need it. Sample code:
global.monsters;
doConn.query('SELECT * FROM monsters', function(error, results) {
if (error) {
throw error;
}
console.log('[MYSQL] Loaded monsters');
monsters = results;
});
There's an important concept of code efficiency called loop-invariant. It refers to anything that remains the same in every iteration of a loop.
Example:
for loop := range 1..100 {
m = 42
// other statements...
}
m is assigned a fixed value 100 times. Why do it 100 times? Why not assign it once, either before or after the loop, and save 99% of that work?
m = 42
for loop := range 1..100 {
// other statements...
}
Some kinds of compilers can factor this code out during the compilation step. But maybe not Node.js. Even if it could factor it out, the code would be more clear to the reader if you write it with loop-invariant statements outside the loop. Otherwise the reader will waste some of their attention trying to figure out if there's some reason the statement is inside the loop.
The example of m = 42 is very simple, but there could be more complex code that is still loop-invariant. Like querying data out of a database, the question you ask about.
There are exceptions to every rule. For example, some of the data about monsters could change frequently, even while players are playing, so your game might need to query repeatedly so it makes sure to have the latest data at all times.
But in general, if you can identify queries that are just as correct if you query them once at the start of the program, it's better to do that than to query them repeatedly.
I need to update or create data in a mysql table from a large array (few 1000s objects) with sequelize.
When I run the following code it uses up almost all my cpu power of my db server (vserver 2gb ram / 2cpu) and clogs my app for a few minutes until it's done.
Is there a better way to do this with sequelize? Can this be done in the background somehow or as a bulk operation so it doesn't effect my apps performance?
data.forEach(function(item) {
var query = {
'itemId': item.id,
'networkId': item.networkId
};
db.model.findOne({
where: query
}).then(function(storedItem) {
try {
if(!!storedItem) {
storedItem.update(item);
} else {
db.model.create(item);
}
} catch(e) {
console.log(e);
}
});
});
Your first line of your sample code data.forEach()... makes a whole mess of calls to your function(item){}. Your code in that function fires off, in turn, a whole mess of asynchronously completing operations.
Try using the async package https://caolan.github.io/async/docs.htm and doing this
async = require('async');
...
async.mapSeries(data, function(item){...
It should allow each iteration of your function (which iterates once per item in your data array) to complete before starting the next one. Paradoxically enough, doing them one at a time will probably make them finish faster. It will certainly avoid soaking up your resources.
Weeks later I found the actual reason for this. (And unfortunately using async didn't really help after all) It was as simple as stupid: I didn't have an MYSQL index for itemId so with every iteration the whole table was queried which caused the high CPU load (obviously).
I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.
I have been reading Clean Code by Robert C. Martin and have a basic (but fundamental) question about functions and program structure.
The book emphasizes that functions should:
be brief (like 10 lines, or less)
do one, and only one, thing
I am a little unclear on how to apply this in practice. For example, I am developing a program to:
load a baseline text file
parse baseline text file
load a test text file
parse test text file
compare parsed test with parsed baseline
aggregate results
I have tried two approaches, but neither seem to meet Martin's criteria:
APPROACH 1
setup a Main function that centrally commands other functions in the workflow. But then main() can end up being very long (violates #1), and is obviously doing many things (violates #2). Something like this:
main()
{
// manage steps, one at a time, from start to finish
baseFile = loadFile("baseline.txt");
parsedBaseline = parseFile(baseFile);
testFile = loadFile("test.txt");
parsedTest = parseFile(testFile);
comparisonResults = compareFiles(parsedBaseline, parsedTest);
aggregateResults(comparisonResults);
}
APPROACH 2
use Main to trigger a function "cascade". But each function is calling a dependency, so it still seems like they are doing more than one thing (violates #2?). For example, calling the aggregation function internally calls for results comparison. The flow also seems backwards, as it starts with the end goal and calls dependencies as it goes. Something like this:
main()
{
// trigger end result, and let functions internally manage
aggregateResults("baseline.txt", "comparison.txt");
}
aggregateResults(baseFile, testFile)
{
comparisonResults = compareFiles(baseFile, testFile);
// aggregate results here
return theAggregatedResult;
}
compareFiles(baseFile, testFile)
{
parsedBase = parseFile(baseFile);
parsedTest = parseFile(testFile);
// compare parsed files here
return theFileComparison;
}
parseFile(filename)
{
loadFile(filename);
// parse the file here
return theParsedFile;
}
loadFile(filename)
{
//load the file here
return theLoadedFile;
}
Obviously functions need to call one another. So what is the right way to structure a program to meet Martin's criteria, please?
I think you are interpreting rule 2 wrong by not taking context into account. The main() function only does one thing and that is everything, i.e. running the whole program. Let's say you have a convert_abc_file_xyz_file(source_filename, target_filename) then this function should only do the one thing its name (and arguments) implies: converting a file of format abc into one of format xyz. Of course on a lower level there are many things to be done to achieve this. For instancereading the source file (read_abc_file(…)), converting the data from format abc into format xyz (convert_abc_to_xyz(…)), and then writing the converted data into a new file (write_xyz_file(…)).
The second approach is wrong as it becomes impossible to write functions that only do one thing because every functions does all the other things in the ”cascaded” calls. In the first approach it is possible to test or reuse single functions, i.e. just call read_abc_file() to read a file. If that function calls convert_abc_to_xyz() which in turn calls write_xyz_file() that is not possible anymore.
I set up my Rails application twice. One is working with MongoDB (Mongoid as mapper) and the other with MySQL and ActiveRecord. Then I wrote a rake task which inserts some test-data to both databases (100.000 entries).
I measured how long it takes for each database with the ruby Benchmark module. I did some testing with 100 and 10.000 entries where mongodb was always faster than mysql (about 1/3). The weird thing is that it takes about 3 times longer in mongodb to insert the 100.000 entries than with mysql. I have no idea why mongodb has this behaviour?! The only thing that I know is that the cpu time is much lower than the total time. Is it possible that mongodb starts some sort of garbage collection while it's inserting the data? At the beginning it's fast, but as more data mongodb is inserting, it gets slower and slower...any idea on this?
To get somehow a read performance of the two databases, I thought about measuring the time when the database gets an search query and respond the result. As I need some precise measurements, I don't want to include the time where Rails is processing my query from the controller to the database.
How do I do the measurement directly at the database and not in the Rails controller? Is there any gem / tool which would help me?
Thanks in advance!
EDIT: Updated my question according to my current situation
If your base goal is to measure database performance time at the DB level, I would recommend you get familiar with the benchRun method in MongoDB.
To do the type of thing you want to do, you can get started with the example on the linked page, here is a variant with explanations:
// skipped dropping the table and reinitializing as I'm assuming you have your test dataset
// your database is called test and collection is foo in this code
ops = [
// this sets up an array of operations benchRun will run
{
// possible operations include find (added in 2.1), findOne, update, insert, delete, etc.
op : "find" ,
// your db.collection
ns : "test.foo" ,
// different operations have different query options - this matches based on _id
// using a random value between 0 and 100 each time
query : { _id : { "#RAND_INT" : [ 0 , 100 ] } }
}
]
for ( x = 1; x<=128; x*=2){
// actual call to benchRun, each time using different number of threads
res = benchRun( { parallel : x , // number of threads to run in parallel
seconds : 5 , // duration of run; can be fractional seconds
ops : ops // array of operations to run (see above)
} )
// res is a json object returned, easiest way to see everything in it:
printjson( res )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
If you put this in a file called testing.js you can run it from mongo shell like this:
> load("testing.js")
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"queryLatencyAverageMs" : 69.3567923734754,
"insert" : 0,
"query" : 12839.4,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 128.4
}
threads: 1 queries/sec: 12839.4
and so on.
I found the reason why MongoDB is getting slower while inserting many documents.
Many to many relations are not recommended for over 10,000 documents when using MRI due to the garbage collector taking over 90% of the run time when calling #build or #create. This is due to the large array appending occuring in these operations.
http://mongoid.org/performance.html
Now I would like to know how to measure the query performance of each database. My main concerns are the the measurement of the query time and the flow capacity / throughput. This measurement should be made directly at the database, so that nothing can adulterate the result.