I'm working on a rewrite of a popular add-on module (NDOUtils) for an even more popular application (Nagios). This module adds functionality to Nagios by making the objects/statuses/histories of each object available in a database.
The current available version takes the data from Nagios (via some registered callbacks/function pointers) and sends it over a socket where an additional process listens and queues the data up. Finally, an additional process then pops the data from the queue and builds mysql queries for insertion.
Although it works, and has worked for quite some time, we encounter problems on larger systems (15k+ objects defined in Nagios configurations). We decided to start over and re-write the module to handle the database calls directly (via mysql c api prepared statements).
This works beautifully for the status data. One problem we face is that on startup, we need to get the object definitions into the database. Since the definitions can change each time the process starts, we truncate the appropriate tables and recreate each object. This works fine for most systems...
But for large systems, this process can take upwards of several minutes - and this is a blocking process - several minutes is unacceptable, and this is exacerbated on critical monitoring setups.
So, to get this rewrite underway, I kept things simple. To begin with, I looped over each object definition, and built a simple query and then inserted. Once each of that type of object were inserted, I looped back over the objects for all of the objects that are related (for example, each host definition likely has a contact or contactgroup associated with it. Those relationships need to be identified). This was the easiest to read, but extremely slow on a system with 15k hosts and 25k services. Extremely slow as in 3 minutes.
Of course we can do better than that. I rewrote the major functions (hosts and services) to only need to loop over the object list twice each, and instead of sending an individual query per object or relationship, we build a bulk insert query. The code for this looks something like this:
#define MAX_OBJECT_INSERT 50
/* this is large because the reality is that the contact/host/services object queries
are several thousand characters before any concatenation happens */
#define MAX_SQL_BUFFER ((MAX_OBJECT_INSERT * 150) + 8000)
#define MAX_SQL_BINDINGS 400
MYSQL_STMT * ndo_stmt = NULL;
MYSQL_BIND ndo_bind[MAX_SQL_BINDINGS];
int ndo_bind_i = 0;
int ndo_max_object_insert_count = 20;
int ndo_write_hosts()
{
host * tmp = host_list;
int host_object_id[MAX_OBJECT_INSERT] = { 0 };
int i = 0;
char query[MAX_SQL_BUFFER] = { 0 };
char * query_base = "INSERT INTO nagios_hosts (instance_id, config_type, host_object_id, name) VALUES ";
size_t query_base_len = strlen(query_base);
size_t query_len = query_base_len;
char * query_values = "(1,?,?,?),";
size_t query_values_len = strlen(query_values);
char * query_on_update = " ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), config_type = VALUES(config_type), host_object_id = VALUES(host_object_id), name = VALUES(name)";
size_t query_on_update_len = strlen(query_on_update);
/* lock the tables */
mysql_query(mysql_connection, "LOCK TABLES nagios_logentries WRITE, nagios_objects WRITE, nagios_hosts WRITE");
strcpy(query, query_base);
/* reset mysql bindings */
memset(ndo_bind, 0, sizeof(ndo_bind));
ndo_bind_i = 0;
while (tmp != NULL) {
/* concat the query_values to the current query */
strcpy(query + query_len, query_values);
query_len += query_values_len;
/* retrieve this object's object_id from `nagios_objects` */
host_object_id[i] = ndo_get_object_id_name1(TRUE, NDO_OBJECTTYPE_HOST, tmp->name);
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_LONG;
ndo_bind[ndo_bind_i].buffer = &(config_type);
ndo_bind_i++;
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_LONG;
ndo_bind[ndo_bind_i].buffer = &(host_object_id[i]);
ndo_bind_i++;
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_STRING;
ndo_bind[ndo_bind_i].buffer_length = MAX_BIND_BUFFER;
ndo_bind[ndo_bind_i].buffer = tmp->name;
ndo_tmp_str_len[ndo_bind_i] = strlen(tmp->name);
ndo_bind[ndo_bind_i].length = &(ndo_tmp_str_len[ndo_bind_i]);
ndo_bind_i++;
i++;
/* we need to finish the query and execute */
if (i >= ndo_max_object_insert_count || tmp->next == NULL) {
memcpy(query + query_len - 1, query_on_update, query_on_update_len);
mysql_stmt_prepare(ndo_stmt, query, query_len + query_on_update_len);
mysql_stmt_bind_param(ndo_stmt, ndo_bind);
mysql_stmt_execute(ndo_stmt);
/* remove everything after the base query */
memset(query + query_base_len, 0, MAX_SQL_BUFFER - query_base_len);
query_len = query_base_len;
ndo_bind_i = 0;
i = 0;
}
tmp = tmp->next;
}
mysql_query(mysql_connection, "UNLOCK TABLES");
}
This has been edited for brevity, and to get at least a basic understanding of what's happening here. In reality, there is real error checking happening after each mysql return.
Regardless, even with ndo_max_object_insert_count set to a high number (50, 100, etc.) - this still takes about 50 seconds for 15k hosts and 25k services.
I'm at my wits end trying to make this faster, so if anyone sees some glaring problem that I'm not noticing, or has any advice on how to make this style of string manipulation/bulk insert more performant, I'm all ears.
Update 1
Since posting this, I've gone through and updated the loop to not continuously rewrite the string, and also to stop re-preparing the statement and re-binding the parameters. Now it only updates the query the first loop through and then on the last loop (depending on the result of number of hosts % max object inserts). This has actually shaved a few seconds off, but nothing substantial.
Your code at first glance does not appear to have any issues that would cause a performance problem like this. With this amount of data I would expect the code to run in a few seconds given normal hardware/OS behavior. I would recommend examining two possible pain points:
How fast are you generating the data to insert? (Replace the insertion part of the code to a NOOP)
If you determine in step 1 that the data is being generated quickly enough, the problem is with the write performance of the database.
Regardless, it is very likely that you have to troubleshoot on the database server level - run SHOW PROCESSLIST to start, then SHOW ENGINE INNODB STATUS if you are using InnoDB, and if all else fails, grab stacktrace shots of mysqld process with gdb.
Likely culprit is something horribly wrong with the I/O subsystem of the server or perhaps some form of synchronous replication is enabled, but it is hard to know for sure without some server-level diagnostic.
Related
When I start my Node app, I store "static" MySQL data like game quests and game monsters which won't be modified in global objects. I'm not sure if is more efficient to do it this way or retrieving the data each time I need it. Sample code:
global.monsters;
doConn.query('SELECT * FROM monsters', function(error, results) {
if (error) {
throw error;
}
console.log('[MYSQL] Loaded monsters');
monsters = results;
});
There's an important concept of code efficiency called loop-invariant. It refers to anything that remains the same in every iteration of a loop.
Example:
for loop := range 1..100 {
m = 42
// other statements...
}
m is assigned a fixed value 100 times. Why do it 100 times? Why not assign it once, either before or after the loop, and save 99% of that work?
m = 42
for loop := range 1..100 {
// other statements...
}
Some kinds of compilers can factor this code out during the compilation step. But maybe not Node.js. Even if it could factor it out, the code would be more clear to the reader if you write it with loop-invariant statements outside the loop. Otherwise the reader will waste some of their attention trying to figure out if there's some reason the statement is inside the loop.
The example of m = 42 is very simple, but there could be more complex code that is still loop-invariant. Like querying data out of a database, the question you ask about.
There are exceptions to every rule. For example, some of the data about monsters could change frequently, even while players are playing, so your game might need to query repeatedly so it makes sure to have the latest data at all times.
But in general, if you can identify queries that are just as correct if you query them once at the start of the program, it's better to do that than to query them repeatedly.
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
I'm working with a database that holds lots of urls (tens of thousands). I'm attempting to multi-thread a resolver, that simply tries to resolve a given domain. On success, it compares the result to what's currently in the database. If it's different, the result is updated. If it fails, it's also updated.
Naturally, this will produce an inordinate volume of database calls. To clarify some of my confusion about the best way to achieve some form of asynchronous load distribution, I have the following questions (being fairly new to Perl still).
What is the best option for distributing the workload? Why?
How should I gather the urls to resolve prior to spawning?
Creating a hash of domains with the data to be compared seems to make the most sense to me. Then split it up, fire up children, children return changes to be made to parent
How should returning data to the parent be handled in a clean manner?
I've been playing with a more pythonic method (given that I have more experience in Python), but have yet to make it work due to a lack of blocking for some reason. Asside from that issue, threading isn't the best option simply due to (a lack of) CPU time for each thread (plus, I've been crucified more than once in the Perl channel for using threads :P and for good reason)
Below is more or less psuedo-code that I've been playing with for my threads (which should be used more as a supplement to my explanation of what I'm trying to accomplish, than anything).
# Create children...
for (my $i = 0; $i < $threads_to_spawn; $i++ )
{
threads->create(\&worker);
}
The parent then sits in a loop, monitoring a shared array of domains. It locks and re-populates it if it becomes empty.
Your code is the start of a persistent worker model.
use threads;
use Thread::Queue 1.03 qw( );
use constant NUM_WORKERS => 5;
sub work {
my ($dbh, $job) = #_;
...
}
{
my $q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
my $dbh = ...;
while (my $job = $q->dequeue())
work($dbh, $job);
}
};
}
for my $job (...) {
$q->enqueue($job);
}
$q->end();
$_->join() for threads->list();
}
Performance tips:
Tweak the number of workers for your system and workload.
Grouping small jobs into larger jobs can improve speed by reducing overhead.
I set up my Rails application twice. One is working with MongoDB (Mongoid as mapper) and the other with MySQL and ActiveRecord. Then I wrote a rake task which inserts some test-data to both databases (100.000 entries).
I measured how long it takes for each database with the ruby Benchmark module. I did some testing with 100 and 10.000 entries where mongodb was always faster than mysql (about 1/3). The weird thing is that it takes about 3 times longer in mongodb to insert the 100.000 entries than with mysql. I have no idea why mongodb has this behaviour?! The only thing that I know is that the cpu time is much lower than the total time. Is it possible that mongodb starts some sort of garbage collection while it's inserting the data? At the beginning it's fast, but as more data mongodb is inserting, it gets slower and slower...any idea on this?
To get somehow a read performance of the two databases, I thought about measuring the time when the database gets an search query and respond the result. As I need some precise measurements, I don't want to include the time where Rails is processing my query from the controller to the database.
How do I do the measurement directly at the database and not in the Rails controller? Is there any gem / tool which would help me?
Thanks in advance!
EDIT: Updated my question according to my current situation
If your base goal is to measure database performance time at the DB level, I would recommend you get familiar with the benchRun method in MongoDB.
To do the type of thing you want to do, you can get started with the example on the linked page, here is a variant with explanations:
// skipped dropping the table and reinitializing as I'm assuming you have your test dataset
// your database is called test and collection is foo in this code
ops = [
// this sets up an array of operations benchRun will run
{
// possible operations include find (added in 2.1), findOne, update, insert, delete, etc.
op : "find" ,
// your db.collection
ns : "test.foo" ,
// different operations have different query options - this matches based on _id
// using a random value between 0 and 100 each time
query : { _id : { "#RAND_INT" : [ 0 , 100 ] } }
}
]
for ( x = 1; x<=128; x*=2){
// actual call to benchRun, each time using different number of threads
res = benchRun( { parallel : x , // number of threads to run in parallel
seconds : 5 , // duration of run; can be fractional seconds
ops : ops // array of operations to run (see above)
} )
// res is a json object returned, easiest way to see everything in it:
printjson( res )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
If you put this in a file called testing.js you can run it from mongo shell like this:
> load("testing.js")
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"queryLatencyAverageMs" : 69.3567923734754,
"insert" : 0,
"query" : 12839.4,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 128.4
}
threads: 1 queries/sec: 12839.4
and so on.
I found the reason why MongoDB is getting slower while inserting many documents.
Many to many relations are not recommended for over 10,000 documents when using MRI due to the garbage collector taking over 90% of the run time when calling #build or #create. This is due to the large array appending occuring in these operations.
http://mongoid.org/performance.html
Now I would like to know how to measure the query performance of each database. My main concerns are the the measurement of the query time and the flow capacity / throughput. This measurement should be made directly at the database, so that nothing can adulterate the result.
I'm doing a little research on possible application of EWS in our existing project which is written with heavy use of MAPI and I found out something disturbing about performance of LoadPropertiesForItems() method.
Consider such scenario:
we have 10000 (ten thousands) messages in Inbox folder
we want to get approximately 30 properties of every message to see if they satisfy our conditions for further processing
messages are retrieved from server in packs of 100 messages
So, code looks like this:
ItemView itemsView = new ItemView(100);
PropertySet properties = new PropertySet();
properties.Add(EmailMessageSchema.From);
/*
add all necessary properties...
*/
properties.Add(EmailMessageSchema.Sensitivity);
FindItemsResults<Item> findResults;
List<EmailMessage> list = new List<EmailMessage>();
do
{
findResults = folder.FindItems(itemsView);
_service.LoadPropertiesForItems(findResults, properties);
foreach (Item it in findResults)
{
... do something with every items
}
if (findResults.NextPageOffset.HasValue)
{
itemsView.Offset = findResults.NextPageOffset.Value;
}
}while(findResults.MoreAvailable);
And the problem is that every increment of itemsView.Offset property makes LoadPropertiesForItems method longer to execute. For first couple of iterations it is not very noticeable but around 30th time loop makes that call time increases from under 1 second to 8 or more seconds. And memory allocation hits physical limits causing out of memory exception.
I'm pretty sure that my problems are "offset related" because I changed a code a little to that:
itemsView = new ItemView(100, offset, OffsetBasePoint.Beginning);
...rest of loop
if (findResults.NextPageOffset.HasValue)
{
offset = findResults.NextPageOffset.Value;
}
and I manipulated offset variable (declared outside of loop) in that way that I set its value on 4500 at start and than in debug mode after first iteration I changed its value to 100. And according to my suspicions first call of LoadPropertiesForItems took veeeery long to execute and second call (with offset = 100) was very quick.
Can anyone confirm that and maybe propose some solution for that?
Of course I can do my work without using an offset but why should I? :)
Changing the offset is expensive because the server has to iterate through the items from the beginning -- it isn't possible to have an ordinal index for messages because new messages can be inserted into the view in any order (think of a view over name or subject).
Paging through all the items once is the best approach.