I set up my Rails application twice. One is working with MongoDB (Mongoid as mapper) and the other with MySQL and ActiveRecord. Then I wrote a rake task which inserts some test-data to both databases (100.000 entries).
I measured how long it takes for each database with the ruby Benchmark module. I did some testing with 100 and 10.000 entries where mongodb was always faster than mysql (about 1/3). The weird thing is that it takes about 3 times longer in mongodb to insert the 100.000 entries than with mysql. I have no idea why mongodb has this behaviour?! The only thing that I know is that the cpu time is much lower than the total time. Is it possible that mongodb starts some sort of garbage collection while it's inserting the data? At the beginning it's fast, but as more data mongodb is inserting, it gets slower and slower...any idea on this?
To get somehow a read performance of the two databases, I thought about measuring the time when the database gets an search query and respond the result. As I need some precise measurements, I don't want to include the time where Rails is processing my query from the controller to the database.
How do I do the measurement directly at the database and not in the Rails controller? Is there any gem / tool which would help me?
Thanks in advance!
EDIT: Updated my question according to my current situation
If your base goal is to measure database performance time at the DB level, I would recommend you get familiar with the benchRun method in MongoDB.
To do the type of thing you want to do, you can get started with the example on the linked page, here is a variant with explanations:
// skipped dropping the table and reinitializing as I'm assuming you have your test dataset
// your database is called test and collection is foo in this code
ops = [
// this sets up an array of operations benchRun will run
{
// possible operations include find (added in 2.1), findOne, update, insert, delete, etc.
op : "find" ,
// your db.collection
ns : "test.foo" ,
// different operations have different query options - this matches based on _id
// using a random value between 0 and 100 each time
query : { _id : { "#RAND_INT" : [ 0 , 100 ] } }
}
]
for ( x = 1; x<=128; x*=2){
// actual call to benchRun, each time using different number of threads
res = benchRun( { parallel : x , // number of threads to run in parallel
seconds : 5 , // duration of run; can be fractional seconds
ops : ops // array of operations to run (see above)
} )
// res is a json object returned, easiest way to see everything in it:
printjson( res )
print( "threads: " + x + "\t queries/sec: " + res.query )
}
If you put this in a file called testing.js you can run it from mongo shell like this:
> load("testing.js")
{
"note" : "values per second",
"errCount" : NumberLong(0),
"trapped" : "error: not implemented",
"queryLatencyAverageMs" : 69.3567923734754,
"insert" : 0,
"query" : 12839.4,
"update" : 0,
"delete" : 0,
"getmore" : 0,
"command" : 128.4
}
threads: 1 queries/sec: 12839.4
and so on.
I found the reason why MongoDB is getting slower while inserting many documents.
Many to many relations are not recommended for over 10,000 documents when using MRI due to the garbage collector taking over 90% of the run time when calling #build or #create. This is due to the large array appending occuring in these operations.
http://mongoid.org/performance.html
Now I would like to know how to measure the query performance of each database. My main concerns are the the measurement of the query time and the flow capacity / throughput. This measurement should be made directly at the database, so that nothing can adulterate the result.
Related
I'm working on a rewrite of a popular add-on module (NDOUtils) for an even more popular application (Nagios). This module adds functionality to Nagios by making the objects/statuses/histories of each object available in a database.
The current available version takes the data from Nagios (via some registered callbacks/function pointers) and sends it over a socket where an additional process listens and queues the data up. Finally, an additional process then pops the data from the queue and builds mysql queries for insertion.
Although it works, and has worked for quite some time, we encounter problems on larger systems (15k+ objects defined in Nagios configurations). We decided to start over and re-write the module to handle the database calls directly (via mysql c api prepared statements).
This works beautifully for the status data. One problem we face is that on startup, we need to get the object definitions into the database. Since the definitions can change each time the process starts, we truncate the appropriate tables and recreate each object. This works fine for most systems...
But for large systems, this process can take upwards of several minutes - and this is a blocking process - several minutes is unacceptable, and this is exacerbated on critical monitoring setups.
So, to get this rewrite underway, I kept things simple. To begin with, I looped over each object definition, and built a simple query and then inserted. Once each of that type of object were inserted, I looped back over the objects for all of the objects that are related (for example, each host definition likely has a contact or contactgroup associated with it. Those relationships need to be identified). This was the easiest to read, but extremely slow on a system with 15k hosts and 25k services. Extremely slow as in 3 minutes.
Of course we can do better than that. I rewrote the major functions (hosts and services) to only need to loop over the object list twice each, and instead of sending an individual query per object or relationship, we build a bulk insert query. The code for this looks something like this:
#define MAX_OBJECT_INSERT 50
/* this is large because the reality is that the contact/host/services object queries
are several thousand characters before any concatenation happens */
#define MAX_SQL_BUFFER ((MAX_OBJECT_INSERT * 150) + 8000)
#define MAX_SQL_BINDINGS 400
MYSQL_STMT * ndo_stmt = NULL;
MYSQL_BIND ndo_bind[MAX_SQL_BINDINGS];
int ndo_bind_i = 0;
int ndo_max_object_insert_count = 20;
int ndo_write_hosts()
{
host * tmp = host_list;
int host_object_id[MAX_OBJECT_INSERT] = { 0 };
int i = 0;
char query[MAX_SQL_BUFFER] = { 0 };
char * query_base = "INSERT INTO nagios_hosts (instance_id, config_type, host_object_id, name) VALUES ";
size_t query_base_len = strlen(query_base);
size_t query_len = query_base_len;
char * query_values = "(1,?,?,?),";
size_t query_values_len = strlen(query_values);
char * query_on_update = " ON DUPLICATE KEY UPDATE instance_id = VALUES(instance_id), config_type = VALUES(config_type), host_object_id = VALUES(host_object_id), name = VALUES(name)";
size_t query_on_update_len = strlen(query_on_update);
/* lock the tables */
mysql_query(mysql_connection, "LOCK TABLES nagios_logentries WRITE, nagios_objects WRITE, nagios_hosts WRITE");
strcpy(query, query_base);
/* reset mysql bindings */
memset(ndo_bind, 0, sizeof(ndo_bind));
ndo_bind_i = 0;
while (tmp != NULL) {
/* concat the query_values to the current query */
strcpy(query + query_len, query_values);
query_len += query_values_len;
/* retrieve this object's object_id from `nagios_objects` */
host_object_id[i] = ndo_get_object_id_name1(TRUE, NDO_OBJECTTYPE_HOST, tmp->name);
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_LONG;
ndo_bind[ndo_bind_i].buffer = &(config_type);
ndo_bind_i++;
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_LONG;
ndo_bind[ndo_bind_i].buffer = &(host_object_id[i]);
ndo_bind_i++;
ndo_bind[ndo_bind_i].buffer_type = MYSQL_TYPE_STRING;
ndo_bind[ndo_bind_i].buffer_length = MAX_BIND_BUFFER;
ndo_bind[ndo_bind_i].buffer = tmp->name;
ndo_tmp_str_len[ndo_bind_i] = strlen(tmp->name);
ndo_bind[ndo_bind_i].length = &(ndo_tmp_str_len[ndo_bind_i]);
ndo_bind_i++;
i++;
/* we need to finish the query and execute */
if (i >= ndo_max_object_insert_count || tmp->next == NULL) {
memcpy(query + query_len - 1, query_on_update, query_on_update_len);
mysql_stmt_prepare(ndo_stmt, query, query_len + query_on_update_len);
mysql_stmt_bind_param(ndo_stmt, ndo_bind);
mysql_stmt_execute(ndo_stmt);
/* remove everything after the base query */
memset(query + query_base_len, 0, MAX_SQL_BUFFER - query_base_len);
query_len = query_base_len;
ndo_bind_i = 0;
i = 0;
}
tmp = tmp->next;
}
mysql_query(mysql_connection, "UNLOCK TABLES");
}
This has been edited for brevity, and to get at least a basic understanding of what's happening here. In reality, there is real error checking happening after each mysql return.
Regardless, even with ndo_max_object_insert_count set to a high number (50, 100, etc.) - this still takes about 50 seconds for 15k hosts and 25k services.
I'm at my wits end trying to make this faster, so if anyone sees some glaring problem that I'm not noticing, or has any advice on how to make this style of string manipulation/bulk insert more performant, I'm all ears.
Update 1
Since posting this, I've gone through and updated the loop to not continuously rewrite the string, and also to stop re-preparing the statement and re-binding the parameters. Now it only updates the query the first loop through and then on the last loop (depending on the result of number of hosts % max object inserts). This has actually shaved a few seconds off, but nothing substantial.
Your code at first glance does not appear to have any issues that would cause a performance problem like this. With this amount of data I would expect the code to run in a few seconds given normal hardware/OS behavior. I would recommend examining two possible pain points:
How fast are you generating the data to insert? (Replace the insertion part of the code to a NOOP)
If you determine in step 1 that the data is being generated quickly enough, the problem is with the write performance of the database.
Regardless, it is very likely that you have to troubleshoot on the database server level - run SHOW PROCESSLIST to start, then SHOW ENGINE INNODB STATUS if you are using InnoDB, and if all else fails, grab stacktrace shots of mysqld process with gdb.
Likely culprit is something horribly wrong with the I/O subsystem of the server or perhaps some form of synchronous replication is enabled, but it is hard to know for sure without some server-level diagnostic.
I'm currently needing to sync data via a redis slave to a remote site. The data lives in MySQL. To do this, I've devised a sync script similar to this:
MyTable
.select("id, first_name, status")
.find_each do |user|
STDOUT.write(gen_redis_proto("SET", "users:#{user.id}",user.to_json))
end
This works perfectly. I pipe this to redis-cli --pipe (as per https://www.redis.io/topics/mass-insert) and it inserts to the local master and syncs to the remote slave.
Unfortunately I have several thousand rows, making this sync quite large. I'd like to only sync rows that have changed, however there's no "last_modified" or similar value available in the table.
The above code runs in a loop with a sleep between runs, so I can store the previous resultset and make a comparison, but I can't work out an efficient way to do this. I'm thinking something similar to the below pseudocode:
lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":1}",
"{\"id\":456,\"first_name\":\"John\",\"status\":2}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
previous_lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":2}",
"{\"id\":456,\"first_name\":\"John\",\"status\":3}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
varied_lines = diff(lines, previous_lines) # returns something like [0,1]
varied_lines.each do |line|
this_line = line.to_a
STDOUT.write(gen_redis_proto("SET", "users:#{this_line.id}",line))
end
I suspect too much manipulation of data or comparison will come with a performance overhead, and I'm also unsure of the best way to diff this data to get results.
Why you don't need it, the rickshaw gem lets you add hashing onto strings easily. I'd probably do something with storing & comparing hash values which would take up less space. example.
require 'rickshaw'
def diff(current, previous)
current.select.each_with_index do |line, idx|
previous_line = previous[idx]
line.to_sha1 == previous_line.to_sha1
end
end
lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":1}",
"{\"id\":456,\"first_name\":\"John\",\"status\":2}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
previous_lines = [
"{\"id\":123,\"first_name\":\"Jimmy\",\"status\":2}",
"{\"id\":456,\"first_name\":\"John\",\"status\":3}",
"{\"id\":789,\"first_name\":\"James\",\"status\":2}"
]
varied_lines = diff(lines, previous_lines)
varied_lines.each do |line|
this_line = line.to_a
STDOUT.write(gen_redis_proto("SET", "users:#{this_line.id}",line))
end
I have large amount data in database, sometimes server not responding when execution of result is more than the server response time. So, is there any way to reduce the load of mysql server with redis and how to implement it with right way.
Redis supports a range of datatypes and you might wonder what a NOSQL key-value store has to do with datatypes? Well, these datatypes help developers store data in a meaningful way and can make data retrieval faster.
Connect with Redis in PHP
1) Download or get clone of predis library from github
2) We will require the Redis Autoloader and register it. Then we’ll wrap the client in a try catch block. The connection setting for connecting to Redis on a local server is different from connecting to a remote server.
require "predis/autoload.php";
PredisAutoloader::register();
try {
$redis = new PredisClient();
// This connection is for a remote server
/*
$redis = new PredisClient(array(
"scheme" => "tcp",
"host" => "153.202.124.2",
"port" => 6379
));
*/
}
catch (Exception $e) {
die($e->getMessage());
}
Now that we have successfully connected to the Redis server, let’s start using Redis.
Datatypes of Redis
Here are some of the datatypes supported by Redis:
String: Similar to Strings in PHP.
List: Similar to a single dimensional array in PHP. You can push, pop, shift and unshift, the elements that are placed in order or insertion FIFO (first in, first out).
Hash: Maps between string fields and string values. They are the perfect data type to represent objects (e.g.: A User with a number of fields like name, surname, and so forth).
Set: Similar to list, except that it has no order and each element may appear only once.
Sorted Set: Similar to Redis Sets with a unique feature of values stored in set. The difference is that each member of a Sorted Set is associated with score, used to order the set from the smallest score to the largest.
Others are bitmaps and hyperloglogs, but they will not be discussed in this article, as they are pretty dense.
Getter and Setter in PHP Redis (Predis)
In Redis, the most important commands are SET, GET and EXISTS. These commands are used to store, check, and retrieve data from a Redis server. Just like the commands, the Predis class can be used to perform Redis operations by methods with the same name as commands. For example:
// sets message to contian "Hello world"
$redis->set('message', 'Hello world');
// gets the value of message
$value = $redis->get('message');
// Hello world
print($value);
echo ($redis->exists('message')) ? "Oui" : "please populate the message key";
INCR and DECR are commands used to either decrease or increase a value.
$redis->set("counter", 0);
$redis->incr("counter"); // 1
$redis->incr("counter"); // 2
$redis->decr("counter"); // 1
$redis->set("counter", 0);
$redis->incrby("counter", 15); // 15
$redis->incrby("counter", 5); // 20
$redis->decrby("counter", 10); // 10
Working with List
There are a few basic Redis commands for working with lists and they are:
LPUSH: adds an element to the beginning of a list
RPUSH: add an element to the end of a list
LPOP: removes the first element from a list and returns it
RPOP: removes the last element from a list and returns it
LLEN: gets the length of a list
LRANGE: gets a range of elements from a list
Example as mentioned below
$redis->rpush("languages", "french"); // [french]
$redis->rpush("languages", "arabic"); // [french, arabic]
$redis->lpush("languages", "english"); // [english, french, arabic]
$redis->lpush("languages", "swedish"); // [swedish, english, french, arabic]
$redis->lpop("languages"); // [english, french, arabic]
$redis->rpop("languages"); // [english, french]
$redis->llen("languages"); // 2
$redis->lrange("languages", 0, -1); // returns all elements
$redis->lrange("languages", 0, 1); // [english, french]
How to Retrive data from Redis over to MySQL
You need to make Redis database as primary and Mysql database as slave, It means you have to fetch data first from Redis and if data not found/retrived then you have to get data from Mysql if data found then update Redis data so next time you can retrive data from redis. basic snapshot as mentioned below.
//Connect with Redis database
$data=get_data_redis($query_param);
if(empty($data))
{
//connect with mysql
$data=get_data_mysql($query_param);
if(!empty($data))
{
// update data into redis for that data
update_data_redis($data,$query_param);
}
}
How to Manage data in MySQL and Redis
In case of manage data into databaseyou have to update data into mysql database first and then update it into Redis database.
//insert data in mysql
$inserted= insert_data_mysql($data);
if($inserted)
{
insert_data_redis($data);
}
//update data in mysql
$updated= update_data_mysql($data,$query);
if($updated)
{
insert_data_redis($data,$query);
}
//delete data in mysql
$deleted= delete_data_mysql($query);
if($deleted)
{
delete_data_redis($query);
}
Redis can be used as a caching layer over the MYSQL queries.
Redis is an in-memory databases, which means, it will keep the data in memory and it can accessed faster as compare to query the data from MYSQL.
One sample use case would be:
Suppose you are creating a gaming listing site, and you have multiple games categories like, car games, bike games, kids games, etc. and to find the game mapping for each categories you have to query SQL database to get the list of the games for your game listing page. This is a scenario in which you can use Redis as a caching layer, and cache the SQL response in memcahce/Radis for X hours.
Exact steps:
First GET from Redis
if found return.
if not found in redis, then do the MYSQL query and before returning save the response in redis cache for the next time.
This will offload a hell lot of queries from the MYSQL to in-memory redis db.
if(data in redis){
step 1: return data;
}else{
step 1: query MYSQL
step 2: Save in redis
step 3: return data
}
Some points to consider before choosing the queries to save in redis are:
Only static queries should be choosen, means those whose data is not user specific.
Choose the slow static queries to further improve the MYSQL performance.
Hope it will help.
I describe the outcome of a strategy by numerous rows. Each row contains a symbol (describing an asset), a timestamp (think of a backtest) and a price + weight.
Before a strategy runs I delete all previous results from this particular strategy (I have many strategies). I then loop over all symbols and all times.
# delete all previous data written by this strategy
StrategyRow.objects.filter(strategy=strategy).delete()
for symbol in symbols.keys():
s = symbols[symbol]
for t in portfolio.prices.index:
p = prices[symbol][t]
w = weights[symbol][t]
row = StrategyRow.objects.create(strategy=strategy, symbol=s, time=t)
if not math.isnan(p):
row.price = p
if not math.isnan(w):
row.weight = w
row.save()
This works but is very, very slow. Is there a chance to achive the same with write_frame from pandas? Or maybe using faster raw sql?
I don't think the first thing you should try is the raw SQL route (more on that in a bit)
But I think it's because of calling row.save() on many objects, that operation is known to be slow.
I'd look into StrategyRow.objects.bulk_create() first, https://docs.djangoproject.com/en/1.7/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
The difference is you pass it a list of your StrategyRow model, instead of calling .save() on individual instances. It's pretty straightforward, bundle up a few rows then create them in batches, maybe try 10, 20, a 100 etc at a time, your database configs can also help find the optimum batch size. (e.g. http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_max_allowed_packet)
Back to your idea of raw SQL, that would make a difference, if e.g. the Python code that creates the StrategyRow instances is slow (e.g. StrategyRow.objects.create()), but still I believe the key is to batch insert them instead of running N queries
I have a Grails application that does a rather huge createCriteria query pulling from many tables. I noticed that the performance is pretty terrible and have pinpointed it to the Object manipulation I do afterwards, rather than the createCriteria itself. My query successfully gets all of the original objects I wanted, but it is performing a new query for each element when I am manipulating the objects. Here is a simplified version of my controller code:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
// Lots of if statements for filters, etc.
}
def results = hosts?.collect{ [ cell: [
it.hostname,
it.type,
it.status.toString(),
it.env.toString(),
it.supporter.person.toString()
...
]]}
I have many more fields, including calls to methods that perform their own queries to find related objects. So my question is: How can I incorporate joins into the original query so that I am not performing tons of extra queries for each individual row? Currently querying for ~700 rows takes 2 minutes, which is way too long. Any advice would be great! Thanks!
One benefit you get using criteria is you can easily fetch associations eagerly. As a result of which you would not face the well known N+1 problem while referring associations.
You have not mentioned the logic in criteria but I would suggest for ~700 rows I would definitely go for something like this:
def hosts = Host.createCriteria().list(max: maxRows, offset: rowOffset) {
...
//associations are eagerly fetched if a DSL like below
//is used in Criteria query
supporter{
person{
}
}
someOtherAssoc{
//Involve logic if required
//eq('someOtherProperty', someOtherValue)
}
}
If you feel that tailoring a Criteria is cumbersome, then you can very well fallback to HQL and use join fetch for eager indexing for associations.
I hope this would definitely reduce the turnaround time to less than 5 sec for ~700 records.