I have a node API that accepts arrays of up to 1000 email addresses:
These are then sent to the server in a single INSERT INTO statement.
i.e. 1000 ROWS with about 8 columns each.
The API can currently only handle 1 of those requests per second.
I have tried to scale my node server (up and out).
I have also tried to scale up my DB.
Nothing seems to increase the API beyond 1 req/sec
The strange thing is that the API server and DB server usage stay well below 50%.
Is there some other possible bottleneck that might impact the performance?
The code part that takes most of the time is definitely the INSERT INTO statement, but why does it still remain that slow when I scale the DB. And why is the usage % so low?
Related
If a lot of message are produced on kakfa and if I try to consume (using python confluent_kafka library) and process them, the database (working on mysql DB) gets loaded with a lot of queries quickly.
I want to slow down the consuming speed based on the load on the DB. I was thinking of using time.sleep() in the consumer loop. This way I can provide a larger time to sleep if there is a load on the DB. I will be fetching the number of seconds to sleep from redis key. Similary will change the value of the redis key when there is a certain amount of load on the db. Like set the value of key to 30s if db load > 80 % or something like that.
I am stuck on how I can calculate the load on the DB.
Also is there a another way of controlling the consuming speed, then please tell.
As long as you are careful to not stop consuming for too long (by default, the maximum time between poll() invocations is 5 minutes) you can slow down your consumption in any way you'd like.
You can also pause and resume as needed instead of slowing down consumption introducing a sleep.
A different approach to organically limit the load in your database could be to limit the amount of concurrent connections that your DB client can make and adjust it to keep a constant, controlled, load.
I am trying to get a build working, and one of the stages intermittently fails with the following error:
Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
How should I deal with this error?
First, let's talk a bit about what that error means.
From the official Spark documentation (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables):
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
From my experience the broadcast timeout usually occurs when one of the input datasets are partitioned poorly. Instead of disabling the broadcast, I recommend that you look at the partitions of your datasets and ensure that they are partitioned correctly.
The rules of thumb I use are to take the size of your dataset in MB and divide by 100, and set the number of partitions to be that number. Since the HDFS block size is 125 MB, we want to spilt the files into about 125 MB but since they don't split perfectly we can divide by a smaller number to get more partitions.
The main thing is that very small datasets (~<125 MB) are in a single partition as the network overhead it too large! Hope this helps.
If I used MySQLdb or JDBC to issue the sql: select * from users to Mysql. Suppose the table has 1 billion records. Then how many rows would be returned by Mysql in one chunk/package. I mean Mysql won't transfer the rows one by one neither transfer all of the data just one time, right? So what's the default chunk/package size one internet transfer to the client please?
If I used server-side cursor then I should set the fetch size bigger than default chunk size for better performance, right please?
The implementation notes of MySQL's JDBC API implementation points out, that by default the whole set will be retreived and stored in memory. So if there are 1 billion records they will be retreived. The limiting factor would probably be the memory of your machine.
So to sum up the size of the ResultSet retreived is depending on the JDBC implementation. For example Oracle's JDBC-Driver would only retreive 10 rows at a time and store them in memory.
We have nightly load jobs that writes several hundred thousand records to an Mysql reporting database running in Amazon RDS.
The load jobs are taking several hours to complete, but I am having a hard time figuring out where the bottleneck is.
The instance is currently running with General Purpose (SSD) storage. By looking at the cloudwatch metrics, it appears I am averaging less than 50 IOPS for the last week. However, Network Receive Throughput is less than 0.2 MB/sec.
Is there anyway to tell from this data if I am being bottlenecked by network latency (we are currently loading the data from a remote server...this will change eventually) or by Write IOPS?
If IOPS is the bottleneck, I can easily upgrade to Provisioned IOPS. But if network latency is the issue, I will need to redesign our load jobs to load raw data from EC2 instances instead of our remote servers, which will take some time to implement.
Any advice is appreciated.
UPDATE:
More info about my instance. I am using an m3.xlarge instance. It is provisioned for 500GB in size. The load jobs are done with the ETL tool from pentaho. They pull from multiple (remote) source databases and insert into the RDS instance using multiple threads.
You aren't using up much CPU. Your memory is very low. An instance with more memory should be a good win.
You're only doing 50-150 iops. That's low, you should get 3000 in a burst on standard SSD-level storage. However, if your database is small, it is probably hurting you (since you get 3 iops per GB- so if you are on a 50gb or smaller database, consider paying for provisioned iops).
You might also try Aurora; it speaks mysql, and supposedly has great performance.
If you can spread out your writes, the spikes will be smaller.
A very quick test is to buy provisioned IOPS, but be careful as you may get fewer than you do currently during a burst.
Another quick means to determine your bottleneck is to profile your job execution application with a profiler that understands your database driver. If you're using Java, JProfiler will show the characteristics of your job and it's use of the database.
A third is to configure your database driver to print statistics about the database workload. This might inform you that you are issuing far more queries than you would expect.
Your most likely culprit accessing the database remotely is actually round-trip latency. The impact is easy to overlook or underestimate.
If the remote database has, for example, a 75 millisecond round-trip time, you can't possibly execute more than 1000 (milliseconds/sec) / 75 (milliseconds/round trip) = 13.3 queries per second if you're using a single connection. There's no getting around the laws of physics.
The spikes suggest inefficiency in the loading process, where it gathers for a while, then loads for a while, then gathers for a while, then loads for a while.
Separate but related, if you don't have the MySQL client/server compression protocol enabled on the client side... find out how to enable it. (The server always supports compression but the client has to request it during the initial connection handshake), This won't fix the core problem but should improve the situation somewhat, since less data to physically transfer could mean less time wasted in transit.
I'm not an RDS expert and I don't know if my own particular case can shed you some light. Anyway, hope this give you any kind of insight.
I have a db.t1.micro with 200GB provisioned (that gives be 600 IOPS baseline performance), on a General Purpose SSD storage.
The heaviest workload is when I aggregate thousands of records from a pool of around 2.5 million rows from a 10 million rows table and another one of 8 million rows. I do this every day. This is what I average (it is steady performance, unlike yours where I see a pattern of spikes):
Write/ReadIOPS: +600 IOPS
NetworkTrafficReceived/Transmit throughput: < 3,000 Bytes/sec (my queries are relatively short)
Database connections: 15 (workers aggregating on parallel)
Queue depth: 7.5 counts
Read/Write Throughput: 10MB per second
The whole aggregation task takes around 3 hours.
Also check 10 tips to improve The Performance of your app in AWS slideshare from AWS Summit 2014.
I don't know what else to say since I'm not an expert! Good luck!
In my case it was the amount of records. I was writing only 30 records per minute and had an Write IOPS of round about the same 20 to 30. But this was eating at the CPU, which reduced the CPU credit quite steeply. So I took all the data in that table and moved it to another "historic" table. And cleared all data in that table.
CPU dropped back down to normal measures, but Write IOPS stayed about the same, this was fine though. The problem: Indexes, I think because so many records needed to indexed when inserting it took way more CPU to do this indexing with that amount of rows. Even though the only index I had was a Primary Key.
Moral of my story, the problem is not always where you think it lies, although I had increased Write IOPS it was not the root cause of the problem, but rather the CPU that was being used to do index stuff when inserting which caused the CPU credit to fall.
Not even X-RAY on the Lambda could catch an increased query time. That is when I started to look at the DB directly.
Your Queue depth graph shows > 2 , which clearly indicate that the IOPS is under provisioned. (if Queue depth < 2 then IOPS is over provisioned)
I think you have used the default AUTOCOMMIT = 1 (autocommit mode). It performs a log flush to disk for every insert and exhausted the IOPS.
So,It is better to use (for performance tuning) AUTOCOMMIT = 0 before bulk inserts of data in MySQL, if the insert query looks like
set AUTOCOMMIT = 0;
START TRANSACTION;
-- first 10000 recs
INSERT INTO SomeTable (column1, column2) VALUES (vala1,valb1),(vala2,valb2) ... (val10000,val10000);
COMMIT;
--- next 10000 recs
START TRANSACTION;
INSERT INTO SomeTable (column1, column2) VALUES (vala10001,valb10001),(vala10001,valb10001) ... (val20000,val20000);
COMMIT;
--- next 10000 recs
.
.
.
set AUTOCOMMIT = 1
Using the above approach in t2.micro and inserted 300000 in 15 minutes using PHP.
We've got a constant stream of simple updates to a single MySQL table (storing user activity information). Let's say we group these into batch updates each second.
I want a ballpark idea of when mysql on a typical 4-core 8GB box will start having an issue keeping up with the updates coming in each second. E.g. how many rows of updates can I make # 1 per second?
This is a thought exercise to decide if I should get going with MySQL in the early days of our applications release (simplify development), or if MySQL's likely to bomb so soon as to make it not worth even venturing down that path.
The only way you can get a decent figure is through benchmarking your specific use case. There are just too many variables and there is no way around that.
It shouldn't take too long either if you just knock a bash script or a small demo app and hammer it with jmeter, then that can give you a good idea.
I used jmeter when trying to benchmark a similar use case. The difference was I was looking for write throughput for number of INSERTS. The most useful thing that came out when I was playing was the 'innodb_flush_log_at_trx_commit' param. If you are using INNODB and don't need ACID compliance for your use case, then changing it to 0. This makes a huge difference to INSERT throughput and will likely do the same in your UPDATE use case. Although note that with this setting, changes only get flushed to disk once per second, so if your server gets a power cut or something, you could lose a seconds worth of data.
On my Quad Core 8GB Machine for my use case:
innodb_flush_log_at_trx_commit=1 resulted in 80 INSERTS per second
innodb_flush_log_at_trx_commit=0 resulted in 2000 INSERTS per second
These figures will probably bear no relevance to your use case - which is why you need to benchmark it yourself.
A lot of it depends on the quality of the code which you use to push to the DB.
If you write your batch to insert a single value per INSERT request (i.e.,
INSERT INTO table (field) VALUES (value_1);
INSERT INTO table (field) VALUES (value_2);
...
INSERT INTO table (field) VALUES (value_n);
, your performance will crash and burn.
If you insert multiple values using a single INSERT (i.e.
INSERT INTO table (field) values (value_1),(value_2)...(value_n);
, you'll find that you could easily insert many records per second
As an example, I wrote a quick app which needed to add the details of a request for an LDAP account to a holding DB. Inserting one field at a time (i.e., LDAP_field, LDAP_value), execution of the whole script took 10's of seconds. When I concatenated the values into a single INSERT request, execution time of the script went down to about 2 seconds from start to finish. This included the overhead of starting and committing a transaction
Hope this helps
Its not easy to give a general answer to this question. The numbers you ask for rely heavily not only on the hardware of your database server, MySQL itself, but also on server/client configuration, network and - equally important - on your database/table design too.
Generally speaking, with a naked MySQL setup on a state-of-the-art server and update statements using unique keys, I don't have issues below 200 update-statementsp er second if I fire them from localhost, at least that's what I get on my six year old winxp test enviroment. A naked installation on a new system will scale this way higher. If you think way bigger, one server isn't the way to go. MySQL can be tweaked and scaled out in some ways, therefore many companies rely heavily on it.
Just some basics:
If the fields you want to update have huge index files, the update
statements are alot slower since each statement has to write not only
data, but also index informations.
If your update statement cannot
use an index, it might take longer for the server to allocate the
required fields it has to update.
Slow memory and/or slow harddisks
might also slow down overall server performance.
Slow network
connection slows down communication between client and server.
There are whole books written about it, so I'll stop here and advise some further reading, if you're interested!