Sequelize / mysql using 100% CPU when create or update data - mysql

I need to update or create data in a mysql table from a large array (few 1000s objects) with sequelize.
When I run the following code it uses up almost all my cpu power of my db server (vserver 2gb ram / 2cpu) and clogs my app for a few minutes until it's done.
Is there a better way to do this with sequelize? Can this be done in the background somehow or as a bulk operation so it doesn't effect my apps performance?
data.forEach(function(item) {
var query = {
'itemId': item.id,
'networkId': item.networkId
};
db.model.findOne({
where: query
}).then(function(storedItem) {
try {
if(!!storedItem) {
storedItem.update(item);
} elseĀ  {
db.model.create(item);
}
} catch(e) {
console.log(e);
}
});
});

Your first line of your sample code data.forEach()... makes a whole mess of calls to your function(item){}. Your code in that function fires off, in turn, a whole mess of asynchronously completing operations.
Try using the async package https://caolan.github.io/async/docs.htm and doing this
async = require('async');
...
async.mapSeries(data, function(item){...
It should allow each iteration of your function (which iterates once per item in your data array) to complete before starting the next one. Paradoxically enough, doing them one at a time will probably make them finish faster. It will certainly avoid soaking up your resources.

Weeks later I found the actual reason for this. (And unfortunately using async didn't really help after all) It was as simple as stupid: I didn't have an MYSQL index for itemId so with every iteration the whole table was queried which caused the high CPU load (obviously).

Related

How to make dynamic number of mysql queries in nodejs?

I'm struggeling a bit with how to shoot several SQL update queries to the MySQL server from NodeJS. I need some kind of synchronous execution.
Situation:
I have an array of objects. (Lets say "formValues")
Some of these objects might have a corresponding row in a database table, others don't.
I need to do a MySQL query for each object in the array to know for which object a new row in MySQL needs to be created and which one only needs to be updated.
at the end I need to update / create the rows in MySQL table.
I need kind of one callback for the whole process.
This is more or less a general question on how to solve situations like this with NodeJS. As I understood, MySQL queries are executed asynchronously.
How can it be achieved, that I can loop through the array to build a list of entries which need to be updated an other to be created in MySQL table?
Is there any "synchronous" MySQL query?
Regards
Jens
You can probably really benefit from switching to an async/await MySQL client. Chances are that the client you are using already has support for this.
But even if yours doesn't, and you can't switch it's still by far the easiest to write a helper function that converts your callback-based mysql to a promise-based one.
Effectively the approach becomes:
for(const formValue of formValues) {
await mysql.query('....');
}
Doing this without async/await and with callbacks is significantly harder.
I imagine one approach might be something like:
function doQueriesForFormValue(formValues, cb) {
const queries = [];
for(const formvalue of formValues) {
queries.push('...');
}
runQueries(queries, cb);
}
function runQueries(queries, cb) {
// Grab the first query
const currentQuery = queries[0];
mysql.query(currentQuery, (res, err) {
if (err) {
cb(null, err);
}
if (currentQueries.length>1) {
// Do the next query
runQueries(currentQueries.slice(1), cb);
} else {
// call final callback
cb(true);
}
});
}
I wrote an article with some effective patterns for working with MySQL in Node.js. Perhaps it's helpful

AWS Lambda - MySQL caching

I have Lambda that uses RDS. I wanted to improve it and use the Lambda connection caching. I have found several articles, and implemented it on my side, best to my knowledge. But now, I am not sure it is this the rigth way to go.
I have Lambda (running Node 8), which has several files used with require. I will start from the main function, until I reach the MySQL initializer, which is exact path. All will be super simple, showing only to flow of the code that runs MySQL:
Main Lambda:
const jobLoader = require('./Helpers/JobLoader');
exports.handler = async (event, context) => {
const emarsysPayload = event.Records[0];
let validationSchema;
const body = jobLoader.loadJob('JobName');
...
return;
...//
Job Code:
const MySQLQueryBuilder = require('../Helpers/MySqlQueryBuilder');
exports.runJob = async (params) => {
const data = await MySQLQueryBuilder.getBasicUserData(userId);
MySQLBuilder:
const mySqlConnector = require('../Storage/MySqlConnector');
class MySqlQueryBuilder {
async getBasicUserData (id) {
let query = `
SELECT * from sometable WHERE id= ${id}
`;
return mySqlConnector.runQuery(query);
}
}
And Finally the connector itself:
const mySqlConnector = require('promise-mysql');
const pool = mySqlConnector.createPool({
host: process.env.MY_SQL_HOST,
user: process.env.MY_SQL_USER,
password: process.env.MY_SQL_PASSWORD,
database: process.env.MY_SQL_DATABASE,
port: 3306
});
exports.runQuery = async query => {
const con = await pool.getConnection();
const result = con.query(query);
con.release();
return result;
};
I know that measuring performance will show the actual results, but today is Friday, and I will not be able to run this on Lambda until the late next week... And really, it would be awesome start of the weekend knowing I am in right direction... or not.
Thank for the inputs.
First thing would be to understand how require works in NodeJS. I do recommend you go through this article if you're interested in knowing more about it.
Now, once you have required your connection, you have it for good and it won't be required again. This matches what you're looking for as you don't want to overwhelm your database by creating a new connection every time.
But, there is a problem...
Lambda Cold Starts
Whenever you invoke a Lambda function for the first time, it will spin up a container with your function inside it and keep it alive for approximately 5 mins. It's very likely (although not guaranteed) that you will hit the same container every time as long as you are making 1 request at a time. But what happens if you have 2 requests at the same time? Then another container will be spun up in parallel with the previous, already warmed up container. You have just created another connection on your database and now you have 2 containers. Now, guess what happens if you have 3 concurrent requests? Yes! One more container, which equals one more DB connection.
As long as there are new requests to your Lambda functions, by default, they will scale out to meet demand (you can configure it in the console to limit the execution to as many concurrent executions as you want - respecting your Account limits)
You cannot safely make sure you have a fixed amount of connections to your Database by simply requiring your code upon a Function's invocation. The good thing is that this is not your fault. This is just how Lambda functions behave.
...one other approach is
to cache the data you want in a real caching system, like ElasticCache, for example. You could then have one Lambda function be triggered by a CloudWatch Event that runs in a certain frequency of time. This function would then query your DB and store the results in your external cache. This way you make sure your DB connection is only opened by one Lambda at a time, because it will respect the CloudWatch Event, which turns out to run only once per trigger.
EDIT: after the OP sent a link in the comment sections, I have decided to add a few more info to clarify what the mentioned article wants to say
From the article:
"Simple. You ARE able to store variables outside the scope of our
handler function. This means that you are able to create your DB
connection pool outside of the handler function, which can then be
shared with each future invocation of that function. This allows for
pooling to occur."
And this is exactly what you're doing. And this works! But the problem is if you have N connections (Lambda Requests) at the same time. If you don't set any limits, by default, up to 1000 Lambda functions can be spun up concurrently. Now, if you then make another 1000 requests simultaneously in the next 5 minutes, it's very likely you won't be opening any new connections, because they have already been opened on previous invocations and the containers are still alive.
Adding to the answer above by Thales Minussi but for a Python Lambda. I am using PyMySQL and to create a connection pool I added the connection code above the handler in a Lambda that fetches data. Once I did this, I was not getting any new data that was added to the DB after an instance of the Lambda was executed. I found bugs reported here and here that are related to this issue.
The solution that worked for me was to add a conn.commit() after the SELECT query execution in the Lambda.
According to the PyMySQL documentation, conn.commit() is supposed to commit any changes, but a SELECT does not make changes to the DB. So I am not sure exactly why this works.

Lifetime of a Web SQL transaction

What's the lifetime of a Web SQL transaction, or, if it's dynamic, what does it depend on?
From my experience opening a new transaction takes a considerable amount of time, so I was trying to keep the transaction open for the longest time possible.
I also wanted to keep the code clean, so I was trying to separate the JS into abstract functions and passing a transaction as a parameter - something I'm sure is not good practice but sometimes greatly improves performance when it works.
As an example:
db.transaction(function (tx) {
// First question: how many tx.executeSql
// calls are allowed within one transaction?
tx.executeSql('[some query]');
tx.executeSql('[some other query]', [], function (tx, results) {
// Do something with results
});
// Second question: passing the transaction
// works some times, but not others. Is this
// allowed by the spec, good practice, and/or
// limited by any external factors?
otherFunction(tx, 'some parameter');
});
function otherFunction(tx, param) {
tx.executeSql('[some query]');
}
Also, any suggestions on techniques for speedy access to the Web SQL database would be welcome as well.

Using Other Data Sources for cubism.js

I like the user experience of cubism, and would like to use this on top of a backend we have.
I've read the API doc's and some of the code, most of this seems to be extracted away. How could I begin to use other data sources exactly?
I have a data store of about 6k individual machines with 5 minute precision on around 100 or so stats.
I would like to query some web app with a specific identifier for that machine and then render a dashboard similar to cubism via querying a specific mongo data store.
Writing the webapp or the querying to mongo isn't the issue.
The issue is more in line with the fact that cubism seems to require querying whatever data store you use for each individual data point (say you have 100 stats across a window of a week...expensive).
Is there another way I could leverage this tool to look at data that gets loaded using something similar to the code below?
var data = [];
d3.json("/initial", function(json) { data.concat(json); });
d3.json("/update", function(json) { data.push(json); });
Cubism takes care of initialization and update for you: the initial request is the full visible window (start to stop, typically 1,440 data points), while subsequent requests are only for a few most recent metrics (7 data points).
Take a look at context.metric for how to implement a new data source. The simplest possible implementation is like this:
var foo = context.metric(function(start, stop, step, callback) {
d3.json("/data", function(data) {
if (!data) return callback(new Error("unable to load data"));
callback(null, data);
});
});
You would extend this to change the "/data" URL as appropriate, passing in the start, stop and step times, and whatever else you want to use to identify a metric. For example, both Cube and Graphite use a metric expression as an additional query parameter.

Parallel deserialization of Json from a database

This is the scenario: In a separate task I read from a datareader which represent a single column result set with a string, a JSON. In that task I add the JSON string to a BlockingCollection that wraps the ConcurrentQueue. At the same time in the main thread I TryTake/dequeue a JSON string from the collection and then yield return it deserialized.
The reading from the database and the deserialization is approximately of the same speed so there will not be to much memory consumption caused by a large BlockingCollection.
When the reading from the database is done, the task is closed and I then deserialize all the non deserialized JSON strings.
Questions/thoughts:
1) Does the TryTake lock so that no adding can be done?
2) Don't do it. Just do it in serial and yield return.
using (var q = new BlockingCollection<string>())
{
Task task = null;
try
{
task = new Task(() =>
{
foreach (var json in sourceData)
q.Add(json);
});
task.Start();
while (!task.IsCompleted)
{
string json;
if (q.TryTake(out json))
yield return Deserialize<T>(json);
}
Task.WaitAll(task);
}
finally
{
if (task != null)
{
task.Dispose();
}
q.CompleteAdding();
}
foreach (var e in q.GetConsumingEnumerable())
yield return Deserialize<T>(e);
}
Question 1
Does the TryTake lock so that no adding can be done
There will be a very brief period whereby an add cannot be performed, however this time will be negligible. From http://msdn.microsoft.com/en-us/library/dd997305.aspx
Some of the concurrent collection types use lightweight
synchronization mechanisms such as SpinLock, SpinWait, SemaphoreSlim,
and CountdownEvent, which are new in the .NET Framework 4. These
synchronization types typically use busy spinning for brief periods
before they put the thread into a true Wait state. When wait times are
expected to be very short, spinning is far less computationally
expensive than waiting, which involves an expensive kernel transition.
For collection classes that use spinning, this efficiency means that
multiple threads can add and remove items at a very high rate. For
more information about spinning vs. blocking, see SpinLock and
SpinWait.
The ConcurrentQueue and ConcurrentStack classes do not use locks
at all. Instead, they rely on Interlocked operations to achieve
thread-safety.
Question 2:
Don't do it. Just do it in serial and yield return.
This seems like the way to go. As with any optimisation work - do what is simplest and then measure! If there is a bottleneck here consider optimising, but at least you'll know if your 'optimistations' are actually helping by virtue of having metrics to compare against.