How to optimize a sequential insert on mysql - mysql

I'm trying to implement an HTTP event streaming server using MySQL where Users are able to append an event to a stream (a MySQL table) and also define the expected sequence number of the event.
The logic is somewhat simple:
Open transaction
get the next sequence number in the table to insert
verify if the next sequence number matches the expected(if supplied)
insert in database
Here's my code:
public async append(
data: any = {},
expectedSeq?: number
): Promise<void> {
let published_at = $date.create();
try {
await $mysql.transaction(async trx => {
let max = await trx(this.table)
.max({
seq: "seq",
})
.first();
if (!max) {
throw $error.InternalError(`unexpected mysql response`);
}
let next = (max.seq || 0) + 1;
if (expectedSeq && expectedSeq !== next) {
throw $error.ExpectationFailed(
`expected seq does not match current seq`
);
}
await trx(this.table).insert({
published_at,
seq: next,
data: $json.stringify(data),
});
});
} catch (err) {
if (err.code === "ER_DUP_ENTRY") {
return this.append(data, expectedSeq);
}
throw err;
}
}
My problem is this is extremely slow since there are race conditions between parallel requests to append to the same stream.. my laptop inserts/second on one stream went from ~1k to ~75.
Any pointers/suggestions on how to optimize this logic?
CONCLUSION
After consideration from comments, I decided to go with auto increment and reset the auto_increment only if there's an error. It yields around the same writes/sec with expectedSeq but much higher rate if ordering is not required.
Here's the solution:
public async append(data: any = {}, expectedSeq?: number): Promise<Event> {
if (!$validator.validate(data, this.schema)) {
throw $error.ValidationFailed("validation failed for event data");
}
let published_at = $date.create();
try {
let seq = await $mysql.transaction(async _trx => {
let result = (await _trx(this.table).insert({
published_at,
data: $json.stringify(data),
})).shift();
if (!result) {
throw $error.InternalError(`unexpected mysql response`);
}
if (expectedSeq && expectedSeq !== result) {
throw $error.ExpectationFailed(
`expected seq ${expectedSeq} but got ${result}`
);
}
return result;
});
return eventFactory(this.topic, seq, published_at, data);
} catch (err) {
await $mysql.raw(`ALTER TABLE ${this.table} auto_increment = ${this.seqStart}`);
throw err;
}
}

Why does the web page need to provide the sequence number? That is just a recipe for messy code, perhaps even messier than what you sketched out. Simply let the auto_increment value be returned to the User.
INSERT ...;
SELECT LAST_INSERT_ID(); -- session-specific, so no need for transaction.
Return that value to user.

Why not use Apache Kafka, It does all of this natively. With the easy answer out of the way, optimization is always tricky with partial information, however I think you've given us one hint that might enable a suggestion. You said without the order clause it performs much faster, which means that getting the max value is what is taking so long. That tells me a few things, first this value is not the clustered index (which is good news), second that you probably do not have sufficient index support (also good news since it's fixable by creating an index on this column, and sorting the index desc). This sounds like a table with millions or billions of rows in it, and this particular column has no guaranteed order, without the right indexing you could be doing a table scan between inserts to get the max value.

Why not use a GUID for your primary key instead of an auto-incremented integer? Then your client could generate the key and would also be able to insert it every time for sure.

Batch inserts versus singleton inserts
Your latency/performance problem is due to a batch size of 1 - as each send to the the database requires multiple round trips to the rdbms. Rather than inserting one row at a time, with a commit and verification after each row, you should rewrite your code to issue batch sizes of 100 or 1000 at a time, inserting n rows and verifying per batch rather than per row. If the batch insert fails, you can retry one row at a time.

Related

Sequalize Insert query with transaction using multi threading in nodejs

I am facing an issue where I am getting frequent request and my code is getting struck for insert query..
Code is first checking whether records exists or not and as per output code is taking decision for insert OR update. But issue is this when my first request is coming it executes create query with transaction and later after 3-4 seconds that transaction is commit. But when second request is coming where first request is stll getting executed and checking if records exists then it is returning null due to which my code is again going into create whereas it should have gone in update case.
let data = await Model.findOne({
where: { code: variantCode }
});
if (!data) {
variant = await Model.create(body, {
transaction
});
} else {
await data.update(body);
}
I have already tried upsert, but that doesnt work.

Node.js level(api key) based rate limit

I have spent the last hour trying to look for a solution to rate limit my api.
I want to limit a path /users for example. But most rate limits work on 1 rate limit for everyone. I want to use api keys that can be generated by a user. People can generate free api let's say 1000 requests per day. Then if they pay some money they can get 5000 requests per day.
I would like to store these api keys in a mysql database.
Does anyone have any solution for this?
One way to structure your project would be:
user_keys table, includes the api key, the user, time of creation and number of uses so far.
When a user tries to generate a key, check that one doesn't exist yet, and add it to the DB.
When a request arrives, check if the key exists, if it does, do the following:
1: if it has been 24 hours since creation date, set number of uses to 0
2: increment the uses count
if you find the API key and it's at 1k the user reached his limit.
This is a basic implementation, and isn't very efficient, you'll want to cache the API keys in memory, either just in a hashmap in nodejs or using memcached/redis. But, get it working first before trying to optimize it.
EDIT: some code examples
//overly simple in memory cache
const apiKeys = {}
//one day's worth in milliseconds, used later on
const oneDayTime = 1000 * 60 * 60 * 24
//function to generate new API keys
function generateKey(user) {
if (apiKeys[user]) {
throw Error("user already has a key")
}
let key = makeRandomHash(); // just some function that creates a random string like "H#4/&DA23#$X/"
//share object so it can be reached by either key or user
//terrible idea, but when you save this in mysql you can just do a normal search query
apiKeys[user] = {
key: key,
user: user,
checked: Date.Now(),
uses: 0
}
apiKeys[key] = apiKeys[user]
}
// a function that does all the key verification for us
function isValid(key) {
//check if key even exists first
if (!apiKeys[key]) throw Error("invalid key")
//if it's been a whole day since it was last checked, reset its uses
if (Date.now() - apiKeys[key].checked >= oneDayTime) {
apiKeys[key].uses = 0
apiKeys[key].checked = Date.now()
}
//check if the user limit cap is reached
if (apiKeys[key].uses >= 1000) throw error("User daily qouta reached");
//increment the user's count and exit the function without errors
apiKeys[key].uses++;
}
//express middleware function
function limiter(req, res, next) {
try {
// get the API key, can be anywhere, part of json or in the header or even get query
let key = req.body["api_key"]
// if key is not valid, it will error out
isValid(key)
// pass on to the next function if there were no errors
next()
} catch (e) {
req.send(e)
}
}
this is an overly simplified implementation of a simpler idea, but I hope it gets the idea across.
the main thing you want to change here is how the API keys are saved and retrieved

why a variable defined in a node module is shared between two users

I run a node API server with pm2 cluster mode that will communicate with a mysql DB server.
In module x.js I have a code like this:
let insertMappingQuery = ``;
...
...
const constructInsertMappingQuery = () => {
insertMappingQuery += `
INSERT IGNORE INTO messages_mapping (message_id, contact_id)
VALUES (` + message_id + `, ` + contact_id + ` + `);`;
}
When a user sends a message a function will call module x and the code above is executed for his message (let's say message_id = 1)
INSERT IGNORE INTO messages_mapping (message_id, contact_id)
VALUES (1, some_id);
then another user sends a message and the code is executed for let's say message_id = 2 however the query will look like this:
INSERT IGNORE INTO messages_mapping (message_id, contact_id)
VALUES (1, some_id);
INSERT IGNORE INTO messages_mapping (message_id, contact_id)
VALUES (2, some_id);
So basically when user two sends a message, this query will contain what user one already executed. So user one will have his record inserted twice.
This doesn't happen all the time but it happens a lot (I would say 30% to 50%) and I couldn't find any pattern when this happens.
Users don't have to do it at the same time, there might be some time difference (minutes or even hours).
could this be related to the variable not being cleared in the memory? or a memory leakage of some kind?
I don't understand how two different users will share a variable.
Remember that require caches modules and all subsequent require calls are given the same things, so write something that exports a function, or class, so that you can safely call/instantiate things without variables getting shared.
For example:
const db = require(`your/db/connector`);
const Mapper = {
addToMessageMapping: async function(messageId, contactId) {
const query = `
INSERT IGNORE INTO messages_mapping (message_id, contact_id)
VALUES (${message_id}, ${contact_id});
`;
...
return db.run(query);
},
...
}
module.exports = Mapper;
And of course this could have been a class, too, or it could even have been that function directly - the only thing that changes is how you make it run that non-conflicting-with-any-other-call function.
Now, consumers of this code simply trust that the following is without side effects:
const mapper = require('mapper.js');
const express, app, etc, whatever = ...
....
app.post(`/api/v1/mappings/message/:msgid`, (req, res, next) => {
const post = getPOSTparamsTheUsualWay();
mapper.addToMessageMapping(req.params.msgId, post.contactId)
.then(() => next());
.catch(error => next(error));
}, ..., moreMiddleware, ... , (req,res) => {
res.render(`blah.html`, {...});
});
Also note that template strings exist specifically to prevent string composition by concatenating strings with +, the whole point is that they can take ${...} inside them and template in "whatever is in those curly braces" (variables, function calls, any JS really).
(The second power they have is that you can prefix tag them with a function name and that function will run as part of the templating action, but not a lot of folks need this on a daily basis. ${...} templating though? Every day, thousands of times).
And of course on a last note: it looks like you're creating raw SQL, which is always a bad idea. Use prepared statements for whatever database library you're using: it supports them, and means any user-input is made safe. Right now, someone could post to your API with a message id that's ); DROP TABLE messages_mapping; -- and done: your table's gone. Fun times.
Apparently I didn't know that requireing a module will cache it and reuse it. Thus global variables in that module will be cached too.
So the best solution here is to avoid using global variables and restructure the code. However if you need a quick solution you can use:
delete require.cache[require.resolve('replaceWithModulePathHere')]
Example:
let somefuncThatNeedsModuleX = () => {
delete require.cache[require.resolve('./x')];
const x = require('./x');
}

Updating Large Volume of data quickly

I have made an application in Nodejs that every minute calls an endpoint and gets a json array that has about 100000 elements. I need to upsert this elements into my database such that if the element doesn't exist I insert it with column "Point" value set to 0.
So far I'm having a cron job and simple upsert query. But it's so slow:
var q = async.queue(function (data, done) {
db.query('INSERT INTO stat(`user`, `user2`, `point`) '+data.values+' ON DUPLICATE KEY UPDATE point=point+ 10',function (err, result) {
if (err) throw err;
});
},100000);
//Cron job here Every 1 minute execute the lines below:
var values='' ;
for (var v=0;v<stats.length;v++) {
values = '("JACK","' + stats[v] + '", 0)';
q.push({values: values});
}
How can I do such a task in a very short amount of time. Is using mysql a wrong decision? I'm open to any other architecture or solution. Note that I have to do this every minute.
I fixed this problem by using Bulk Upsert (from documentation)! I managed to Upsert over 24k rows in less than 3 seconds. Basically created the query first then ran it:
INSERT INTO table (a,b,c) VALUES (1,2,3),(4,5,6)
ON DUPLICATE KEY UPDATE c=VALUES(a)+VALUES(b);

Linq to SQL: Queries don't look at pending changes

Follow up to this question. I have the following code:
string[] names = new[] { "Bob", "bob", "BoB" };
using (MyDataContext dataContext = new MyDataContext())
{
foreach (var name in names)
{
string s = name;
if (dataContext.Users.SingleOrDefault(u => u.Name.ToUpper() == s.ToUpper()) == null)
dataContext.Users.InsertOnSubmit(new User { Name = name });
}
dataContext.SubmitChanges();
}
...and it inserts all three names ("Bob", "bob" and "BoB"). If this was Linq-to-Objects, it wouldn't.
Can I make it look at the pending changes as well as what's already in the table?
I don't think that would be possible in general. Imagine you made a query like this:
dataContext.Users.InsertOnSubmit(new User { GroupId = 1 });
var groups = dataContext.Groups.Where(grp => grp.Users.Any());
The database knows nothing about the new user (yet) because the insert wasn't commited yet, so the generated SQL query might not return the Group with Id = 1. The only way the DataContext could take into account the not-yet-submitted insert in cases like this would be to get the whole Groups-Table (and possibly more tables, if they are affected by the query) and perform the query on the client, which is of course undesirable. I guess the L2S designers decided that it would be counterintuitive if some queries took not-yet-committed inserts into account while others wouldn't, so they chose to never take them into account.
Why don't you use something like
foreach (var name in names.Distinct(StringComparer.InvariantCultureIgnoreCase))
to filter out duplicate names before hitting the database?
Why dont you try something like this
foreach (var name in names)
{
string s = name;
if (dataContext.Users.SingleOrDefault(u => u.Name.ToUpper() == s.ToUpper()) == null)
{
dataContext.Users.InsertOnSubmit(new User { Name = name });
break;
}
}
I am sorry, I don't understand LINQ to SQL as much.
But, when I look at the code, it seems you are telling it to insert all the records at once (similar to a transaction) using SubmitChanges and you are trying to check the existence of it from the DB, when the records are not inserted at all.
EDIT: Try putting the SubmitChanges inside the loop and see that the code will run as per your expectation.
You can query the appropriate ChangeSet collection, such as
if(
dataContext.Users.
Union(dataContext.GetChangeSet().Inserts).
Except(dataContext.GetChangeSet().Deletes).
SingleOrDefault(u => u.Name.ToUpper() == s.ToUpper()) == null)
This will create a union of the values in the Users table and the pending Inserts, and will exclude pending deletes.
Of course, you might want to create a changeSet variable to prevent multiple calls to the GetChangeSet function, and you may need to appropriately cast the object in the collection to the appropriate type. In the Inserts and Deletes collections, you may want to filter it with something like
...GetChangeSet().Inserts.Where(o => o.GetType() == typeof(User)).OfType<User>()...