Updating Large Volume of data quickly - mysql

I have made an application in Nodejs that every minute calls an endpoint and gets a json array that has about 100000 elements. I need to upsert this elements into my database such that if the element doesn't exist I insert it with column "Point" value set to 0.
So far I'm having a cron job and simple upsert query. But it's so slow:
var q = async.queue(function (data, done) {
db.query('INSERT INTO stat(`user`, `user2`, `point`) '+data.values+' ON DUPLICATE KEY UPDATE point=point+ 10',function (err, result) {
if (err) throw err;
});
},100000);
//Cron job here Every 1 minute execute the lines below:
var values='' ;
for (var v=0;v<stats.length;v++) {
values = '("JACK","' + stats[v] + '", 0)';
q.push({values: values});
}
How can I do such a task in a very short amount of time. Is using mysql a wrong decision? I'm open to any other architecture or solution. Note that I have to do this every minute.

I fixed this problem by using Bulk Upsert (from documentation)! I managed to Upsert over 24k rows in less than 3 seconds. Basically created the query first then ran it:
INSERT INTO table (a,b,c) VALUES (1,2,3),(4,5,6)
ON DUPLICATE KEY UPDATE c=VALUES(a)+VALUES(b);

Related

Sequalize Insert query with transaction using multi threading in nodejs

I am facing an issue where I am getting frequent request and my code is getting struck for insert query..
Code is first checking whether records exists or not and as per output code is taking decision for insert OR update. But issue is this when my first request is coming it executes create query with transaction and later after 3-4 seconds that transaction is commit. But when second request is coming where first request is stll getting executed and checking if records exists then it is returning null due to which my code is again going into create whereas it should have gone in update case.
let data = await Model.findOne({
where: { code: variantCode }
});
if (!data) {
variant = await Model.create(body, {
transaction
});
} else {
await data.update(body);
}
I have already tried upsert, but that doesnt work.

How to optimize a sequential insert on mysql

I'm trying to implement an HTTP event streaming server using MySQL where Users are able to append an event to a stream (a MySQL table) and also define the expected sequence number of the event.
The logic is somewhat simple:
Open transaction
get the next sequence number in the table to insert
verify if the next sequence number matches the expected(if supplied)
insert in database
Here's my code:
public async append(
data: any = {},
expectedSeq?: number
): Promise<void> {
let published_at = $date.create();
try {
await $mysql.transaction(async trx => {
let max = await trx(this.table)
.max({
seq: "seq",
})
.first();
if (!max) {
throw $error.InternalError(`unexpected mysql response`);
}
let next = (max.seq || 0) + 1;
if (expectedSeq && expectedSeq !== next) {
throw $error.ExpectationFailed(
`expected seq does not match current seq`
);
}
await trx(this.table).insert({
published_at,
seq: next,
data: $json.stringify(data),
});
});
} catch (err) {
if (err.code === "ER_DUP_ENTRY") {
return this.append(data, expectedSeq);
}
throw err;
}
}
My problem is this is extremely slow since there are race conditions between parallel requests to append to the same stream.. my laptop inserts/second on one stream went from ~1k to ~75.
Any pointers/suggestions on how to optimize this logic?
CONCLUSION
After consideration from comments, I decided to go with auto increment and reset the auto_increment only if there's an error. It yields around the same writes/sec with expectedSeq but much higher rate if ordering is not required.
Here's the solution:
public async append(data: any = {}, expectedSeq?: number): Promise<Event> {
if (!$validator.validate(data, this.schema)) {
throw $error.ValidationFailed("validation failed for event data");
}
let published_at = $date.create();
try {
let seq = await $mysql.transaction(async _trx => {
let result = (await _trx(this.table).insert({
published_at,
data: $json.stringify(data),
})).shift();
if (!result) {
throw $error.InternalError(`unexpected mysql response`);
}
if (expectedSeq && expectedSeq !== result) {
throw $error.ExpectationFailed(
`expected seq ${expectedSeq} but got ${result}`
);
}
return result;
});
return eventFactory(this.topic, seq, published_at, data);
} catch (err) {
await $mysql.raw(`ALTER TABLE ${this.table} auto_increment = ${this.seqStart}`);
throw err;
}
}
Why does the web page need to provide the sequence number? That is just a recipe for messy code, perhaps even messier than what you sketched out. Simply let the auto_increment value be returned to the User.
INSERT ...;
SELECT LAST_INSERT_ID(); -- session-specific, so no need for transaction.
Return that value to user.
Why not use Apache Kafka, It does all of this natively. With the easy answer out of the way, optimization is always tricky with partial information, however I think you've given us one hint that might enable a suggestion. You said without the order clause it performs much faster, which means that getting the max value is what is taking so long. That tells me a few things, first this value is not the clustered index (which is good news), second that you probably do not have sufficient index support (also good news since it's fixable by creating an index on this column, and sorting the index desc). This sounds like a table with millions or billions of rows in it, and this particular column has no guaranteed order, without the right indexing you could be doing a table scan between inserts to get the max value.
Why not use a GUID for your primary key instead of an auto-incremented integer? Then your client could generate the key and would also be able to insert it every time for sure.
Batch inserts versus singleton inserts
Your latency/performance problem is due to a batch size of 1 - as each send to the the database requires multiple round trips to the rdbms. Rather than inserting one row at a time, with a commit and verification after each row, you should rewrite your code to issue batch sizes of 100 or 1000 at a time, inserting n rows and verifying per batch rather than per row. If the batch insert fails, you can retry one row at a time.

SQL Deadlock with Python Data Insert

I'm currently trying to build a database interface with python to store stock data. This data is in the form of a tuple list with each element consisting of "date, open, high, low, close, volume. date represents a UNIX timestamp and has to be unique in combination with the ticker symbol in the database. Below is an example of a typically processed output (company_stock):
[(1489780560, 'NYSE:F', 12.5, 12.505, 12.49, 12.495, 567726),
(1489780620, 'NYSE:F', 12.495, 12.5, 12.48, 12.48, 832487),
(1489780680, 'NYSE:F', 12.485, 12.49, 12.47, 12.475, 649818),
(1489780740, 'NYSE:F', 12.475, 12.48, 12.47, 12.47, 700579),
(1489780800, 'NYSE:F', 12.47, 12.48, 12.47, 12.48, 567798)]
I'm using the pymysql package to insert this list into a local MySQL database (Version 5.5). While the code runs through and the values get inserted, the database will crash - or rather stop - after reaching about ~250k rows. Since the relevant This is the export part of the stock data processing function which gets called about once every 20 seconds and inserts about 400 values.
# SQL Export
def tosql(company_stock, ticker, interval, amount_period, period):
try:
conn = pymysql.connect(host = "localhost", user = "root",
passwd = "pw", db = "db", charset = "utf8",
autocommit = True,
cursorclass = pymysql.cursors.DictCursor)
cur = conn.cursor()
# To temp table
query = "INSERT INTO stockdata_import "
query += "(date, tickersymbol, open, high, low, close, volume)"
query += "VALUES (%s, %s, %s, %s, %s, %s, %s)"
cur.executemany(query, company_stock)
# Duplicate Check with temp table and existing database storage
query = "INSERT INTO stockdata (date, tickersymbol, open, high, low, close, volume) "
query += "SELECT i.date, i.tickersymbol, i.open, i.high, i.low, "
query += "i.close, i.volume FROM stockdata_import i "
query += "WHERE NOT EXISTS(SELECT dv.date, dv.tickersymbol FROM "
query += "stockdata dv WHERE dv.date = i.date "
query += "AND dv.tickersymbol = i.tickersymbol)"
cur.execute(query)
print(": ".join([datetime.now().strftime("%d.%m.%Y %H:%M:%S"),
"Data stored in Vault. Ticker", str(ticker),
"Interval", str(interval),
"Last", str(amount_period), str(period)]))
finally:
# Clear temp import table and close connection
query = "DELETE from stockdata_import"
cur.execute(query)
cur.close()
conn.close()
I suspect that the check for already existent values takes too long as the database grows and eventually breaks down due to the lock of the tables (?) while checking for uniqueness of the date/ticker combination. Since I expect this database to grow rather fast (about 1 million rows per week) it seems that a different solution is required to ensure that there is only one date/ticker pair. This is the SQL CREATE statement for the import table (the real table with which it gets compared looks the same):
CREATE TABLE stockdata_import (id_stock_imp BIGINT(12) NOT NULL AUTO_INCREMENT,
date INT(10),
tickersymbol VARCHAR(16),
open FLOAT(12,4),
high FLOAT(12,4),
low FLOAT(12,4),
close FLOAT(12,4),
volume INT(12),
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id_stock_imp));
I have already looked into setting a constraint for the date/tickersymbol pair and to handle upcoming exceptions in python, but my research so far suggested that this would be even slower plus I am not even sure if this will work with the bulk insert of the pymysql cursor function executemany(query, data).
Context information:
The SQL export shown above is the final part of a python script handling the stock data response. This script, in turn, gets called by another script which is timed by a crontab to run at a specific time each day.
Once the crontab starts the control script, this will call the subscript about 500 times with a sleep time of about 20-25 seconds between each run.
The error which I see in the logs is: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
Questions:
How can I optimize the query or alter the storage table to ensure uniqueness for a given date/ticker combination?
Is this even the problem or do I fail to see some other problem here?
Any further advice is also welcome.
If you would like to ensure uniqueness of your data, then just add a unique index on the relevant date and ticker fields. Unique index prevents duplicate values from being inserted, therefore there is no need to check for the existence of data before the insertion.
Since you do not want to insert duplicate data, just use insert ignore instead of plain insert to supress duplicate insert errors. Based on the mumber of affected rows, you can still detect and log duplicate insertions.

Sqljocky, transactioned query with multiple sets of parameters

I'm using sqljocky to insert data into a MySQL database. I need to truncate a table first, then insert multiple rows in it. I would do this in a single transaction, but it seems that sqljocky doesn't support this at all now (or maybe I'm quite new in dart and sqljocky).
The solution I found is the following, but I was wondering if there's a better one.
// Start transaction
pool.query('START TRANSACTION').then((r) {
// Truncate table
pool.query('TRUNCATE myTable').then((r) {
// Prepare statement to insert new data
pool.prepare('REPLACE INTO myTable (Id, Name, Description) VALUES (?,?,?)').then((query) {
// Execute query inserting multiple rows
query.executeMulti(myArrayValues).then((results) {
// Other stuff here
pool.query('COMMIT').then((r) {
...
To be honest, I'm still wondering if this code really executes a transactioned query!
Here's the same code rewritten with transaction support:
// Start transaction
pool.startTransaction().then((trans) {
// Delete all from table
trans.query('DELETE FROM myTable WHERE 1=1').then((r) {
// Prepare statement
trans.prepare('REPLACE INTO myTable (Id, Name, Description) VALUES (?,?,?)').then((query) {
// Execute query inserting multiple rows
query.executeMulti(myArrayValues).then((results) {
// Stuff here
// Commit
trans.commit().then((r) { ...
Use ConectionPool.startTransaction().
* You
* must use this method rather than `query('start transaction')` otherwise
* subsequent queries may get executed on other connections which are not
* in the transaction.
Haven't tried it myself yet.

node-mysql: REPLACE into statement incrementing a current table value

I am using the MySQL driver for Node.js "node-mysql" but struggling with the syntax to do a REPLACE into and increment a counter. Here's what I have so far, note the count field is where I want to increment the current table value by 1:
connection.query('REPLACE INTO links SET ?', { symbol: statName, timestamp: now, count: count + 1 }, function(err, result) {
Can anyone advise on how to do this?
I found the solution in using INSERT with ON DUPLICATE KEY UPDATE:
connection.query('INSERT INTO links SET symbol = ?, timestamp = ?, title = ?, url = ? ON DUPLICATE KEY UPDATE count = count + 1', [statName, now, interaction.data.links.title[y], interaction.data.links.url[y]], function(err, results) {
if (err) console.log(err);
});
Please keep in mind that REPLACE is similar to INSERT not UPDATE.
Under the hood, REPLACE is mechanically a DELETE followed by an INSERT. It says so in the MySQL Documentation as follows:
REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted. See Section 13.2.5, “INSERT Syntax”.
REPLACE is a MySQL extension to the SQL standard. It either inserts, or deletes and inserts. For another MySQL extension to standard SQL—that either inserts or updates—see Section 13.2.5.3, “INSERT ... ON DUPLICATE KEY UPDATE Syntax”.
If you want to increment the count column, use INSERT INTO ... ON DUPLICATE KEY UPDATE.
Please make sure the JavaScript Engine can work with INSERT INTO ... ON DUPLICATE KEY UPDATE. If not, you must capture the count as newcount, increment newcount, and feed the number into the REPLACE
connection.query('REPLACE INTO links SET ?', { symbol: statName, timestamp: now, count: newcount }, function(err, result) {
Give it a Try !!!