Bulk MySQL Inserts are 2x Slower than PHP - mysql

I've been testing Go in hopes to use it for a new site and wanted to make sure it was as fast or faster than PHP. So I ran a basic test doing bulk inserts in Go and PHP because I'll need bulk inserts.
My tests used transactions, prepared statements, the same machine, the exact same table definition, no index but the PK, and the same logic in the function.
Results:
100k Inserts in PHP (mysqli) was 4.42 seconds
100k Inserts in Go (Go-MySQL-Driver) was 9.2 seconds
The go mysql driver i'm using is the most popular one 'Go-MySQL-Driver' found here: https://github.com/go-sql-driver/mysql
I'm wondering if anyone can tell me if my code in go is not set up right or if this is just how go is.
The functions add a bit of variability to a few of the row variables just so every row isnt the same.
Go Function:
func fill_table(w http.ResponseWriter, r *http.Request, result_string *string, num_entries_to_add int) {
defer recover_show_error(result_string)
db := getDBConn()
defer db.Close()
var int_a int = 9
var int_b int = 4
var int_01 int = 1
var int_02 int = 1451628000 // Date Entered (2016-1-1, 1am)
var int_03 int = 11
var int_04 int = 0
var int_05 int = 0
var float_01 float32 = 90.0 // Value
var float_02 float32 = 0
var float_03 float32 = 0
var text_01 string = ""
var text_02 string = ""
var text_03 string = ""
start_time := time.Now()
tx, err := db.Begin()
if err != nil {
panic(err)
}
stmt, err := tx.Prepare("INSERT INTO " + TABLE_NAME +
"(`int_a`,`int_b`,`int_01`,`int_02`,`int_03`,`int_04`,`int_05`,`float_01`,`float_02`,`float_03`,`text_01`,`text_02`,`text_03`) " +
"VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)")
if err != nil {
panic(err)
}
defer stmt.Close()
var flip int = 0
for i := 0; i < num_entries_to_add; i++ {
flip = ((int)(i / 500)) % 2
if flip == 0 {
float_01 += .1 // add to Value
} else {
float_01 -= .1 // sub from Value
}
int_02 += 1 // add a second to date.
_, err = stmt.Exec(int_a, int_b, int_01, int_02, int_03, int_04, int_05, float_01, float_02, float_03, text_01, text_02, text_03)
if err != nil {
panic(err)
}
}
err = tx.Commit()
if err != nil {
panic(err)
}
elapsed := time.Since(start_time)
*result_string += fmt.Sprintf("Fill Table Time = %s</br>\n", elapsed)
}
PHP Function:
function FillTable($num_entries_to_add){
$mysqli= new mysqli("localhost", $GLOBALS['db_username'], $GLOBALS['db_userpass'], $GLOBALS['database_name']);
if ($mysqli->connect_errno == 0) {
$int_a = 9;
$int_b = 4;
$int_01 = 1;
$int_02 = 1451628000; // Date Entered (2016-1-1, 1am)
$int_03 = 11;
$int_04 = 0;
$int_05 = 0;
$float_01 = 90.0; // Value
$float_02 = 0;
$float_03 = 0;
$text_01 = "";
$text_02 = "";
$text_03 = "";
$mysqli->autocommit(FALSE); // This Starts Transaction mode. It will end when you use mysqli->commit();
$sql = "INSERT INTO " . $GLOBALS['table_name'] .
"(`int_a`,`int_b`,`int_01`,`int_02`,`int_03`,`int_04`,`int_05`,`float_01`,`float_02`,`float_03`,`text_01`,`text_02`,`text_03`) " .
"VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)";
$start_time = microtime(true);
if($stmt = $mysqli->prepare($sql)) {
$stmt->bind_param('iiiiiiidddsss', $int_a, $int_b, $int_01, $int_02, $int_03, $int_04, $int_05, $float_01, $float_02, $float_03, $text_01, $text_02, $text_03);
$flip = 0;
for ($i = 1; $i <= $num_entries_to_add; $i++) {
$flip = ((int)($i / 500)) % 2;
if ($flip == 0) {
$float_01 += .1; // add Value
}
else {
$float_01 -= .1; // sub Value
}
$int_02 += 1; // add a second to date.
$stmt->execute(); //Executes a prepared Update
}
$mysqli->commit(); // Transaction mode ends now
$stmt->close(); //Close statement
}
$execute_time = microtime(true) - $start_time;
echo $GLOBALS['html_newline'] . $GLOBALS['html_newline'] .
'FillDataEntryTable Speed: '.$execute_time.' sec' . $GLOBALS['html_newline'] . $GLOBALS['html_newline'];
$thread_id = $mysqli->thread_id; // Get MySQL thread ID
$mysqli->kill($thread_id); // Kill MySQL Server connection
$mysqli->close(); // Close MySQL Server connection
}
}

In my testing to find what language I want to use for my new website I experimented with php, golang, and java. I don't have much experience with any of the languages so anything I say here could be corrected by someone in the future.
My main test was batch inserts into the mysql database because I'll be needing it for an app.
I wanted to move away from php because it's a non-compiled old scripting language which is slower at many things than golang and java. It's also an awkward syntax for many things. However php mysqli is actually 2x faster than golang for large "transactions" unless you awkwardly spawn many go-routines to divide the work up.
During my testing and research I found out a few things.
The PHP mysqli "transactions" api are probably using some kind of batch operations to get a "transaction" done because mysqli has no separate batch functions and the transactions are quicker than single inserts. But in most other languages transactions don't auto-batch everything and don't even increase the execution time. They are just a mechanism to roll back everything in the transaction if something goes wrong. What increases execution time in other languages is using batches.
But one of the big problems with go mysql interface right now appears to be no real support for batch operations. The closest I got was to jerry rig one and make my own batch operation as pointed out by this post (golang - mysql Insert multiple data at once?). Doing this I was able to get the execution time in go from 9.2s to 3.9s without spawning other go routines. But since there's no real support for it the batch operation only returns a single result set for the first operation of the batch. This is worthless to me because I need to return autoinc Ids for my inserted rows. There were other problems with this setup too that I wont go into.
So lastly I tried java on a tomcat server. Tomcat/java installation is a bit more involved than go but programming in java was so much easier and natural. JDBC is an excellent driver with fulls support for easy batch operations with prepared statements. It did 100k inserts in only 1 sec. It's the clear winner in my book. Plus java syntax is much more natural than golang IMO.

Related

UPDATE query returns "Error: Commands out of sync; you can't run this command now"

I've got this annoying error when I try to run an UPDATE query in c-code.
Can't figure out why I get it and I have searched and mainly found solutions for SELECT and in web-code
Here is my simplified code loop
MYSQL *conn;
MYSQL_RES;
CHAR sql[2048], sqltmp[200], stmp[200];
INT Cellindex = -1;
CELL celltab[MAXCELLS]; // array of struct Cell's with data
memcpy((CHAR *) celltab, argv, noOfCells * sizeof (CELL)); // from in-parameter argv
for (Cellindex = 0; Cellindex < MAXSAMPLES; Cellindex++){
sprintf(sql, "UPDATE celltab SET name=\'%s\'", (len(name) ? (name): "\0\n"));
sprintf(sqltmp, ", sect = %d", (celltab[Cellindex].sect ? (celltab[Cellindex].sect): 0));
strcat (sql, sqltmp);
sprintf(sqltmp, ", ptyp = %d", (celltab[Cellindex].ptyp ? (celltab[Cellindex].ptyp): 0));
strcat (sql, sqltmp);
sprintf(sqltmp, "WHERE celltab_id = %d", (Cellindex + 1)); // no ; in c-code run query
strcat(sql, sqltmp);
if (conn != NULL) { // initiated earlier in code
RowsAffected = mysql_query(conn, sql); // run the query
WT("%s",mysql_error(conn)); // print any errors
mysql_fetch_row(res); // fetch any result
mysql_free_result(res); // drop it and go for next Q
}
}
This error means that you have outstanding results. Maybe some statement before returned more rows than you processed. Before you can proceed you must process all the results. UPDATE should always return a single row, so the problem is probably somewhere else in your code.
What I do in my code is this:
while(mysql_more_results(m_connection))
mysql_next_result(m_connection);
MYSQL_RES *result = mysql_store_result(m_connection);
mysql_free_result(result);
But of course, you may need to look at your code if this is desired for your particular usecase, or if you have to do something with the results.

Go trying to imporve insert speed with MySQL

Hi I need to upload enormous amount of small text info into the MySQL.
Unfortunately there no BulkOp with MySQL, what I am trying to use go-routines to parallelize transactions.
The problem that all this concurrency and racing stuff drives me a bit crazy.
And I am not sure if to what I come is any good.
A simplified code looks like this, the huge file is scanned line by line, and lines appends to an slice, when size of slice is 1000
sem := make(chan int, 10) //Transactions pool
sem2 := make(chan int) // auxiliary blocking semaphore
for scanner.Scan() {
line := scanner.Text()
lines = append(lines, line)
if len(lines) > 1000 {
sem <- 1 //keep max 10 transactions
go func(mylines ...lineT) {
// I use variadic, to avoid issue with pointers
// I want to path data by values.
<-sem2 // all lines of slice copied, release the lock
gopher(mylines...) //gopher does the transaction by iterating
//every each line. And here I may use slice
//I think.
<-sem //after transaction done, release the lock
}(lines...)
sem2 <- 1 //this to ensure, that slice will be reset,
//after values are copied to func, otherwise
//lines could be nil before the goroutine fired.
lines = nil //reset slice
}
}
How can I better solve thing.
I know I could have make something to bulk import via MySQL utilities, but this is not possible. I neither can make it like INSERT with many values VALUES ("1", "2), ("3", "4") because it's not properly escaping, and I just get errors.
This way looks a ta wierd, but not as my 1st approach
func gopher2(lines []lineT) {
q := "INSERT INTO main_data(text) VALUES "
var aur []string
var inter []interface{}
for _, l := range lines {
aur = append(aur, "(?)")
inter = append(inter, l)
}
q = q + strings.Join(aur, ", ")
if _, err := db.Exec(q, inter...); err != nil {
log.Println(err)
}
}

Big database - doctrine query slow even with index

I'm building an app with Symfony 4 + Doctrine, where people can upload big CSV files and those records then get stored in a database. Before inserting, I'm checking that the entry doesn't already exist...
On a sample CSV file with only 1000 records, it takes 16 seconds without an index and 8 seconds with an index (MacBook 3Ghz - 16 GB Memory). My intuition tells me, this is quite slow and should be done in under < 1 sec especially with the index.
The index is set on the email column.
My code:
$ssList = $this->em->getRepository(EmailList::class)->findOneBy(["id" => 1]);
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
$this->em->flush();
}
}
My Question:
How can I speed up this process?
Use LOAD DATA INFILE.
LOAD DATA INFILE has IGNORE and REPLACE options for handling duplicates if you put a UNIQUE KEY or PRIMARY KEY on your email column.
Look at settings for making the import faster.
Like Cid said, move the flush() outside of the loop or put a batch counter inside the loop and only flush inside of it at certain intervals
$batchSize = 1000;
$i = 1;
foreach ($csv as $record) {
$subscriber_exists = $this->em->getRepository(Subscriber::class)
->findOneByEmail($record['email']);
if ($subscriber_exists === NULL) {
$subscriber = (new Subscriber())
->setEmail($record['email'])
->setFirstname($record['first_name'])
->addEmailList($ssList)
;
$this->em->persist($subscriber);
if ( ($i % $batchSize) === 0) {
$this->em->flush();
}
$i++;
}
}
$this->em->flush();
Or if that's still slow, you could grab the Connection $this->em->getConnection() and use DBAL as stated here: https://www.doctrine-project.org/projects/doctrine-dbal/en/2.8/reference/data-retrieval-and-manipulation.html#insert

ffmpeg azure function consumption plan low CPU availability for high volume requests

I am running an azure queue function on a consumption plan; my function starts an FFMpeg process and accordingly is very CPU intensive. When I run the function with less than 100 items in the queue at once it works perfectly, azure scales up and gives me plenty of servers and all of the tasks complete very quickly. My problem is once I start doing more than 300 or 400 items at once, it starts fine but after a while the CPU slowly goes from 80% utilisation to only around 10% utilisation - my functions cant finish in time with only 10% CPU. This can be seen in the image shown below.
Does anyone know why the CPU useage is going lower the more instances my function creates? Thanks in advance Cuan
edit: the function is set to only run one at a time per instance, but the problem exists when set to 2 or 3 concurrent processes per instance in the host.json
edit: the CPU drops get noticeable at 15-20 servers, and start causing failures at around 60. After that the CPU bottoms out at an average of 8-10% with individuals reaching 0-3%, and the server count seems to increase without limit (which would be more helpful if I got some CPU with the servers)
Thanks again, Cuan.
I've also added the function code to the bottom of this post in case it helps.
using System.Net;
using System;
using System.Diagnostics;
using System.ComponentModel;
public static void Run(string myQueueItem, TraceWriter log)
{
log.Info($"C# Queue trigger function processed a request: {myQueueItem}");
//Basic Parameters
string ffmpegFile = #"D:\home\site\wwwroot\CommonResources\ffmpeg.exe";
string outputpath = #"D:\home\site\wwwroot\queue-ffmpeg-test\output\";
string reloutputpath = "output/";
string relinputpath = "input/";
string outputfile = "video2.mp4";
string dir = #"D:\home\site\wwwroot\queue-ffmpeg-test\";
//Special Parameters
string videoFile = "1 minute basic.mp4";
string sub = "1 minute sub.ass";
//guid tmp files
// Guid g1=Guid.NewGuid();
// Guid g2=Guid.NewGuid();
// string f1 = g1 + ".mp4";
// string f2 = g2 + ".ass";
string f1 = videoFile;
string f2 = sub;
//guid output - we will now do this at the caller level
string g3 = myQueueItem;
string outputGuid = g3+".mp4";
//get input files
//argument
string tmp = subArg(f1, f2, outputGuid );
//String.Format("-i \"" + #"input/tmp.mp4" + "\" -vf \"ass = '" + sub + "'\" \"" + reloutputpath +outputfile + "\" -y");
log.Info("ffmpeg argument is: "+tmp);
//startprocess parameters
Process process = new Process();
process.StartInfo.FileName = ffmpegFile;
process.StartInfo.Arguments = tmp;
process.StartInfo.UseShellExecute = false;
process.StartInfo.RedirectStandardOutput = true;
process.StartInfo.RedirectStandardError = true;
process.StartInfo.WorkingDirectory = dir;
//output handler
process.OutputDataReceived += new DataReceivedEventHandler(
(s, e) =>
{
log.Info("O: "+e.Data);
}
);
process.ErrorDataReceived += new DataReceivedEventHandler(
(s, e) =>
{
log.Info("E: "+e.Data);
}
);
//start process
process.Start();
log.Info("process started");
process.BeginOutputReadLine();
process.BeginErrorReadLine();
process.WaitForExit();
}
public static void getFile(string link, string fileName, string dir, string relInputPath){
using (var client = new WebClient()){
client.DownloadFile(link, dir + relInputPath+ fileName);
}
}
public static string subArg(string input1, string input2, string output1){
return String.Format("-i \"" + #"input/" +input1+ "\" -vf \"ass = '" + #"input/"+input2 + "'\" \"" + #"output/" +output1 + "\" -y");
}
When you use the D:\home directory you are writing to the virtual function, which means each instance has to continually try to write to the same spot as the functions run which causes the massive I/O block. Instead writing to D:\local and then sending the finished file somewhere else solves that issue, this way rather than each instance constantly writing to a location they only write when completed, and write to a location designed to handle high throughput.
The easiest way I could find to manage the input and output after writing to D:\local was just to hook up the function to an azure storage container and handle the ins and outs that way. Doing so made the average CPU stay at 90-100% for upwards of 70 concurrent Instances.

MYSQL C Connector

I have a C process that is rapidly writing to a mysql database ~10 times per second. This process uses the MySql C Connector.
After about 2 minutes of running, the process hangs and in system monitor shows
"futex_wait_queue_me"
, and also
"Can't initialized threads: error 11"
is printed to console, I assume by the C connector library(since I do not print this). Following that write, connections to mysql fail with
"MySQL server has gone away".
What could be causing this? I am only writing from 1 thread.
fyi, I am using the library as so. mutex lock and unlock are there for future as i will be multithreading the logging. The logging events in actual app will be much less frequent, but I am trying to stress it as much as possible in this particular test.
//pseudocode:
while(1)
mutexlock
connect();
mysql_query();
disconnect();
sleep(100ms);
mutexunlock
A better solution, maybe not the best
connect();
while(1)
mutexlock
if error on mysql_query();
disconnect();
connect();
sleep(100ms);
mutexunlock
//connect/disconnect functions
int DBConnector::connect()
{
if(DBConnector::m_isConnected) return 0;//already connected...
if(!mutexInitialized)
{
pthread_mutex_init(&DBLock, 0);
}
if(mysql_library_init(0, NULL, NULL))
{
LoggingUtil::logError("DBConnector.DB_connect [DB library init error] " + string(mysql_error(&DBConnector::m_SQLHandle)));
DBConnector::m_isConnected = false;
return -1;
}
if((mysql_init(&m_SQLHandle)) == NULL)
{
LoggingUtil::logError("DBConnector.DB_connect [DB mysql init error] " + string(mysql_error(&DBConnector::m_SQLHandle)));
DBConnector::m_isConnected = false;
return -1;
}
if((mysql_real_connect(&DBConnector::m_SQLHandle, host.c_str(), user.c_str(), pw.c_str(), db.c_str(), port, socket.c_str(), client_flags)) == NULL)
{
LoggingUtil::logError("DBConnector.DB_connect [DB Connect error] " + string(mysql_error(&DBConnector::m_SQLHandle)));
DBConnector::m_isConnected = false;
return -1;
}
DBConnector::m_isConnected = true;
return 0;
}
int DBConnector::disconnect()
{
DBConnector::m_isConnected = false;
mysql_close(&DBConnector::m_SQLHandle);
mysql_library_end();
return 0;
}
Try to not call
mysql_library_init(0, NULL, NULL);
and
mysql_library_end();
at each connection attempt.
Also your second idea of not reconnecting at every mysql-access is much better as establishing a connection will always take some time/resource. For nothing in your case.
After a query has failed, you don't need to re-connect to the database.