MySQL sequential inserts are slow while threaded inserts are fast - why? - mysql

I found that sequentially inserting data into my database is very slow compared to a multi-threaded solution where both insert the same number of rows.
Inserting 50000 rows took ~4 mins in my sequential approach and only ~10 seconds with the parallel version.
I use the https://github.com/go-sql-driver/mysql driver.
For the database, I just took the recent version of XAMPP for windows and use the MySQL database with its standard config.
sequential version:
for i := 0; i < 50000; i++ {
_, err2 := db.Exec("insert into testtable (num, text1, text2) values (?, ?, ?)", i, "txt1", "txt2")
if err2 != nil {
fmt.Println(err2)
}
}
Parallel version:
for i := 0; i < 50; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for j := 0; j < 1000; j++ {
_, err2 := db.Exec("insert into testtable (num, text1, text2) values (?, ?, ?)", 1, "txt1", "txt2")
if err2 != nil {
fmt.Println(err2)
}
}
}()
}
Why is the first one so slow compared to the second version?
Any Ideas? Am I maybe using the wrong function for inserting data?

There is a lot of overhead in running an INSERT:
Communication between client and server.
Parse the INSERT
Open the table, etc.
Get next AUTO_INCREMENT value.
Check for conflicts, deadlocks, etc.
Commit the transaction.
And all of that is done in a single CPU, with waiting for I/O, if necessary.
You have 50 threads; they ran 24 times as fast.
But you can do 10 times as good as that -- Batch up the rows into a single INSERT 100 rows at a time. This eliminates much of the overhead, especially the commit. (Going past 100-1000 rows is getting into diminishing returns and other overheads; so stop there.)
Meanwhile, don't use more threads than, say, twice the number of CPU cores you have. Otherwise, they will just stumble over each other. The may be why 50 threads was only 24 times as fast.

Related

How to merge hundreds of CSVs, each with millions of rows, using minimal RAM?

I have terabytes of sensor data stored across separate CSV files in format timestamp and value. I need to merge these CSV files where I will have one column for timestamp (calculated as the average of all other timestamps for each row) and each sensor will have its own column where the column name comes from the file name. The number of sensors is more than 500. What tech stack to look into? I can't fit all data in RAM at once.
Example:
sensor1.csv
timestamp value
10000 1.9
10010 2.2
... (millions of rows)
sensor2.csv
timestamp value
10004 3.5
10012 4.3
... (500 more files)
Result should look like this (timestamp in this file is the average of all the timestamps from all 500+ input files, names of sensors, e.g. sensor1, sensor2 and etc., come from filenames):
merged_file.csv
timestamp sensor1 sensor2 ... sensor500
10002 1.9 3.5 2.1
10011 2.2 4.3 3.5
After merge, I would like to store data in a database, e.g. InfluxDB, for future analysis and model trainings. What tools would be best to perform merge and analysis operations on this data?
I see your post as two distinct questions: 1) How to merge 500 CSVs? 2) whatever comes next, some DB?
This is a solution for the first question. I'm using Python, there are languages/runtimes that could do this faster, but I think Python will give you a good first start at the problem and I'm expecting Python will be more accessbile and easier to use for you.
Also, my solution is predicated on the fact that all 500 CSVs have identical row counts.
My solution opens all 500 CSVs at once, creates an outer loop over a set number of rows, and an inner loop over each CSV:
The inner loop reads the timestamp and value for a row in each CSV, averaging the 500 timestamps into a single column, and accumulating the 500 distinct values in their own columns, and all that goes into a final merged row with 501 columns.
The outer loop repeats that process for as many rows as there are across all 500 CSVs.
I generated some sample data, 500 CSVs each with 1_000_000 rows, for 6.5G of CSVs. I ran the following script on my M1 Macbook Air. It completed in 8.3 minutes and peaked at 34.6M of RAM and produced a final CSV that is about 2G on disk.
import csv
import glob
# Fill this in based on your own knowledge, or, based on the output of 'analyze_stats.py'
NUM_ROWS = 1_000_000
sensor_filenames = sorted(glob.glob('sensor*.csv'))
# Sort: trim leading 6 chars, 'sensor', and trailing 4 chars, '.csv', leaving just the number in the middle
sensor_filenames = sorted(sensor_filenames, key=lambda x: int(x[6:-4]))
# Get handles to all files, and create input CSV readers
sensor_readers = []
for sensor_fname in sensor_filenames:
f = open(sensor_fname, newline='')
sensor_readers.append(csv.reader(f))
# Create output CSV writer
f = open('merged_sensors.csv', 'w', newline='')
writer = csv.writer(f)
# Discard all sensor headers
for reader in sensor_readers:
next(reader)
# Build up output header and write
output_header = ['timestamp']
for sensor_fname in sensor_filenames:
sensor_name = sensor_fname[:-4] # trim off '.csv'
output_header.append(sensor_name)
writer.writerow(output_header)
row_count = 0
while row_count < NUM_ROWS:
row_count += 1
values = []
timestamps = []
for reader in sensor_readers:
row = next(reader)
ts, val = row
timestamps.append(int(ts))
values.append(val)
if row_count % 1000 == 0:
print(f'Merged {row_count} rows')
avg_ts = int(sum(timestamps)/len(timestamps))
writer.writerow([avg_ts] + values)
I haven't profiled this, but I believe the only real allocations of memory that add up are going to be:
the 500 file handles and CSV readers (which is small) for the entirety of the process
each row from the input CSVs in the inner loop
the final merged row in the outer loop
At the top of the script I mention analyze_stats.py. Even if this were my data, I'd be very patient and break down the entire process into multiple steps, each which I could verify, and I would ultimately arrive at the correct, final CSV. This is especially true for me trying to help you, because I don't control the data or "know" it like you, so I'm going to offer up this bigger process:
Read all the CSVs and record some stats: headers, column counts, and especially row counts.
Analyze those stats for "conformance", making sure no CSVs deviates from your idea of what it should be, and especially get confirmation that all 500 CSVs have the same number of columns and rows.
Use the proven row count as input into the merge process.
There are ways to write the merge script so it doesn't have to know "the row count" up front, but it's more code, it's slightly more confusing, and it won't help you if there ever is a problem... you probably don't want to find out on row 2 million that there was problem "somewhere", I know I hate it when that happens.
If you're new to Python or the CSV readers/writers I recommend you read these scripts first.
get_sensor_stats.py: reads all your sensor data and records the header, the minimum and maximum number of columns seens, and the row count for each CSV; it writes all those stats out to a CSV
analyze_stats.py: reads in the stats CSV and checks to make sure header and column counts meet pre-defined values; it also keeps a tally of all the row counts for each file, and if there are files with different row counts it will let you know
Also, here's the script I used to generate my 6G of sample data:
gen_sensor_data.py: is my attempt to meaningfully represent your problem space, both in size and complexity (which is very easy, thankfully 🙂)
Same big idea as my Python solution, but in Go for a 2.5 minute speed-up and using 1/3 the RAM.
package main
import (
"encoding/csv"
"fmt"
"io"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
)
// Fill this in based on your own knowledge, or, based on the output of 'analyze_stats.py'
const NUM_ROWS = 1_000_000
// Sorting sensor filenames by a custom key, the sensor number
type BySensorNum []string
func (sn BySensorNum) Len() int { return len(sn) }
func (sn BySensorNum) Swap(i, j int) { sn[i], sn[j] = sn[j], sn[i] }
func (sn BySensorNum) Less(i, j int) bool {
re := regexp.MustCompile(`.*sensor(\d+).csv`)
// Tease out sensor num from file "A"
fnameA := sn[i]
matchA := re.FindStringSubmatch(fnameA)
numA := matchA[1]
intA, errA := strconv.Atoi(numA)
if errA != nil {
panic(fmt.Errorf("Could not parse number \"%s\" from file-name \"%s\" as int: %v\n", errA, numA, fnameA))
}
// Tease out sensor num from file "B"
fnameB := sn[j]
matchB := re.FindStringSubmatch(fnameB)
numB := matchB[1]
intB, errB := strconv.Atoi(numB)
if errB != nil {
panic(fmt.Errorf("%v: Could not parse number \"%s\" from file-name \"%s\" as int\n", errB, numB, fnameB))
}
// Compare sensor nums numerically
return intA < intB
}
func main() {
// filenames := []string{"../sensor1.csv", "../sensor2.csv", "../sensor3.csv", "../sensor4.csv", "../sensor5.csv", "../sensor6.csv", "../sensor7.csv", "../sensor8.csv", "../sensor9.csv", "../sensor10.csv"}
filenames, err := filepath.Glob("sensor*.csv")
if err != nil {
panic(err) // only expect error if Glob pattern is bad itself (malformed)
}
fileCount := len(filenames)
sort.Sort(BySensorNum(filenames))
// Create output CSV writer
outFname := "merged_sensors.csv"
f, err := os.Create(outFname)
if err != nil {
panic(fmt.Errorf("Could not create \"%s\" for writing: %v", outFname, err))
}
defer f.Close()
w := csv.NewWriter(f)
// Get handles to all files, and create input CSV readers
readers := make([]*csv.Reader, fileCount)
for i, fname := range filenames {
f, err := os.Open(fname)
if err != nil {
panic(fmt.Errorf("Could not open \"%s\": %v", fname, err))
}
defer f.Close()
readers[i] = csv.NewReader(f)
}
// Discard all sensor headers
for _, r := range readers {
r.Read()
}
// With everything created or opened, start writing...
// Build up output header and write
header := make([]string, fileCount+1)
header[0] = "timestamp"
re := regexp.MustCompile(`.*(sensor\d+)\.csv`)
for i, fname := range filenames {
sensorName := re.FindStringSubmatch(fname)[1]
header[i+1] = sensorName
}
w.Write(header)
// "Shell" of final record with fileCount+1 columns, create once and use over-and-over again
finalRecord := make([]string, fileCount+1)
for i := 1; i <= NUM_ROWS; i++ {
var tsSum int
for j, r := range readers {
record, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
panic(fmt.Errorf("Could not read record for row %d of file \"%s\": %v", i, filenames[j], err))
}
timestamp, err := strconv.Atoi(record[0])
if err != nil {
panic(fmt.Errorf("Could not parse timestamp \"%s\" as int in record `%v`, row %d, of file \"%s\": %v", record[0], record, i, filenames[j], err))
}
tsSum += timestamp
finalRecord[j+1] = record[1]
}
// Add average timestamp to first cell/column
finalRecord[0] = fmt.Sprintf("%.1f", float32(tsSum)/float32(fileCount))
w.Write(finalRecord)
}
w.Flush()
}

go mysql LAST_INSERT_ID() returns 0

I have this MySQL database where I need to add records with a go program and need to retrieve the id of the last added record, to add the id to another table.
When i run insert INSERT INTO table1 values("test",1); SELECT LAST_INSERT_ID() in MySQL Workbench, it returns the last id, which is auto incremented, with no issues.
If I run my go code however, it always prints 0. The code:
_, err := db_client.DBClient.Query("insert into table1 values(?,?)", name, 1)
var id string
err = db_client.DBClient.QueryRow("SELECT LAST_INSERT_ID()").Scan(&id)
if err != nil {
panic(err.Error())
}
fmt.Println("id: ", id)
I tried this variation to try to narrow down the problem scope further: err = db_client.DBClient.QueryRow("SELECT id from table1 where name=\"pleasejustwork\";").Scan(&id), which works perfectly fine; go returns the actual id.
Why is it not working with the LAST_INSERT_ID()?
I'm a newbie in go so please do not go hard on me if i'm making stupid go mistakes that lead to this error :D
Thank you in advance.
The MySQL protocol returns LAST_INSERT_ID() values in its response to INSERT statements. And, the golang driver exposes that returned value. So, you don't need the extra round trip to get it. These ID values are usually unsigned 64-bit integers.
Try something like this.
res, err := db_client.DBClient.Exec("insert into table1 values(?,?)", name, 1)
if err != nil {
panic (err.Error())
}
id, err := res.LastInsertId()
if err != nil {
panic (err.Error())
}
fmt.Println("id: ", id)
I confess I'm not sure why your code didn't work. Whenever you successfully issue a single-row INSERT statement, the next statement on the same database connection always has access to a useful LAST_INSERT_ID() value. This is true whether or not you use explicit transactions.
But if your INSERT is not successful, you must treat the last insert ID value as unpredictable. (That's a technical term for "garbage", trash, rubbish, basura, etc.)

How to avoid race conditions in GORM

I am developing a system to enable patient registration with incremental queue number. I am using Go, GORM, and MySQL.
An issue happens when more than one patients are registering at the same time, they tend to get the same queue number which it should not happen.
I attempted using transactions and hooks to achieve that but I still got duplicate queue number. I have not found any resource about how to lock the database when a transaction is happening.
func (r repository) CreatePatient(pat *model.Patient) error {
tx := r.db.Begin()
defer func() {
if r := recover(); r != nil {
tx.Rollback()
}
}()
err := tx.Error
if err != nil {
return err
}
// 1. get latest queue number and assign it to patient object
var queueNum int64
err = tx.Model(&model.Patient{}).Where("registration_id", pat.RegistrationID).Select("queue_number").Order("created_at desc").First(&queueNum).Error
if err != nil && err != gorm.ErrRecordNotFound {
tx.Rollback()
return err
}
pat.QueueNumber = queueNum + 1
// 2. write patient data into the db
err = tx.Create(pat).Error
if err != nil {
tx.Rollback()
return err
}
return tx.Commit().Error
}
As stated by #O. Jones, transactions don't save you here because you're extracting the largest value of a column, incrementing it outside the db and then saving that new value. From the database's point of view the updated value has no dependence on the queried value.
You could try doing the update in a single query, which would make the dependence obvious:
UPDATE patient AS p
JOIN (
SELECT max(queue_number) AS queue_number FROM patient WHERE registration_id = ?
) maxp
SET p.queue_number = maxp.queue_number + 1
WHERE id = ?
In gorm you can't run a complex update like this, so you'll need to make use of Exec.
I'm not 100% certain the above will work because I'm less familiar with MySQL transaction isolation guarantees.
A cleaner way
Overall, it'd be cleaner to keep a table of queues (by reference_id) with a counter that you update atomically:
Start a transaction, then
SELECT queue_number FROM queues WHERE registration_id = ? FOR UPDATE;
Increment the queue number in your app code, then
UPDATE queues SET queue_number = ? WHERE registration_id = ?;
Now you can use the incremented queue number in your patient creation/update before transaction commit.

Limit max prepared statement count

The problem
I wrote an application which synchronizes data from BigQuery into a MySQL database. I try to insert roughly 10-20k rows in batches (up to 10 items each batch) every 3 hours. For some reason I receive the following error when it tries to upsert these rows into MySQL:
Can't create more than max_prepared_stmt_count statements:
Error 1461: Can't create more than max_prepared_stmt_count statements
(current value: 2000)
My "relevant code"
// ProcessProjectSkuCost receives the given sku cost entries and sends them in batches to upsertProjectSkuCosts()
func ProcessProjectSkuCost(done <-chan bigquery.SkuCost) {
var skuCosts []bigquery.SkuCost
var rowsAffected int64
for skuCostRow := range done {
skuCosts = append(skuCosts, skuCostRow)
if len(skuCosts) == 10 {
rowsAffected += upsertProjectSkuCosts(skuCosts)
skuCosts = []bigquery.SkuCost{}
}
}
if len(skuCosts) > 0 {
rowsAffected += upsertProjectSkuCosts(skuCosts)
}
log.Infof("Completed upserting project sku costs. Affected rows: '%d'", rowsAffected)
}
// upsertProjectSkuCosts inserts or updates ProjectSkuCosts into SQL in batches
func upsertProjectSkuCosts(skuCosts []bigquery.SkuCost) int64 {
// properties are table fields
tableFields := []string{"project_name", "sku_id", "sku_description", "usage_start_time", "usage_end_time",
"cost", "currency", "usage_amount", "usage_unit", "usage_amount_in_pricing_units", "usage_pricing_unit",
"invoice_month"}
tableFieldString := fmt.Sprintf("(%s)", strings.Join(tableFields, ","))
// placeholderstring for all to be inserted values
placeholderString := createPlaceholderString(tableFields)
valuePlaceholderString := ""
values := []interface{}{}
for _, row := range skuCosts {
valuePlaceholderString += fmt.Sprintf("(%s),", placeholderString)
values = append(values, row.ProjectName, row.SkuID, row.SkuDescription, row.UsageStartTime,
row.UsageEndTime, row.Cost, row.Currency, row.UsageAmount, row.UsageUnit,
row.UsageAmountInPricingUnits, row.UsagePricingUnit, row.InvoiceMonth)
}
valuePlaceholderString = strings.TrimSuffix(valuePlaceholderString, ",")
// put together SQL string
sqlString := fmt.Sprintf(`INSERT INTO
project_sku_cost %s VALUES %s ON DUPLICATE KEY UPDATE invoice_month=invoice_month`, tableFieldString, valuePlaceholderString)
sqlString = strings.TrimSpace(sqlString)
stmt, err := db.Prepare(sqlString)
if err != nil {
log.Warn("Error while preparing SQL statement to upsert project sku costs. ", err)
return 0
}
// execute query
res, err := stmt.Exec(values...)
if err != nil {
log.Warn("Error while executing statement to upsert project sku costs. ", err)
return 0
}
rowsAffected, err := res.RowsAffected()
if err != nil {
log.Warn("Error while trying to access affected rows ", err)
return 0
}
return rowsAffected
}
// createPlaceholderString creates a string which will be used for prepare statement (output looks like "(?,?,?)")
func createPlaceholderString(tableFields []string) string {
placeHolderString := ""
for range tableFields {
placeHolderString += "?,"
}
placeHolderString = strings.TrimSuffix(placeHolderString, ",")
return placeHolderString
}
My question:
Why do I hit the max_prepared_stmt_count when I immediately execute the prepared statement (see function upsertProjectSkuCosts)?
I could only imagine it's some sort of concurrency which creates tons of prepared statements in the meantime between preparing and executing all these statements. On the other hand I don't understand why there would be so much concurrency as the channel in the ProcessProjectSkuCost is a buffered channel with a size of 20.
You need to close the statement inside upsertProjectSkuCosts() (or re-use it - see the end of this post).
When you call db.Prepare(), a connection is taken from the internal connection pool (or a new connection is created, if there aren't any free connections). The statement is then prepared on that connection (if that connection isn't free when stmt.Exec() is called, the statement is then also prepared on another connection).
So this creates a statement inside your database for that connection. This statement will not magically disappear - having multiple prepared statements in a connection is perfectly valid. Golang could see that stmt goes out of scope, see it requires some sort of cleanup and then do that cleanup, but Golang doesn't (just like it doesn't close files for you and things like that). So you'll need to do that yourself using stmt.Close(). When you call stmt.Close(), the driver will send a command to the database server, telling it the statement is no longer needed.
The easiest way to do this is by adding defer stmt.Close() after the err check following db.Prepare().
What you can also do, is prepare the statement once and make that available for upsertProjectSkuCosts (either by passing the stmt into upsertProjectSkuCosts or by making upsertProjectSkuCosts a func of a struct, so the struct can have a property for the stmt). If you do this, you should not call stmt.Close() - because you aren't creating new statements anymore, you are re-using an existing statement.
Also see Should we also close DB's .Prepare() in Golang? and https://groups.google.com/forum/#!topic/golang-nuts/ISh22XXze-s

Inserting Rows in MySQL in Go very slow?

So I have been rewriting an old PHP system to Go looking for some performance gains but I'm not get any. And the problem seems to be in the Inserts i'm doing into Mysql.
So where PHP does some processing of a CSV file, does some hashing and inserts around 10k rows in MySQL it takes 40 seconds (unoptimized code).
Now Go on the other hand stripped away of any processing and just the same inserting of 10k(empty) rows takes 110 seconds.
Both tests are run on the same machine and I'm using the go-mysql-driver.
Now for some Go code:
This is extremely dumbed down code and this still takes almost 2 minutes, compared to PHP which does it in less then half.
db := GetDbCon()
defer db.Close()
stmt, _ := db.Prepare("INSERT INTO ticket ( event_id, entry_id, column_headers, column_data, hash, salt ) VALUES ( ?, ?, ?, ?, ?, ? )")
for i := 0; i < 10000; i++{
//CreateTicket(columns, line, storedEvent)
StoreTicket(models.Ticket{int64(0), storedEvent.Id, int64(i),
"", "", "", "", int64(0), int64(0)}, *stmt)
}
//Extra functions
func StoreTicket(ticket models.Ticket, stmt sql.Stmt){
stmt.Exec(ticket.EventId, ticket.EntryId, ticket.ColumnHeaders, ticket.ColumnData, ticket.Hash, ticket.Salt)
}
func GetDbCon() (sql.DB) {
db, _ := sql.Open("mysql", "bla:bla#/bla")
return *db
}
Profiler result
So is it my code, the go-mysql-driver or is this normal and is PHP just really fast in inserting records?
==EDIT==
As per requested, I have recorded both PHP and Go runs with tcpdump:
The files:
Go Tcpdump
Go Textdump
PHP Tcpdump
PHP Textdump
I have a hard time reaching any conclusions comparing the two logs, both seem to be sending the same size packets back and forth. But with Go(~110) mysql seems to almost take twice as long to process the request then with PHP(~44), also Go seems to wait slightly longer before sending a new request again(the difference is minimal though).
It's an old question but still - better late than never; you're in for a treat:
put all your data into a bytes.Buffer as tab-separated, newline terminated and unquoted lines (if the text causes problems, it has to be escaped first). NULL has to be encoded as \N.
Use http://godoc.org/github.com/go-sql-driver/mysql#RegisterReaderHandler and register a function returning that buffer under "instream". Next, call LOAD DATA LOCAL INFILE "Reader::instream" INTO TABLE ... - that's a very fast way to pump data into MySQL (I measured about 19 MB/sec with Go from a file piped from stdin compared to 18 MB/sec for the MySQL command line client when uploading data from stdin).
As far as I know, that very driver is the only way to LOAD DATA LOCAL INFILE without the need of a file.
I notice you're not using a transaction, if you're a using a vanilla mysql 5.x with InnoDB this will be a huge performance drag as it will auto-commit on every insert.
func GetDbCon() (sql.DB) {
db, _ := sql.Open("mysql", "bla:bla#/bla")
return *db
}
func PrepareTx(db *db.DB,qry string) (tx *db.Tx, s *db.Stmt, e error) {
if tx,e=db.Begin(); e!=nil {
return
}
if s, e = tx.Prepare(qry);e!=nil {
tx.Close()
}
return
}
db := GetDbCon()
defer db.Close()
qry := "INSERT INTO ticket ( event_id, entry_id, column_headers, column_data, hash, salt ) VALUES ( ?, ?, ?, ?, ?, ? )"
tx,stmt,e:=PrepareTx(db,qry)
if e!=nil {
panic(e)
}
defer tx.Rollback()
for i := 0; i < 10000; i++{
ticket:=models.Ticket{int64(0), storedEvent.Id, int64(i),"", "", "", "", int64(0), int64(0)}
stmt.Exec(ticket.EventId, ticket.EntryId, ticket.ColumnHeaders, ticket.ColumnData, ticket.Hash, ticket.Salt)
// To avoid huge transactions
if i % 1000 == 0 {
if e:=tx.Commit();e!=nil {
panic(e)
} else {
// can only commit once per transaction
tx,stmt,e=PrepareTx(db,qry)
if e!=nil {
panic(e)
}
}
}
}
// Handle left overs - should also check it isn't already committed
if e:=tx.Commit();e!=nil {
panic(e)
}