I have an issue on Linux Ubuntu with 1.4.2, which I am not sure how to sort:
func main() {
dir, _ := filepath.Abs(filepath.Dir(os.Args[0]))
outputFile, outputError := os.OpenFile(dir+"/out1.csv",
os.O_WRONLY|os.O_CREATE, 0666)
if outputError != nil {
fmt.Printf("An error occurred with file creation\n")
return
}
defer outputFile.Close()
writer := csv.NewWriter(outputFile)
results := getResults()
for _, result := range results {
writer.Write([]string{result.Item, result.Price, result.Shipping})
}
writer.Flush()
}
when results is 1000+ records my PC freezes for seconds and when it's say 20k, it freezes for minutes.
How do I solve such an issue in a proper way?
I though to flush it every N records, and add time.Sleep – but that looks awkward…
Related
I have terabytes of sensor data stored across separate CSV files in format timestamp and value. I need to merge these CSV files where I will have one column for timestamp (calculated as the average of all other timestamps for each row) and each sensor will have its own column where the column name comes from the file name. The number of sensors is more than 500. What tech stack to look into? I can't fit all data in RAM at once.
Example:
sensor1.csv
timestamp value
10000 1.9
10010 2.2
... (millions of rows)
sensor2.csv
timestamp value
10004 3.5
10012 4.3
... (500 more files)
Result should look like this (timestamp in this file is the average of all the timestamps from all 500+ input files, names of sensors, e.g. sensor1, sensor2 and etc., come from filenames):
merged_file.csv
timestamp sensor1 sensor2 ... sensor500
10002 1.9 3.5 2.1
10011 2.2 4.3 3.5
After merge, I would like to store data in a database, e.g. InfluxDB, for future analysis and model trainings. What tools would be best to perform merge and analysis operations on this data?
I see your post as two distinct questions: 1) How to merge 500 CSVs? 2) whatever comes next, some DB?
This is a solution for the first question. I'm using Python, there are languages/runtimes that could do this faster, but I think Python will give you a good first start at the problem and I'm expecting Python will be more accessbile and easier to use for you.
Also, my solution is predicated on the fact that all 500 CSVs have identical row counts.
My solution opens all 500 CSVs at once, creates an outer loop over a set number of rows, and an inner loop over each CSV:
The inner loop reads the timestamp and value for a row in each CSV, averaging the 500 timestamps into a single column, and accumulating the 500 distinct values in their own columns, and all that goes into a final merged row with 501 columns.
The outer loop repeats that process for as many rows as there are across all 500 CSVs.
I generated some sample data, 500 CSVs each with 1_000_000 rows, for 6.5G of CSVs. I ran the following script on my M1 Macbook Air. It completed in 8.3 minutes and peaked at 34.6M of RAM and produced a final CSV that is about 2G on disk.
import csv
import glob
# Fill this in based on your own knowledge, or, based on the output of 'analyze_stats.py'
NUM_ROWS = 1_000_000
sensor_filenames = sorted(glob.glob('sensor*.csv'))
# Sort: trim leading 6 chars, 'sensor', and trailing 4 chars, '.csv', leaving just the number in the middle
sensor_filenames = sorted(sensor_filenames, key=lambda x: int(x[6:-4]))
# Get handles to all files, and create input CSV readers
sensor_readers = []
for sensor_fname in sensor_filenames:
f = open(sensor_fname, newline='')
sensor_readers.append(csv.reader(f))
# Create output CSV writer
f = open('merged_sensors.csv', 'w', newline='')
writer = csv.writer(f)
# Discard all sensor headers
for reader in sensor_readers:
next(reader)
# Build up output header and write
output_header = ['timestamp']
for sensor_fname in sensor_filenames:
sensor_name = sensor_fname[:-4] # trim off '.csv'
output_header.append(sensor_name)
writer.writerow(output_header)
row_count = 0
while row_count < NUM_ROWS:
row_count += 1
values = []
timestamps = []
for reader in sensor_readers:
row = next(reader)
ts, val = row
timestamps.append(int(ts))
values.append(val)
if row_count % 1000 == 0:
print(f'Merged {row_count} rows')
avg_ts = int(sum(timestamps)/len(timestamps))
writer.writerow([avg_ts] + values)
I haven't profiled this, but I believe the only real allocations of memory that add up are going to be:
the 500 file handles and CSV readers (which is small) for the entirety of the process
each row from the input CSVs in the inner loop
the final merged row in the outer loop
At the top of the script I mention analyze_stats.py. Even if this were my data, I'd be very patient and break down the entire process into multiple steps, each which I could verify, and I would ultimately arrive at the correct, final CSV. This is especially true for me trying to help you, because I don't control the data or "know" it like you, so I'm going to offer up this bigger process:
Read all the CSVs and record some stats: headers, column counts, and especially row counts.
Analyze those stats for "conformance", making sure no CSVs deviates from your idea of what it should be, and especially get confirmation that all 500 CSVs have the same number of columns and rows.
Use the proven row count as input into the merge process.
There are ways to write the merge script so it doesn't have to know "the row count" up front, but it's more code, it's slightly more confusing, and it won't help you if there ever is a problem... you probably don't want to find out on row 2 million that there was problem "somewhere", I know I hate it when that happens.
If you're new to Python or the CSV readers/writers I recommend you read these scripts first.
get_sensor_stats.py: reads all your sensor data and records the header, the minimum and maximum number of columns seens, and the row count for each CSV; it writes all those stats out to a CSV
analyze_stats.py: reads in the stats CSV and checks to make sure header and column counts meet pre-defined values; it also keeps a tally of all the row counts for each file, and if there are files with different row counts it will let you know
Also, here's the script I used to generate my 6G of sample data:
gen_sensor_data.py: is my attempt to meaningfully represent your problem space, both in size and complexity (which is very easy, thankfully 🙂)
Same big idea as my Python solution, but in Go for a 2.5 minute speed-up and using 1/3 the RAM.
package main
import (
"encoding/csv"
"fmt"
"io"
"os"
"path/filepath"
"regexp"
"sort"
"strconv"
)
// Fill this in based on your own knowledge, or, based on the output of 'analyze_stats.py'
const NUM_ROWS = 1_000_000
// Sorting sensor filenames by a custom key, the sensor number
type BySensorNum []string
func (sn BySensorNum) Len() int { return len(sn) }
func (sn BySensorNum) Swap(i, j int) { sn[i], sn[j] = sn[j], sn[i] }
func (sn BySensorNum) Less(i, j int) bool {
re := regexp.MustCompile(`.*sensor(\d+).csv`)
// Tease out sensor num from file "A"
fnameA := sn[i]
matchA := re.FindStringSubmatch(fnameA)
numA := matchA[1]
intA, errA := strconv.Atoi(numA)
if errA != nil {
panic(fmt.Errorf("Could not parse number \"%s\" from file-name \"%s\" as int: %v\n", errA, numA, fnameA))
}
// Tease out sensor num from file "B"
fnameB := sn[j]
matchB := re.FindStringSubmatch(fnameB)
numB := matchB[1]
intB, errB := strconv.Atoi(numB)
if errB != nil {
panic(fmt.Errorf("%v: Could not parse number \"%s\" from file-name \"%s\" as int\n", errB, numB, fnameB))
}
// Compare sensor nums numerically
return intA < intB
}
func main() {
// filenames := []string{"../sensor1.csv", "../sensor2.csv", "../sensor3.csv", "../sensor4.csv", "../sensor5.csv", "../sensor6.csv", "../sensor7.csv", "../sensor8.csv", "../sensor9.csv", "../sensor10.csv"}
filenames, err := filepath.Glob("sensor*.csv")
if err != nil {
panic(err) // only expect error if Glob pattern is bad itself (malformed)
}
fileCount := len(filenames)
sort.Sort(BySensorNum(filenames))
// Create output CSV writer
outFname := "merged_sensors.csv"
f, err := os.Create(outFname)
if err != nil {
panic(fmt.Errorf("Could not create \"%s\" for writing: %v", outFname, err))
}
defer f.Close()
w := csv.NewWriter(f)
// Get handles to all files, and create input CSV readers
readers := make([]*csv.Reader, fileCount)
for i, fname := range filenames {
f, err := os.Open(fname)
if err != nil {
panic(fmt.Errorf("Could not open \"%s\": %v", fname, err))
}
defer f.Close()
readers[i] = csv.NewReader(f)
}
// Discard all sensor headers
for _, r := range readers {
r.Read()
}
// With everything created or opened, start writing...
// Build up output header and write
header := make([]string, fileCount+1)
header[0] = "timestamp"
re := regexp.MustCompile(`.*(sensor\d+)\.csv`)
for i, fname := range filenames {
sensorName := re.FindStringSubmatch(fname)[1]
header[i+1] = sensorName
}
w.Write(header)
// "Shell" of final record with fileCount+1 columns, create once and use over-and-over again
finalRecord := make([]string, fileCount+1)
for i := 1; i <= NUM_ROWS; i++ {
var tsSum int
for j, r := range readers {
record, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
panic(fmt.Errorf("Could not read record for row %d of file \"%s\": %v", i, filenames[j], err))
}
timestamp, err := strconv.Atoi(record[0])
if err != nil {
panic(fmt.Errorf("Could not parse timestamp \"%s\" as int in record `%v`, row %d, of file \"%s\": %v", record[0], record, i, filenames[j], err))
}
tsSum += timestamp
finalRecord[j+1] = record[1]
}
// Add average timestamp to first cell/column
finalRecord[0] = fmt.Sprintf("%.1f", float32(tsSum)/float32(fileCount))
w.Write(finalRecord)
}
w.Flush()
}
I am trying to filter a .csv file to only include 2 columns that will be specified by the user. My current code can only filter the .csv file to one column (but when I write to a .csv file, the results are in a row instead of a column) . Any ideas on how to filter the two columns and write the results in a single column on Go? Seems
In addition, is there any way I can write the data as a column instead of a row?
import (
"encoding/csv"
"fmt"
"io"
"log"
"os"
)
func main() {
file, err := os.Open("sample.csv")
checkError(err)
reader := csv.NewReader(file)
_, err = reader.Read() //Skips header
checkError(err)
results := make([]string, 0)
for {
row, err := reader.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
//fmt.Println(row[columnNum])
results = append(results, row[columnNum])
}
fmt.Print(results)
//File creation
f, err := os.Create("results.csv")
checkError(err)
defer f.Close()
w := csv.NewWriter(f)
err = w.Write(results)
checkError(err)
w.Flush()
}
...is there any way I can write the data as a column instead of a row?
You're only writing a single column. You call w.Write once, which writes a single record to the CSV. If you write once per consumed row, you'll get multiple rows. Notice that you're calling read Read many times and Write once.
Any ideas on how to filter the two columns and write the results in a single column on Go?
To get two columns, you just need to access both - I see one access to each row right now (row[columnNum]), so you'll just need a second. Combined with my previous point, I think the main problem is that you're missing that CSVs are two dimensional but you're only storing a single dimensional array.
I'm not sure what you mean by "write to a single column" - maybe you want to double the length of the CSV, or maybe you want to somehow merge two columns?
In either case, I'd suggest restructuring your code to avoid building up the intermediate results array, and instead write directly after reading. This will be more performant and more directly maps from the old format to the new one.
file, err := os.Open("sample.csv")
checkError(err)
reader := csv.NewReader(file)
_, err = reader.Read() // Skips header
checkError(err)
// Create the file and writer immediately
f, err := os.Create("results.csv")
checkError(err)
defer f.Close()
w := csv.NewWriter(f)
for {
row, err := reader.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
// here - each row is written to the new CSV writer
err = w.Write([]string{row[columnNum]})
// an example of writing two columns to each row
// err = w.Write([]string{row[columnNum1], row[columnNum2]})
// an example of merging two rows
// err = w.Write([]string{fmt.Sprintf("%s - %s", row[columnNum1], row[columnNum2])})
checkError(err)
}
w.Flush()
The problem
I wrote an application which synchronizes data from BigQuery into a MySQL database. I try to insert roughly 10-20k rows in batches (up to 10 items each batch) every 3 hours. For some reason I receive the following error when it tries to upsert these rows into MySQL:
Can't create more than max_prepared_stmt_count statements:
Error 1461: Can't create more than max_prepared_stmt_count statements
(current value: 2000)
My "relevant code"
// ProcessProjectSkuCost receives the given sku cost entries and sends them in batches to upsertProjectSkuCosts()
func ProcessProjectSkuCost(done <-chan bigquery.SkuCost) {
var skuCosts []bigquery.SkuCost
var rowsAffected int64
for skuCostRow := range done {
skuCosts = append(skuCosts, skuCostRow)
if len(skuCosts) == 10 {
rowsAffected += upsertProjectSkuCosts(skuCosts)
skuCosts = []bigquery.SkuCost{}
}
}
if len(skuCosts) > 0 {
rowsAffected += upsertProjectSkuCosts(skuCosts)
}
log.Infof("Completed upserting project sku costs. Affected rows: '%d'", rowsAffected)
}
// upsertProjectSkuCosts inserts or updates ProjectSkuCosts into SQL in batches
func upsertProjectSkuCosts(skuCosts []bigquery.SkuCost) int64 {
// properties are table fields
tableFields := []string{"project_name", "sku_id", "sku_description", "usage_start_time", "usage_end_time",
"cost", "currency", "usage_amount", "usage_unit", "usage_amount_in_pricing_units", "usage_pricing_unit",
"invoice_month"}
tableFieldString := fmt.Sprintf("(%s)", strings.Join(tableFields, ","))
// placeholderstring for all to be inserted values
placeholderString := createPlaceholderString(tableFields)
valuePlaceholderString := ""
values := []interface{}{}
for _, row := range skuCosts {
valuePlaceholderString += fmt.Sprintf("(%s),", placeholderString)
values = append(values, row.ProjectName, row.SkuID, row.SkuDescription, row.UsageStartTime,
row.UsageEndTime, row.Cost, row.Currency, row.UsageAmount, row.UsageUnit,
row.UsageAmountInPricingUnits, row.UsagePricingUnit, row.InvoiceMonth)
}
valuePlaceholderString = strings.TrimSuffix(valuePlaceholderString, ",")
// put together SQL string
sqlString := fmt.Sprintf(`INSERT INTO
project_sku_cost %s VALUES %s ON DUPLICATE KEY UPDATE invoice_month=invoice_month`, tableFieldString, valuePlaceholderString)
sqlString = strings.TrimSpace(sqlString)
stmt, err := db.Prepare(sqlString)
if err != nil {
log.Warn("Error while preparing SQL statement to upsert project sku costs. ", err)
return 0
}
// execute query
res, err := stmt.Exec(values...)
if err != nil {
log.Warn("Error while executing statement to upsert project sku costs. ", err)
return 0
}
rowsAffected, err := res.RowsAffected()
if err != nil {
log.Warn("Error while trying to access affected rows ", err)
return 0
}
return rowsAffected
}
// createPlaceholderString creates a string which will be used for prepare statement (output looks like "(?,?,?)")
func createPlaceholderString(tableFields []string) string {
placeHolderString := ""
for range tableFields {
placeHolderString += "?,"
}
placeHolderString = strings.TrimSuffix(placeHolderString, ",")
return placeHolderString
}
My question:
Why do I hit the max_prepared_stmt_count when I immediately execute the prepared statement (see function upsertProjectSkuCosts)?
I could only imagine it's some sort of concurrency which creates tons of prepared statements in the meantime between preparing and executing all these statements. On the other hand I don't understand why there would be so much concurrency as the channel in the ProcessProjectSkuCost is a buffered channel with a size of 20.
You need to close the statement inside upsertProjectSkuCosts() (or re-use it - see the end of this post).
When you call db.Prepare(), a connection is taken from the internal connection pool (or a new connection is created, if there aren't any free connections). The statement is then prepared on that connection (if that connection isn't free when stmt.Exec() is called, the statement is then also prepared on another connection).
So this creates a statement inside your database for that connection. This statement will not magically disappear - having multiple prepared statements in a connection is perfectly valid. Golang could see that stmt goes out of scope, see it requires some sort of cleanup and then do that cleanup, but Golang doesn't (just like it doesn't close files for you and things like that). So you'll need to do that yourself using stmt.Close(). When you call stmt.Close(), the driver will send a command to the database server, telling it the statement is no longer needed.
The easiest way to do this is by adding defer stmt.Close() after the err check following db.Prepare().
What you can also do, is prepare the statement once and make that available for upsertProjectSkuCosts (either by passing the stmt into upsertProjectSkuCosts or by making upsertProjectSkuCosts a func of a struct, so the struct can have a property for the stmt). If you do this, you should not call stmt.Close() - because you aren't creating new statements anymore, you are re-using an existing statement.
Also see Should we also close DB's .Prepare() in Golang? and https://groups.google.com/forum/#!topic/golang-nuts/ISh22XXze-s
I have a problem with getting database table list (SHOW TABLES) in Go.
I use this packages
database/sql
gopkg.in/gorp.v1
github.com/ziutek/mymysql/godrv
and connect to MYSQL by this code:
db, err := sql.Open(
"mymysql",
"tcp:127.0.0.1:3306*test/root/root")
if err != nil {
panic(err)
}
dbmap := &DbMap{Conn:&gorp.DbMap{Db: db}}
And I use this code to get list of tables
result, _ := dbmap.Exec("SHOW TABLES")
But result is empty!
I use classic go-sql-driver/mysql:
db, _ := sql.Open("mysql", "root:qwerty#/dbname")
res, _ := db.Query("SHOW TABLES")
var table string
for res.Next() {
res.Scan(&table)
fmt.Println(table)
}
PS don't ignore errors! This is only an example
I'm trying this code and work successfully. I create a list of string and use Select query to get list of database tables.
tables := []string{}
dbmap.Select(&tables, "SHOW TABLES")
fmt.Println(tables)
I connect to the database in the init function of a controller, like:
db, err = sql.Open("mysql", "user:pass#tcp(<ip>:3306)/<db>")
if err != nil {
log.Fatal(err)
}
err = db.Ping()
if err != nil {
log.Fatal(err)
}
Then I prepare some statements (db.Prepare) and finally execute them somewhere in the code, without creating new db connections or anything weird. Just letting go handle the connection pool.
But as you can see in the image, I'm getting a lot of connections and aborted connections which make the server run slow and even crash.
Why is it happening? Also I have around 2000 simultaneous online users which result in about 20 queries per second. I don't think it's much, just in case it mattered.
EDIT:
Here's how I run the prepared statements. I have 2 selects, 1 update and 1 insert. Selects are being run like:
err = getStatement.QueryRow(apiKey).Scan(&keyId)
if err != nil {
res, _ := json.Marshal(RespError{"Statement Error"})
w.Write(res)
return
}
Inserts and updates:
insertStatement.Exec(a,b,c)