Undefined behaviour while loading a large CSV concurrently using Goroutines - csv

I am trying to load a big CSV file using goroutines using Golang. The dimension of the csv is (254882, 100). But using my goroutines when I am parsing the csv and storing it into an 2D list, I am getting rows lesser than 254882 and the number is varying for each run. I feel it is happening due goroutines but can't seem to point the reason. Can anyone please help me. I am also new in Golang. Here is my code below
func loadCSV(csvFile string) (*[][]float64, error) {
startTime := time.Now()
var dataset [][]float64
f, err := os.Open(csvFile)
if err != nil {
return &dataset, err
}
r := csv.NewReader(bufio.NewReader(f))
counter := 0
var wg sync.WaitGroup
for {
record, err := r.Read()
if err == io.EOF {
break
}
if counter != 0 {
wg.Add(1)
go func(r []string, dataset *[][]float64) {
var temp []float64
for _, each := range record {
f, err := strconv.ParseFloat(each, 64)
if err == nil {
temp = append(temp, f)
}
}
*dataset = append(*dataset, temp)
wg.Done()
}(record, &dataset)
}
counter++
}
wg.Wait()
duration := time.Now().Sub(startTime)
log.Printf("Loaded %d rows in %v seconds", counter, duration)
return &dataset, nil
}
And my main function looks like the following
func main() {
// runtime.GOMAXPROCS(4)
dataset, err := loadCSV("AvgW2V_train.csv")
if err != nil {
panic(err)
}
fmt.Println(len(*dataset))
}
If anyone needs to download the CSV too, then click the link below (485 MB)
https://drive.google.com/file/d/1G4Nw6JyeC-i0R1exWp5BtRtGM1Fwyelm/view?usp=sharing

Go Data Race Detector
Your results are undefined because you have data races.
~/gopath/src$ go run -race racer.go
==================
WARNING: DATA RACE
Write at 0x00c00008a060 by goroutine 6:
runtime.mapassign_faststr()
/home/peter/go/src/runtime/map_faststr.go:202 +0x0
main.main.func2()
/home/peter/gopath/src/racer.go:16 +0x6a
Previous write at 0x00c00008a060 by goroutine 5:
runtime.mapassign_faststr()
/home/peter/go/src/runtime/map_faststr.go:202 +0x0
main.main.func1()
/home/peter/gopath/src/racer.go:11 +0x6a
Goroutine 6 (running) created at:
main.main()
/home/peter/gopath/src/racer.go:14 +0x88
Goroutine 5 (running) created at:
main.main()
/home/peter/gopath/src/racer.go:9 +0x5b
==================
fatal error: concurrent map writes
==================
WARNING: DATA RACE
Write at 0x00c00009a088 by goroutine 6:
main.main.func2()
/home/peter/gopath/src/racer.go:16 +0x7f
Previous write at 0x00c00009a088 by goroutine 5:
main.main.func1()
/home/peter/gopath/src/racer.go:11 +0x7f
Goroutine 6 (running) created at:
main.main()
/home/peter/gopath/src/racer.go:14 +0x88
Goroutine 5 (running) created at:
main.main()
/home/peter/gopath/src/racer.go:9 +0x5b
==================
goroutine 34 [running]:
runtime.throw(0x49e156, 0x15)
/home/peter/go/src/runtime/panic.go:608 +0x72 fp=0xc000094718 sp=0xc0000946e8 pc=0x44b342
runtime.mapassign_faststr(0x48ace0, 0xc00008a060, 0x49c9c3, 0x8, 0xc00009a088)
/home/peter/go/src/runtime/map_faststr.go:211 +0x46c fp=0xc000094790 sp=0xc000094718 pc=0x43598c
main.main.func1(0x49c9c3, 0x8)
/home/peter/gopath/src/racer.go:11 +0x6b fp=0xc0000947d0 sp=0xc000094790 pc=0x47ac6b
runtime.goexit()
/home/peter/go/src/runtime/asm_amd64.s:1340 +0x1 fp=0xc0000947d8 sp=0xc0000947d0 pc=0x473061
created by main.main
/home/peter/gopath/src/racer.go:9 +0x5c
goroutine 1 [sleep]:
time.Sleep(0x5f5e100)
/home/peter/go/src/runtime/time.go:105 +0x14a
main.main()
/home/peter/gopath/src/racer.go:19 +0x96
goroutine 35 [runnable]:
main.main.func2(0x49c9c3, 0x8)
/home/peter/gopath/src/racer.go:16 +0x6b
created by main.main
/home/peter/gopath/src/racer.go:14 +0x89
exit status 2
~/gopath/src$
racer.go:
package main
import (
"bufio"
"encoding/csv"
"fmt"
"io"
"log"
"os"
"strconv"
"sync"
"time"
)
func loadCSV(csvFile string) (*[][]float64, error) {
startTime := time.Now()
var dataset [][]float64
f, err := os.Open(csvFile)
if err != nil {
return &dataset, err
}
r := csv.NewReader(bufio.NewReader(f))
counter := 0
var wg sync.WaitGroup
for {
record, err := r.Read()
if err == io.EOF {
break
}
if counter != 0 {
wg.Add(1)
go func(r []string, dataset *[][]float64) {
var temp []float64
for _, each := range record {
f, err := strconv.ParseFloat(each, 64)
if err == nil {
temp = append(temp, f)
}
}
*dataset = append(*dataset, temp)
wg.Done()
}(record, &dataset)
}
counter++
}
wg.Wait()
duration := time.Now().Sub(startTime)
log.Printf("Loaded %d rows in %v seconds", counter, duration)
return &dataset, nil
}
func main() {
// runtime.GOMAXPROCS(4)
dataset, err := loadCSV("/home/peter/AvgW2V_train.csv")
if err != nil {
panic(err)
}
fmt.Println(len(*dataset))
}

There is no need to use *[][]float64 as that would be a double pointer.
I have made some minor modifications to your program.
dataset is available to new goroutine, since it's declared in it's above block of code.
Similarly record is also available, but since record variable, is changing from time to time, we need to pass it to new goroutine.
While there is no need to pass dataset, as it is not changing and that is what we want, so that we can append temp to dataset.
But race condition happens when multiple goroutines are trying to append to same variable, i.e., multiple goroutines are trying to write to same variable.
So we need to make sure that only one can goroutine can add at any instance of time.
So we use a lock to make appending sequential.
package main
import (
"bufio"
"encoding/csv"
"fmt"
"os"
"strconv"
"sync"
)
func loadCSV(csvFile string) [][]float64 {
var dataset [][]float64
f, _ := os.Open(csvFile)
r := csv.NewReader(f)
var wg sync.WaitGroup
l := new(sync.Mutex) // lock
for record, err := r.Read(); err == nil; record, err = r.Read() {
wg.Add(1)
go func(record []string) {
defer wg.Done()
var temp []float64
for _, each := range record {
if f, err := strconv.ParseFloat(each, 64); err == nil {
temp = append(temp, f)
}
}
l.Lock() // lock before writing
dataset = append(dataset, temp) // write
l.Unlock() // unlock
}(record)
}
wg.Wait()
return dataset
}
func main() {
dataset := loadCSV("train.csv")
fmt.Println(len(dataset))
}
Some errors were not handled to make it minimal, but you should handle errors.

Related

Summarizing contents of csv

Context
I'm working on creating a little program that can summarize the contents of an absolute mess of a bill, which is in csv form.
The bill has three columns I'm interested in:
Event type. Here, I'm only interested in the rows where this column reads CHARGE
The cost. Self explanatory.
Resource name, containing Server and cluster names. The format is servername.clustername.
The idea is to select the rows that are labeled as charge, split them up first by cluster and then by server name, and sum up the total costs for each.
I can't help but feel like this should be easy, but I've been scratching my head on this for a while now, and just can't seem to figure it out. At this point I ought to state that I am fairly new to programming and entirely new to GO.
Here's what I have so far:
package main
import (
"encoding/csv"
"log"
"os"
"sort"
"strings"
)
func main() {
rows := readBill("bill-2018-April.csv")
rows = calculateSummary(rows)
writeSummary("bill-2018-April-output", rows)
}
func readBill(name string) [][]string {
f, err := os.Open(name)
if err != nil {
log.Fatalf("Cannot open '%s': %s\n", name, err.Error())
}
defer f.Close()
r := csv.NewReader(f)
rows, err := r.ReadAll()
if err != nil {
log.Fatalln("Cannot read CSV data:", err.Error())
}
return rows
}
type charges struct {
impactType string
cost float64
resName string
}
func createCharges(rows [][]string){
charges:= []charges{}
for i,r:=range rows {
var c charges
c.impactType :=r [i][10]
c.cost := r [i][15]
c.resName := r [i][20]
charges = append()
}
return charges
}
So, as far as I can tell, I should now have isolated the columns I am interested in (i.e. columns 10, 15 and 20). Is what I have so far even correct?
How would I go about singling out the rows reading "CHARGE" and slicing everything up by cluster and server?
Summing things up shouldn't be too tricky, but for whatever reason, this is really stumping me.
Just use two maps to store the sums per server and per cluster. And since you're not interested in the whole CSV but only some rows, reading everything is kind of wasteful. Just skip the rows you don't care about:
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"strconv"
"strings"
)
func main() {
b := `
,,,,,,,,,,CHARGE,,,,,100.00,,,,,s1.c1
,,,,,,,,,,IGNORE,,,,,,,,,,
,,,,,,,,,,CHARGE,,,,,200.00,,,,,s2.c1
,,,,,,,,,,CHARGE,,,,,300.00,,,,,s3.c2
`
r := csv.NewReader(strings.NewReader(b))
byServer := make(map[string]float64)
byCluster := make(map[string]float64)
for i := 0; ; i++ {
row, err := r.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
if row[10] != "CHARGE" {
continue
}
cost, err := strconv.ParseFloat(row[15], 64)
if err != nil {
log.Fatalf("row %d: malformed cost: %v", i, err)
}
xs := strings.SplitN(row[20], ".", 2)
if len(xs) != 2 {
log.Fatalf("row %d: malformed resource name", i)
}
server, cluster := xs[0], xs[1]
byServer[server] += cost
byCluster[cluster] += cost
}
fmt.Printf("byServer: %+v\n", byServer)
fmt.Printf("byCluster: %+v\n", byCluster)
}
// Output:
// byServer: map[s2:200 s3:300 s1:100]
// byCluster: map[c1:300 c2:300]
Try it on the playground: https://play.golang.org/p/1e9mJf4LyYE

Concurrently write multiple csv files from one, splitting on a partition column in Golang

My objective is to read one or multiple csv files that share a common format, and write to separate files based on a partition column in the csv data. Please allow that the last column is the partition, that data is un-sorted, and a given partition can be found in multiple files. Example of one file:
fsdio,abc,def,2017,11,06,01
1sdf9,abc,def,2017,11,06,04
22df9,abc,def,2017,11,06,03
1d243,abc,def,2017,11,06,02
If this approach smells like the dreaded XY Problem, I'm happy to adjust.
What I've tried so far:
Read in the data set and iterate over each line
If the partition has
been seen, spin off a new worker routine (this will contain a file/csv
writer). Send the line into a chan []string.
As each worker is a file writer, it should only receive lines for exactly one partition over it's input channel.
This obviously doesn't work (yet), as I'm not aware of how to send a line to the correct worker based on the partition value seen on a given line.
I've given each worker an id string for each partition value, but am not aware how to select that worker to send to, if I should be creating a separate chan []string for each worker and send to that channel with a select, or if perhaps a struct should hold each worker with some sort of pool and routing functionality.
TLDR; I'm lost as to how to conditionally send data to a given go routine or channel based on some categorical string value, where the number of unique's can be arbitrary, but likely does not exceed 24 unique partition values.
I will caveat by stating I've noticed questions like this do get down-voted, so if you feel this is counter-constructive or incomplete enough to down-vote, please comment with why so I can avoid repeating the offense.
Thanks for any help in advance!
Playground
Snippet:
package main
import (
"encoding/csv"
"fmt"
"log"
"strings"
"time"
)
func main() {
// CSV
r := csv.NewReader(csvFile1)
lines, err := r.ReadAll()
if err != nil {
log.Fatalf("error reading all lines: %v", err)
}
// CHANNELS
lineChan := make(chan []string)
// TRACKER
var seenPartitions []string
for _, line := range lines {
hour := line[6]
if !stringInSlice(hour, seenPartitions) {
seenPartitions = append(seenPartitions, hour)
go worker(hour, lineChan)
}
// How to send to the correct worker/channel?
lineChan <- line
}
close(lineChan)
}
func worker(id string, lineChan <-chan []string) {
for j := range lineChan {
fmt.Println("worker", id, "started job", j)
// Write to a new file here and wait for input over the channel
time.Sleep(time.Second)
fmt.Println("worker", id, "finished job", j)
}
}
func stringInSlice(str string, list []string) bool {
for _, v := range list {
if v == str {
return true
}
}
return false
}
// DUMMY
var csvFile1 = strings.NewReader(`
12fy3,abc,def,2017,11,06,04
fsdio,abc,def,2017,11,06,01
11213,abc,def,2017,11,06,02
1sdf9,abc,def,2017,11,06,01
2123r,abc,def,2017,11,06,03
1v2t3,abc,def,2017,11,06,01
1r2r3,abc,def,2017,11,06,02
g1253,abc,def,2017,11,06,02
d1e23,abc,def,2017,11,06,02
a1d23,abc,def,2017,11,06,02
12jj3,abc,def,2017,11,06,03
t1r23,abc,def,2017,11,06,03
22123,abc,def,2017,11,06,03
14d23,abc,def,2017,11,06,04
1d243,abc,def,2017,11,06,01
1da23,abc,def,2017,11,06,04
a1523,abc,def,2017,11,06,01
12453,abc,def,2017,11,06,04`)
Synchronous version no go concurrent magic first (see concurrent version below).
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"strings"
)
func main() {
// CSV
r := csv.NewReader(csvFile1)
partitions := make(map[string][][]string)
for {
rec, err := r.Read()
if err != nil {
if err == io.EOF {
err = nil
save_partitions(partitions)
return
}
log.Fatal(err)
}
process(rec, partitions)
}
}
// prints only
func save_partitions(partitions map[string][][]string) {
for part, recs := range partitions {
fmt.Println(part)
for _, rec := range recs {
fmt.Println(rec)
}
}
}
// this can also write/append directly to a file
func process(rec []string, partitions map[string][][]string) {
l := len(rec)
part := rec[l-1]
if p, ok := partitions[part]; ok {
partitions[part] = append(p, rec)
} else {
partitions[part] = [][]string{rec}
}
}
// DUMMY
var csvFile1 = strings.NewReader(`
fsdio,abc,def,2017,11,06,01
1sdf9,abc,def,2017,11,06,01
1d243,abc,def,2017,11,06,01
1v2t3,abc,def,2017,11,06,01
a1523,abc,def,2017,11,06,01
1r2r3,abc,def,2017,11,06,02
11213,abc,def,2017,11,06,02
g1253,abc,def,2017,11,06,02
d1e23,abc,def,2017,11,06,02
a1d23,abc,def,2017,11,06,02
12jj3,abc,def,2017,11,06,03
t1r23,abc,def,2017,11,06,03
2123r,abc,def,2017,11,06,03
22123,abc,def,2017,11,06,03
14d23,abc,def,2017,11,06,04
1da23,abc,def,2017,11,06,04
12fy3,abc,def,2017,11,06,04
12453,abc,def,2017,11,06,04`)
https://play.golang.org/p/--iqZGzxCF
And the concurrent version:
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"strings"
"sync"
)
var (
// list of channels to communicate with workers
// workers accessed synchronousely no mutex required
workers = make(map[string]chan []string)
// wg is to make sure all workers done before exiting main
wg = sync.WaitGroup{}
// mu used only for sequential printing, not relevant for program logic
mu = sync.Mutex{}
)
func main() {
// wait for all workers to finish up before exit
defer wg.Wait()
r := csv.NewReader(csvFile1)
for {
rec, err := r.Read()
if err != nil {
if err == io.EOF {
savePartitions()
return
}
log.Fatal(err) // sorry for the panic
}
process(rec)
}
}
func process(rec []string) {
l := len(rec)
part := rec[l-1]
if c, ok := workers[part]; ok {
// send rec to worker
c <- rec
} else {
// if no worker for the partition
// make a chan
nc := make(chan []string)
workers[part] = nc
// start worker with this chan
go worker(nc)
// send rec to worker via chan
nc <- rec
}
}
func worker(c chan []string) {
// wg.Done signals to main worker completion
wg.Add(1)
defer wg.Done()
part := [][]string{}
for {
// wait for a rec or close(chan)
rec, ok := <-c
if ok {
// save the rec
// instead of accumulation in memory
// this can be saved to file directly
part = append(part, rec)
} else {
// channel closed on EOF
// dump partition
// locks ensures sequential printing
// not a required for independent files
mu.Lock()
for _, p := range part {
fmt.Printf("%+v\n", p)
}
mu.Unlock()
return
}
}
}
// simply signals to workers to stop
func savePartitions() {
for _, c := range workers {
// signal to all workers to exit
close(c)
}
}
// DUMMY
var csvFile1 = strings.NewReader(`
fsdio,abc,def,2017,11,06,01
1sdf9,abc,def,2017,11,06,01
1d243,abc,def,2017,11,06,01
1v2t3,abc,def,2017,11,06,01
a1523,abc,def,2017,11,06,01
1r2r3,abc,def,2017,11,06,02
11213,abc,def,2017,11,06,02
g1253,abc,def,2017,11,06,02
d1e23,abc,def,2017,11,06,02
a1d23,abc,def,2017,11,06,02
12jj3,abc,def,2017,11,06,03
t1r23,abc,def,2017,11,06,03
2123r,abc,def,2017,11,06,03
22123,abc,def,2017,11,06,03
14d23,abc,def,2017,11,06,04
1da23,abc,def,2017,11,06,04
12fy3,abc,def,2017,11,06,04
12453,abc,def,2017,11,06,04`)
https://play.golang.org/p/oBTPosy0yT
Have fun!

SQL result to JSON as fast as possible

I'm trying to transform the Go built-in sql result to JSON. I'm using goroutines for that but I got problems.
The base problem:
There is a really big database with around 200k user and I have to serve them through tcp sockets in a microservice based system. To get the users from the database spent 20ms but transform this bunch of data to JSON spend 10 seconds with the current solution. This is why I want to use goroutines.
Solution with Goroutines:
func getJSON(rows *sql.Rows, cnf configure.Config) ([]byte, error) {
log := logan.Log{
Cnf: cnf,
}
cols, _ := rows.Columns()
defer rows.Close()
done := make(chan struct{})
go func() {
defer close(done)
for result := range resultChannel {
results = append(
results,
result,
)
}
}()
wg.Add(1)
go func() {
for rows.Next() {
wg.Add(1)
go handleSQLRow(cols, rows)
}
wg.Done()
}()
go func() {
wg.Wait()
defer close(resultChannel)
}()
<-done
s, err := json.Marshal(results)
results = []resultContainer{}
if err != nil {
log.Context(1).Error(err)
}
rows.Close()
return s, nil
}
func handleSQLRow(cols []string, rows *sql.Rows) {
defer wg.Done()
result := make(map[string]string, len(cols))
fmt.Println("asd -> " + strconv.Itoa(counter))
counter++
rawResult := make([][]byte, len(cols))
dest := make([]interface{}, len(cols))
for i := range rawResult {
dest[i] = &rawResult[i]
}
rows.Scan(dest...) // GET PANIC
for i, raw := range rawResult {
if raw == nil {
result[cols[i]] = ""
} else {
fmt.Println(string(raw))
result[cols[i]] = string(raw)
}
}
resultChannel <- result
}
This solution give me a panic with the following message:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45974c]
goroutine 408 [running]:
panic(0x7ca140, 0xc420010150)
/usr/lib/golang/src/runtime/panic.go:500 +0x1a1
database/sql.convertAssign(0x793960, 0xc420529210, 0x7a5240, 0x0, 0x0, 0x0)
/usr/lib/golang/src/database/sql/convert.go:88 +0x1ef1
database/sql.(*Rows).Scan(0xc4203e4060, 0xc42021fb00, 0x44, 0x44, 0x44, 0x44)
/usr/lib/golang/src/database/sql/sql.go:1850 +0xc2
github.com/PumpkinSeed/zerodb/operations.handleSQLRow(0xc420402000, 0x44, 0x44, 0xc4203e4060)
/home/loow/gopath/src/github.com/PumpkinSeed/zerodb/operations/operations.go:290 +0x19c
created by github.com/PumpkinSeed/zerodb/operations.getJSON.func2
/home/loow/gopath/src/github.com/PumpkinSeed/zerodb/operations/operations.go:258 +0x91
exit status 2
The current solution which is working but spend too much time:
func getJSON(rows *sql.Rows, cnf configure.Config) ([]byte, error) {
log := logan.Log{
Cnf: cnf,
}
var results []resultContainer
cols, _ := rows.Columns()
rawResult := make([][]byte, len(cols))
dest := make([]interface{}, len(cols))
for i := range rawResult {
dest[i] = &rawResult[i]
}
defer rows.Close()
for rows.Next() {
result := make(map[string]string, len(cols))
rows.Scan(dest...)
for i, raw := range rawResult {
if raw == nil {
result[cols[i]] = ""
} else {
result[cols[i]] = string(raw)
}
}
results = append(results, result)
}
s, err := json.Marshal(results)
if err != nil {
log.Context(1).Error(err)
}
rows.Close()
return s, nil
}
Question:
Why the goroutine solution give me an error, where it is not an obvious panic, because the first ~200 goroutine running properly?!
UPDATE
Performance test for the original working solution:
INFO[0020] setup taken -> 3.149124658s file=operations.go func=operations.getJSON line=260 service="Database manager" ts="2017-04-02 19:45:27.132881211 +0100 BST"
INFO[0025] toJSON taken -> 5.317647046s file=operations.go func=operations.getJSON line=263 service="Database manager" ts="2017-04-02 19:45:32.450551417 +0100 BST"
The sql to map is 3 sec and to json is 5 sec.
Go routines won't improve performance on CPU-bound operations like JSON marshaling. What you need is a more efficient JSON marshaler. There are some available, although I haven't used any. A simple Google search for 'faster JSON marshaling' will turn up many results. A popular one is ffjson. I suggest starting there.

Using Golang to read csv, reorder columns then write result to a new csv with Concurrency

Here's my starting point.
It is a Golang script to read in a csv with 3 columns, re-order the columns and write the result to a new csv file.
package main
import (
"fmt"
"encoding/csv"
"io"
"os"
"math/rand"
"time"
)
func main(){
start_time := time.Now()
// Loading csv file
rFile, err := os.Open("data/small.csv") //3 columns
if err != nil {
fmt.Println("Error:", err)
return
}
defer rFile.Close()
// Creating csv reader
reader := csv.NewReader(rFile)
lines, err := reader.ReadAll()
if err == io.EOF {
fmt.Println("Error:", err)
return
}
// Creating csv writer
wFile, err := os.Create("data/result.csv")
if err != nil {
fmt.Println("Error:",err)
return
}
defer wFile.Close()
writer := csv.NewWriter(wFile)
// Read data, randomize columns and write new lines to results.csv
rand.Seed(int64(time.Now().Nanosecond()))
var col_index []int
for i,line :=range lines{
if i == 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index = rand.Perm(len(line))
}
writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3 columns
writer.Flush()
}
//print report
fmt.Println("No. of lines: ",len(lines))
fmt.Println("Time taken: ", time.Since(start_time))
}
Question:
Is my code idiomatic for Golang?
How can I add concurrency to this code?
Your code is OK. There are no much case for concurrency. But you can at least reduce memory consumption reordering on the fly. Just use Read() instead of ReadAll() to avoid allocating slice for hole input file.
for line, err := reader.Read(); err == nil; line, err = reader.Read(){
if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
fmt.Println("Error:", err)
break
}
writer.Flush()
}
Move the col_index initialisation outside the write loop:
if len(lines) > 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index := rand.Perm(len(lines[0]))
newLine := make([]string, len(col_index))
for _, line :=range lines[1:] {
for from, to := range col_index {
newLine[to] = line[from]
}
writer.Write(newLine)
writer.Flush()
}
}
To use concurrency, you must not use reader.ReadAll. Instead make a goroutine that calls reader.Read and write the output on a channel that would replace the lines array. The main goroutine would read the channel and do the shuffle and the write.

Efficient read and write CSV in Go

The Go code below reads in a 10,000 record CSV (of timestamp times and float values), runs some operations on the data, and then writes the original values to another CSV along with an additional column for score. However it is terribly slow (i.e. hours, but most of that is calculateStuff()) and I'm curious if there are any inefficiencies in the CSV reading/writing I can take care of.
package main
import (
"encoding/csv"
"log"
"os"
"strconv"
)
func ReadCSV(filepath string) ([][]string, error) {
csvfile, err := os.Open(filepath)
if err != nil {
return nil, err
}
defer csvfile.Close()
reader := csv.NewReader(csvfile)
fields, err := reader.ReadAll()
return fields, nil
}
func main() {
// load data csv
records, err := ReadCSV("./path/to/datafile.csv")
if err != nil {
log.Fatal(err)
}
// write results to a new csv
outfile, err := os.Create("./where/to/write/resultsfile.csv"))
if err != nil {
log.Fatal("Unable to open output")
}
defer outfile.Close()
writer := csv.NewWriter(outfile)
for i, record := range records {
time := record[0]
value := record[1]
// skip header row
if i == 0 {
writer.Write([]string{time, value, "score"})
continue
}
// get float values
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal("Record: %v, Error: %v", floatValue, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
//fmt.Printf("Result: %v\n", []string{time, valueString, scoreString})
writer.Write([]string{time, valueString, scoreString})
}
writer.Flush()
}
I'm looking for help making this CSV read/write template code as fast as possible. For the scope of this question we need not worry about the calculateStuff method.
You're loading the file in memory first then processing it, that can be slow with a big file.
You need to loop and call .Read and process one line at a time.
func processCSV(rc io.Reader) (ch chan []string) {
ch = make(chan []string, 10)
go func() {
r := csv.NewReader(rc)
if _, err := r.Read(); err != nil { //read header
log.Fatal(err)
}
defer close(ch)
for {
rec, err := r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
ch <- rec
}
}()
return
}
playground
//note it's roughly based on DaveC's comment.
This is essentially Dave C's answer from the comments sections:
package main
import (
"encoding/csv"
"log"
"os"
"strconv"
)
func main() {
// setup reader
csvIn, err := os.Open("./path/to/datafile.csv")
if err != nil {
log.Fatal(err)
}
r := csv.NewReader(csvIn)
// setup writer
csvOut, err := os.Create("./where/to/write/resultsfile.csv"))
if err != nil {
log.Fatal("Unable to open output")
}
w := csv.NewWriter(csvOut)
defer csvOut.Close()
// handle header
rec, err := r.Read()
if err != nil {
log.Fatal(err)
}
rec = append(rec, "score")
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
for {
rec, err = r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
// get float value
value := rec[1]
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal("Record, error: %v, %v", value, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
scoreString := strconv.FormatFloat(score, 'f', 8, 64)
rec = append(rec, scoreString)
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
w.Flush()
}
}
Note of course the logic is all jammed into main(), better would be to split it into several functions, but that's beyond the scope of this question.
encoding/csv is indeed very slow on big files, as it performs a lot of allocations. Since your format is so simple I recommend using strings.Split instead which is much faster.
If even that is not fast enough you can consider implementing the parsing yourself using strings.IndexByte which is implemented in assembly: http://golang.org/src/strings/strings_decl.go?s=274:310#L1
Having said that, you should also reconsider using ReadAll if the file is larger than your memory.