Efficient read and write CSV in Go - csv

The Go code below reads in a 10,000 record CSV (of timestamp times and float values), runs some operations on the data, and then writes the original values to another CSV along with an additional column for score. However it is terribly slow (i.e. hours, but most of that is calculateStuff()) and I'm curious if there are any inefficiencies in the CSV reading/writing I can take care of.
package main
import (
"encoding/csv"
"log"
"os"
"strconv"
)
func ReadCSV(filepath string) ([][]string, error) {
csvfile, err := os.Open(filepath)
if err != nil {
return nil, err
}
defer csvfile.Close()
reader := csv.NewReader(csvfile)
fields, err := reader.ReadAll()
return fields, nil
}
func main() {
// load data csv
records, err := ReadCSV("./path/to/datafile.csv")
if err != nil {
log.Fatal(err)
}
// write results to a new csv
outfile, err := os.Create("./where/to/write/resultsfile.csv"))
if err != nil {
log.Fatal("Unable to open output")
}
defer outfile.Close()
writer := csv.NewWriter(outfile)
for i, record := range records {
time := record[0]
value := record[1]
// skip header row
if i == 0 {
writer.Write([]string{time, value, "score"})
continue
}
// get float values
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal("Record: %v, Error: %v", floatValue, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
valueString := strconv.FormatFloat(floatValue, 'f', 8, 64)
scoreString := strconv.FormatFloat(prob, 'f', 8, 64)
//fmt.Printf("Result: %v\n", []string{time, valueString, scoreString})
writer.Write([]string{time, valueString, scoreString})
}
writer.Flush()
}
I'm looking for help making this CSV read/write template code as fast as possible. For the scope of this question we need not worry about the calculateStuff method.

You're loading the file in memory first then processing it, that can be slow with a big file.
You need to loop and call .Read and process one line at a time.
func processCSV(rc io.Reader) (ch chan []string) {
ch = make(chan []string, 10)
go func() {
r := csv.NewReader(rc)
if _, err := r.Read(); err != nil { //read header
log.Fatal(err)
}
defer close(ch)
for {
rec, err := r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
ch <- rec
}
}()
return
}
playground
//note it's roughly based on DaveC's comment.

This is essentially Dave C's answer from the comments sections:
package main
import (
"encoding/csv"
"log"
"os"
"strconv"
)
func main() {
// setup reader
csvIn, err := os.Open("./path/to/datafile.csv")
if err != nil {
log.Fatal(err)
}
r := csv.NewReader(csvIn)
// setup writer
csvOut, err := os.Create("./where/to/write/resultsfile.csv"))
if err != nil {
log.Fatal("Unable to open output")
}
w := csv.NewWriter(csvOut)
defer csvOut.Close()
// handle header
rec, err := r.Read()
if err != nil {
log.Fatal(err)
}
rec = append(rec, "score")
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
for {
rec, err = r.Read()
if err != nil {
if err == io.EOF {
break
}
log.Fatal(err)
}
// get float value
value := rec[1]
floatValue, err := strconv.ParseFloat(value, 64)
if err != nil {
log.Fatal("Record, error: %v, %v", value, err)
}
// calculate scores; THIS EXTERNAL METHOD CANNOT BE CHANGED
score := calculateStuff(floatValue)
scoreString := strconv.FormatFloat(score, 'f', 8, 64)
rec = append(rec, scoreString)
if err = w.Write(rec); err != nil {
log.Fatal(err)
}
w.Flush()
}
}
Note of course the logic is all jammed into main(), better would be to split it into several functions, but that's beyond the scope of this question.

encoding/csv is indeed very slow on big files, as it performs a lot of allocations. Since your format is so simple I recommend using strings.Split instead which is much faster.
If even that is not fast enough you can consider implementing the parsing yourself using strings.IndexByte which is implemented in assembly: http://golang.org/src/strings/strings_decl.go?s=274:310#L1
Having said that, you should also reconsider using ReadAll if the file is larger than your memory.

Related

Compare CSV and conclusion of differences

I started writing a program to compare two CSV files. After reading the documentation, I found a solution for I can't figure out how to print the differences from the second file since the function returns true/false
package main
import (
"encoding/csv"
"fmt"
"os"
"reflect"
)
func main() {
file, err := os.Open("sms_in_max.csv")
if err != nil {
fmt.Println(err)
}
reader := csv.NewReader(file)
records, _ := reader.ReadAll()
fmt.Println(records)
file2, err := os.Open("sms_out.csv")
if err != nil {
fmt.Println(err)
}
reader2 := csv.NewReader(file2)
records2, _ := reader2.ReadAll()
fmt.Println(records2)
allrs :=reflect.DeepEqual(records, records2)
fmt.Println(allrs)
}
The csv ReadAll() function returns slice of rows, where row is a slice of columns.
We can loop over the rows, and within a row, again loop over the columns and compare each column value.
Here is the code that prints all lines that have differences alongwith their line numbers.
package main
import (
"encoding/csv"
"fmt"
"os"
)
func main() {
file, err := os.Open("sms_in_max.csv")
if err != nil {
fmt.Println(err)
}
reader := csv.NewReader(file)
records, _ := reader.ReadAll()
fmt.Println(records)
file2, err := os.Open("sms_out.csv")
if err != nil {
fmt.Println(err)
}
reader2 := csv.NewReader(file2)
records2, _ := reader2.ReadAll()
fmt.Println(records2)
// allrs := reflect.DeepEqual(records, records2)
// fmt.Println(allrs)
// Prints lines at which there is difference
for i := range records {
diff := false
for j := range records[i] {
if records[i][j] != records2[i][j] {
diff = true
break
}
}
if diff {
fmt.Printf("Line %d: %v, %v\n", i+1, records[i], records2[i])
}
}
}

Converting csv data ([]string) to float64 with strconv then summing the data

I am trying to sum data from a CSV file I created from a previous function. Here is a snippet of the file:
datetime,open,high,low,close,volume
2020-11-09 00:00,69.58,137.45,69.00,100.00,273517274.00
2020-11-10 00:00,104.65,128.80,101.75,107.00,141284399.00
2020-11-11 00:00,109.00,114.45,96.76,98.42,96648953.00
2020-11-12 00:00,95.98,106.60,89.15,90.00,149794913.00
[For context: this is historical price data for Rolls-Royce Holdings stock price from Yahoo finance. I plan to use up to 200 rows].
The problem I am facing is converting the []string data from the CSV to float64. The ParseFloat() function is trying to convert my headings and obviously can't as it is 'invalid syntax'. Here is the error code:
Error converting string: strconv.ParseFloat: parsing "open": invalid syntaxError converting string: strconv.ParseFloat: parsing "high": invalid syntaxError converting string: strconv.ParseFloat: parsing "low": invalid syntaxError converting string: strconv.ParseFloat: parsing "close": invalid syntaxError converting string: strconv.ParseFloat: parsing "volume": invalid syntax&{ 0 0 0 0 0}
My code is below for reference:
package main
import (
"encoding/csv"
"fmt"
"log"
"os"
"strconv"
)
const file = "./rr.csv"
// Data struct is the data from the csv file
type Data struct {
datetime string
open float64
high float64
low float64
close float64
volume float64
}
func readAmounts(r []string) (d *Data, err error) {
d = new(Data)
open := r[1]
d.open, err = strconv.ParseFloat(open, 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
high := r[2]
d.high, err = strconv.ParseFloat(high, 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
low := r[3]
d.low, err = strconv.ParseFloat(low, 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
close := r[4]
d.close, err = strconv.ParseFloat(close, 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
volume := r[5]
d.volume, err = strconv.ParseFloat(volume, 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
return d, nil
}
func main() {
csvFile, err := os.Open(file)
if err != nil {
log.Fatal(err)
}
r := csv.NewReader(csvFile)
lines, err := r.Read()
if err != nil {
log.Fatal(err)
}
data, err := readAmounts(lines)
if err != nil {
fmt.Printf("Error reading file: %v", err)
}
fmt.Println(data)
}
I am just printing the data to see if the ParseFloat() works and then I'll create a function to sum the columns.
So, what I'm asking is; how do I ignore the header line and just read through the numerical lines.
FYI: I've read other answers (eg: FieldsPerRecord) but they haven't worked for my specific problem, as I am then trying to sum the whole columns afterwards.
You can load the file into a CSV, then process row by row the file:
package main
import (
"bytes"
"encoding/csv"
"fmt"
"io/ioutil"
"strconv"
)
const file = "./data.csv"
// Data struct is the data from the csv file
type Data struct {
datetime string
open float64
high float64
low float64
close float64
volume float64
}
func main() {
f, err := ioutil.ReadFile(file)
if err != nil {
panic(err)
}
rawData, err := readCsv(f, ',')
if err != nil {
panic(err)
}
amounts, err := readAmounts(rawData[1:])
if err != nil {
panic(err)
}
fmt.Printf("%+v\n", amounts)
}
func readAmounts(r [][]string) ([]Data, error) {
var d []Data = make([]Data, len(r))
var err error
for i, row := range r {
d[i].datetime = row[0]
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
d[i].open, err = strconv.ParseFloat(row[1], 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
d[i].high, err = strconv.ParseFloat(row[2], 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
d[i].low, err = strconv.ParseFloat(row[3], 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
d[i].close, err = strconv.ParseFloat(row[4], 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
d[i].volume, err = strconv.ParseFloat(row[5], 64)
if err != nil {
fmt.Printf("Error converting string: %v", err)
}
}
return d, nil
}
func readCsv(data []byte, separator rune) ([][]string, error) {
csvReader := csv.NewReader(bytes.NewReader(data))
csvReader.Comma = separator
lines, err := csvReader.ReadAll()
if err != nil {
return nil, err
}
return lines, nil
}
Example of output:
[{datetime:2020-11-09 00:00 open:69.58 high:137.45 low:69 close:100 volume:2.73517274e+08} {datetime:2020-11-10 00:00 open:104.65 high:128.8 low:101.75 close:107 volume:1.41284399e+08} {datetime:2020-11-11 00:00 open:109 high:114.45 low:96.76 close:98.42 volume:9.6648953e+07} {datetime:2020-11-12 00:00 open:95.
98 high:106.6 low:89.15 close:90 volume:1.49794913e+08}]
NOTE:
You can find some example of code that work using the CSV library, you can view the following repositories: https://github.com/alessiosavi/GoSFTPtoS3
I've commented out the program so that's easy to understand. Basic idea is to ignore the header. Also, as you're indexing and getting the fields of the record; it's better to put a check for the number of fields that are present in the record (FieldsPerRecord).
package main
import (
"encoding/csv"
"errors"
"fmt"
"io"
"log"
"os"
"strconv"
)
// file stores the filepath
const file = "./rr.csv"
// Data store metadata
type Data struct {
datetime string
open float64
high float64
low float64
close float64
volume float64
}
// s2f converts string to float64
func s2f(str string) (float64, error) {
f, err := strconv.ParseFloat(str, 64)
if err != nil {
return 0, fmt.Errorf("Error converting string \"%v\" to float", err)
}
return f, nil
}
// ReadAmounts processes the fields from the record and stores them in Data
func ReadAmounts(r []string) (*Data, error) {
var (
dt = r[0]
open = r[1]
high = r[2]
low = r[3]
close = r[4]
volume = r[5]
d = new(Data)
err error
)
d.datetime = dt
d.open, err = s2f(open)
if err != nil {
return nil, err
}
d.high, err = s2f(high)
if err != nil {
return nil, err
}
d.low, err = s2f(low)
if err != nil {
return nil, err
}
d.close, err = s2f(close)
if err != nil {
return nil, err
}
d.volume, err = s2f(volume)
if err != nil {
return nil, err
}
return d, nil
}
func main() {
// Open the file
file, err := os.Open(file)
if err != nil {
log.Fatalln(err)
}
// CSV Reader
r := csv.NewReader(file)
// Set Options for the reader
{
r.Comma = ',' // Delimiter
r.TrimLeadingSpace = true // Trim the leading spaces
r.FieldsPerRecord = 0 // Rows should have same number of columns as header
r.ReuseRecord = true // Reuse the same backing array (Efficient)
}
// Alternatively, r.ReadAll() could be also used and slicing it using [1:] ignores
// the header as well.
// Ignore header
_, _ = r.Read()
for {
// Read record (one by one)
record, err := r.Read()
if err != nil {
// Exit out. Done!
if errors.Is(err, io.EOF) {
break
}
// Log and continue
log.Printf("Error reading record: %v\n", err)
continue
}
// Process
data, err := ReadAmounts(record)
if err != nil {
// Log and continue
fmt.Printf("Error reading record: %v\n", err)
continue
}
// Print the filled Data struct
fmt.Printf("Record: %+v\n", *data)
}
}
Output:
Record: {datetime:2020-11-09 00:00 open:69.58 high:137.45 low:69 close:100 volume:2.73517274e+08}
Record: {datetime:2020-11-10 00:00 open:104.65 high:128.8 low:101.75 close:107 volume:1.41284399e+08}
Record: {datetime:2020-11-11 00:00 open:109 high:114.45 low:96.76 close:98.42 volume:9.6648953e+07}
Record: {datetime:2020-11-12 00:00 open:95.98 high:106.6 low:89.15 close:90 volume:1.49794913e+08}
Some different options:
Skip parsing the first line. This assumes every file starts with a header.
Skip lines that have parsing errors. The easiest method, but hard to debug when things go wrong.
If the first line has parsing errors, skip it, because it is probably a header row.
On a side note, you should handle errors properly in your code, which you are not currently doing.

How to compare JSON with varying order?

I'm attempting to implement testing with golden files, however, the JSON my function generates varies in order but maintains the same values. I've implemented the comparison method used here:
How to compare two JSON requests?
But it's order dependent. And as stated here by brad:
JSON objects are unordered, just like Go maps. If
you're depending on the order that a specific implementation serializes your JSON
objects in, you have a bug.
I've written some sample code that simulated my predicament:
package main
import (
"bufio"
"encoding/json"
"fmt"
"io/ioutil"
"math/rand"
"os"
"reflect"
"time"
)
type example struct {
Name string
Earnings float64
}
func main() {
slice := GetSlice()
gfile, err := ioutil.ReadFile("testdata/example.golden")
if err != nil {
fmt.Println(err)
fmt.Println("Failed reading golden file")
}
testJSON, err := json.Marshal(slice)
if err != nil {
fmt.Println(err)
fmt.Println("Error marshalling slice")
}
equal, err := JSONBytesEqual(gfile, testJSON)
if err != nil {
fmt.Println(err)
fmt.Println("Error comparing JSON")
}
if !equal {
fmt.Println("Restults don't match JSON")
} else {
fmt.Println("Success!")
}
}
func GetSlice() []example {
t := []example{
example{"Penny", 50.0},
example{"Sheldon", 70.0},
example{"Raj", 20.0},
example{"Bernadette", 200.0},
example{"Amy", 250.0},
example{"Howard", 1.0}}
rand.Seed(time.Now().UnixNano())
rand.Shuffle(len(t), func(i, j int) { t[i], t[j] = t[j], t[i] })
return t
}
func JSONBytesEqual(a, b []byte) (bool, error) {
var j, j2 interface{}
if err := json.Unmarshal(a, &j); err != nil {
return false, err
}
if err := json.Unmarshal(b, &j2); err != nil {
return false, err
}
return reflect.DeepEqual(j2, j), nil
}
func WriteTestSliceToFile(arr []example, filename string) {
file, err := os.OpenFile(filename, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
if err != nil {
fmt.Println("failed creating file: %s", err)
}
datawriter := bufio.NewWriter(file)
marshalledStruct, err := json.Marshal(arr)
if err != nil {
fmt.Println("Error marshalling json")
fmt.Println(err)
}
_, err = datawriter.Write(marshalledStruct)
if err != nil {
fmt.Println("Error writing to file")
fmt.Println(err)
}
datawriter.Flush()
file.Close()
}
JSON arrays are ordered. The json.Marshal function preserves order when encoding a slice to a JSON array.
JSON objects are not ordered. The json.Marshal function writes object members in sorted key order as described in the documentation.
The bradfitz comment JSON object ordering is not relevant to this question:
The application in the question is working with a JSON array, not a JSON object.
The package was updated to write object fields in sorted key order a couple of years after Brad's comment.
To compare slices while ignoring order, sort the two slices before comparing. This can be done before encoding to JSON or after decoding from JSON.
sort.Slice(slice, func(i, j int) bool {
if slice[i].Name != slice[j].Name {
return slice[i].Name < slice[j].Name
}
return slice[i].Earnings < slice[j].Earnings
})
For unit testing, you could use assert.JSONEq from Testify. If you need to do it programatically, you could follow the code of the JSONEq function.
https://github.com/stretchr/testify/blob/master/assert/assertions.go#L1551

Changing the last character of a file

I want to continuously write json objects to a file. To be able to read it, I need to wrap them into an array. I don't want to read the whole file, for simple appending. So what I' doing now:
comma := []byte(", ")
file, err := os.OpenFile(erp.TransactionsPath, os.O_WRONLY|os.O_APPEND|os.O_CREATE, 0666)
if err != nil {
return err
}
transaction, err := json.Marshal(t)
if err != nil {
return err
}
transaction = append(transaction, comma...)
file.Write(transaction)
But with this implementation I will need to add []scopes by hand(or via some script) before reading. How can I add an object before closing scope on each writing?
You don't need to wrap the JSON objects into an array, you can just write them as-is. You may use json.Encoder to write them to the file, and you may use json.Decoder to read them. Encoder.Encode() and Decoder.Decode() encode and decode individual JSON values from a stream.
To prove it works, see this simple example:
const src = `{"id":"1"}{"id":"2"}{"id":"3"}`
dec := json.NewDecoder(strings.NewReader(src))
for {
var m map[string]interface{}
if err := dec.Decode(&m); err != nil {
if err == io.EOF {
break
}
panic(err)
}
fmt.Println("Read:", m)
}
It outputs (try it on the Go Playground):
Read: map[id:1]
Read: map[id:2]
Read: map[id:3]
When writing to / reading from a file, pass the os.File to json.NewEncoder() and json.NewDecoder().
Here's a complete demo which creates a temporary file, uses json.Encoder to write JSON objects into it, then reads them back with json.Decoder:
objs := []map[string]interface{}{
map[string]interface{}{"id": "1"},
map[string]interface{}{"id": "2"},
map[string]interface{}{"id": "3"},
}
file, err := ioutil.TempFile("", "test.json")
if err != nil {
panic(err)
}
// Writing to file:
enc := json.NewEncoder(file)
for _, obj := range objs {
if err := enc.Encode(obj); err != nil {
panic(err)
}
}
// Debug: print file's content
fmt.Println("File content:")
if data, err := ioutil.ReadFile(file.Name()); err != nil {
panic(err)
} else {
fmt.Println(string(data))
}
// Reading from file:
if _, err := file.Seek(0, io.SeekStart); err != nil {
panic(err)
}
dec := json.NewDecoder(file)
for {
var obj map[string]interface{}
if err := dec.Decode(&obj); err != nil {
if err == io.EOF {
break
}
panic(err)
}
fmt.Println("Read:", obj)
}
It outputs (try it on the Go Playground):
File content:
{"id":"1"}
{"id":"2"}
{"id":"3"}
Read: map[id:1]
Read: map[id:2]
Read: map[id:3]

Go JSON with simplejson

Trying to use the JSON lib from "github.com/bitly/go-simplejson"
url = "http://api.stackoverflow.com/1.1/tags?pagesize=100&page=1"
res, err := http.Get(url)
body, err := ioutil.ReadAll(res.Body)
fmt.Printf("%s\n", string(body)) //WORKS
js, err := simplejson.NewJson(body)
total,_ := js.Get("total").String()
fmt.Printf("Total:%s"+total )
But it seems it doenst work !?
Trying to access the total and tag fields
You have a few mistakes:
If you'll check the JSON response you'll notice that total field is not string, that's why you should use MustInt() method, not String(), when you are accessing the field.
Printf() method invocation was totally wrong. You should pass a "template", and then pass arguments appropriate to the number of "placeholders".
By the way, I strongly recomend you to check err != nil everywhere, that'll help you a lot.
Here is the working example:
package main
import (
"fmt"
"github.com/bitly/go-simplejson"
"io/ioutil"
"log"
"net/http"
)
func main() {
url := "http://api.stackoverflow.com/1.1/tags?pagesize=100&page=1"
res, err := http.Get(url)
if err != nil {
log.Fatalln(err)
}
body, err := ioutil.ReadAll(res.Body)
if err != nil {
log.Fatalln(err)
}
// fmt.Printf("%s\n", string(body))
js, err := simplejson.NewJson(body)
if err != nil {
log.Fatalln(err)
}
total := js.Get("total").MustInt()
if err != nil {
log.Fatalln(err)
}
fmt.Printf("Total:%s", total)
}