How to read large CSV files - csv

What is the best way to read large CSV files, at the moment I am reading one record at a time rather than using ReadAll().
reader := csv.NewReader(csvFile)
reader.FieldsPerRecord = -1
for {
// read just one record at a time
record, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
checkErr(err)
return
}
Is there a better way to do this to save memory?
I am writing each record/row to a database by sending an array over GRPC to a separate service.

Yes, there is one option you can use to improve it.
It is possible to allow reader to reuse a slice that is returned by it on each Read method call.
To do it you need to set reader.ReuseRecord = true.
But be careful, because the returned slice may be changed after the next call of Read!

Related

Stream data from API

I am trying to pull data on mails coming into an API from an email testing tool mailhog.
If I use a call to get a list of emails e.g
GET /api/v1/messages
I can load this data into a struct with no issues and print out values I need.
However if I use a different endpoint that is essentially a stream of new emails coming in, I have different behavior. When I run my go application I get no output whatsoever.
Do I need to do like a while loop to constantly listen to the endpoint to get the output?
My end goal is to pull some information from emails as they come in and then pass them into a different function.
Here is me trying to access the streaming endpoint
https://github.com/mailhog/MailHog/blob/master/docs/APIv1.md
res, err := http.Get("http://localhost:8084/api/v1/events")
if err != nil {
panic(err.Error())
}
body, err := ioutil.ReadAll(res.Body)
if err != nil {
panic(err.Error())
}
var data Email
json.Unmarshal(body, &data)
fmt.Printf("Email: %v\n", data)
If I do a curl request at the mailhog service with the same endpoint, I do get output as mails come in. However I cant seem to figure out why I am getting no output via my go app. The app does stay running just I dont get any output.
I am new to Go so apologies if this is a really simple question
From ioutil.ReadAll documentation:
ReadAll reads from r until an error or EOF and returns the data it read.
When you use to read the body of a regular endpoint, it works because the payload has an EOF: the server uses the header Content-Length to tell how many bytes the body response has, and once the client read that many bytes, it understands that it has read all of the body and can stop.
Your "streaming" endpoint doesn't use Content-Length though, because the body has an unknown size, it's supposed to write events as they come, so you can't use ReadAll in this case. Usually, in this case, you are supposed to read line-by-line, where each line represents an event. bufio.Scanner does exactly that:
res, err := http.Get("http://localhost:8084/api/v1/events")
if err != nil {
panic(err.Error())
}
scanner := bufio.NewScanner(res.Body)
for e.scanner.Scan() {
if err := e.scanner.Err(); err != nil {
panic(err.Error())
}
event := e.scanner.Bytes()
var data Email
json.Unmarshal(event, &data)
fmt.Printf("Email: %v\n", data)
}
curl can process the response as you expect because it checks that the endpoint will stream data, so it reacts accordinly. It may be helpful to add the response curl gets to the question.

Approach to send a large JSON payload to a web service

Consider a small Go application that reads a large JSON file 2GB+, marshals the JSON data into a struct, and POSTs the JSON data to a web service endpoint.
The web service receiving the payload changed its functionality, and now has a limit of 25MB per payload. What would be the best approach to overcome this issue using Go? I've thought of the following, however I'm not sure it is the best approach:
Creating a function to split the large JSON file into multiple smaller ones (up to 20MB), and then iterate over the files sending multiple smaller requests.
Similar function to the one being used to currently send the entire JSON payload:
func sendDataToService(data StructData) {
payload, err := json.Marshal(data)
if err != nil {
log.Println("ERROR:", err)
}
request, err := http.NewRequest("POST", endpoint, bytes.NewBuffer(payload))
if err != nil {
log.Println("ERROR:", err)
}
client := &http.Client{}
response, err := client.Do(request)
log.Println("INFORMATIONAL:", request)
if err != nil {
log.Println("ERROR:", err)
}
defer response.Body.Close()
}
You can break the input into chunks and send each piece individually:
dec := json.NewDecoder(inputStream)
tok, err := dec.Token()
if err != nil {
return err
}
if tok == json.Delim('[') {
for {
var obj json.RawMessage
if err := dec.Decode(&obj); err != nil {
return err
}
// Here, obj contains one element of the array. You can send this
// to the server.
if !dec.More() {
break
}
}
}
As the server-side can process data progressively, I assume that the large JSON object can be split into smaller pieces. From this point, I can propose several options.
Use HTTP requests
Pros: Pretty simple to implement on the client-side.
Cons: Making hundreds of HTTP requests might be slow. You will also need to handle timeouts - this is additional complexity.
Use WebSocket messages
If the receiving side supports WebSockets, a step-by-step flow will look like this:
Split the input data into smaller pieces.
Connect to the WebSocket server.
Start sending messages with the smaller pieces till the end of the file.
Close connection to the server.
This solution might be more performant as you won't need to connect and disconnect from the server each time you send a message, as you'd do with HTTP.
However, both solutions suppose that you need to assemble all pieces on the server-side. For example, you would probably need to send along with the data a correlation ID to let the server know what file you are sending right now and a specific end-of-file message to let the server know when the file ends. In the case of the WebSocket server, you could assume that the entire file is sent during a single connection session if it is relevant.

Convert CSV contents to Go interface to upload them to a Google Sheet

I am trying to send the contents of a CSV file to a Google Sheet, via their very ill-documented API for Go.
The BatchUpdate takes an interface, so this would work:
values := [][]interface{}{{"one cell", "another cell"}, {"one cell in new row", "another cell in new row"}}
The problem comes when I want to send the contents from a CSV. I have done this:
func prepare(filename string) [][]interface{} {
file, _ := os.Open(filename)
defer file.Close()
reader := csv.NewReader(file)
record, err := reader.ReadAll()
if err != nil {
fmt.Println("Error", err)
}
all := [][]interface{}{}
for _, value := range record {
all = append(all, []interface{}{value})
}
return all
}
And I believe this should give me the interface ready to be inserted in the sheet. However, when I do this later on:
rb := &sheets.BatchUpdateValuesRequest{
ValueInputOption: "USER_ENTERED",
}
rb.Data = append(rb.Data, &sheets.ValueRange{
Range: rangeData,
Values: values, // This is the interface that I returned earlier on
})
_, err = sheetsService.Spreadsheets.Values.BatchUpdate(spreadsheetId, rb).Context(ctx).Do()
it gives me a googleapi: Error 400: Invalid data[0]: Invalid values[0][0]
So I understand that I am trying to pass the CSV fields in an incorrect format. I know that when I do this in Python, I need to pass it as tuples for the data to be accepted. What's the correct way to pass the data here in Go?
I just found out what my problem was, when appending, I was doing this:
all = append(all, []interface{}{value})
And I should have done this
all = append(all, []interface{}{value[0], value[1]})
This is, because value can be indexed, since every item inside of it corresponds to a cell, so my CSV specifically had two cells. Passing just value was the wrong format for this.

Correct way to import numeric csv data in go

I want to read a file in csv format containing only numeric values (with decimals) and store it on a matrix so I can perform operations on them. The file looks like this:
1.5, 2.3, 4.4
1.1, 5.3, 2.4
...
It may have thousands of lines and more than 3 columns.
I solved this using the go csv library. This creates a [][]string and after I use a for loop to parse the matrix into [][]float64.
func readCSV(filepath string) [][]float64 {
csvfile, err := os.Open(filepath)
if err != nil {
return nil
}
reader := csv.NewReader(csvfile)
stringMatrix, err := reader.ReadAll()
csvfile.Close()
matrix := make([][]float64, len(stringMatrix))
//Parse string matrix into float64
for i := range stringMatrix {
matrix[i] = make([]float64, len(stringMatrix[0]))
for y := range stringMatrix[i] {
matrix[i][y], err = strconv.ParseFloat(stringMatrix[i][y], 64)
}
}
return matrix
}
I was wondering if this is a correct and efficient way of doing it or if there is a better way.
Like using reader.Read() instead and parse each line while it's being read. I don't know but it feel like I'm doing a lot duplicate work.
It all depends on how you want to use the data. Your code isn't efficient in terms of memory because you read the entire CSV content in memory (stringMatrix) and then you create another variable to hold the data converted to float64 (matrix). So if your CSV file is 1 GB in size, your program would use 1 GB of RAM for stringMatrix + a lot more for matrix.
You can optimize the code by either:
Reading the reader line by line and appending the data to matrix; you don't need to have the entire stringMatrix in memory at once;
Reading the reader line by line and processing that data line by line. Maybe you don't need to have matrix in memory as well, maybe you can process the data as you read it and never have everything in memory at once. It depends on the rest of your program, on how it needs to use the CSV data.
Your program can use a few bytes of RAM instead of gigabytes if you use the second method above, if you don't need to return the entire CSV data from that function.

Wrapping json member fields to object

My objective is to add fields to json on user request.
Everything is great, but when displaying the fields with
fmt.Printf("%s: %s\n", content.Date, content.Description)
an error occurs:
invalid character '{' after top-level value
And that is because after adding new fields the file looks like this:
{"Date":"2017-03-20 10:46:48","Description":"new"}
{"Date":"2017-03-20 10:46:51","Description":"new .go"}
The biggest problem is with the writting to file
reminder := &Name{dateString[:19], text} //text - input string
newReminder, _ := json.Marshal(&reminder)
I dont really know how to do this properly
My question is how should I wrap all member fields into one object?
And what is the best way to iterate through member fields?
The code is available here: https://play.golang.org/p/NunV_B6sud
You should store the reminders into an array inside the json file, as mentioned by #Gerben Jacobs, and then, every time you want to add a new reminder to the array you need to read the full contents of rem.json, append the new reminder in Go, truncate the file, and write the new slice into the file. Here's a quick implentation https://play.golang.org/p/UKR91maQF2.
If you have lots of reminders and the process of reading, decoding, encoding, and writing the whole content becomes a pain you could open the file, implement a way to truncate only the last ] from the file contents, and then write only , + new reminder + ].
So after some research, people in the go-nuts group helped me and suggested me to use a streaming json parser that parses items individually.
So I needed to change my reminder listing function:
func listReminders() error {
f, err := os.Open("rem.json")
if err != nil {
return err
}
dec := json.NewDecoder(f)
for {
var content Name
switch dec.Decode(&content) {
case nil:
fmt.Printf("%#v\n", content)
case io.EOF:
return nil
default:
return err
}
}
}
Now everything works the way I wanted.