CSV parser in Go breaks due to trailing space - csv

We are trying to parse a csv file using Go's encoding/csv package. This particular csv is a bit peculiar, each row has a trailing space. When trying to decode this csv with quoted fields the package breaks since it expects a newline, separator or quote. The trailing space is not expected.
How would you handle this case? Do you know of another parser that we could use?
Edit:
f,err := os.Open("file.go")
// err etc..
csvr := csv.NewReader(f)
csvr.Comma = csvDelimiter
for {
rowAsSlice, err := csvr.Read()
// Handle row and errors etc.
}
Edit 2:
CSV example, mind the trailing space!
"RECORD_TYPE","COMPANY_SHORTNAME"
"HDR","COMPANY_EXAMPLE"

One possible solution is to wrap the source file reader in a custom reader whose Read(...) method silently trims trailing whitespace from what the underlying reader actually reads. The csv.Reader could use that type directly.
For example (Go Playground):
type TrimReader struct{ io.Reader }
var trailingws = regexp.MustCompile(` +\r?\n`)
func (tr TrimReader) Read(bs []byte) (int, error) {
// Perform the requested read on the given reader.
n, err := tr.Reader.Read(bs)
if err != nil {
return n, err
}
// Remove trailing whitespace from each line.
lines := string(bs[:n])
trimmed := []byte(trailingws.ReplaceAllString(lines, "\n"))
copy(bs, trimmed)
return len(trimmed), nil
}
func main() {
file, err := file.Open("myfile.csv")
// TODO: handle err...
csvr := csv.NewReader(TrimReader{file})
for {
record, err := csvr.Read()
if err == io.EOF {
break
}
fmt.Printf("LINE: record=%#v, err=%v\n", record, err)
}
// LINE: record=[]string{"RECORD_TYPE", "COMPANY_SHORTNAME"}, err=<nil>
// LINE: record=[]string{"HDR", "COMPANY_EXAMPLE"}, err=<nil>
}
Note that, as commenter #svsd points out, there is a subtle bug here wherein trailing whitespace can still make it through if the line terminator isn't read until the subsequent call. You can workaround by buffering or, perhaps best, simply preprocess these CSV files to remove the trailing whitespace before attempting to parse them.

Related

In golang json.Unmarshal() works in playground/copy pasted JSON but not in actual code

I am writing a program in Golang that interfaces with a modified version of the barefoot mapmatching library which returns results in json via netcat.
My in my actual code json.Unmarshal will only parse the response to the nil value of the struct. But if print the json to console (see code snippet below) and copy paste into goplayground it behaves as expected.
I am wondering if this is an encoding issue that is bypassed when I copy paste from the console as a result.
How do I get my code to process the same string as it is received from barefoot as when it is copy pasted from the console?
Here is the relevant code snippet (structs are identical to goplayground)
body := io_func(conn, cmd)
var obvs []Json_out
json.Unmarshal([]byte(body), &obvs)
fmt.Println(body)
fmt.Println(obvs)
and io_func() if relevant (the response is two lines, with a message on the first and a json string on the second)
func io_func(conn net.Conn, cmd string) string {
fmt.Fprintf(conn, cmd+"\n")
r := bufio.NewReader(conn)
header, _ := r.ReadString('\n')
if header == "SUCCESS\n" {
resp, _ := r.ReadString('\n')
return resp
} else {
return ""
}
}
Following Cerise Limón's advice to properly handle error messages I determined the osm_id value in the JSON was being parsed by json.Unmarshall as number when taking the string from io_func(), although it wasn't doing so when the string was passed in manually in the playground example. Although I don't understand why this is so I would have picked it up with proper error handling.
I altered barefoot code to return the osm_id explicitly in inverted commas since, although only ever composed of digits, I only use it as a string. It now works as expected. Equally I could have changed the type in the struct and convert in Go as needed.
The io_func function creates and discards a bufio.Reader and data the reader may have buffered. If the application calls io_func more than once, then the application may be discarding data read from the network. Fix by creating a single bufio.Reader outside the function and pass that single reader to each invocation of io_func.
Always check and handle errors. The error returned from any of these functions may point you in the right direction for a fix.
func io_func(r *bufio.Reader, conn net.Conn, cmd string) (string, error) {
fmt.Fprintf(conn, cmd+"\n")
header, err := r.ReadString('\n')
if err != nil {
return "", err
}
if header == "SUCCESS\n" {
return r.ReadString('\n')
}
return "", nil
}
...
r := bufio.NewReader(conn)
body, err := io_func(r, conn, cmd)
if err != nil {
// handle error
}
var obvs []Json_out
err = json.Unmarshal([]byte(body), &obvs)
if err != nil {
// handle error
}
fmt.Println(body)
fmt.Println(obvs)
// read next
body, err = io_func(r, conn, cmd)
if err != nil {
// handle error
}
The application uses newline to terminate the JSON body, but newline is valid whitespace in JSON. If the peer includes a newline in the JSON, then the application will read a partial message.

Unmarshalling JSON with non-printable ASCII characters

Using Go, how can I unmarshal a JSON string that contains unprintable ASCII characters?
For Example
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal([]byte(testJsonString), &dat)
if err != nil {
panic(err)
}
Yields:
panic: invalid character '\x10' in string literal
goroutine 1 [running]:
main.main()
/tmp/sandbox903140350/main.go:14 +0x180
https://play.golang.org/p/mFGWzndDK8V
Unfortunately I do not have control over the source data, so I need a way to ignore or strip out the unprintable characters.
Similarly, another data issue I'm encountering is stripping out a few C escape sequences as well - like \0 and \a. If I replace string listed above with this string below, the program fails as well. Essentially it also fails on any C escape sequence https://en.wikipedia.org/wiki/Escape_sequences_in_C
testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"
will error out with
panic: invalid character 'a' in string escape code
goroutine 1 [running]:
main.main()
/tmp/sandbox322770276/main.go:12 +0x100
This also seems to not be able to be unmarshaled, but is not able to be escaped through rune number checking or checking the unicode (since Go appears to treat it as a backslash followed by the character 'a', which are both legal)
Is there a good way to handle these edge cases?
According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped (or converted to valid unicode escapes)
So here's a converter that makes non printable characters into their uri escaped forms. These can then be fed into the Unmarshal
If this isn't exactly the behaviour you need then modify the converter to remove the characters (with continue) or replace with a question mark rune or whatever
BTW, the second problem with \\a does not "print out as expected" for me. Please give a better example that actually shows the problem you are experiencing
package main
import (
"bytes"
"encoding/json"
"fmt"
"unicode"
"net/url"
)
func safety(d string) []byte {
var buffer bytes.Buffer
for _, c := range d {
s := string(c)
if c == 92 { // 92 is a backslash
continue
}
if unicode.IsPrint(c) {
buffer.WriteString(s)
} else {
buffer.WriteString(url.QueryEscape(s))
}
fmt.Println(buffer.String())
}
return buffer.Bytes()
}
func main() {
testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal(safety(testJsonString), &dat)
if err != nil {
panic(err)
}
fmt.Printf("%v", dat)
}

CSV Encoding Broken when Downloading from S3

I'm trying to download a CSV file from S3 using golang's SDK but it comes out encoded wrongly and is interpreted as one slice.
input := &s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: aws.String(key),
ResponseContentType: aws.String("text/csv"),
ResponseContentEncoding: aws.String("utf-8"),
}
object, err := s3.New(s).GetObject(input)
if err != nil {
var obj s3.GetObjectOutput
return &obj, err
}
defer object.Body.Close()
lines, err := csv.NewReader(object.Body).ReadAll()
if err != nil {
log.Fatal(err)
}
log.Printf("%q", lines[0])
// returns ["\ufeffH1" "H2\r" "field1" "field2\r" "field1" field2\r00602"]
I'm guessing this is incorrect character encoding. Problem is that I'm not clear what encoding that it is. When I'm putting the file, I'm specifying csv.
I would have expected to see [][]string:
[
[],
[]
]
Any advice?
Approach 2
buffer := new(bytes.Buffer)
buffer.ReadFrom(object.Body)
str := buffer.String()
lines, err := csv.NewReader(strings.NewReader(str)).ReadAll()
if err != nil {
log.Fatal(err)
}
log.Printf("length: %v", len(lines))
// still one line
Approach 3
My new approach is going to be manually removing byte sequences that are problematic. This is pretty terrible. Godocs on this need work.
This is closer but now I have to split out on new lines then again on commas.
Edit
When I print out the bytes it looks like:
"\ufeffH1,H2\r,field1,field2\r
I have tried using the following encodings:
utf-8, iso-8859-1, iso-8859-1:utf-8

Parsing nested JSON objects in a CSV file with golang

I'm trying to parse a CSV file which contains a JSON object in the last column.
Here is an example with two rows from the input CSV file:
'id','value','createddate','attributes'
524256,CAFE,2018-04-06 16:41:01,{"Att1Numeric": 6, "Att2String": "abc"}
524257,BEBE,2018-04-06 17:00:00,{}
I tried using the parser from csv package:
func processFileAsCSV(f *multipart.Part) (int, error) {
reader := csv.NewReader(f)
reader.LazyQuotes = true
reader.Comma = ','
lineCount := 0
for {
line, err := reader.Read()
if err == io.EOF {
break
} else if err != nil {
fmt.Println("Error:", err)
return 0, err
}
if lineCount%100000 == 0 {
fmt.Println(lineCount)
}
lineCount++
fmt.Println(lineCount, line)
processLine(line) // do something with the line
}
fmt.Println("done!", lineCount)
return lineCount, nil
}
But I got an error:
Error: line 2, column 0: wrong number of fields in line,
probably because the parser ignores the JSON scope which starts with {.
Should I be writing my own CSV parser, or is there a library that can handle this?
Your CSV input doesn't follow normal CSV convention, by using unquoted fields (for JSON).
I think the best approach would be to pre-process your input, either in your Go program, or in an external script.
If your CSV input is predictable (as indicated in your question), it should be easy to properly quote last element, using a simple strings.Split call, for instance, before passing it to the CSV parser.

Why does json.Encoder add an extra line?

json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground
Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.
You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)
The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.