CSV Encoding Broken when Downloading from S3 - csv

I'm trying to download a CSV file from S3 using golang's SDK but it comes out encoded wrongly and is interpreted as one slice.
input := &s3.GetObjectInput{
Bucket: aws.String(bucket),
Key: aws.String(key),
ResponseContentType: aws.String("text/csv"),
ResponseContentEncoding: aws.String("utf-8"),
}
object, err := s3.New(s).GetObject(input)
if err != nil {
var obj s3.GetObjectOutput
return &obj, err
}
defer object.Body.Close()
lines, err := csv.NewReader(object.Body).ReadAll()
if err != nil {
log.Fatal(err)
}
log.Printf("%q", lines[0])
// returns ["\ufeffH1" "H2\r" "field1" "field2\r" "field1" field2\r00602"]
I'm guessing this is incorrect character encoding. Problem is that I'm not clear what encoding that it is. When I'm putting the file, I'm specifying csv.
I would have expected to see [][]string:
[
[],
[]
]
Any advice?
Approach 2
buffer := new(bytes.Buffer)
buffer.ReadFrom(object.Body)
str := buffer.String()
lines, err := csv.NewReader(strings.NewReader(str)).ReadAll()
if err != nil {
log.Fatal(err)
}
log.Printf("length: %v", len(lines))
// still one line
Approach 3
My new approach is going to be manually removing byte sequences that are problematic. This is pretty terrible. Godocs on this need work.
This is closer but now I have to split out on new lines then again on commas.
Edit
When I print out the bytes it looks like:
"\ufeffH1,H2\r,field1,field2\r
I have tried using the following encodings:
utf-8, iso-8859-1, iso-8859-1:utf-8

Related

How to append consecutively to a JSON file in Go?

I wonder how can I write consecutively to the same file in Go? Do I have to use os.WriteAt()?
The JSON is basically just an array filled with structs:
[
{
"Id": "2817293",
"Data": "XXXXXXX"
},
{
"Id": "2817438",
"Data": "XXXXXXX"
}
...
]
I want right data to it consecutively i.e. append to that JSON array more than once before closing the file.
The data I want to write to the file is a slice of said structs:
dataToWrite := []struct{
Id string
Data string
}{}
What is the proper way to write consecutively to a JSON array in Go?
My current approach creates multiple slices in the JSON file and thus is not what I want. The write process (lying in a for loop) looks like this:
...
// Read current state of file
data := []byte{}
f.Read(data)
// Write current state to slice
curr := []Result{}
json.Unmarshal(data, &curr)
// Append data to the created slice
curr = append(curr, *initArr...)
JSON, _ := JSONMarshal(curr)
// Empty data container
initArr = &[]Result{}
// Write
_, err := f.Write(JSON)
if err != nil {
log.Fatal(err)
}
...
Write the opening [ to the file. Create an encoder on the file. Loop over slices and the elements of each slice. Write a comma if it's not the first slice element. Encode each slice element with the encoder. Write the closing ].
_, err := f.WriteString("[")
if err != nil {
log.Fatal(err)
}
e := json.NewEncoder(f)
first := true
for i := 0; i < 10; i++ {
// Create dummy slice data for this iteration.
dataToWrite := []struct {
Id string
Data string
}{
{fmt.Sprintf("id%d.1", i), fmt.Sprintf("data%d.1", i)},
{fmt.Sprintf("id%d.2", i), fmt.Sprintf("data%d.2", i)},
}
// Encode each slice element to the file
for _, v := range dataToWrite {
// Write comma separator if not the first.
if !first {
_, err := f.WriteString(",\n")
if err != nil {
log.Fatal(err)
}
}
first = false
err := e.Encode(v)
if err != nil {
log.Fatal(err)
}
}
}
_, err = f.WriteString("]")
if err != nil {
log.Fatal(err)
}
https://go.dev/play/p/Z-T1nxRIaqL
If it's reasonable to hold all of the slice elements in memory, then simplify the code by encoding all of the data in a single batch:
type Item struct {
Id string
Data string
}
// Collect all items to write in this slice.
var result []Item
for i := 0; i < 10; i++ {
// Generate slice for this iteration.
dataToWrite := []Item{
{fmt.Sprintf("id%d.1", i), fmt.Sprintf("data%d.1", i)},
{fmt.Sprintf("id%d.2", i), fmt.Sprintf("data%d.2", i)},
}
// Append slice generated in this iteration to the result.
result = append(result, dataToWrite...)
}
// Write the result to the file.
err := json.NewEncoder(f).Encode(result)
if err != nil {
log.Fatal(err)
}
https://go.dev/play/p/01xmVZg7ePc
If you don't care about the existing file you can just use Encoder.Encode on the whole slice as #redblue mentioned.
If you have an existing file you want to append to, the simplest way is to do what you've shown in your edit: Unmarshal or Decoder.Decoder the whole file into a slice of structs, append the new struct to the slice, and re-decode the whole lot using Marshal or Encoder.Encode.
If you have a large amount of data, you may want to consider using JSON Lines to avoid the trailing , and ] issue, and write one JSON object per line. Or you could use regular JSON, seek back from the end of the file so you're writing over the final ], then write a ,, the new JSON-encoded struct, and finally a ] to make the file a valid JSON array again.
So it depends on a bit on your use case and the data size which approach you take.
NOTICE
This Answer is a solution or workaround if you care about the content of an existing file!
This means it allows you to append to an existing json file, created by your API.
Obviously this only works for arrays of same structs
actual working json format:
[
object,
...
object,
]
When writing to a file, do NOT write [ and ].
Just append to the file writing your serialized json object and append a ,
actual file content:
object,
...
object,
Finally when reading the file prepend [ and append ]
This way you can write to the file from multiple sources and still have valid JSON
Also load the file and have a valid input for your json-processor.
We write our logfiles like this and provide a vaild json via REST calls, which is then processed (for example by a JavaScript Grid)

Json Marshalling straight to stdout

I am trying to learn Golang and while doing that I wrote below code (part of bigger self learning project) and took the code review from strangers, one of the comment was, "you could have marshalled this straight to stdout, instead of marshalling to heap, then converting to string and then streaming it to stdout"
I have gone through the documentation of encoding/json package and io but not able to piece together the change which is required.
Any pointers or help would be great.
// Marshal the struct with proper tab indent so it can be readable
b, err := json.MarshalIndent(res, "", " ")
if err != nil {
log.Fatal(errors.Wrap(err, "error marshaling response data"))
}
// Print the output to the stdout
fmt.Fprint(os.Stdout, string(b))
EDIT
I just found below code sample in documentation:
var out bytes.Buffer
json.Indent(&out, b, "=", "\t")
out.WriteTo(os.Stdout)
But again it writes to heap first before writing to stdout. It does remove one step of converting it to string though.
Create and use a json.Encoder directed to os.Stdout. json.NewEncoder() accepts any io.Writer as its destination.
res := map[string]interface{}{
"one": 1,
"two": "twotwo",
}
if err := json.NewEncoder(os.Stdout).Encode(res); err != nil {
panic(err)
}
This will output (directly to Stdout):
{"one":1,"two":"twotwo"}
If you want to set indentation, use its Encoder.SetIndent() method:
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
if err := enc.Encode(res); err != nil {
panic(err)
}
This will output:
{
"one": 1,
"two": "twotwo"
}
Try the examples on the Go Playground.

Unmarshalling JSON with Duplicate Fields

I'm still learning the go language, but I've been trying to find some practical things to work on to get a better handle on it. Currently, I'm trying to build a simple program that goes to a youtube channel and returns some information by taking the public JSON and unmarshalling it.
Thus far I've tried making a completely custom struct that only has a few fields in it, but that doesn't seem to pull in any values. I've also tried using tools like https://mholt.github.io/json-to-go/ and getting the "real" struct that way. The issue with that method is there are numerous duplicates and I don't know enough to really assess how to tackle that.
This is an example JSON (I apologize for its size) https://pastebin.com/6u0b39tU
This is the struct that I get from the above tool: https://pastebin.com/3ZCu96st
the basic pattern of code I've tried is:
jsonFile, err := os.Open("test.json")
if err != nil {
fmt.Println("Couldn't open file", err)
}
defer jsonFile.Close()
bytes, _ := ioutil.ReadAll(jsonFile)
var channel Autogenerated
json.Unmarshal(bytes, &Autogenerated)
if err != nil {
fmt.Println("Failed to Unmarshal", err)
}
fmt.Println(channel.Fieldname)
Any feedback on the correct approach for how to handle something like this would be great. I get the feeling I'm just completely missing something.
In your code, you are not unmarshaling into the channel variable. Furthermore, you can optimize your code to not use ReadAll. Also, don't forget to check for errors (all errors).
Here is an improvement to your code.
jsonFile, err := os.Open("test.json")
if err != nil {
log.Fatalf("could not open file: %v", err)
}
defer jsonFile.Close()
var channel Autogenerated
if err := json.NewDecoder(jsonFile).Decode(&channel); err != nil {
log.Fatalf("failed to parse json: %v", err)
}
fmt.Println(channel.Fieldname)
Notice how a reference to channel is passed to Decode.

CSV parser in Go breaks due to trailing space

We are trying to parse a csv file using Go's encoding/csv package. This particular csv is a bit peculiar, each row has a trailing space. When trying to decode this csv with quoted fields the package breaks since it expects a newline, separator or quote. The trailing space is not expected.
How would you handle this case? Do you know of another parser that we could use?
Edit:
f,err := os.Open("file.go")
// err etc..
csvr := csv.NewReader(f)
csvr.Comma = csvDelimiter
for {
rowAsSlice, err := csvr.Read()
// Handle row and errors etc.
}
Edit 2:
CSV example, mind the trailing space!
"RECORD_TYPE","COMPANY_SHORTNAME"
"HDR","COMPANY_EXAMPLE"
One possible solution is to wrap the source file reader in a custom reader whose Read(...) method silently trims trailing whitespace from what the underlying reader actually reads. The csv.Reader could use that type directly.
For example (Go Playground):
type TrimReader struct{ io.Reader }
var trailingws = regexp.MustCompile(` +\r?\n`)
func (tr TrimReader) Read(bs []byte) (int, error) {
// Perform the requested read on the given reader.
n, err := tr.Reader.Read(bs)
if err != nil {
return n, err
}
// Remove trailing whitespace from each line.
lines := string(bs[:n])
trimmed := []byte(trailingws.ReplaceAllString(lines, "\n"))
copy(bs, trimmed)
return len(trimmed), nil
}
func main() {
file, err := file.Open("myfile.csv")
// TODO: handle err...
csvr := csv.NewReader(TrimReader{file})
for {
record, err := csvr.Read()
if err == io.EOF {
break
}
fmt.Printf("LINE: record=%#v, err=%v\n", record, err)
}
// LINE: record=[]string{"RECORD_TYPE", "COMPANY_SHORTNAME"}, err=<nil>
// LINE: record=[]string{"HDR", "COMPANY_EXAMPLE"}, err=<nil>
}
Note that, as commenter #svsd points out, there is a subtle bug here wherein trailing whitespace can still make it through if the line terminator isn't read until the subsequent call. You can workaround by buffering or, perhaps best, simply preprocess these CSV files to remove the trailing whitespace before attempting to parse them.

How do I make raw unicode encoded content readable?

I used net/http request a web API and the server returned a JSON response. When I print the response body, it displayed as raw ASCII content. I tried using bufio.ScanRunes to parse the content but failed.
I also tried write a simple server and return a unicode string and it worked well.
Here is the core code:
func (c ClientInfo) Request(method string, url string, form url.Values) string {
req, _ := http.NewRequest(method, url, strings.NewReader(c.Encode(form)))
req.Header = c.Header
req.AddCookie(&c.Cookie)
resp, err := http.DefaultClient.Do(req)
defer resp.Body.Close()
if err != nil {
fmt.Println(err)
}
scanner := bufio.NewScanner(resp.Body)
scanner.Split(bufio.ScanRunes)
var buf bytes.Buffer
for scanner.Scan() {
buf.WriteString(scanner.Text())
}
rv := buf.String()
fmt.Println(rv)
return rv
}
Here is the example output:
{"forum":{"id":"3251718","name":"\u5408\u80a5\u5de5\u4e1a\u5927\u5b66\u5ba3\u57ce\u6821\u533a","first_class":"\u9ad8\u7b49\u9662\u6821","second_class":"\u5b89\u5fbd\u9662\u6821","is_like":"0","user_level":"1","level_id":"1","level_name":"\u7d20\u672a\u8c0b\u9762","cur_score":"0","levelup_score":"5","member_num":"80329","is_exists":"1","thread_num":"108762","post_num":"3445881","good_classify":[{"class_id":"0","class_name":"\u5168\u90e8"},{"class_id":"1","class_name":"\u516c\u544a\u7c7b"},{"class_id":"2","class_name":"\u5427\u53cb\u4e13\u533a"},{"class_id":"4","class_name":"\u6d3b\u52a8\u4e13\u533a"},{"class_id":"6","class_name":"\u793e\u56e2\u73ed\u7ea7"},{"class_id":"5","class_name":"\u8d44\u6e90\u5171\u4eab"},{"class_id":"8","class_name":"\u6e29\u99a8\u751f\u6d3b\u7c7b"},{"class_id":"7","class_name":"\u54a8\u8be2\u65b0\u95fb\u7c7b"},{"class_id":"3","class_name":"\u98ce\u91c7\u5c55\u793a\u533a"}],"managers":[{"id":"793092593","name":"yi\u62b9\u660e\u5a9a\u7684\u5fe7\u4f24"},
...
That is just the standard way to escape any Unicode character.
Unmarshal it to see the unquoted text (the json package will unquote it):
func main() {
var i interface{}
err := json.Unmarshal([]byte(src), &i)
fmt.Println(err, i)
}
const src = `{"forum":{"id":"3251718","name":"\u5408\u80a5\u5de5\u4e1a\u5927\u5b66\u5ba3\u57ce\u6821\u533a","first_class":"\u9ad8\u7b49\u9662\u6821","second_class":"\u5b89\u5fbd\u9662\u6821","is_like":"0","user_level":"1","level_id":"1","level_name":"\u7d20\u672a\u8c0b\u9762","cur_score":"0","levelup_score":"5","member_num":"80329","is_exists":"1","thread_num":"108762","post_num":"3445881","good_classify":[{"class_id":"0","class_name":"\u5168\u90e8"},{"class_id":"1","class_name":"\u516c\u544a\u7c7b"},{"class_id":"2","class_name":"\u5427\u53cb\u4e13\u533a"},{"class_id":"4","class_name":"\u6d3b\u52a8\u4e13\u533a"},{"class_id":"6","class_name":"\u793e\u56e2\u73ed\u7ea7"},{"class_id":"5","class_name":"\u8d44\u6e90\u5171\u4eab"},{"class_id":"8","class_name":"\u6e29\u99a8\u751f\u6d3b\u7c7b"},{"class_id":"7","class_name":"\u54a8\u8be2\u65b0\u95fb\u7c7b"},{"class_id":"3","class_name":"\u98ce\u91c7\u5c55\u793a\u533a"}]}}`
Output (trimmed) (try it on the Go Playground):
<nil> map[forum:map[levelup_score:5 is_exists:1 post_num:3445881 good_classify:[map[class_id:0 class_name:全部] map[class_id:1 class_name:公告类] map[class_id:2 class_name:吧友专区] map[class_id:4 class_name:活动专区] map[class_id:6 class_name:社团班级] map[class_id:5 class_name:资源共享] map[class_id:8 class_name:温馨生活类] map[class_name:咨询新闻类 class_id:7] map[class_id:3 class_name:风采展示区]] id:3251718 is_like:0 cur_score:0
If you just want to unquote a fragment, you may use strconv.Unquote():
fmt.Println(strconv.Unquote(`"\u7d20\u672a\u8c0b"`))
Output (try it on the Go Playground):
素未谋 <nil>
Note that strconv.Unquote() expects a string that is in quotes, that's why I used a raw string literal, so I could add quotes, and also so that the compiler itself will not interpret / unquote the Unicode escapes.
See related question: How to convert escape characters in HTML tags?