Parsing CSV file which includes a header block of "comments" - csv

I'm trying to parse a CSV file hosted in a remote location, but annoyingly the file contains some readme comments atop the file, in the following format:
######
# some readme text,
# some more, comments
######
01-02-03,123,foo,http://example.com
04-05-06,789,baz,http://another.com
I'm attempting to use the following code in order to extract the URLs within the data, but it throws an error saying wrong number of fields due to the comments at the top, presumably it's trying to parse them as CSV content.
type myData struct {
URL string `json:"url"`
}
func doWork() ([]myData, error) {
rurl := "https://example.com/some.csv"
out := make([]myData, 0)
resp, err := http.Get(rurl)
if err != nil {
return []myData{}, err
}
defer resp.Body.Close()
reader := csv.NewReader(resp.Body)
reader.Comma = ','
data, err := reader.ReadAll()
if err != nil {
return []myData{}, err
}
for _, row := range data {
out = append(out, myData{URL: row[4]})
}
return out, nil
}
func main() {
data, err := doWork()
if err != nil {
panic(err)
}
// do something with data
}
Is there a way to skip over the first N lines of the remote file, or have it ignore lines which start with a #

Oh, actually I just realised I can add this:
reader.Comment = '#' // ignores the line starting with '#'
Which works perfectly with my current code, but appreciate the other suggestions.

Well, the approach is simple: do not try to interpret the lines starting with '#' as part of the CSV stream; instead, consider the whole stream of data as a concatenation of two streams: the header and the actual CSV payload.
The easiest approach probably is to employ the fact bufio.Reader is able to read lines from its underlying stream and it itself is an io.Reader so you can make a csv.Reader to read from it instead of the source stream.
So, you could roll like this (not real code, untested):
import (
"bufio"
"encoding/csv"
"io"
"strings"
)
func parse(r io.Reader) ([]MyData, error) (
br := bufio.NewReader(r)
var line string
for {
s, err := br.ReadString('\n')
if err != nil {
return nil, err
}
if len(s) == 0 || s[0] != '#' {
line = s
break
}
}
// At this point the line variable contains the 1st line of the CSV stream.
// Let's create a "multi reader" which reads first from that line
// and then — from the rest of the CSV stream.
cr := csv.NewReader(io.MultiReader(strings.NewReader(line), br))
cr.Comma = ','
data, err := cr.ReadAll()
if err != nil {
return nil, err
}
for _, row := range data {
out = append(out, myData{URL: row[4]})
}
return out, nil
}

Related

How to filter elements of a [][]string slice in Golang?

First of all i'm new here and i'm trying to learn Golang. I would like to check my csv file (which has 3 values; type, maker, model) and create a new one and after a filter operation i want to write new data(filtered) to the created csv file. Here is my code so you can understand me more clearly.
package main
import (
"encoding/csv"
"fmt"
"os"
)
func main() {
//openning my csv file which is vehicles.csv
recordFile, err := os.Open("vehicles.csv")
if err != nil{
fmt.Println("An error encountered ::", err)
}
//reading it
reader := csv.NewReader(recordFile)
vehicles, _ := reader.ReadAll()
//creating a new csv file
newRecordFile, err := os.Create("newCsvFile.csv")
if err != nil{
fmt.Println("An error encountered ::", err)
}
//writing vehicles.csv into the new csv
writer := csv.NewWriter(newRecordFile)
err = writer.WriteAll(vehicles)
if err != nil {
fmt.Println("An error encountered ::", err)
}
}
After i build it, it is working this way. It reads and writes the all data to new created csv file. But the problem here is, i want to filter duplicates of readed csv which is vehicles, i am creating another function (outside of the main function) to filter duplicates but i can't do it because vehicles 's type is [][]string, i searched the internet about filtering duplicates but all i found is int or string types. What i want to do is create a function and call it before WriteAll operation so WriteAll can write the correct (duplicates filtered) data into new csv file. Help me please!!
I appreciate any answer.
Happy coding!
This depends on how you define "uniqueness", but in general there are a few parts of this problem.
What is unique?
All fields must be equal
Only some fields must be equal
Normalize some or all fields before comparing
You have a few approaches for applying your uniqueness, including:
You can use a map, keyed by the "pieces" of uniqueness, requires O(N) state
You can sort the records and compare with the prior record as you iterate, requires O(1) state but is more complicated
You have two approaches for filtering and outputting:
You can build a new slice based on the old one using a loop and write all at once, this requires O(N) space
You can write the records out to the file as you go if you don't need to sort, this requires O(1) space
I think a reasonably simple and performant approach would be to pick (1) from the first, (1) from the second, and (2) from the third, which together would look like:
package main
import (
"encoding/csv"
"errors"
"io"
"log"
"os"
)
func main() {
input, err := os.Open("vehicles.csv")
if err != nil {
log.Fatalf("opening input file: %s", err)
}
output, err := os.Create("vehicles_filtered.csv")
if err != nil {
log.Fatalf("creating output file: %s", err)
}
defer func() {
// Ensure the file is closed at the end of the program
if err := output.Close(); err != nil {
log.Fatalf("finalizing output file: %s", err)
}
}()
reader := csv.NewReader(input)
writer := csv.NewWriter(output)
seen := make(map[[3]string]bool)
for {
// Read in one record
record, err := reader.Read()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
log.Fatalf("reading record: %s", err)
}
if len(record) != 3 {
log.Printf("bad record %q", record)
continue
}
// Check if the record has been seen before, skipping if so
key := [3]string{record[0], record[1], record[2]}
if seen[key] {
continue
}
seen[key] = true
// Write the record
if err := writer.Write(record); err != nil {
log.Fatalf("writing record %d: %s", len(seen), err)
}
}
}

Chrome native messaging host in golang fails when JSON size is more than 65500 characters

I am trying to write a native messaging host for chrome in golang. For this purpose, I tried using chrome-go as well as chrome-native-messaging packages. Both presented with the same problem as explained below.
Here is the code. I have added the relevant parts from the chrome-go package to the main file instead of importing it for easy understanding.
The following code actually works when I send a json message to it like {content:"Apple Mango"}. However, it stops working once the length of the json goes over approximately 65500 characters, give or take a 100 characters. There is no error output either.
package main
import (
"encoding/binary"
"encoding/json"
"fmt"
"io"
"os"
)
var byteOrder binary.ByteOrder = binary.LittleEndian
func Receive(reader io.Reader) ([]byte, error) {
// Read message length in native byte order
var length uint32
if err := binary.Read(reader, byteOrder, &length); err != nil {
return nil, err
}
// Return if no message
if length == 0 {
return nil, nil
}
// Read message body
received := make([]byte, length)
if n, err := reader.Read(received); err != nil || n != len(received) {
return nil, err
}
return received, nil
}
type response struct {
Content string `json:"content"`
}
func main() {
msg, err := Receive(os.Stdin)
if err != nil {
panic(err)
}
var res response
err = json.Unmarshal([]byte(msg), &res)
if err != nil {
panic(err)
}
fmt.Println(res.Content)
}
For those interested in testing, I have set up a repository with instructions. Run the following
git clone --depth=1 https://tesseract-index#bitbucket.org/tesseract-index/chrome-native-messaging-test-riz.git && cd chrome-native-messaging-test-riz
./json2msg.js < test-working.json | go run main.go
./json2msg.js < test-not-working.json | go run main.go
You will see that test-not-working.json gives no output, although its difference with test-working.json is a few hundred characters only.
What is the issue here?
There is a limitation of a pipe buffer which varies across systems. Mac OS X, for example, uses a capacity of 16384 bytes by default.
You can use this bash script to check your buffer capacity:
M=0; while printf A; do >&2 printf "\r$((++M)) B"; done | sleep 999
So it is not related to go, because I tried to change your code to read from file and Unmarshal and it worked:
func main() {
reader, err := os.Open("test-not-working.json")
if err != nil {
panic(err)
}
var res response
decoder := json.NewDecoder(reader)
err = decoder.Decode(&res)
if err != nil {
panic(err)
}
fmt.Println(res.Content)
}
This is because the pipe buffer of your OS is limited to 65536 bytes. Thus, the os.Stdin.Read(...) function can read 65536 bytes at once.
You can fix your code with this simple replacement:
n, err := io.ReadFull(reader, received)
And there is your error:
msg, err := Receive(os.Stdin)
if err != nil {
panic(err)
}
You have compared err with nil, but you have not compared msg with nil. But since you have read 65532 (65536 - 4) bytes, the func Receive(...) returned nil, nil.
To fix this, your function Receive(...) ought not return nil, nil.

How to output results to CSV of a concurrent web scraper in Go?

I'm new to Go and am trying to take advantage of the concurrency in Go to build a basic scraper to pull extract title, meta description, and meta keywords from URLs.
I am able to print out the results to terminal with the concurrency but can't figure out how to write output to CSV. I've tried many a variations that I could think of with limited knowledge of Go and many end up breaking the concurrency - so losing my mind a bit.
My code and URL input file is below - Thanks in advance for any tips!
// file name: metascraper.go
package main
import (
// import standard libraries
"encoding/csv"
"fmt"
"io"
"log"
"os"
"time"
// import third party libraries
"github.com/PuerkitoBio/goquery"
)
func csvParsing() {
file, err := os.Open("data/sample.csv")
checkError("Cannot open file ", err)
if err != nil {
// err is printable
// elements passed are separated by space automatically
fmt.Println("Error:", err)
return
}
// automatically call Close() at the end of current method
defer file.Close()
//
reader := csv.NewReader(file)
// options are available at:
// http://golang.org/src/pkg/encoding/csv/reader.go?s=3213:3671#L94
reader.Comma = ';'
lineCount := 0
fileWrite, err := os.Create("data/result.csv")
checkError("Cannot create file", err)
defer fileWrite.Close()
writer := csv.NewWriter(fileWrite)
defer writer.Flush()
for {
// read just one record
record, err := reader.Read()
// end-of-file is fitted into err
if err == io.EOF {
break
} else if err != nil {
fmt.Println("Error:", err)
return
}
go func(url string) {
// fmt.Println(msg)
doc, err := goquery.NewDocument(url)
if err != nil {
checkError("No URL", err)
}
metaDescription := make(chan string, 1)
pageTitle := make(chan string, 1)
go func() {
// time.Sleep(time.Second * 2)
// use CSS selector found with the browser inspector
// for each, use index and item
pageTitle <- doc.Find("title").Contents().Text()
doc.Find("meta").Each(func(index int, item *goquery.Selection) {
if item.AttrOr("name", "") == "description" {
metaDescription <- item.AttrOr("content", "")
}
})
}()
select {
case res := <-metaDescription:
resTitle := <-pageTitle
fmt.Println(res)
fmt.Println(resTitle)
// Have been trying to output to CSV here but it's not working
// writer.Write([]string{url, resTitle, res})
// err := writer.WriteString(`res`)
// checkError("Cannot write to file", err)
case <-time.After(time.Second * 2):
fmt.Println("timeout 2")
}
}(record[0])
fmt.Println()
lineCount++
}
}
func main() {
csvParsing()
//Code is to make sure there is a pause before program finishes so we can see output
var input string
fmt.Scanln(&input)
}
func checkError(message string, err error) {
if err != nil {
log.Fatal(message, err)
}
}
The data/sample.csv input file with URLs:
http://jonathanmh.com
http://keshavmalani.com
http://google.com
http://bing.com
http://facebook.com
In the code you supplied, you had commented the following code:
// Have been trying to output to CSV here but it's not working
err = writer.Write([]string{url, resTitle, res})
checkError("Cannot write to file", err)
This code is correct, except you have one issue.
Earlier in the function, you have the following code:
fileWrite, err := os.Create("data/result.csv")
checkError("Cannot create file", err)
defer fileWrite.Close()
This code causes the fileWriter to close once your csvParsing() func exits.
Because you've closed fileWriter with the defer, you are unable to write to it in your concurrent function.
Solution:
You'll need to use defer fileWrite.Close() inside your concurrent func or something similar so you do not close the fileWriter before you have written to it.

Using Golang to read csv, reorder columns then write result to a new csv with Concurrency

Here's my starting point.
It is a Golang script to read in a csv with 3 columns, re-order the columns and write the result to a new csv file.
package main
import (
"fmt"
"encoding/csv"
"io"
"os"
"math/rand"
"time"
)
func main(){
start_time := time.Now()
// Loading csv file
rFile, err := os.Open("data/small.csv") //3 columns
if err != nil {
fmt.Println("Error:", err)
return
}
defer rFile.Close()
// Creating csv reader
reader := csv.NewReader(rFile)
lines, err := reader.ReadAll()
if err == io.EOF {
fmt.Println("Error:", err)
return
}
// Creating csv writer
wFile, err := os.Create("data/result.csv")
if err != nil {
fmt.Println("Error:",err)
return
}
defer wFile.Close()
writer := csv.NewWriter(wFile)
// Read data, randomize columns and write new lines to results.csv
rand.Seed(int64(time.Now().Nanosecond()))
var col_index []int
for i,line :=range lines{
if i == 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index = rand.Perm(len(line))
}
writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}) //3 columns
writer.Flush()
}
//print report
fmt.Println("No. of lines: ",len(lines))
fmt.Println("Time taken: ", time.Since(start_time))
}
Question:
Is my code idiomatic for Golang?
How can I add concurrency to this code?
Your code is OK. There are no much case for concurrency. But you can at least reduce memory consumption reordering on the fly. Just use Read() instead of ReadAll() to avoid allocating slice for hole input file.
for line, err := reader.Read(); err == nil; line, err = reader.Read(){
if err = writer.Write([]string{line[col_index[0]], line[col_index[1]], line[col_index[2]]}); err != nil {
fmt.Println("Error:", err)
break
}
writer.Flush()
}
Move the col_index initialisation outside the write loop:
if len(lines) > 0 {
//randomize column index based on the number of columns recorded in the 1st line
col_index := rand.Perm(len(lines[0]))
newLine := make([]string, len(col_index))
for _, line :=range lines[1:] {
for from, to := range col_index {
newLine[to] = line[from]
}
writer.Write(newLine)
writer.Flush()
}
}
To use concurrency, you must not use reader.ReadAll. Instead make a goroutine that calls reader.Read and write the output on a channel that would replace the lines array. The main goroutine would read the channel and do the shuffle and the write.

Golang json query from bitcoin api returns invalid character

Something tells me I'm not understanding json correctly. I'm trying to grab some data off http://api.bitcoincharts.com/v1/trades.csv?symbol=rockUSD, but my Unmarshal seems to not be able to read the json data. I'm a fresh beginner to golang (and json as well), and I'm wondering how I am able to skip that wrong character error I'm making.
My error:
invalid character ',' after top-level value
panic: invalid character ',' after top-level value
My code:
package main
import ("fmt"
"net/http"
"io/ioutil"
"encoding/json"
)
type Prices struct {
Data string
}
func main() {
url := "http://api.bitcoincharts.com/v1/trades.csv?symbol=rockUSD"
httpresp, err := http.Get(url)
if err != nil{
fmt.Println(err)
panic(err)
}
defer httpresp.Body.Close()
htmldata, err := ioutil.ReadAll(httpresp.Body)
if err != nil{
fmt.Println(err)
panic (err)
}
var jsonData []Prices
err = json.Unmarshal([]byte(htmldata), &jsonData)
if err != nil {
fmt.Println(err)
panic (err)
}
fmt.Println(jsonData)
}
That is NOT json data at all, you'd have to write a custom parser.
Example:
.........
data := readData(httpresp.Body)
........
func readData(r io.Reader) (out [][3]float64) {
br := bufio.NewScanner(r)
for br.Scan() {
parts := strings.Split(br.Text(), ",")
if len(parts) != 3 {
continue
}
var fparts [3]float64
for i, p := range parts {
// bad idea to ignore errors, but it's left as exercise for the reader.
fparts[i], _ = strconv.ParseFloat(p, 64)
}
out = append(out, fparts)
}
return
}
playground