Writing data from bigquery to csv is slow - csv

I wrote code that behaves weird and slow and I can't understand why.
What I'm trying to do is to download data from bigquery (using a query as an input) to a CSV file, then create a url link with this CSV so people can download it as a report.
I'm trying to optimize the process of writing the CSV as it takes some time and have some weird behavior.
The code iterates over bigquery results and pass each result to a channel for future parsing/writing using golang encoding/csv package.
This is the relevant parts with some debugging
func (s *Service) generateReportWorker(ctx context.Context, query, reportName string) error {
it, err := s.bigqueryClient.Read(ctx, query)
if err != nil {
return err
}
filename := generateReportFilename(reportName)
gcsObj := s.gcsClient.Bucket(s.config.GcsBucket).Object(filename)
wc := gcsObj.NewWriter(ctx)
wc.ContentType = "text/csv"
wc.ContentDisposition = "attachment"
csvWriter := csv.NewWriter(wc)
var doneCount uint64
go backgroundTimer(ctx, it.TotalRows, &doneCount)
rowJobs := make(chan []bigquery.Value, it.TotalRows)
workers := 10
wg := sync.WaitGroup{}
wg.Add(workers)
// start wrokers pool
for i := 0; i < workers; i++ {
go func(c context.Context, num int) {
defer wg.Done()
for row := range rowJobs {
records := make([]string, len(row))
for j, r := range records {
records[j] = fmt.Sprintf("%v", r)
}
s.mu.Lock()
start := time.Now()
if err := csvWriter.Write(records); err != {
log.Errorf("Error writing row: %v", err)
}
if time.Since(start) > time.Second {
fmt.Printf("worker %d took %v\n", num, time.Since(start))
}
s.mu.Unlock()
atomic.AddUint64(&doneCount, 1)
}
}(ctx, i)
}
// read results from bigquery and add to the pool
for {
var row []bigquery.Value
if err := it.Next(&row); err != nil {
if err == iterator.Done || err == context.DeadlineExceeded {
break
}
log.Errorf("Error loading next row from BQ: %v", err)
}
rowJobs <- row
}
fmt.Println("***done loop!***")
close(rowJobs)
wg.Wait()
csvWriter.Flush()
wc.Close()
url := fmt.Sprintf("%s/%s/%s", s.config.BaseURL s.config.GcsBucket, filename)
/// ....
}
func backgroundTimer(ctx context.Context, total uint64, done *uint64) {
ticker := time.NewTicker(10 * time.Second)
go func() {
for {
select {
case <-ctx.Done():
ticker.Stop()
return
case _ = <-ticker.C:
fmt.Printf("progress (%d,%d)\n", atomic.LoadUint64(done), total)
}
}
}()
}
bigquery Read func
func (c *Client) Read(ctx context.Context, query string) (*bigquery.RowIterator, error) {
job, err := c.bigqueryClient.Query(query).Run(ctx)
if err != nil {
return nil, err
}
it, err := job.Read(ctx)
if err != nil {
return nil, err
}
return it, nil
}
I run this code with query that have about 400,000 rows. the query itself take around 10 seconds, but the whole process takes around 2 minutes
The output:
progress (112346,392565)
progress (123631,392565)
***done loop!***
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
worker 3 took 1m16.728143875s
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
worker 3 took 1m13.525662666s
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
worker 4 took 1m17.576536375s
progress (392565,392565)
You can see that writing first 112346 rows was fast, then for some reason worker 3 took 1.16minutes (!!!) to write a single row, which cause the other workers to wait for the mutex to be released, and this happened again 2 more times, which caused the whole process to take more than 2 minutes to finish.
I'm not sure whats going and how can I debug this further, why I have this stalls in the execution?

As suggested by #serge-v, you can write all the records to a local file and then transfer the file as a whole to GCS. To make the process happen in a shorter time span you can split the files into multiple chunks and can use this command : gsutil -m cp -j where
gsutil is used to access cloud storage from command line
-m is used to perform a parallel multi-threaded/multi-processing copy
cp is used to copy files
-j applies gzip transport encoding to any file upload. This also saves network bandwidth while leaving the data uncompressed in Cloud Storage.
To apply this command in your go Program you can refer to this Github link.
You could try implementing profiling in your Go program. Profiling will help you analyze the complexity. You can also find the time consumption in the program through profiling.
Since you are reading millions of rows from BigQuery you can try using the BigQuery Storage API. It Provides faster access to BigQuery-managed Storage than Bulk data export. Using BigQuery Storage API rather than the iterators that you are using in Go program can make the process faster.
For more reference you can also look into the Query Optimization techniques provided by BigQuery.

Related

libvirt-go DomainEventLifecycleRegister "could not initialize domain event timer"

I have installed libvirt-dev, compiled and run that code on a Ubuntu box:
package main
import (
"fmt"
"github.com/libvirt/libvirt-go"
)
func main() {
conn, _ := libvirt.NewConnect("qemu:///system")
defer conn.Close()
cb := func(c *libvirt.Connect, d *libvirt.Domain, event *libvirt.DomainEventLifecycle) {
fmt.Println(fmt.Sprintf("Event %d", event.Event))
}
_, err := conn.DomainEventLifecycleRegister(nil, cb)
if err != nil {
panic(fmt.Sprintf("cannot register libvirt domain event: %s", err))
}
}
And got: cannot register libvirt domain event: virError(Code=1, Domain=0, Message='internal error: could not initialize domain event timer')
I'm using libvirt-go while digital ocean go-libvirt LifecycleEvents just works fine...
Any ideas?
You've not registered any event loop implementation.
The easy way is to call EventRegisterDefaultImpl before opening a libvirt connection, and then spawn a goroutine that runs EventRunDefaultImpl in an infinite loop
The harder way is to provide your own custom event loop implementation using EventRegisterImpl

Visually align TSV using tabs

I have a text file with fields, separated by some number of consequent tabs (so that the fields are all visually aligned). I'd like to add a lot of new fields to it from another (not aligned, pure tsv) file, while keeping everything aligned. A lot of values contain spaces in them, so only tabs (with assumed width of 8) can be used for alignment, because I want to be able to parse the file later by splitting each line on any number of consequent tabs. This means that I can't use tools like column or tsv-pretty as they use spaces for alignment. Is there a tool or a short script I can use to achieve what I want?
Example:
File 1:
AA BB CCC
AAAA BBB CCC
AA BBBB CC
File 2:
DD EE FF
DDDD EE FFFF
DD EEEE FF
Result:
AA BB CCC
AAAA BBB CCC
AA BBBB CC
DD EE FF
DDDD EE FFFF
DD EEEE FF
Visual alignment is for human consumption don't save the file in that format, rather when you need to view the file use column to format it for you.
First need to get rid of the extra tabs in your first file and combine the files
$ cat <(tr -s '\t' <file1) file2 > file12
which will have the aligned columns by the delimiter (tab). Now you can use column -ts$'\t' file12 whenever you want to view the file which will align the columns for you.
This assumes you don't have missing fields.
I asked this question in hope that there's an existing tool or a simple awk/perl one-liner that can do what I want. Looks like there isn't, so I wrote a simple tool in Go that worked for my input. It doesn't handle a lot of things that a good tsv parser should (like escaping) but maybe it'll still be useful for someone else:
package main
import (
"bufio"
"fmt"
"math"
"os"
"strings"
)
const tabWidth = 8
func tsvAlign(filenames []string) (err error) {
var lines [][]string
for _, filename := range filenames {
file, err := os.Open(filename)
if err != nil {
return err
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines = append(lines, strings.FieldsFunc(scanner.Text(), func(c rune) bool { return c == '\t' }))
}
}
maxFieldWidths := make([]int, len(lines[0])-1, len(lines[0])-1)
for i := 0; i < len(lines[0])-1; i++ {
for _, line := range lines {
if len(line[i]) > maxFieldWidths[i] {
maxFieldWidths[i] = len(line[i])
}
}
}
for _, line := range lines {
for i, field := range line[:len(line)-1] {
padding := int(math.Ceil(float64(maxFieldWidths[i]+tabWidth-maxFieldWidths[i]%tabWidth)/8 - float64(len(field))/8))
fmt.Print(field, strings.Repeat("\t", padding))
}
fmt.Println(line[len(line)-1])
}
return err
}
func main() {
if len(os.Args) < 2 {
fmt.Fprintln(os.Stderr, "ERROR: No arguments provided")
return
}
err := tsvAlign(os.Args[1:])
if err != nil {
fmt.Fprintln(os.Stderr, "ERROR: ", err)
}
}

JSON Marshal uint or int as integer

I'm looking for information about the json marshal with Go. I'll explain the situation first.
I'm developing an app for a IoT device. The app sends a JSON inside a MQTT Packet to our broker. How the device is using a SIM for data connection I need to reduce to minimum the bytes of the packet.
Right now, The JSON has this structure
{
"d": 1524036831
"p": "important message"
}
The field d is a timestamp and p is the payload.
When the app sends this JSON it has 40 bytes. But if d is 1000, pe, the JSON will be 34 bytes. So the marshal is converting the field d as uint32 to ASCII representation of the number and then sends the string.
What I want is to send this field as a true int or uint. I want to say, 1524036831 is a int32, 4 bytes, the same as 1000. So with this change I could reduce the packet size some bytes and the number is be able to grow to 32 bits.
I read the docs for json.Marshal and I did not find anything about this.
I found a "solution" but I guest it is not pretty but does the work. I want another opinions.
Ugly solution (for me)
package main
import (
"encoding/binary"
"encoding/json"
"fmt"
)
type test struct {
Data uint32 `json:"d"`
Payload string `json:"p"`
}
type testB struct {
Data []byte `json:"d"`
Payload string `json:"p"`
}
func main() {
fmt.Println("TEST with uin32")
d := []test{test{Data: 5, Payload: "Important Message"}, test{Data: 10, Payload: "Important Message"}, test{Data: 1000, Payload: "Important Message"}, test{Data: 1524036831, Payload: "Important Message"}}
for _, i := range d {
j, _ := json.Marshal(i)
fmt.Println(string(j))
fmt.Println("All:", len(j))
fmt.Println("-----------")
}
fmt.Println("\nTEST with []Byte")
d1 := []testB{testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}}
binary.BigEndian.PutUint32(d1[0].Data, 5)
binary.BigEndian.PutUint32(d1[1].Data, 20)
binary.BigEndian.PutUint32(d1[2].Data, 1000)
binary.BigEndian.PutUint32(d1[3].Data, 1524036831)
for _, i := range d1 {
j, _ := json.Marshal(i)
fmt.Println(string(j))
fmt.Println(len(j))
fmt.Println("-----------")
}
}
Play
To re-interate my comment: JSON is a text format, and text format are not designed to produce small messages. In particular there is no representation for numbers other than decimal strings in JSON.
Encoding numbers in a base larger than 10 will reduce the message size for large enough numbers.
You can reduce the message size your "ugly" code produces by removing leading zero bytes and encoding with base64.RawStdEncoding (which omits the padding characters). Doing this pays of for numbers >= 1e6.
If you put this all in a custom type it becomes much nicer to use:
package main
import (
"bytes"
"encoding/base64"
"encoding/binary"
"encoding/json"
"fmt"
)
type IntB64 uint32
func (n IntB64) MarshalJSON() ([]byte, error) {
b := make([]byte, 4)
binary.BigEndian.PutUint32(b, uint32(n))
b = bytes.TrimLeft(b, string(0))
// All characters in the base64 alphabet need not be escaped, so we don't
// have to call json.Marshal here.
l := base64.RawStdEncoding.EncodedLen(len(b)) + 2
j := make([]byte, l)
base64.RawStdEncoding.Encode(j[1:], b)
j[0], j[l-1] = '"', '"'
return j, nil
}
func main() {
enc(1) // "AQ"
enc(1000) // "A+g"
enc(1e6 - 1) // "D0I/"
enc(1e6) // "D0JA"
enc(1524036831) // "Wtb03w"
}
func enc(n int64) {
b, _ := json.Marshal(IntB64(n))
fmt.Printf("%10d %s\n", n, string(b))
}
Updated playground: https://play.golang.org/p/7Z03VE9roqN

Early ending of code

In my code the code is executed before doing all tasks. Whattä do I have to change on my code such that does all tasks before ending?
package main
import (
"fmt"
"math/rand"
"time"
)
// run x tasks at random intervals
// - a task is a goroutine that runs for 2 seconds.
// - a task runs concurrently to other task
// - the interval between task is between 0 and 2 seconds
func main() {
// set x to the number of tasks
x := 4
// random numbers generation initialization
random := rand.New(rand.NewSource(1234))
for num := 0; num < x; num++ {
// sleep for a random amount of milliseconds before starting a new task
duration := time.Millisecond * time.Duration(random.Intn(2000))
time.Sleep(duration)
// run a task
go func() {
// this is the work, expressed by sleeping for 2 seconds
time.Sleep(2 * time.Second)
fmt.Println("task done")
}()
}
}
Yes as #Laney mentions this can be done using both Waitgroups and channels. Refer code below.
Waitgroups:
package main
import (
"fmt"
"math/rand"
"sync"
"time"
)
// run x tasks at random intervals
// - a task is a goroutine that runs for 2 seconds.
// - a task runs concurrently to other task
// - the interval between task is between 0 and 2 seconds
func main() {
// set x to the number of tasks
x := 4
// random numbers generation initialization
var wg sync.WaitGroup
random := rand.New(rand.NewSource(1234))
for num := 0; num < x; num++ {
// sleep for a random amount of milliseconds before starting a new task
duration := time.Millisecond * time.Duration(random.Intn(2000))
time.Sleep(duration)
//
wg.Add(1)
// run a task
go func() {
// this is the work, expressed by sleeping for 2 seconds
time.Sleep(2 * time.Second)
fmt.Println("task done")
wg.Done()
}()
}
wg.Wait()
fmt.Println("All tasks done")
}
Output:
task done
task done
task done
task done
All tasks done
On playground : https://play.golang.org/p/V-olyX9Qm8
Using channels:
package main
import (
"fmt"
"math/rand"
"time"
)
// run x tasks at random intervals
// - a task is a goroutine that runs for 2 seconds.
// - a task runs concurrently to other task
// - the interval between task is between 0 and 2 seconds
func main() {
//Channel to indicate completion of a task, can be helpful in sending a result value also
results := make(chan int)
// set x to the number of tasks
x := 4
t := 0 //task tracker
// random numbers generation initialization
random := rand.New(rand.NewSource(1234))
for num := 0; num < x; num++ {
// sleep for a random amount of milliseconds before starting a new task
duration := time.Millisecond * time.Duration(random.Intn(2000))
time.Sleep(duration)
//
// run a task
go func() {
// this is the work, expressed by sleeping for 2 seconds
time.Sleep(2 * time.Second)
fmt.Println("task done")
results <- 1 //may be something possibly relevant to the task
}()
}
//Iterate over the channel till the number of tasks
for result := range results {
fmt.Println("Got result", result)
t++
if t == x {
close(results)
}
}
fmt.Println("All tasks done")
}
Output:
task done
task done
Got result 1
Got result 1
task done
Got result 1
task done
Got result 1
All tasks done
Playground : https://play.golang.org/p/yAFdDj5nhb
In Go, as most languages, the process will exit when the entrypoint main() function exits.
Because you're spawning a number of goroutines, the main function is ending before the goroutines are all done, causing the process to exit and not finish those goroutines.
As others have suggested, you want to block your main() function until all the goroutines are done, and a couple of the most common ways to do that are either using semaphores (sync.WaitGroup), or channels (go by example)
There are many options. For example, you can use channels or sync.WaitGroup
The program ends when main goroutine ends.
You may use:
waitgroup - it gives very convenient way to wait when all tasks are done
channels - read from channel is blocked until new data arrives or channel gets closed.
naïve sleep - good only for example purposes

Receiving binary data from stdin, sending to channel in Go

so I have the following test Go code which is designed to read from a binary file through stdin, and send the data read to a channel, (where it would then be processed further). In the version I've given here, it only reads the first two values from stdin, although that's fine as far as showing the problem is concerned.
package main
import (
"fmt"
"io"
"os"
)
func input(dc chan []byte) {
data := make([]byte, 2)
var err error
var n int
for err != io.EOF {
n, err = os.Stdin.Read(data)
if n > 0 {
dc <- data[0:n]
}
}
}
func main() {
dc := make(chan []byte, 1)
go input(dc)
fmt.Println(<-dc)
}
To test it, I first build it using go build, and then send data to it using the command-
./inputtest < data.bin
The data I am using currently to test is just random binary data created using the openssl command.
The problem I am having is that it misses the first values from Stdin, and only gives the second and greater values. I think this is to do with the channel, as the same script with the channel removed produces the correct data. Has anyone come across this before? For example, I get the following output when running this command-
./inputtest < data.bin
[36 181]
Whereas I should be getting-
./inputtest < data.bin
[72 218]
(The binary data is the same in both instances.)
You're overwriting your buffer on every read and you've got a channel buffer, so you'll lose data every time there's space in the channel.
Try something like this (not tested, written on tablet, etc...):
import "os"
func input(dc chan []byte) error {
defer close(dc)
for {
data := make([]byte, 2)
n, err := os.Stdin.Read(data)
if n > 0 {
dc <- data[0:n]
}
if err != nil {
return err
}
}
return nil
}