Visually align TSV using tabs - csv

I have a text file with fields, separated by some number of consequent tabs (so that the fields are all visually aligned). I'd like to add a lot of new fields to it from another (not aligned, pure tsv) file, while keeping everything aligned. A lot of values contain spaces in them, so only tabs (with assumed width of 8) can be used for alignment, because I want to be able to parse the file later by splitting each line on any number of consequent tabs. This means that I can't use tools like column or tsv-pretty as they use spaces for alignment. Is there a tool or a short script I can use to achieve what I want?
Example:
File 1:
AA BB CCC
AAAA BBB CCC
AA BBBB CC
File 2:
DD EE FF
DDDD EE FFFF
DD EEEE FF
Result:
AA BB CCC
AAAA BBB CCC
AA BBBB CC
DD EE FF
DDDD EE FFFF
DD EEEE FF

Visual alignment is for human consumption don't save the file in that format, rather when you need to view the file use column to format it for you.
First need to get rid of the extra tabs in your first file and combine the files
$ cat <(tr -s '\t' <file1) file2 > file12
which will have the aligned columns by the delimiter (tab). Now you can use column -ts$'\t' file12 whenever you want to view the file which will align the columns for you.
This assumes you don't have missing fields.

I asked this question in hope that there's an existing tool or a simple awk/perl one-liner that can do what I want. Looks like there isn't, so I wrote a simple tool in Go that worked for my input. It doesn't handle a lot of things that a good tsv parser should (like escaping) but maybe it'll still be useful for someone else:
package main
import (
"bufio"
"fmt"
"math"
"os"
"strings"
)
const tabWidth = 8
func tsvAlign(filenames []string) (err error) {
var lines [][]string
for _, filename := range filenames {
file, err := os.Open(filename)
if err != nil {
return err
}
defer file.Close()
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines = append(lines, strings.FieldsFunc(scanner.Text(), func(c rune) bool { return c == '\t' }))
}
}
maxFieldWidths := make([]int, len(lines[0])-1, len(lines[0])-1)
for i := 0; i < len(lines[0])-1; i++ {
for _, line := range lines {
if len(line[i]) > maxFieldWidths[i] {
maxFieldWidths[i] = len(line[i])
}
}
}
for _, line := range lines {
for i, field := range line[:len(line)-1] {
padding := int(math.Ceil(float64(maxFieldWidths[i]+tabWidth-maxFieldWidths[i]%tabWidth)/8 - float64(len(field))/8))
fmt.Print(field, strings.Repeat("\t", padding))
}
fmt.Println(line[len(line)-1])
}
return err
}
func main() {
if len(os.Args) < 2 {
fmt.Fprintln(os.Stderr, "ERROR: No arguments provided")
return
}
err := tsvAlign(os.Args[1:])
if err != nil {
fmt.Fprintln(os.Stderr, "ERROR: ", err)
}
}

Related

Writing data from bigquery to csv is slow

I wrote code that behaves weird and slow and I can't understand why.
What I'm trying to do is to download data from bigquery (using a query as an input) to a CSV file, then create a url link with this CSV so people can download it as a report.
I'm trying to optimize the process of writing the CSV as it takes some time and have some weird behavior.
The code iterates over bigquery results and pass each result to a channel for future parsing/writing using golang encoding/csv package.
This is the relevant parts with some debugging
func (s *Service) generateReportWorker(ctx context.Context, query, reportName string) error {
it, err := s.bigqueryClient.Read(ctx, query)
if err != nil {
return err
}
filename := generateReportFilename(reportName)
gcsObj := s.gcsClient.Bucket(s.config.GcsBucket).Object(filename)
wc := gcsObj.NewWriter(ctx)
wc.ContentType = "text/csv"
wc.ContentDisposition = "attachment"
csvWriter := csv.NewWriter(wc)
var doneCount uint64
go backgroundTimer(ctx, it.TotalRows, &doneCount)
rowJobs := make(chan []bigquery.Value, it.TotalRows)
workers := 10
wg := sync.WaitGroup{}
wg.Add(workers)
// start wrokers pool
for i := 0; i < workers; i++ {
go func(c context.Context, num int) {
defer wg.Done()
for row := range rowJobs {
records := make([]string, len(row))
for j, r := range records {
records[j] = fmt.Sprintf("%v", r)
}
s.mu.Lock()
start := time.Now()
if err := csvWriter.Write(records); err != {
log.Errorf("Error writing row: %v", err)
}
if time.Since(start) > time.Second {
fmt.Printf("worker %d took %v\n", num, time.Since(start))
}
s.mu.Unlock()
atomic.AddUint64(&doneCount, 1)
}
}(ctx, i)
}
// read results from bigquery and add to the pool
for {
var row []bigquery.Value
if err := it.Next(&row); err != nil {
if err == iterator.Done || err == context.DeadlineExceeded {
break
}
log.Errorf("Error loading next row from BQ: %v", err)
}
rowJobs <- row
}
fmt.Println("***done loop!***")
close(rowJobs)
wg.Wait()
csvWriter.Flush()
wc.Close()
url := fmt.Sprintf("%s/%s/%s", s.config.BaseURL s.config.GcsBucket, filename)
/// ....
}
func backgroundTimer(ctx context.Context, total uint64, done *uint64) {
ticker := time.NewTicker(10 * time.Second)
go func() {
for {
select {
case <-ctx.Done():
ticker.Stop()
return
case _ = <-ticker.C:
fmt.Printf("progress (%d,%d)\n", atomic.LoadUint64(done), total)
}
}
}()
}
bigquery Read func
func (c *Client) Read(ctx context.Context, query string) (*bigquery.RowIterator, error) {
job, err := c.bigqueryClient.Query(query).Run(ctx)
if err != nil {
return nil, err
}
it, err := job.Read(ctx)
if err != nil {
return nil, err
}
return it, nil
}
I run this code with query that have about 400,000 rows. the query itself take around 10 seconds, but the whole process takes around 2 minutes
The output:
progress (112346,392565)
progress (123631,392565)
***done loop!***
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
progress (123631,392565)
worker 3 took 1m16.728143875s
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
progress (247525,392565)
worker 3 took 1m13.525662666s
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
progress (370737,392565)
worker 4 took 1m17.576536375s
progress (392565,392565)
You can see that writing first 112346 rows was fast, then for some reason worker 3 took 1.16minutes (!!!) to write a single row, which cause the other workers to wait for the mutex to be released, and this happened again 2 more times, which caused the whole process to take more than 2 minutes to finish.
I'm not sure whats going and how can I debug this further, why I have this stalls in the execution?
As suggested by #serge-v, you can write all the records to a local file and then transfer the file as a whole to GCS. To make the process happen in a shorter time span you can split the files into multiple chunks and can use this command : gsutil -m cp -j where
gsutil is used to access cloud storage from command line
-m is used to perform a parallel multi-threaded/multi-processing copy
cp is used to copy files
-j applies gzip transport encoding to any file upload. This also saves network bandwidth while leaving the data uncompressed in Cloud Storage.
To apply this command in your go Program you can refer to this Github link.
You could try implementing profiling in your Go program. Profiling will help you analyze the complexity. You can also find the time consumption in the program through profiling.
Since you are reading millions of rows from BigQuery you can try using the BigQuery Storage API. It Provides faster access to BigQuery-managed Storage than Bulk data export. Using BigQuery Storage API rather than the iterators that you are using in Go program can make the process faster.
For more reference you can also look into the Query Optimization techniques provided by BigQuery.

Time JSON marshals to 0 time

I have the following code which primarily marshals and un-marshals a time struct. Here is the code
package main
import (
"fmt"
"time"
"encoding/json"
)
type check struct{
A time.Time `json:"a"`
}
func main(){
ds := check{A:time.Now().Truncate(0)}
fmt.Println(ds)
dd, _ := json.Marshal(ds)
d2 := check {}
json.Unmarshal(dd, d2)
fmt.Println(d2)
}
here is the output it produces
{2019-05-20 15:20:16.247914 +0530 IST}
{0001-01-01 00:00:00 +0000 UTC}
The first line is the original time and the second line is the time after the unmarshalling. Why do we have this loss of information with JSON conversions? How to prevent this?
Thanks.
Go vet tells you exactly what the problem is:
./prog.go:18:16: call of Unmarshal passes non-pointer as second argument
Also never ignore errors! The least you can do is print it:
ds := check{A: time.Now().Truncate(0)}
fmt.Println(ds)
dd, err := json.Marshal(ds)
fmt.Println(err)
d2 := check{}
err = json.Unmarshal(dd, d2)
fmt.Println(err)
fmt.Println(d2)
This will output (try it on the Go Playground):
{2009-11-10 23:00:00 +0000 UTC}
<nil>
json: Unmarshal(non-pointer main.check)
{0001-01-01 00:00:00 +0000 UTC}
You have to pass a pointer to json.Unmarshal() for it to be able to unmarshal into (change) your value:
err = json.Unmarshal(dd, &d2)
With this change output will be (try it on the Go Playground):
{2009-11-10 23:00:00 +0000 UTC}
<nil>
<nil>
{2009-11-10 23:00:00 +0000 UTC}

JSON Marshal uint or int as integer

I'm looking for information about the json marshal with Go. I'll explain the situation first.
I'm developing an app for a IoT device. The app sends a JSON inside a MQTT Packet to our broker. How the device is using a SIM for data connection I need to reduce to minimum the bytes of the packet.
Right now, The JSON has this structure
{
"d": 1524036831
"p": "important message"
}
The field d is a timestamp and p is the payload.
When the app sends this JSON it has 40 bytes. But if d is 1000, pe, the JSON will be 34 bytes. So the marshal is converting the field d as uint32 to ASCII representation of the number and then sends the string.
What I want is to send this field as a true int or uint. I want to say, 1524036831 is a int32, 4 bytes, the same as 1000. So with this change I could reduce the packet size some bytes and the number is be able to grow to 32 bits.
I read the docs for json.Marshal and I did not find anything about this.
I found a "solution" but I guest it is not pretty but does the work. I want another opinions.
Ugly solution (for me)
package main
import (
"encoding/binary"
"encoding/json"
"fmt"
)
type test struct {
Data uint32 `json:"d"`
Payload string `json:"p"`
}
type testB struct {
Data []byte `json:"d"`
Payload string `json:"p"`
}
func main() {
fmt.Println("TEST with uin32")
d := []test{test{Data: 5, Payload: "Important Message"}, test{Data: 10, Payload: "Important Message"}, test{Data: 1000, Payload: "Important Message"}, test{Data: 1524036831, Payload: "Important Message"}}
for _, i := range d {
j, _ := json.Marshal(i)
fmt.Println(string(j))
fmt.Println("All:", len(j))
fmt.Println("-----------")
}
fmt.Println("\nTEST with []Byte")
d1 := []testB{testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}, testB{Data: make([]byte, 4), Payload: "Important Message"}}
binary.BigEndian.PutUint32(d1[0].Data, 5)
binary.BigEndian.PutUint32(d1[1].Data, 20)
binary.BigEndian.PutUint32(d1[2].Data, 1000)
binary.BigEndian.PutUint32(d1[3].Data, 1524036831)
for _, i := range d1 {
j, _ := json.Marshal(i)
fmt.Println(string(j))
fmt.Println(len(j))
fmt.Println("-----------")
}
}
Play
To re-interate my comment: JSON is a text format, and text format are not designed to produce small messages. In particular there is no representation for numbers other than decimal strings in JSON.
Encoding numbers in a base larger than 10 will reduce the message size for large enough numbers.
You can reduce the message size your "ugly" code produces by removing leading zero bytes and encoding with base64.RawStdEncoding (which omits the padding characters). Doing this pays of for numbers >= 1e6.
If you put this all in a custom type it becomes much nicer to use:
package main
import (
"bytes"
"encoding/base64"
"encoding/binary"
"encoding/json"
"fmt"
)
type IntB64 uint32
func (n IntB64) MarshalJSON() ([]byte, error) {
b := make([]byte, 4)
binary.BigEndian.PutUint32(b, uint32(n))
b = bytes.TrimLeft(b, string(0))
// All characters in the base64 alphabet need not be escaped, so we don't
// have to call json.Marshal here.
l := base64.RawStdEncoding.EncodedLen(len(b)) + 2
j := make([]byte, l)
base64.RawStdEncoding.Encode(j[1:], b)
j[0], j[l-1] = '"', '"'
return j, nil
}
func main() {
enc(1) // "AQ"
enc(1000) // "A+g"
enc(1e6 - 1) // "D0I/"
enc(1e6) // "D0JA"
enc(1524036831) // "Wtb03w"
}
func enc(n int64) {
b, _ := json.Marshal(IntB64(n))
fmt.Printf("%10d %s\n", n, string(b))
}
Updated playground: https://play.golang.org/p/7Z03VE9roqN

Compress JSON file by eliminating whitespace

I am working with a large json file (~100,000 lines) and need to compress it down to make a program run faster. I wish to delete all the horizontal tabs, returns, etc. to minimize the size of the file.
For example if a line was originally:
"name_id": "Richard Feynman",
"occupation": "Professional Bongos Player"
it should be compressed to:
"name_id":"Richard Feynman","occupation":"Professional Bongos Player"`
I have scoured the Internet (forgive me if it is a simple answer, I am a beginner) and can't seem to find a command for the terminal that will help me do this. Any help would be much appreciated
Looks like you're looking for a JSON minifier.
There are some around, both online and standalone.
Try googling these terms + your favorite language, I'm sure you'll find something that suits your needs.
There are other tools that modify your JSON to make it smaller, but you'll end up with a different JSON, I guess. Haven't tried those.
Using GNU awk for RT:
$ awk 'BEGIN{RS="\""} NR%2{gsub(/[[:space:]]/,"")} {ORS=RT;print} END{printf "\n"}' file
"name_id":"Richard Feynman","occupation":"Professional Bongos Player"
The following flex(1) program will do the work. It makes a lexical analisys of json source and eliminates comments and spaces between tokens, respecting the in-string spaces. It also recognizes unquoted identifiers, and quotes them.
To compile it, just do
make json
Use it with the following command:
json [ file ... ]
if you don't specify a file, the program will read from stdin.
Here's the source:
%{
/* json-min. JSON minimizer.
* Author: Luis Colorado <lc#luiscoloradosistemas.com>
* Date: Wed Aug 13 07:35:23 EEST 2014
* Disclaimer: This program is GPL, as of GPL version 3, you
* may have received a copy of that document, or you can
* instead look at http://www.gnu.org/licenses/gpl.txt to read
* it. There's no warranty, nor assumed nor implicit on the
* use of this program, you receive it `as is' so whatever you
* do with it is only your responsibility. Luis Colorado
* won't assume any responsibility of the use or misuse of
* this program. You are warned.
*/
%}
dec ([1-9][0-9]*)
oct (0[0-7]*)
hex (0[xX][0-9a-fA-F]*)
doub ({dec}"."([0-9]*)?|{dec}?"."[0-9]+)
strd (\"([^\"]|\\.)*\")
t "true"
f "false"
n "null"
com1 "//".*
com2b "/*"
endc "*/"
ident ([a-zA-Z_][a-zA-Z0-9_]*)
%x INCOMMENT
%option noyywrap
%%
{dec} |
{oct} |
{hex} |
{doub} |
{strd} |
{t} |
{f} |
{n} |
"{" |
":" |
";" |
"}" |
"[" |
"]" |
"," ECHO;
[\ \t\n] |
{com1} ;
{com2b} BEGIN(INCOMMENT);
<INCOMMENT>. ;
<INCOMMENT>{endc} BEGIN(INITIAL);
{ident} { fprintf(stderr, "WARNING:"
"unquoted identifier %s "
"in source. Quoting.\n",
yytext);
printf("\"%s\"", yytext);
}
. { fprintf(stderr,
"WARNING: unknown symbol %s "
"in source, copied to output\n",
yytext);
ECHO;
}
%%
void process(const char *fn);
int main(int argc, const char **argv)
{
int i;
if (argc > 1) for (i = 1; i < argc; i++)
process(argv[i]);
else process(NULL); /* <-- stdin */
} /* main */
void process(const char *fn)
{
FILE *f = stdin;
if (fn) {
f = fopen(fn, "r");
if (!f) {
fprintf(stderr,
"ERROR:fopen:%s:%s(errno=%d)\n",
fn, strerror(errno), errno);
exit(EXIT_FAILURE);
} /* if */
} /* if */
yyin = f;
yylex();
if (fn) /* only close if we opened, don't close stdin. */
fclose(f);
printf("\n");
}
I have just written it, so there's little testing on it. Use it with care (conserve a backup of your original file) It doesn't overwrite the original file, just outputs to stdout, so you don't overwrite your data using it.
BR,
Luis

Receiving binary data from stdin, sending to channel in Go

so I have the following test Go code which is designed to read from a binary file through stdin, and send the data read to a channel, (where it would then be processed further). In the version I've given here, it only reads the first two values from stdin, although that's fine as far as showing the problem is concerned.
package main
import (
"fmt"
"io"
"os"
)
func input(dc chan []byte) {
data := make([]byte, 2)
var err error
var n int
for err != io.EOF {
n, err = os.Stdin.Read(data)
if n > 0 {
dc <- data[0:n]
}
}
}
func main() {
dc := make(chan []byte, 1)
go input(dc)
fmt.Println(<-dc)
}
To test it, I first build it using go build, and then send data to it using the command-
./inputtest < data.bin
The data I am using currently to test is just random binary data created using the openssl command.
The problem I am having is that it misses the first values from Stdin, and only gives the second and greater values. I think this is to do with the channel, as the same script with the channel removed produces the correct data. Has anyone come across this before? For example, I get the following output when running this command-
./inputtest < data.bin
[36 181]
Whereas I should be getting-
./inputtest < data.bin
[72 218]
(The binary data is the same in both instances.)
You're overwriting your buffer on every read and you've got a channel buffer, so you'll lose data every time there's space in the channel.
Try something like this (not tested, written on tablet, etc...):
import "os"
func input(dc chan []byte) error {
defer close(dc)
for {
data := make([]byte, 2)
n, err := os.Stdin.Read(data)
if n > 0 {
dc <- data[0:n]
}
if err != nil {
return err
}
}
return nil
}