Gnuplotting the sorted merge of two CSV files - csv

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards

$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

Related

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

How to use Bash to create arrays with values from the same line of many files?

I have a number of files (in the same folder) all with the same number of lines:
a.txt
20
3
10
15
15
b.txt
19
4
5
8
8
c.txt
2
4
9
21
5
Using Bash, I'd like to create an array of arrays that contain the value of each line in every file. So, line 1 from a.txt, b.txt, and c.txt. The same for lines 2 to 5, so that in the end it looks like:
[
[20, 19, 2],
[3, 4, 4],
...
[15, 8, 5]
]
Note: I messed up the formatting and wording. I've changed this now.
I'm actually using jq to get these lists in the first place, as they're originally specific values within a JSON file I download every X minutes. I used jq to get the values I needed into different files as I thought that would get me further, but now I'm not sure that was the way to go. If it helps, here is the original JSON file I download and start with.
I've looked at various questions that somewhat deal with this:
Creating an array from a text file in Bash
Bash Script to create a JSON file
JQ create json array using bash
Among others. But none of these deal with taking the value of the same line from various files. I don't know Bash well enough to do this and any help is greatly appreciated.
Here’s one approach:
$ jq -c -n '[$a,$b,$c] | transpose' --slurpfile a a.txt --slurpfile b b.txt --slurpfile c c.txt
Generalization to an arbitrary number of files
In the following, we'll assume that the files to be processed can be specified by *.txt in the current directory:
jq -n -c '
[reduce inputs as $i ({}; .[input_filename] += [$i]) | .[]]
| transpose' *.txt
Use paste to join the files, then read the input as raw text, splitting on the tabs inserted by paste:
$ paste a.txt b.txt c.txt | jq -Rc 'split("\t") | map(tonumber)'
[20,19,2]
[3,4,4]
[10,5,9]
[15,8,21]
[15,8,5]
If you want to gather the entire result into a single array, pipe it into another instance of jq in slurp mode. (There's probably a way to do it with a single invocation of jq, but this seems simpler.)
$ paste a.txt b.txt c.txt | jq -R 'split("\t") | map(tonumber)' | jq -sc
[[20,19,2],[3,4,4],[10,5,9],[15,8,21],[15,8,5]]
I could not come up with a simple way, but here's one I got to do this.
1. Join files and create CSV-like file
If your machine have join, you can create joined records from two files (like join command in SQL).
To do this, make sure your file is sorted.
The easiest way I think is just numbering each lines. This works as Primary ID in SQL.
$ cat a.txt | nl > a.txt.nl
$ cat b.txt | nl > b.txt.nl
$ cat c.txt | nl > c.txt.nl
Now you can join sorted files into one. Note that join can join only two files at once. This is why I piped output to next join.
$ join a.txt.nl b.txt.nl | join - c.txt.nl > conc.txt
now conc.txt is:
1 20 19 2
2 3 4 4
3 10 5 9
4 15 8 21
5 15 8 5
2. Create json from the CSV-like file
It seems little complicated.
jq -Rsn '
[inputs
| . / "\n"
| (.[] | select((. | length) > 0) | . / " ") as $input
| [$input[1], $input[2], $input[3] ] ]
' <conc.txt
Actually I do not know detailed syntex or usage of jq, it seems like doing:
split input file by \n
split a given line by space, then select valid data
put splitted records in appropriate location by their index
I used this question as a reference:
https://stackoverflow.com/a/44781106/10675437

How do I sort a CSV file [duplicate]

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards
$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

linux command-line update csv file inline using value from another column that is json

I have a large csv file that contains several columns. One of the columns is a json string. I am trying to extract a specific value from the column that contains the json and add that value to the row as it's own column.
I've tinkered around a little with sed and awk to try to do this but really I'm just spinning my wheels
I'm also trying to do this as an inline file edit. The csv is tab delimited.
The value I'm trying to put in its own column is the value for destinationIDUsage
Sample row (highly trimmed down for readability here):
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] false
End result for the row should now have 876543 as a value in its own column as such:
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
Any help is greatly appreciated.
Something like this seems that does the job.
$ echo "$a"
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{MC_LIVEREPEATER:false},{environment:details},{feature:pushPublishUsage,destinationIDUsage:876543}] false
$ echo "$a" |awk '{for (i=1;i<=NF;i++) {if ($i~/destinationIDU/) {match($i,/(.*)(destinationIDUsage:)(.*)(})/,f);extra=f[3]}}}{prev=NF;$(NF+1)=$prev;$(NF-1)=extra}1'
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{MC_LIVEREPEATER:false},{environment:details},{feature:pushPublishUsage,destinationIDUsage:876543}] 876543 false
Is possible though, awk experts inhere to propose something different and maybe better.
With GNU awk for the 3rd arg to match():
$ awk 'BEGIN{FS=OFS="\t"} {match($6,/"destinationIDUsage":([0-9]+)/,a); $NF=a[1] OFS $NF}1' file
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
Add -i inplace for "inplace" editing or just do awk 'script' file > tmp && mv tmp file like you can with any UNIX tool.
Here is a solution using jq
If the file filter.jq contains
split("\n")[] # split string into lines
| select(length>0) # eliminate blanks
| split("\t") # split data rows by tabs
| (.[5] | fromjson | add) as $f # expand json
| .[:-1] + [$f.destinationIDUsage] + .[-1:] # add destinationIDUsage column
| #tsv # convert to tab-separated
and data contains the sample data then the command
jq -M -R -s -r -f filter.jq data
will produce the output with the additional column
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
To edit the file inplace you can make use of a tool like sponge as described in this answer:
Manipulate JSON with jq

How can I compare columns from two csv files by key by Linux command line?

I have two CSV files:
hogehoge.csv
1,aaa,bbb
2,ccc,ddd
3,eee,fff
4,ggg,hhh
5,iii,jjj
6,kkk,lll
7,mmm,nnn
8,ooo,ppp
9,qqq,rrr
10,sss,ttt
hogehoge2.csv
1,aaa,bb
2,ccc,ddd
3,eee,fff
4,ggg,hhh
5,iii,jjj
7,mmm,nnn
8,ooo,ppp
9,qqq,rrr
10,sss,ttt
I want to get a result like this by command line (diff/cut/awk).
6,kkk,lll
There is a difference on 1st line, but I want to ignore this difference on 1st line.
As the question is stated, you simply want to compare two files line-by-line. comm may be a good choice:
comm -3 hogehoge.csv hogehoge2.csv
If you want to ignore the first line of each file:
comm -3 <(tail -n +2 hogehoge.csv) <(tail -n +2 hogehoge2.csv)
which will print exactly the output you specified. Note: comm -3 will print the lines that differ in each file, and the list of different lines in the second file will be indented with tabs. To remove the tabs:
comm -3 <(tail -n +2 hogehoge.csv) <(tail -n +2 hogehoge2.csv) | sed $'s/\t*//'