How to use Bash to create arrays with values from the same line of many files? - json

I have a number of files (in the same folder) all with the same number of lines:
a.txt
20
3
10
15
15
b.txt
19
4
5
8
8
c.txt
2
4
9
21
5
Using Bash, I'd like to create an array of arrays that contain the value of each line in every file. So, line 1 from a.txt, b.txt, and c.txt. The same for lines 2 to 5, so that in the end it looks like:
[
[20, 19, 2],
[3, 4, 4],
...
[15, 8, 5]
]
Note: I messed up the formatting and wording. I've changed this now.
I'm actually using jq to get these lists in the first place, as they're originally specific values within a JSON file I download every X minutes. I used jq to get the values I needed into different files as I thought that would get me further, but now I'm not sure that was the way to go. If it helps, here is the original JSON file I download and start with.
I've looked at various questions that somewhat deal with this:
Creating an array from a text file in Bash
Bash Script to create a JSON file
JQ create json array using bash
Among others. But none of these deal with taking the value of the same line from various files. I don't know Bash well enough to do this and any help is greatly appreciated.

Here’s one approach:
$ jq -c -n '[$a,$b,$c] | transpose' --slurpfile a a.txt --slurpfile b b.txt --slurpfile c c.txt
Generalization to an arbitrary number of files
In the following, we'll assume that the files to be processed can be specified by *.txt in the current directory:
jq -n -c '
[reduce inputs as $i ({}; .[input_filename] += [$i]) | .[]]
| transpose' *.txt

Use paste to join the files, then read the input as raw text, splitting on the tabs inserted by paste:
$ paste a.txt b.txt c.txt | jq -Rc 'split("\t") | map(tonumber)'
[20,19,2]
[3,4,4]
[10,5,9]
[15,8,21]
[15,8,5]
If you want to gather the entire result into a single array, pipe it into another instance of jq in slurp mode. (There's probably a way to do it with a single invocation of jq, but this seems simpler.)
$ paste a.txt b.txt c.txt | jq -R 'split("\t") | map(tonumber)' | jq -sc
[[20,19,2],[3,4,4],[10,5,9],[15,8,21],[15,8,5]]

I could not come up with a simple way, but here's one I got to do this.
1. Join files and create CSV-like file
If your machine have join, you can create joined records from two files (like join command in SQL).
To do this, make sure your file is sorted.
The easiest way I think is just numbering each lines. This works as Primary ID in SQL.
$ cat a.txt | nl > a.txt.nl
$ cat b.txt | nl > b.txt.nl
$ cat c.txt | nl > c.txt.nl
Now you can join sorted files into one. Note that join can join only two files at once. This is why I piped output to next join.
$ join a.txt.nl b.txt.nl | join - c.txt.nl > conc.txt
now conc.txt is:
1 20 19 2
2 3 4 4
3 10 5 9
4 15 8 21
5 15 8 5
2. Create json from the CSV-like file
It seems little complicated.
jq -Rsn '
[inputs
| . / "\n"
| (.[] | select((. | length) > 0) | . / " ") as $input
| [$input[1], $input[2], $input[3] ] ]
' <conc.txt
Actually I do not know detailed syntex or usage of jq, it seems like doing:
split input file by \n
split a given line by space, then select valid data
put splitted records in appropriate location by their index
I used this question as a reference:
https://stackoverflow.com/a/44781106/10675437

Related

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

How do I sort a CSV file [duplicate]

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards
$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

Gnuplotting the sorted merge of two CSV files

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards
$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

Convert multiple JSON files to a single .CSV file

I am trying to convert 2000 JSON files of the same dimensionality to .csv and merge it to a single .csv file. What would be the best place to look at? Please assist.
I had a bunch of .json files with the same problem.
My solution was to use a bash script to loop all of the files, and use jq on each to convert to individual csv files. Something like
i=1
for eachFile in /path/to/json/*.json; do
cat json-$i.json | jq -r '.[] | {column1: .path.to.data, column2: .path.to.data} | [.[] | tostring] | #csv' > extract-$i.csv
echo "converted $i of many json files..."
((i=i+1))
done
Then you can cat and >> all of those in a similar loop to a single .csv file. Something like
i=1
for eachFile in /path/to/csv/*.csv; do
cat extract-$i.csv >> concatenate.csv
((i=i+1))
done
If you are crafty enough, you can combine those into a single script... edit: in fact, it's just a matter of adding a > to the first script, and using a single file name, so cat json-$i.json | jq -r '.[] | {column1: .path.to.data, column2: .path.to.data} | [.[] | tostring] | #csv' >> output.csv
There's this great program from Withdata which unfortunately costs a bit of money, but there is a 30 day free trial if you just need a quick fix. Its called DataFileConverter and there is a guide on their website as to how to change json files specifically into .csv. If you're looking for a free program try this repository https://github.com/evidens/json2csv. It's written in python but can still be used with the directions.
Using pathlib and pandas
from pathlib import Path
import pandas as pd
path = "/path/to/files/root/directory/"
files = list(Path(path).rglob("*json.gz"))
pd.concat((pd.read_json(file, compression="gzip") for file in files), ignore_index=True).to_csv(f"{path}/final.csv")

linux command-line update csv file inline using value from another column that is json

I have a large csv file that contains several columns. One of the columns is a json string. I am trying to extract a specific value from the column that contains the json and add that value to the row as it's own column.
I've tinkered around a little with sed and awk to try to do this but really I'm just spinning my wheels
I'm also trying to do this as an inline file edit. The csv is tab delimited.
The value I'm trying to put in its own column is the value for destinationIDUsage
Sample row (highly trimmed down for readability here):
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] false
End result for the row should now have 876543 as a value in its own column as such:
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
Any help is greatly appreciated.
Something like this seems that does the job.
$ echo "$a"
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{MC_LIVEREPEATER:false},{environment:details},{feature:pushPublishUsage,destinationIDUsage:876543}] false
$ echo "$a" |awk '{for (i=1;i<=NF;i++) {if ($i~/destinationIDU/) {match($i,/(.*)(destinationIDUsage:)(.*)(})/,f);extra=f[3]}}}{prev=NF;$(NF+1)=$prev;$(NF-1)=extra}1'
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{MC_LIVEREPEATER:false},{environment:details},{feature:pushPublishUsage,destinationIDUsage:876543}] 876543 false
Is possible though, awk experts inhere to propose something different and maybe better.
With GNU awk for the 3rd arg to match():
$ awk 'BEGIN{FS=OFS="\t"} {match($6,/"destinationIDUsage":([0-9]+)/,a); $NF=a[1] OFS $NF}1' file
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
Add -i inplace for "inplace" editing or just do awk 'script' file > tmp && mv tmp file like you can with any UNIX tool.
Here is a solution using jq
If the file filter.jq contains
split("\n")[] # split string into lines
| select(length>0) # eliminate blanks
| split("\t") # split data rows by tabs
| (.[5] | fromjson | add) as $f # expand json
| .[:-1] + [$f.destinationIDUsage] + .[-1:] # add destinationIDUsage column
| #tsv # convert to tab-separated
and data contains the sample data then the command
jq -M -R -s -r -f filter.jq data
will produce the output with the additional column
2017-03-22 00:00:01 %key%94e901fd3ceef351a0ad770e0be91d38 10 3.0.0 [{"MC_LIVEREPEATER":false},{"environment":"details"},{"feature":"pushPublishUsage","destinationIDUsage":876543}] 876543 false
To edit the file inplace you can make use of a tool like sponge as described in this answer:
Manipulate JSON with jq