Retrieve columns based on headers - csv

I have a .csv file with million rows, I want to retrieve columns based on header for example
headers: 300A 300B 300C 301 302 303 303A 303B 304 (file has 9 columns and million rows)
I'm unable to provide a command which can retrieve all the columns starting from 300B and between 304
Expected output: (need to retrieve the columns along with header)
300B 300C 301 302 303 303A 303B
Tried with basis awk and grep, giving output based on only .csv column number, unable to retrieve columns based on header

Using Miller, you can run
mlr --csv cut -f 300B,300C,301,302,303,303A,303B input.csv >output.csv
Before starting test on million of rows, you could start with few rows
mlr --csv head then cut -f 300B,300C,301,302,303,303A,303B input.csv >output.csv
If you need to extract programmatically a range, you can use linux utilities. In example, running
<input.csv head -n 1 | grep -o -P '(?<=300A,).*(?=,304)'
you get 300B,300C,301,302,303,303A,303B.
Create a bash script and use that command as input variable.

Related

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

How do I sort a CSV file [duplicate]

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards
$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

Gnuplotting the sorted merge of two CSV files

I am trying to merge and sort two CSV files skipping the first 8 rows.
I try to sort one of the files by the 36th column I use:
awk '(NR>8 ){print; }' Hight_5x5.csv | sort -nk36
and to merge the two files:
cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)
The sort command it does not work.
I would like two use both actions in a command and send the result to the plot command of gnuplot. I have tried this line:
awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36
and it does merge the two files but it does not sort by column 36, thus I assume in gnuplot plot command will not work too.
plot "<awk '(NR>8 ){print; }' (cat Hight_5x5.csv <(tail +8 Hight_5x5_b.csv)) | sort -nk36"
The problem is the format of the two files. The data have "," separations. For example, ...,"0.041","3.5","40","false","1000","1.3","20","5","5","-20","2","100000000","0.8",....
This link has the two CSV files.
Regards
$ awk 'FNR>8' file1 file2 | sort -k36n
should do, I guess you should be able to pipe to gnuplot as well.
Don't understand your comment, sort will sort. Perhaps you don't have 36 fields or your separator is not white space, which you have to specify.
Here is an example with dummy data with comma separated fields
$ awk 'FNR>3' <(seq 20 | paste - - -d,) <(seq 10 | shuf | paste - - -d,) | sort -t, -k2n
5,1
2,7
7,8
9,10
11,12
13,14
15,16
17,18
19,20

Shell - read html into variable and filter sequence

I need to read a webpage with tables into a variable and filter the number of one cell out.
the html is like:
<tr><th>Totals:</th><td> 99999.9</td>
I need to get that 99999.9 number.
I tried:
value=$(curl -s -m 10 http://$host | egrep -o "Totals:</th><td> [0-9]\{5\}" | cut -d'> ' -f 2)
an other valid option is to check if the page is generated at least. I mean reading the html into an value and check if the value is full of html (maybe length).
any glue what is wrong about the curl command combined with the cut command?
thank you?
You should use a proper html parser for that. If you really want to do it with bash (which is error prone and can cause you lot of headache if the html getting more complex) you can do that in the following way:
# html="$(curl -s -m 10 http://$host)"
html="<tr><th>Totals:</th><td> 99999.9</td>"
# remove all whitespaces
# it is not guaranteed that your cell value will be on the same line with Totals:
html_cl="$(echo $html | tr -d ' \t\n\r\f')"
# strip .*Totals:</th><td> before the desired cell value
# strip </td>.* after the value
value="${html_cl##*Totals:</th><td>}"
value="${value%%</td>*}"
echo $value
Gives you the result:
99999.9
NOTE: If you have multiple Totals with the same tags then it will extract only the last one from your string.

Split JSON into multiple files

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json