matching patterns and create new files - csv

I have a csv file named file1.csv:
something;AD;sss;Andorra;nothing;type_1;sss
something222;AD;sss222;Andorra;nothing222;type_2;aaa
thing;NL;thing3;Netherlands;thing;type_2;bb
etc;US;etc;United States;etc;type_2;nothing
I want to create separate files for each country. I make greps like that:
grep -e "\;AD\;.*\;Andorra\;" file1.csv > fileAD.csv
grep -e "\;NL\;.*\;Netherlands\;" file1.csv > fileNL.csv
grep -e "\;US\;.*\;United\sStates\;" file1.csv > fileUS.csv
This works, but I have all countries in the world. And i don't want to write these lines for every country. Is there any other solution ? Any help is really apreciated.
Edit: I updated my question. I also have a column with type_1 and type_2. And after all the files corresponding each country are created , I need to create new files for every country with just type_1 and new files with just type_2.
For example, for Andorra, I need the files:
fileAD.csv :
something;AD;sss;Andorra;nothing;type_1;sss
something222;AD;sss222;Andorra;nothing222;type_2;aaa
fileADtype_1.csv:
something;AD;sss;Andorra;nothing;type_1;sss
fileADtype_2.csv:
something222;AD;sss222;Andorra;nothing222;type_2;aaa
I think that is ok to look just for the column with the abbreviation, but i wanted the 2 columns, the one with "AD" and the one with the full_name "Andorra" for security reasons.

I go for a one liner with only one instance of awk, without temporary files:
awk -F ';' '{print >> "file" $2 ".csv"}' file1.csv

As one liner with awk:
for code in $(awk -F';' '{print $2}' data.csv | uniq); do awk -F';' -v pat="$code" '$2 ~ pat {print $0}' data.csv > "file${code}.csv"; done

Related

How to format a TXT file into a structured CSV file in bash?

I wanted to get some information about my CPU temperatures on my Linux Server (OpenSuse Leap 15.2). So I wrote a Script which collects data every 20 seconds and writes it into a text file. Now I have removed all garbage data (like "CPU Temp" etc.) I don't need.
Now I have a file like this:
47
1400
75
3800
The first two lines are one reading of the CPU temperature in C and the fan speed in RPM, respectively. The next two lines are another reading of the same measurements.
In the end I want this structure:
47,1400
75,3800
My question is: Can a Bash script do this for me? I tried something with sed and Awk but nothing worked perfectly for me. Furthermore I want a CSV file to make a graph, but i think it isn't a problem to convert a text file into a CSV file.
You could use paste
paste -d, - - < file.txt
With pr
pr -ta2s, file.txt
with ed
ed -s file.txt <<-'EOF'
g/./s/$/,/\
;.+1j
,p
Q
EOF
You can use awk:
awk 'NR%2{printf "%s,",$0;next;}1' file.txt > file.csv
Another awk:
$ awk -v OFS=, '{printf "%s%s",$0,(NR%2?OFS:ORS)}' file
Output:
47,1400
75,3800
Explained:
$ awk -v OFS=, '{ # set output field delimiter to a comma
printf "%s%s", # using printf to control newline in output
$0, # output line
(NR%2?OFS:ORS) # and either a comma or a newline
}' file
Since you asked if a bash script can do this, here's a solution in pure bash. ;o]
c=0
while read -r line; do
if (( c++ % 2 )); then
echo "$line"
else printf "%s," "$line"
fi
done < file
Take a look at 'paste'. This will join multiple lines of text together into a single line and should work for what you want.
echo "${DATA}"
Name
SANISGA01CI
5WWR031
P59CSADB01
CPDEV02
echo "${DATA}"|paste -sd ',' -
Name,SANISGA01CI,5WWR031,P59CSADB01,CPDEV02

AWK using file to remove csv rows

I have the following csv:
old.csv
irrelevant,irrelevant,Abc#gmail.com,irrelevant
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,yZ#yahoo.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,this#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
irrelevant,irrelevant,anoTher#mydomain.com,irrelevant
that I need to remove the rows containing emails from this file:
remove.txt
abc#gmail.com
yz#yahoo.com
this#email.com
another#mydomain.com
And I need the output to be this:
new.csv
irrelevant,irrelevant,zyx#gmail.com,irrelevant
irrelevant,irrelevant,that#email.com,irrelevant
irrelevant,irrelevant,def#gmail.com,irrelevant
I've tried this, but it doesn't work. Can anyone help?
awk -F, 'BEGIN{IGNORECASE = 1};NR==FNR{remove[$1]++;next}!($1 in remove)' remove.txt old.csv > new.csv
With grep:
grep -v -i -f remove.txt all.csv
Here,
-f - Obtain patterns from FILE, one per line.
-i - Ignore case
-v - Invert the matching
With awk:
awk -F, 'BEGIN{IGNORECASE=1} NR==FNR{a[$1]++;next} {for(var in a){if($3 ~ var){print}}}' remove.txt all.csv
Another awk:
awk -F, 'NR==FNR{a[tolower($1)]++;next} !(tolower($3) in a){print}' remove.txt all.csv
In your case, it won't work. Because,
IGNORECASE=1
will work only on if (x ~ /ab/) and not with array indexes.
index in array
IGNORECASE is gawk-specific, you may not be using gawk.
You're testing the wrong field.
Incrementing the array element does nothing useful.
Try this:
awk -F, 'NR==FNR{remove[tolower($1)];next}!(tolower($3) in remove)' remove.txt old.csv > new.csv

Find Values in CSV that only Appear Once

I have a csv file with thousands of lines in it. I'd like to be able to find values that only appear once in this file.
For instance
dog
dog
cat
dog
bird
I'd like to get as my result:
cat
bird
I tried using the following awk command but it returned one of each value in the file:
awk -F"," '{print $1}' test.csv|sort|uniq
Returns:
dog
cat
bird
Thank you for your help!
Just with awk:
awk -F, '{count[$1]++} END {for (key in count) if (count[key] == 1) print key}' test.csv
Close. Try:
awk -F"," '{print $1}' test.csv |sort | uniq -c | awk '{if ($1 == 1) print $2}'
the -c flag on uniq will give you counts. Next awk will look for any items with the count of 1 (first field) and print the value of the second field ($2)
Only caveat is that this will return bird before cat due to it being previously sroted. you could pipe once more to sort -r to reverse the sort direction. This would be identical to the expected answer you asked for, but it is not the original sort order.
Cutting to first field, then sorting and displaying only uniques:
cut -d ',' -f 1 test.csv | sort | uniq -u
That is, if you append -u to your command, it'd work. This is just using cut instead of awk.
If Perl is an option, this code is similar to #glenn jackman's:
perl -F, -lane '$c{$F[0]}++; END{for $k (sort keys %c){print $k if $c{$k} == 1}}' test.csv
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,
#F is the array of words in each line, indexed starting with $F[0]

grep html file from wget

I use wget to download number of papers matching a given query in scholar.google.com
and I obtain a file which shows all the content of the page.
I want to retrieve the last number in the following part of the file
"Results 1 - 10 of about 8,890."
I tried:
cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+'
but it outputs:
about <b>8890</b>. (<b>0.12</b> sec) </font></td></tr></table></form> <div class
whereas I just want the 8890 (with no comma which is taken care by tr -d ","
any suggestion on how to improve it?
Thank you in advance!
Grep pulls out the right line - use sed after that to chop away what you don't want.
cat /dir/file | tr -d "," | grep -o -E -- 'about ([^"]+) \w+' |sed -e 's/.*about <b>//' -e 's/<.b>.*//'
If the html tags (<b> and </b>) are present in your file, you'll have to modify your regex to take care of them too. To get just the fragment you're interested in use a lookbehind assertion. Here's something that should work:
cat /dir/file | tr -d "," | grep -oP -- '(?<=about <b>)[^/<> ]+'
Try something like: sed -n 's#.*about <b>\([0-9]*\)</b>.*#\1#p' instead of grep.
-n means don't print input lines as default, s flag p means print if substituted.

parse a csv file that contains commans in the fields with awk

i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23