awk delete field if other column match - csv

i have a CSV file that looks like this:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,5,c,01/03
2,5,6,c,01/03
The last 2 rows have been appended to the file but it has an extra column (third column). I want to delete the third column from the last 2 rows (i.e. where column 4 == "c" and column 5 == "01/03")
Output i want is to remove the third column from last 2 rows such that it has only 4 columns:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03
if can be done in vim, would be good too

Here's a slightly different approach that avoids having to type the list of columns to be included:
awk -F, 'BEGIN {OFS=FS} NF==5 {for(i=3;i<=NF;i++){$i=$(i+1)}; NF--} 1'
The solution with an explicit listing of columns can also be written more compactly as follows:
awk -F, 'BEGIN {OFS=FS} NF == 5 {print $1, $2, $4, $5; next} 1'

This should do it
awk -F, 'BEGIN {OFS=","} {if (NF == 5) {print $1, $2, $4, $5} else {print}}' filename

$ awk 'BEGIN{FS=OFS=","} {print $1,$2,$(NF-1),$NF}' file
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03

Related

Subtract fixed number of days from date column using awk and add it to new column

Let's assume that we have a file with the values as seen bellow:
% head test.csv
20220601,A,B,1
20220530,A,B,1
And we want to add two new columns, one with the date minus 1 day and one with minus 7 days, resulting the following:
% head new_test.csv
20220601,A,B,20220525,20220531,1
20220530,A,B,20220523,20220529,1
The awk that was used to produce the above is:
% awk 'BEGIN{FS=OFS=","} { a="date -d \"$(date -d \""$1"\") -7 days\" +'%Y%m%d'"; a | getline st ; close(a) ;b="date -d \"$(date -d \""$1"\") -1 days\" +'%Y%m%d'"; b | getline cb ; close(b) ;print $1","$2","$3","st","cb","$4}' test.csv > new_test.csv
But after applying the above in a large file with more than 100K lines it runs for 20 minutes, is there any way to optimize the awk?
One GNU awk approach:
awk '
BEGIN { FS=OFS=","
secs_in_day = 60 * 60 * 24
}
{ dt = mktime( substr($1,1,4) " " substr($1,5,2) " " substr($1,7,2) " 12 0 0" )
dt1 = strftime("%Y%m%d",dt - secs_in_day )
dt7 = strftime("%Y%m%d",dt - (secs_in_day * 7) )
print $1,$2,$3,dt7,dt1,$4
}
' test.csv
This generates:
20220601,A,B,20220525,20220531,1
20220530,A,B,20220523,20220529,1
NOTES:
requires GNU awk for the mktime() and strftime() functions; see GNU awk time functions for more details
other flavors of awk may have similar functions, ymmv
You can try using function calls, it is faster than calling the .
awk -F, '
function cmd1(date){
a="date -d \"$(date -d \""date"\") -1days\" +'%Y%m%d'"
a | getline st
return st
close(a)
}
function cmd2(date){
b="date -d \"$(date -d \""date"\") -7days\" +'%Y%m%d'"
b | getline cm
return cm
close(b)
}
{
$5=cmd1($1)
$6=cmd2($1)
print $1","$2","$3","$5","$6","$4
}' OFS=, test > newFileTest
I executed this against a file with 20000 records in seconds, compared to the original awk which took around 5 minutes.

Get count based on value bash

I have data in this format in a file:
{"field1":249449,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249448,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249447,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249443,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
{"field1":249449,"field2":116895,"field3":1,"field4":"apple","field5":42,"field6":"2019-07-01T00:00:10","metadata":"","frontend":""}
Here, each entry represents a row. I want to have a count of the rows with respect to the value in field one, like:
249449 : 2
249448 : 1
249447 : 1
249443 : 1
How can I get that?
with awk
$ awk -F'[,:]' -v OFS=' : ' '{a[$2]++} END{for(k in a) print k, a[k]}' file
You can use the jq command line tool to interpret JSON data. uniq -c counts the number of occurences.
% jq .field1 < $INPUTFILE | sort | uniq -c
1 249443
1 249447
1 249448
2 249449
(tested with jq 1.5-1-a5b5cbe on linux xubuntu 18.04 with zsh)
Here's an efficient jq-only solution:
reduce inputs.field1 as $x ({}; .[$x|tostring] += 1)
| to_entries[]
| "\(.key) : \(.value)"
Invocation: jq -nrf program.jq input.json
(Note in particular the -n option.)
Of course if an object-representation of the counts is satisfactory, then
one could simply write:
jq -n 'reduce inputs.field1 as $x ({}; .[$x|tostring] += 1)' input.json
Using datamash and some shell utils, change the non-data delimiters to squeezed tabs, count field 3, (it'd be field 2, but there's a leading tab), reverse, then pretty print as per OP spec:
tr -s '{":,}' '\t' < file | datamash -sg 3 count 3 | tac | xargs printf '%s : %s\n'
Output:
249449 : 2
249448 : 1
249447 : 1
249443 : 1

python csv print first 10 rows only

I am working with a large CSV file with a lot of rows and columns. I need only the first 5 columns but only if the value for column 1 of each row is 1. (Column 1 can only have value 0 or 1).
So far I can print out the first 5 columns but can't filter to only show when column 1 is equal to 1. My .awk file looks like:
BEGIN {FS = ","}
NR!=1 {print $1", " $2", " $3", "$4", "$5}
I have tried things like $1>1 but to no luck, the output is always every row, regardless if the first column of each row is a 0 or 1.
Modifying your awk a bit:
BEGIN {FS = ","; OFS = ", "}
$1 == 1 {print $1, $2, $3, $4, $5; n++}
n == 10 {exit}

Convert MySql column with multiple values to proper table

I have a dataset in the form of a CSV file than is sent to me on a regular basis. I want to import this data into my MySql database and turn it into a proper set of tables. The problem I am having is that one of the fields the is used to store multiple values. For example the field is storing email addresses. It may one email address or it may have two, or three, or four, etc. The field contents would look something like this. "user1#domain.com,user2#domain.com,user3#domain.com".
I need to be able to take the undetermined number of values from each field and then add them into a separate table so that they look like this.
user1#domain.com
user2#domain.com
user3#domain.com
I am not sure how I can do this. Thank you for the help.
Probably the simplest way is a brute force approach of inserting the first email, then the second, and so on:
insert into newtable(email)
select substring_index(substring_index(emails, ',', 1), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 1;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 2), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 2;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 3), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 3;
And so on.
That is, extract the nth element from the list and insert that into the table. The where clause counts the number of commas in the list, which is a proxy for the length of the list.
You need to repeat this up to the maximum number of emails in the list.
Instead of importing the csv file directly and then trying to fix the problems in it, I found the best way to attack this was to first pass the csv to AWK.
AWK outputs three separate csv file that follow the normal forms. I then import those tables and all is well.
2 info="`ncftpget -V -c -u myuser -p mypassword ftp://fake.com/data_map.csv`"
3
4 echo "$info" | \
5 awk -F, -v OFS="," 'NR > 1 {
6 split($6, keyvalue, ";")
7 for (var in keyvalue) {
8 gsub(/.*:/, "", keyvalue[var])
9 print $1, keyvalue[var]
10 }}' > ~/sqlrw/table1.csv
11
12 echo "$info" | \
13 awk -F, -v OFS="," 'NR > 1 {
14 split($6, keyvalue, ";")
15 for (var in keyvalue) {
16 gsub(/:/, ",", keyvalue[var])
17 print keyvalue[var]
18 }}' > ~/sqlrw/table2.csv
19
20 sort -u ~/sqlrw/table2.csv -o ~/sqlrw/table2.csv
21
22 echo "$info" | \
23 awk -F, -v OFS="," 'NR > 1 {
24 print $1, $2, $3, $4, $5, $7, $8
25 }' > ~/sqlrw/table3.csv
Maybe using a simple php script would/shoud do the trick
<?php
$file = file_get_contents("my_file.csv");
$tmp = explode(";", $file); // iirc lines in csv are terminated by a ;
for ($i=0; $i<count($tmp); $i++)
{
$field = $tmp[$i];
$q = "INSERT INTO my_table (emails) VALUES (`$field`)";
// or use $i as an id if don't have an autoincrement
$q = "INSERT INTO my_table (id, emails) VALUES ($i, `$field`)";
// execute query ....
}
?>
Hope this helps even if it's not pure SQL .....

How to specify *one* tab as field separator in AWK?

The default for white-space field separators, such as tab when using FS = "\t", in AWK is either one or many. Therefore, if you want to read in a tab separated file with null values in some columns (other than the last), it skips over them. For example:
1 "\t" 2 "\t" "" "\t" 4 "\t" 5
$3 would refer to 4, not the null "" even though there are clearly two tabs.
What should I do so that I can specify the field separator to be one tab only, so that $4 would refer to 4 and not 5?
echo '1 "\t" 2 "\t" "" "\t" 4 "\t" 5' | awk -F"\t" '{print "$3="$3 , "$4="$4}'
output
$3=" "" " $4=" 4 "
So you can remove the dbl-quotes in your original string, and get
echo '1\t2\t\t4\t5' | awk -F"\t" '{print "$3="$3 , "$4="$4}'
output2
$3= $4=4
You're right, the default FS is white space, with the caveat that space and tab char next to each other, would qualify as 1 FS instance. So to use just "\t" as your FS, you can do as above as a cmd-line argument, or you can include an explict reset on FS, usually done in a BEGIN block, like
echo '1 "\t" 2 "\t" "" "\t" 4 "\t" 5' | awk 'BEGIN{FS="\t"}{print "$3="$3 , "$4="$4}'
IHTH