How to create seperate variables in a .csv file derived from values in text files? - csv

I would be grateful for any advice you could provide on how to run the following in the UNIX command line. Essentially, I have text files for each of my subjects, which look like the following (simulated data).
2.97 3.61 -1.88
-0.38 2.33 -0.22
0.76 -0.71 -0.97
The subject ID is contained in the textfile heading (e.g. '100012_var.txt')
I would like to write a .csv file where each value (for each subject) in a row appears under a new variable heading. For instance:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
I would also like to ensure this is consistent across all subjects, i.e. value 1 in row 1 is always coded VAR 1.
I would really appreciate any suggestions!

Using awk:
$ awk -v RS="" -v OFS="\t" ' # using whole file as a record *
NR==1 { # first record, build the header
printf "ID" OFS
for(i=1;i<=NF;i++)
printf "Var%d%s",i,(i<NF?OFS:ORS)
}
{
split(FILENAME,f,"_") # split filename by _ to get the number
$1=$1 # rebuild the record to use tabs (OFS)
print f[1],$0 # print number part and the values
}' 100012_var.txt 100013_var.txt # them files
Output:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
* -v RS="" explained here.

Using Miller (https://github.com/johnkerl/miller) and perl
mlr --n2x --ifs ' ' --repifs put '$file=FILENAME' then reorder -f file input.tsv | \
perl -p -e 's/^\r\n$//g' | \
mlr --n2c --ifs ' ' --repifs uniq -a then cut -f 2 then cat -n then reshape -s n,2 \
then rename 1,ID then rename -r '([0-9]+),VAR\1'
you will have (it's a CSV)
ID,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10
input.tsv,2.97,3.61,-1.88,-0.38,2.33,-0.22,0.76,-0.71,-0.97
Then you can do a for loop for all files.

Related

Subtract fixed number of days from date column using awk and add it to new column

Let's assume that we have a file with the values as seen bellow:
% head test.csv
20220601,A,B,1
20220530,A,B,1
And we want to add two new columns, one with the date minus 1 day and one with minus 7 days, resulting the following:
% head new_test.csv
20220601,A,B,20220525,20220531,1
20220530,A,B,20220523,20220529,1
The awk that was used to produce the above is:
% awk 'BEGIN{FS=OFS=","} { a="date -d \"$(date -d \""$1"\") -7 days\" +'%Y%m%d'"; a | getline st ; close(a) ;b="date -d \"$(date -d \""$1"\") -1 days\" +'%Y%m%d'"; b | getline cb ; close(b) ;print $1","$2","$3","st","cb","$4}' test.csv > new_test.csv
But after applying the above in a large file with more than 100K lines it runs for 20 minutes, is there any way to optimize the awk?
One GNU awk approach:
awk '
BEGIN { FS=OFS=","
secs_in_day = 60 * 60 * 24
}
{ dt = mktime( substr($1,1,4) " " substr($1,5,2) " " substr($1,7,2) " 12 0 0" )
dt1 = strftime("%Y%m%d",dt - secs_in_day )
dt7 = strftime("%Y%m%d",dt - (secs_in_day * 7) )
print $1,$2,$3,dt7,dt1,$4
}
' test.csv
This generates:
20220601,A,B,20220525,20220531,1
20220530,A,B,20220523,20220529,1
NOTES:
requires GNU awk for the mktime() and strftime() functions; see GNU awk time functions for more details
other flavors of awk may have similar functions, ymmv
You can try using function calls, it is faster than calling the .
awk -F, '
function cmd1(date){
a="date -d \"$(date -d \""date"\") -1days\" +'%Y%m%d'"
a | getline st
return st
close(a)
}
function cmd2(date){
b="date -d \"$(date -d \""date"\") -7days\" +'%Y%m%d'"
b | getline cm
return cm
close(b)
}
{
$5=cmd1($1)
$6=cmd2($1)
print $1","$2","$3","$5","$6","$4
}' OFS=, test > newFileTest
I executed this against a file with 20000 records in seconds, compared to the original awk which took around 5 minutes.

How to use grep command inside Tcl script

How to run simple grep command in a tcl script and get output
grep B file1 > temp # bash grep command need to execute inside tcl commad,
file1 looks like this:
1 2 3 6 180.00 B
1 2 3 6 F
2 3 6 23 50.00 B
2 3 6 23 F
these do not work
exec grep B file.txt > temp
child process exited abnormally
exec "grep B pes_test.com > temp1"
couldn't execute "grep -e B ./pes_test.com > temp1": no such file or directory
exec /bin/sh -c {grep -e B ; true} < pes_test.com > tmp1
works but do not gives output,
exec throws an error when the process returns non-zero. See exec and the Tcl wiki
try {
set result [exec grep $pattern $file]
} on error {e} {
# typically, pattern not found
set result ""
}
Ref: try man page

Complex CSV parsing with Linux commands

I have a CSV log file that records the properties HA;HB;HC;HD;HE. The following file records 6 entries (separated by the above header).
I would like to extract the 3rd property(HC) of every entry.
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
Whenever there's n lines of HC recorded per entry, I want to extract the addition of the n entries.
The expected output for the above file:
14
28
51
0
37
10
I know I can write a program for this, but is there an easy way to get this with a combination on awk and/or sed commands?
I haven't tested this; try it and let me know if it works.
awk -F';' '
$3 == "HC" {
if (NR > 1) {
print sum
sum = 0 }
next }
{ sum += $3 }
END { print sum }'
awk solution:
$ awk -F';' '$3=="HC" && p{
print sum # print current total
sum=p=0 # reinitialize sum and p
next
}
$3!="HC"{
sum=sum+($3+0) # make sure $3 is converted to integer. sum it up.
p=1 # set p to 1
} # print last sum
END{print sum}' input.txt
output:
14
28
51
0
37
10
one-liner:
$ awk -F";" '$3=="HC" && p{print sum;sum=p=0;next} $3!="HC"{sum=sum+($3+0);p=1} END{print sum}' input.txt
awk -F';' '/^H.*/{if(f)print s;s=0;f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
For given inputs:
$ cat infile
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HC;HD;HE
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile
14
28
51
0
37
10
It takes little more care for example:
$ cat infile2
HA;HB;HC;HD;HE
a1;b1;14;d;e
HA;HB;HC;HD;HE
a2;b2;28;d;e
HA;HB;HC;HD;HE
a31;b31;44;d;e
a32;b32;07;d;e
HA;HB;HC;HD;HE
a4;b4;0;d;e
HA;HB;HD;HD;HE <---- Say if HC does not found
a51;b51;32;d;e
a52;b52;0;d;e
a53;b53;5;d;e
HA;HB;HC;HD;HE
a6;b6;10;d;e
# find only HC in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HC"}f{s+=$3}END{if(f)print s}' infile2
14
28
51
0
10
# Find HD in 3rd column
$ awk -F';' '/^H.*/{if(f)print s; s=0; f=$3=="HD"}f{s+=$3}END{if(f)print s}' infile2
37
eval "true || $(cat data.csv|cut -d ";" -f3 |sed -e s/"HC"/"0; expr 0"/g |tr '\n' '#'|sed -e s/"##"/""/g|sed -e s/"#"/" + "/g)"
Explanation:
Get contents of the file using cat
Take only the third column using cut delimiter of ;
Replace HC lines with 0; expr 0 values to start building eval-worthy bash expressions to eventually yield expr 0 + 14;
Replace \n newlines temporarily with # to circumvent possible BSD sed limitations
Replace double ## with single # to avoid blank lines turning into spaces and causing expr to bomb out.
Replace # with + to add the numbers together.
Execute the command, but with a true || 0; expr ... to avoid a guaranteed syntax error on the first line.
Which creates this:
true || 0; expr 0 + 14 + 0; expr 0 + 28 + 0; expr 0 + 44 + 07 + 0; expr 0 + 0 + 0; expr 0 + 32 + 0 + 5 + 0; expr 0 + 10
The output looks like this:
14
28
51
0
37
10
This was tested on Bash 3.2 and MacOS El Capitan.
Could you please try following and let me know if this helps you.
awk -F";" '
/^H/ && $3!="HC"{
flag="";
next
}
/^H/ && $3=="HC"{
if(NR>1){
printf("%d\n",sum)
};
sum=0;
flag=1;
next
}
flag{
sum+=$3
}
END{
printf("%d\n",sum)
}
' Input_file
Output will be as follows.
14
28
51
0
37
10
$ awk -F';' '$3=="HC"{if (NR>1) print s; s=0; next} {s+=$3} END{print s}' file
14
28
51
0
37
10

AWK wrong math on first line only

This is the input file input.awk DOS type
06-13-2014,08:43:11
RLS007817
RRC001021
yes,71.61673,0,150,37,1
no,11,156,1.35,306.418
4,3,-1,2.5165,20,-1.4204
-4,0,11,0,0,0
1.00E-001,0.2,3.00E-001,0.6786031,0.5,6.37E-002
110,40,30,222,200,-539
120,50,35,215,220,-547
130,60,40,207,240,-553
140,70,45,196,260,-560
150,80,50,184,280,-566
160,90,55,170,300,-573
170,100,60,157,320,-578
180,110,65,141,340,-582
190,120,70,126,360,-586
200,130,75,110,380,-590
This is what I basically need:
Ignore the first 8 lines (OK)
Pick and split the numbers on lines 6,7 & 8 (OK)
Do AWK math on columns (Error only in first line?)
BASH code
#!/bin/bash
myfile="input.awk"
vzeros=$(sed '6q;d' $myfile)
vshift=$(sed '7q;d' $myfile)
vcalib=$(sed '8q;d' $myfile)
IFS=','
read -a avz <<< "${vzeros}"
read -a avs <<< "${vshift}"
read -a avc <<< "${vcalib}"
z1=${avz[0]};s1=${avs[0]};c1=${avc[0]}
z2=${avz[1]};s2=${avs[1]};c2=${avc[1]}
z3=${avz[2]};s3=${avs[2]};c3=${avc[2]}
z4=${avz[4]};s4=${avs[4]};c4=${avc[4]}
#The single variables will be passed to awk
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" -v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" -v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" 'NR>8 { FS = "," ;
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile > test.plot
This is the result on the file test.plot
11 -0.6 -3 -10
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
This is the weird part... Only in the first line and after the first column all is wrong... And I have no idea why.
This is the expected result file:
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
I've printed the correction factors captured from lines 6,7 & 8 and everything is fine. All math is fine, except on the first line, after the first column.
OS: Slackware 13.37.
AWK: GNU Awk 3.1.6 Copyright (C) 1989, 1991-2007 Free Software Foundation.
I agree with #jeanrjc.
I copied your file and script to my machine and reduced it to processing the first 2 lines of your data.
With your code as is, I duplicate your results, i.e.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2= z2=3 s2=0
11 -0.6 -3 -10
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2= z2=3 s2=0
12 -0.6 -3 -10
With FS=","; commented out, and -F, added in the option list the output is what you are looking for.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2=40 z2=3 s2=0
11 7.4 6 90
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2=50 z2=3 s2=0
12 9.4 7.5 100
So make sure you have removed the FS=","; from the block of code, and you are using -F, In any case, I would say, that resetting the FS="," for each line that is processed is not useful.
If that still doesn't solve it, try the corrected code on a machine with a newer version of awk.
It would take a small magazine article to completely illustrate what is happening while reading thru the first 8 records (when FS="[[:space:]]), the transition to the first row that meets your rule NR>8, the FS is still [:space:] when the fields are parsed, then, FS is set to ,, but that first row is not rescanned.
IHTH!
Your sample is too complex to reproduce something, but I guess you should try :
awk -F"," 'NR>8{...
instead of
awk 'NR>8 { FS = "," ;
You can also try with BEGIN:
awk 'BEGIN{FS=","}NR>8{...
I eventually tested your script, and you should change the position of the FS parameter, as I told you:
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" \
-v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" \
-v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" -F"," 'NR>8 {
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
0 -0.6 -3 -10
Why you had a problem ?
Because awk parses the line before executing the block, so if you tell it to change something related to parsing, the changes will occur from the next line.
HTH

Read file, parse by tabs and insert into mysql database with Bash Script

I am doing a bash script to read a file line by line, parse and insert it in mysql database.
The script is the following:
#!/bin/bash
echo "Start!"
while IFS=' ' read -ra ADDR;
do
for line in $(cat filename)
do
regex='(\d\d)-(\d\d)-(\d\d)\s(\d\d:\d\d:\d\d)'
if [[$line=~$regex]]
then
$line='20$3-$2-$1 $4';
fi
echo "insert into table (time, total, caracas, anzoategui) values('$line', '$line', '$line', $
done | mysql -uuser -ppassword database;
done < filename
The file 'filename' with data is something like this:
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
And its dropping this:
Use the LOAD DATA statement. Do any transformations you have to on your file first, then
LOAD DATA LOCAL INFILE 'filename' INTO table (time, total, caracas, anzoategui)
Tab separated fields is the default.
I would recommend creating your sql file using awk and then sourcing it from mysql. Re-direct the output once it looks ok to you into a new file (say insert.sql) and source it from mysql command line. Something like this:
$ cat file
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
$ awk -F'[/ ]+' -v q="'" '{print "insert into table (time, total, caracas, anzoategui) values ("q"20"$3"-"$2"-"$1" "$4q","q$5q","q$6q","q$7q");"}' file
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:34:38','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:42:57','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 10:20:03','17492','5217','394');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 11:20:02','17521','5210','402');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 12:20:04','17540','5209','396');
Bash uses POSIX character classes. So instead of \d you could use [0-9] or [[:digit:]] to match a digit.
Also, there must be spaces in between the brackets and operator in your regex test. So [[$line=~$regex]] would be fixed by doing [[ $line =~ $regex ]].
In order to access the regular expressions enclosed in parentheses, you must access the builtin array variable BASH_REMATCH. So, to access the first match, you'd reference ${BASH_REMATCH[1]}, the second ${BASH_REMATCH[2]}, and so on.