AWK wrong math on first line only - csv

This is the input file input.awk DOS type
06-13-2014,08:43:11
RLS007817
RRC001021
yes,71.61673,0,150,37,1
no,11,156,1.35,306.418
4,3,-1,2.5165,20,-1.4204
-4,0,11,0,0,0
1.00E-001,0.2,3.00E-001,0.6786031,0.5,6.37E-002
110,40,30,222,200,-539
120,50,35,215,220,-547
130,60,40,207,240,-553
140,70,45,196,260,-560
150,80,50,184,280,-566
160,90,55,170,300,-573
170,100,60,157,320,-578
180,110,65,141,340,-582
190,120,70,126,360,-586
200,130,75,110,380,-590
This is what I basically need:
Ignore the first 8 lines (OK)
Pick and split the numbers on lines 6,7 & 8 (OK)
Do AWK math on columns (Error only in first line?)
BASH code
#!/bin/bash
myfile="input.awk"
vzeros=$(sed '6q;d' $myfile)
vshift=$(sed '7q;d' $myfile)
vcalib=$(sed '8q;d' $myfile)
IFS=','
read -a avz <<< "${vzeros}"
read -a avs <<< "${vshift}"
read -a avc <<< "${vcalib}"
z1=${avz[0]};s1=${avs[0]};c1=${avc[0]}
z2=${avz[1]};s2=${avs[1]};c2=${avc[1]}
z3=${avz[2]};s3=${avs[2]};c3=${avc[2]}
z4=${avz[4]};s4=${avs[4]};c4=${avc[4]}
#The single variables will be passed to awk
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" -v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" -v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" 'NR>8 { FS = "," ;
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile > test.plot
This is the result on the file test.plot
11 -0.6 -3 -10
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
This is the weird part... Only in the first line and after the first column all is wrong... And I have no idea why.
This is the expected result file:
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
I've printed the correction factors captured from lines 6,7 & 8 and everything is fine. All math is fine, except on the first line, after the first column.
OS: Slackware 13.37.
AWK: GNU Awk 3.1.6 Copyright (C) 1989, 1991-2007 Free Software Foundation.

I agree with #jeanrjc.
I copied your file and script to my machine and reduced it to processing the first 2 lines of your data.
With your code as is, I duplicate your results, i.e.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2= z2=3 s2=0
11 -0.6 -3 -10
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2= z2=3 s2=0
12 -0.6 -3 -10
With FS=","; commented out, and -F, added in the option list the output is what you are looking for.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2=40 z2=3 s2=0
11 7.4 6 90
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2=50 z2=3 s2=0
12 9.4 7.5 100
So make sure you have removed the FS=","; from the block of code, and you are using -F, In any case, I would say, that resetting the FS="," for each line that is processed is not useful.
If that still doesn't solve it, try the corrected code on a machine with a newer version of awk.
It would take a small magazine article to completely illustrate what is happening while reading thru the first 8 records (when FS="[[:space:]]), the transition to the first row that meets your rule NR>8, the FS is still [:space:] when the fields are parsed, then, FS is set to ,, but that first row is not rescanned.
IHTH!

Your sample is too complex to reproduce something, but I guess you should try :
awk -F"," 'NR>8{...
instead of
awk 'NR>8 { FS = "," ;
You can also try with BEGIN:
awk 'BEGIN{FS=","}NR>8{...
I eventually tested your script, and you should change the position of the FS parameter, as I told you:
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" \
-v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" \
-v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" -F"," 'NR>8 {
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
0 -0.6 -3 -10
Why you had a problem ?
Because awk parses the line before executing the block, so if you tell it to change something related to parsing, the changes will occur from the next line.
HTH

Related

which post-hoc test after welch-anova

i´m doing the statistical evaluation for my master´s thesis. the levene test was significant so i did the welch anova which was significant. now i tried the games-howell post hoc test but it didn´t work.
can anybody help me sending me the exact functions which i have to run in R to do the games-howell post hoc test and to get kind of a compact letter display, where it shows me which treatments are not significantly different from each other? i also wanted to ask if i did the welch anova the right way (you can find the output of R below)
here it the output which i did till now for the statistical evalutation:
data.frame': 30 obs. of 3 variables:
$ Dauer: Factor w/ 6 levels "0","2","4","6",..: 1 2 3 4 5 6 1 2 3 4 ...
$ WH : Factor w/ 5 levels "r1","r2","r3",..: 1 1 1 1 1 1 2 2 2 2 ...
$ TSO2 : num 107 86 98 97 88 95 93 96 96 99 ...
> leveneTest(TSO2~Dauer, data=TSO2R)
`Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 5 3.3491 0.01956 *
24
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1`
`> oneway.test (TSO2 ~Dauer, data=TSO2R, var.equal = FALSE) ###Welch-ANOVA
One-way analysis of means (not assuming equal variances)
data: TSO2 and Dauer
F = 5.7466, num df = 5.000, denom df = 10.685, p-value = 0.00807
'''`
Thank you very much!

How to create seperate variables in a .csv file derived from values in text files?

I would be grateful for any advice you could provide on how to run the following in the UNIX command line. Essentially, I have text files for each of my subjects, which look like the following (simulated data).
2.97 3.61 -1.88
-0.38 2.33 -0.22
0.76 -0.71 -0.97
The subject ID is contained in the textfile heading (e.g. '100012_var.txt')
I would like to write a .csv file where each value (for each subject) in a row appears under a new variable heading. For instance:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
I would also like to ensure this is consistent across all subjects, i.e. value 1 in row 1 is always coded VAR 1.
I would really appreciate any suggestions!
Using awk:
$ awk -v RS="" -v OFS="\t" ' # using whole file as a record *
NR==1 { # first record, build the header
printf "ID" OFS
for(i=1;i<=NF;i++)
printf "Var%d%s",i,(i<NF?OFS:ORS)
}
{
split(FILENAME,f,"_") # split filename by _ to get the number
$1=$1 # rebuild the record to use tabs (OFS)
print f[1],$0 # print number part and the values
}' 100012_var.txt 100013_var.txt # them files
Output:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
* -v RS="" explained here.
Using Miller (https://github.com/johnkerl/miller) and perl
mlr --n2x --ifs ' ' --repifs put '$file=FILENAME' then reorder -f file input.tsv | \
perl -p -e 's/^\r\n$//g' | \
mlr --n2c --ifs ' ' --repifs uniq -a then cut -f 2 then cat -n then reshape -s n,2 \
then rename 1,ID then rename -r '([0-9]+),VAR\1'
you will have (it's a CSV)
ID,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10
input.tsv,2.97,3.61,-1.88,-0.38,2.33,-0.22,0.76,-0.71,-0.97
Then you can do a for loop for all files.

Padding preceding blanks on a variable for Money or Disk Space lineup

OK, Basically I am echo'ing a line to a CSV comma delimited.
This is what is happening:
This is the output:
Computer1 Fri 08/04/2017 13:20 110 917 340 907
Computer2 Fri 08/04/2017 13:21 110 917 435 852
Computer3 Fri 08/04/2017 12:39 180 92 916
Computer4 Fri 08/04/2017 12:35 232 353 720
I want:
Computer1 Fri 08/04/2017 13:20 110 917 340 907
Computer2 Fri 08/04/2017 13:21 110 917 435 852
Computer3 Fri 08/04/2017 12:39 180 92 916
Computer4 Fri 08/04/2017 12:35 232 353 720
I want to lead with a comma for every 3rd right-justified character, so the values line up correctly.
I am getting size of folders to calculate current folder size, then again weekly to determine growth.
The part I am struggling with is this:
for /f "tokens=1-2 delims= " %%a in ('C:\du64.EXE -v -q -nobanner C:\Temp^|find "Size:"') do SET DISKSIZE=%%b
ECHO. "%DISKSIZE%" **
(This will give a value containing commas. ex 12,345,678,910)
ECHO. %COMPUTERNAME%,%DATE%,%TIME:~0,5%,%DISKSIZE%,%PROCESSOR_ARCHITECTURE%>> "C:\DUOutput.CSV"
...set "DISKSIZE= %%b"
echo %disksize:~-15%
No idea why you're getting 92 in your data, nor what lead with a comma for every 3rd right-justified character means.
see set /?|more from the prompt for documentation. I've no idea how many spaces I put before %%b - so long as there is at least a string of 15, it should be OK.

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

Filtering CSV files by multiple columns, sort them and create 2 new files

I have been searching how to do the following for couple hours and could not find it. I apologize if I am repeating something.
I have 22 csv files with 14 columns and 17,392 lines in each.I am using awk to filter the original files using the following commands:
First need to get lines that have values on column 14 smaller than 0.05
awk -F '\t' '$14 < 0.05 { print $0 }' file1 > file2
Next I need to get the lines with values higher and 1 and smaller than -1.
awk -F '\t' '$10 < -1 { print $0 }' file2 > file3
awk -F '\t' '$10 > 1 { print $0 }' file2 > file4
My last step is to get the lines that have values on column 7 OR 8 higher than 1 (e.g. on 7 could be 0 if on 8 it is 1)
awk -F '\t' '$7<=1 {print $0}' file3 > file5
awk -F '\t' '$8>=1 {print $0}' file4 > file6
My problem is that I create several intermediate files. I would need just two files at the end. File3 and 4 where columns 7 or 8 have values equal or greater than 1. How can I make an awk command to do that at once?
Thank you.
Your question is ambiguous, so there are many possible answers. However, you can combine conditions in awk and you can write to separate files in a single pass, so you might mean:
awk -F '\t' '$14 < 0.05 && $10 < -1 && $7 > 1 { print > "file5" }
$14 < 0.05 && $10 > +1 && $8 > 1 { print > "file6" }' file1
This command should give you the same output in file5 and file6 as you got from your original sequence of operations (but it only makes one pass over the data, not many). (Strictly, it produces the same answer if you change your $7<=1 to $7>1 to agree with your description of wanting column 7 or 8 higher than 1, though that contradicts your example 'on 7 could be 0 if on 8 it is 1'.)
Given an input file:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The output in file5 is:
1 2 3 4 5 6 7 8 9 -10 11 12 13 -14
and the output in file6 is:
1 2 3 4 5 6 7 8 9 10 11 12 13 -14
If you need to combine the conditions differently, then you need to clarify your question.
You could try:
awk -F'\t' '($14 < 0.05) && ($10 < -1) && ($7 <= 1) {print}' file1 > file3