Read file, parse by tabs and insert into mysql database with Bash Script - mysql

I am doing a bash script to read a file line by line, parse and insert it in mysql database.
The script is the following:
#!/bin/bash
echo "Start!"
while IFS=' ' read -ra ADDR;
do
for line in $(cat filename)
do
regex='(\d\d)-(\d\d)-(\d\d)\s(\d\d:\d\d:\d\d)'
if [[$line=~$regex]]
then
$line='20$3-$2-$1 $4';
fi
echo "insert into table (time, total, caracas, anzoategui) values('$line', '$line', '$line', $
done | mysql -uuser -ppassword database;
done < filename
The file 'filename' with data is something like this:
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
And its dropping this:

Use the LOAD DATA statement. Do any transformations you have to on your file first, then
LOAD DATA LOCAL INFILE 'filename' INTO table (time, total, caracas, anzoategui)
Tab separated fields is the default.

I would recommend creating your sql file using awk and then sourcing it from mysql. Re-direct the output once it looks ok to you into a new file (say insert.sql) and source it from mysql command line. Something like this:
$ cat file
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
$ awk -F'[/ ]+' -v q="'" '{print "insert into table (time, total, caracas, anzoategui) values ("q"20"$3"-"$2"-"$1" "$4q","q$5q","q$6q","q$7q");"}' file
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:34:38','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:42:57','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 10:20:03','17492','5217','394');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 11:20:02','17521','5210','402');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 12:20:04','17540','5209','396');

Bash uses POSIX character classes. So instead of \d you could use [0-9] or [[:digit:]] to match a digit.
Also, there must be spaces in between the brackets and operator in your regex test. So [[$line=~$regex]] would be fixed by doing [[ $line =~ $regex ]].
In order to access the regular expressions enclosed in parentheses, you must access the builtin array variable BASH_REMATCH. So, to access the first match, you'd reference ${BASH_REMATCH[1]}, the second ${BASH_REMATCH[2]}, and so on.

Related

How to create seperate variables in a .csv file derived from values in text files?

I would be grateful for any advice you could provide on how to run the following in the UNIX command line. Essentially, I have text files for each of my subjects, which look like the following (simulated data).
2.97 3.61 -1.88
-0.38 2.33 -0.22
0.76 -0.71 -0.97
The subject ID is contained in the textfile heading (e.g. '100012_var.txt')
I would like to write a .csv file where each value (for each subject) in a row appears under a new variable heading. For instance:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
I would also like to ensure this is consistent across all subjects, i.e. value 1 in row 1 is always coded VAR 1.
I would really appreciate any suggestions!
Using awk:
$ awk -v RS="" -v OFS="\t" ' # using whole file as a record *
NR==1 { # first record, build the header
printf "ID" OFS
for(i=1;i<=NF;i++)
printf "Var%d%s",i,(i<NF?OFS:ORS)
}
{
split(FILENAME,f,"_") # split filename by _ to get the number
$1=$1 # rebuild the record to use tabs (OFS)
print f[1],$0 # print number part and the values
}' 100012_var.txt 100013_var.txt # them files
Output:
ID Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9
100012 2.97 3.61 -1.88 -0.38 2.33 -0.22 0.76 -0.71 -0.97
100013 -1.21 1.79 -0.88 -0.91 2.01 2.88 0.32 -1.15 2.70
* -v RS="" explained here.
Using Miller (https://github.com/johnkerl/miller) and perl
mlr --n2x --ifs ' ' --repifs put '$file=FILENAME' then reorder -f file input.tsv | \
perl -p -e 's/^\r\n$//g' | \
mlr --n2c --ifs ' ' --repifs uniq -a then cut -f 2 then cat -n then reshape -s n,2 \
then rename 1,ID then rename -r '([0-9]+),VAR\1'
you will have (it's a CSV)
ID,VAR2,VAR3,VAR4,VAR5,VAR6,VAR7,VAR8,VAR9,VAR10
input.tsv,2.97,3.61,-1.88,-0.38,2.33,-0.22,0.76,-0.71,-0.97
Then you can do a for loop for all files.

Padding preceding blanks on a variable for Money or Disk Space lineup

OK, Basically I am echo'ing a line to a CSV comma delimited.
This is what is happening:
This is the output:
Computer1 Fri 08/04/2017 13:20 110 917 340 907
Computer2 Fri 08/04/2017 13:21 110 917 435 852
Computer3 Fri 08/04/2017 12:39 180 92 916
Computer4 Fri 08/04/2017 12:35 232 353 720
I want:
Computer1 Fri 08/04/2017 13:20 110 917 340 907
Computer2 Fri 08/04/2017 13:21 110 917 435 852
Computer3 Fri 08/04/2017 12:39 180 92 916
Computer4 Fri 08/04/2017 12:35 232 353 720
I want to lead with a comma for every 3rd right-justified character, so the values line up correctly.
I am getting size of folders to calculate current folder size, then again weekly to determine growth.
The part I am struggling with is this:
for /f "tokens=1-2 delims= " %%a in ('C:\du64.EXE -v -q -nobanner C:\Temp^|find "Size:"') do SET DISKSIZE=%%b
ECHO. "%DISKSIZE%" **
(This will give a value containing commas. ex 12,345,678,910)
ECHO. %COMPUTERNAME%,%DATE%,%TIME:~0,5%,%DISKSIZE%,%PROCESSOR_ARCHITECTURE%>> "C:\DUOutput.CSV"
...set "DISKSIZE= %%b"
echo %disksize:~-15%
No idea why you're getting 92 in your data, nor what lead with a comma for every 3rd right-justified character means.
see set /?|more from the prompt for documentation. I've no idea how many spaces I put before %%b - so long as there is at least a string of 15, it should be OK.

How to get SELECT to output escaped "\0" character

When I run query #1:
SELECT * FROM $TABLE INTO OUTFILE $FILE CHARACTER SET utf8 FIELDS ESCAPED BY '\\\\' TERMINATED BY '|';
There is this one field value that is outputted as:
blah\0
I am trying to get identical output without using INTO OUTFILE.
query #2 with Perl code:
$query = $db->prepare("SELECT * FROM $TABLE");
$query->execute;
open(FILE_OUT, ">$FILE");
However, the same column above is outputted as
blah
So the \0 character (ASCII NUL) is outputted differently. How do I modify query #2 to output blah\0?
The modification could be in either MySQL query or Perl. The column's Collation is utf8_general_ci. I've tried using CONVERT($column using utf8) but that still displayed blah, and I'm not sure how I would apply it to every column of every table when outputting.
Update:
This Perl code worked in escaping the NUL character.
$row =~ s/\0/\\0/g;
But there are many other MySQL special characters, is there a way to escape them all at once instead of handling them one by one?
The quote method of the database handle should be able to quote any special characters:
my $sth = $dbh->prepare("select '\0'");
$sth->execute;
while (my $row = $sth->fetch) {
my ($col) = #$row;
print "[$col] is quoted as ", $dbh->quote($col), "\n";
}
On my version of MySQL this prints:
[] is quoted as '\0'
If I pipe that through hexdump -C we get
00000000 5b 00 5d 20 69 73 20 71 75 6f 74 65 64 20 61 73 |[.] is quoted as|
00000010 20 27 5c 30 27 0a | '\0'.|
00000016

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

Using MySQL SELECT WHERE IN with multibyte characters

I have a table of all defined Unicode characters (the character column) and their associated Unicode points (the id column). I have the following query:
SELECT id FROM unicode WHERE `character` IN ('A', 'B', 'C')
While this query should return only 3 rows (id = 65, 66, 67), it instead returns 129 rows including the following IDs:
65 66 67 97 98 99 129 141 143 144 157 160 193 205 207 208 221 224 257
269 271 272 285 288 321 333 335 336 349 352 449 461 463 464 477 480
2049 2061 2063 2064 2077 2080 4161 4173 4175 4176 4189 4192 4929 4941
4943 4944 4957 4960 5057 5069 5071 5072 5085 5088 5121 5133 5135 5136
5149 5152 5953 5965 5967 5968 5984 6145 6157 6160 6176 8257 8269 8271
8272 8285 8288 9025 9037 9039 9040 9053 9056 9153 9165 9167 9168 9181
9184 9217 9229 9231 9232 9245 9248 10049 10061 10063 10064 10077 10080
10241 10253 10255 10256 10269 10272 12353 12365 12367 12368 12381
12384 13121 13133 13135 13136 13149 13152 13249 13261 13263 13264
13277 13280
I'm sure this must have something to do with multi-byte characters but I'm not sure how to fix it. Any ideas what's going on here?
String equality and order is governed by a collation. By default the collation used is determined from the column, but you can set the collation per-query with the COLLATE clause. For example, if your columns are declared with charset utf8 you could use utf8_bin to use a binary collation that considers A and à different:
SELECT id FROM unicode WHERE `character` COLLATE utf8_bin IN ('A', 'B', 'C')
Alternatively you could use the BINARY operator to convert character into a "binary string" which forces the use of a binary comparison, which is almost but not quite the same as binary collation:
SELECT id FROM unicode WHERE BINARY `character` IN ('A', 'B', 'C')
Update: I thought that the following should be equivalent, but it's not because a column has lower "coercibility" than the constants. The binary string constants would be converted into non-binary and then compared.
SELECT id FROM unicode WHERE `character` IN (_binary'A', _binary'B', _binary'C')
You can try:
SELECT id FROM unicode WHERE 'character' IN (_utf8'A',_utf8'B',_utf8'C')