Using MySQL SELECT WHERE IN with multibyte characters - mysql

I have a table of all defined Unicode characters (the character column) and their associated Unicode points (the id column). I have the following query:
SELECT id FROM unicode WHERE `character` IN ('A', 'B', 'C')
While this query should return only 3 rows (id = 65, 66, 67), it instead returns 129 rows including the following IDs:
65 66 67 97 98 99 129 141 143 144 157 160 193 205 207 208 221 224 257
269 271 272 285 288 321 333 335 336 349 352 449 461 463 464 477 480
2049 2061 2063 2064 2077 2080 4161 4173 4175 4176 4189 4192 4929 4941
4943 4944 4957 4960 5057 5069 5071 5072 5085 5088 5121 5133 5135 5136
5149 5152 5953 5965 5967 5968 5984 6145 6157 6160 6176 8257 8269 8271
8272 8285 8288 9025 9037 9039 9040 9053 9056 9153 9165 9167 9168 9181
9184 9217 9229 9231 9232 9245 9248 10049 10061 10063 10064 10077 10080
10241 10253 10255 10256 10269 10272 12353 12365 12367 12368 12381
12384 13121 13133 13135 13136 13149 13152 13249 13261 13263 13264
13277 13280
I'm sure this must have something to do with multi-byte characters but I'm not sure how to fix it. Any ideas what's going on here?

String equality and order is governed by a collation. By default the collation used is determined from the column, but you can set the collation per-query with the COLLATE clause. For example, if your columns are declared with charset utf8 you could use utf8_bin to use a binary collation that considers A and à different:
SELECT id FROM unicode WHERE `character` COLLATE utf8_bin IN ('A', 'B', 'C')
Alternatively you could use the BINARY operator to convert character into a "binary string" which forces the use of a binary comparison, which is almost but not quite the same as binary collation:
SELECT id FROM unicode WHERE BINARY `character` IN ('A', 'B', 'C')
Update: I thought that the following should be equivalent, but it's not because a column has lower "coercibility" than the constants. The binary string constants would be converted into non-binary and then compared.
SELECT id FROM unicode WHERE `character` IN (_binary'A', _binary'B', _binary'C')

You can try:
SELECT id FROM unicode WHERE 'character' IN (_utf8'A',_utf8'B',_utf8'C')

Related

MySQL type conversion while ignoring nonnumeric characters

How do I convert from varchar(500) to float in MySQL while ignoring non-numeric characters? My code currently looks like this
select distinct(customer_id), cast(val_amt as float) from(
select prsn_real_gid, vital_nam, vital_performed_date,
case when val_amt ~'^[0-9]+' then val_amt else null end as val_amt from my_table);
but I get the following error message
[Amazon](500310) Invalid operation: Invalid digit, Value 'X', Pos 2, Type: Double
Details:
-----------------------------------------------
error: Invalid digit, Value 'X', Pos 2, Type: Double
code: 1207
context: 1.XYXY04
query: 4147
location: :0
process: query0_118_4147 [pid=0]
-----------------------------------------------;
1 statement failed.
My data for example looks like this:
customer_id | val_amt
111 | 23.45
112 | 21
113 | x
114 | /
115 |
116 | 23/24
As you can see I have decimals, integers, letters, symbols and nulls. I want to ignore everything that's not a decimal or integer.
You can use regexp:
select customer_id, val_amt + 0
from my_table
where val_amt regexp '^[0-9]+[.]?[0-9]*';

Writing multivariate objective function, with matrix data entities

S1=[20 32 44 56 68 80 92 104 116 128 140 152 164 176 188 200];
P=[16.82 26.93 37.01 47.1 57.21 67.32 77.41 87.5 97.54 107.7 117.8 127.9 138 148 158.2 168.3];
X = [0.119 0.191 0.262 0.334 0.405 0.477 0.548 0.620 0.691 0.763 0.835 0.906 0.978 1.049 1.120 1.192];
S = [2.3734 3.6058 5.0256 6.6854 8.6413 10.978 13.897 17.396 21.971 28.040 36.475 49.065 69.736 110.20 224.69 2779.1];
objective=#(x)((1250*x(3)*S(a)-(S(a)+x(2))*(P(a)+x(1)))/(1250*(S(a)+x(2))*(P(a)+x(1)))-x(5))^2+((x(2)*(P(a)^2+x(1)*P(a)))/(1250*x(4)*X(a)*x(3)-P(a)^2-x(1)*P(a))-S(a))^2+(74000/3*((X(a)*x(3)*S(a))/S1(a)*(S(a)+x(2)))-P(a))^2
%x0 = [Kp Ks mu.m Yp mu.d]
x0=[7.347705469 14.88611028 1.19747242 16.65696429 6.01E-03];
x=fminunc(objective,x0);
disp(x)
The code above is used for optimisizing the objective function, so that all the unknown values of the parameters can be found. As you may have seen, the objective function consists of 4 variables (S1, S, P, X), each having 16 data entities. My question is: how to create an objective function, so that all the data entities are utilised?
The final objective function has to be the sum of the objective function (shown above) with a=1:16. Any ideas?
Make the following changes to your code:
Replace, e.g. all S(a) variables with S to use the whole vector. Do the same for each of your four variables.
Convert all 'scalar' operations in your objective function to 'elementwise' ones, i.e. replace ^, * and / with .^, .* and ./. This produces 16 values, one for each index from 1 to 16 (i.e. what was previously referred to by a).
wrap the resulting expression into a sum() function to sum the 16 results into a final value
Use your optimiser as normal.
Resulting code:
S1 = [20 32 44 56 68 80 92 104 116 128 140 152 164 176 188 200];
P = [16.82 26.93 37.01 47.1 57.21 67.32 77.41 87.5 97.54 107.7 117.8 127.9 138 148 158.2 168.3];
X = [0.119 0.191 0.262 0.334 0.405 0.477 0.548 0.620 0.691 0.763 0.835 0.906 0.978 1.049 1.120 1.192];
S = [2.3734 3.6058 5.0256 6.6854 8.6413 10.978 13.897 17.396 21.971 28.040 36.475 49.065 69.736 110.20 224.69 2779.1];
objective = #(x) sum( ((1250.*x(3).*S-(S+x(2)).*(P+x(1)))./(1250.*(S+x(2)).*(P+x(1)))-x(5)).^2+((x(2).*(P.^2+x(1).*P))./(1250.*x(4).*X.*x(3)-P.^2-x(1).*P)-S).^2+(74000./3.*((X.*x(3).*S)./S1.*(S+x(2)))-P).^2 );
%x0 = [Kp Ks mu.m Yp mu.d]
x0 = [7.347705469 14.88611028 1.19747242 16.65696429 6.01E-03];
x = fminunc(objective,x0);
disp(x)
Note that you can make this code a lot clearer to read for humans; I just made the "direct" changes that illustrate conversion from your scalar expression to the desired vectorised one.

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

Read file, parse by tabs and insert into mysql database with Bash Script

I am doing a bash script to read a file line by line, parse and insert it in mysql database.
The script is the following:
#!/bin/bash
echo "Start!"
while IFS=' ' read -ra ADDR;
do
for line in $(cat filename)
do
regex='(\d\d)-(\d\d)-(\d\d)\s(\d\d:\d\d:\d\d)'
if [[$line=~$regex]]
then
$line='20$3-$2-$1 $4';
fi
echo "insert into table (time, total, caracas, anzoategui) values('$line', '$line', '$line', $
done | mysql -uuser -ppassword database;
done < filename
The file 'filename' with data is something like this:
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
And its dropping this:
Use the LOAD DATA statement. Do any transformations you have to on your file first, then
LOAD DATA LOCAL INFILE 'filename' INTO table (time, total, caracas, anzoategui)
Tab separated fields is the default.
I would recommend creating your sql file using awk and then sourcing it from mysql. Re-direct the output once it looks ok to you into a new file (say insert.sql) and source it from mysql command line. Something like this:
$ cat file
15/08/13 09:34:38 17528 5240 399 89 460 159 1107 33240
15/08/13 09:42:57 17528 5240 399 89 460 159 1107 33240
15/08/13 10:20:03 17492 5217 394 89 459 159 1101 33245
15/08/13 11:20:02 17521 5210 402 90 462 158 1112 33249
15/08/13 12:20:04 17540 5209 396 90 459 160 1105 33258
$ awk -F'[/ ]+' -v q="'" '{print "insert into table (time, total, caracas, anzoategui) values ("q"20"$3"-"$2"-"$1" "$4q","q$5q","q$6q","q$7q");"}' file
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:34:38','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 09:42:57','17528','5240','399');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 10:20:03','17492','5217','394');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 11:20:02','17521','5210','402');
insert into table (time, total, caracas, anzoategui) values ('2013-08-15 12:20:04','17540','5209','396');
Bash uses POSIX character classes. So instead of \d you could use [0-9] or [[:digit:]] to match a digit.
Also, there must be spaces in between the brackets and operator in your regex test. So [[$line=~$regex]] would be fixed by doing [[ $line =~ $regex ]].
In order to access the regular expressions enclosed in parentheses, you must access the builtin array variable BASH_REMATCH. So, to access the first match, you'd reference ${BASH_REMATCH[1]}, the second ${BASH_REMATCH[2]}, and so on.

MySQL order by string with numbers

I have strings such as M1 M3 M4 M14 M30 M40 etc (really any int 2-3 digits after a letter)
When I do " ORDER BY name " this returns:
M1, M14, M3, M30, M4, M40
When I want:
M1, M3, M4, M14, M30, M40
Its treating the whole thing as a string but I want to treat it as string + int
Any ideas?
You could use SUBSTR and CAST AS UNSIGNED/SIGNED within ORDER BY:
SELECT * FROM table_name ORDER BY
SUBSTR(col_name FROM 1 FOR 1),
CAST(SUBSTR(col_name FROM 2) AS UNSIGNED)
If there can be multiple characters at the beginning of the string, for example like 'M10', 'MTR10', 'ABCD50', 'JL8', etc..., you basically have to get the substring of the name from the first position of a number.
Unfortunately MySQL does not support that kind of REGEXP operation (only a boolean value is returned, not the actual match).
You can use this solution to emulate it:
SELECT name
FROM tbl
ORDER BY CASE WHEN ASCII(SUBSTRING(name,1)) BETWEEN 48 AND 57 THEN
CAST(name AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,2)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,1)
WHEN ASCII(SUBSTRING(name,3)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,2)
WHEN ASCII(SUBSTRING(name,4)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,3)
WHEN ASCII(SUBSTRING(name,5)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,4)
WHEN ASCII(SUBSTRING(name,6)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,5)
WHEN ASCII(SUBSTRING(name,7)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,6)
WHEN ASCII(SUBSTRING(name,8)) BETWEEN 48 AND 57 THEN
SUBSTRING(name,1,7)
END,
CASE WHEN ASCII(SUBSTRING(name,1)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,1) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,2)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,2) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,3)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,3) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,4)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,4) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,5)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,5) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,6)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,6) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,7)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,7) AS UNSIGNED)
WHEN ASCII(SUBSTRING(name,8)) BETWEEN 48 AND 57 THEN
CAST(SUBSTRING(name,8) AS UNSIGNED)
END
This will order by the character part of the string first, then the extracted number part of the string as long as there are <=7 characters at the beginning of the string. If you need more, you can just chain additional WHENs to the CASE statement.
I couldn't get this working for my issue which was sorting MLS Numbers like below:
V12345
V1000000
V92832
The problem was V1000000 wasn't being valued higher than the rest even though it's bigger.
Using this solved my problem:
ORDER BY CAST(SUBSTR(col_name FROM 2) AS UNSIGNED) DESC
Just removed the SUBSTR(col_name FROM 1 FOR 1)
You can use:
order by name,SUBSTRING(name,1,LENGTH(name)-1)
It split number and letters as separately.
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(
SUBSTRING_INDEX(SUBSTRING_INDEX(SUBSTRING_INDEX(col,'1', 1), '2', 1), '3', 1), '4', 1), '5', 1), '6', 1)
, '7', 1), '8', 1), '9', 1), '0', 1) as new_col
FROM table group by new_col;
Try remove the character with SUBSTR. Then use ABS to get the absolute value from field:
SELECT * FROM table ORDER BY ABS(SUBSTR(field,1));
Another method i used in my project is:
SELECT * FROM table_name ORDER BY LENGTH(col_name) DESC, col_name DESC LIMIT 1