This question already has answers here:
Insert into MySQL from R
(3 answers)
Closed 3 years ago.
I am trying to load a dataframe from r to sql and am having trouble getting the NAs to load to their equivalent NULL in sql. They are coming up as blank cells instead. Example data:
data.frame(name = c('Sara', 'Matt', 'Kyle', 'Steve', 'Maggie', NA, 'Alex', 'Morgan'),
student_id = c(123,124,125,126,127,128,129,130),
score = c(78, 83, 91, NA, 88, 92, NA, 77))
Table schema: student_score with columns name (varchar), student_id (int), and score (int)
R code:
load = "Insert into schema.student_score (name, student_id, score) values"
data = list()
for (i in seq(nrow(df))) {
info = paste0("('", df$name[i], "','",
df$student_id[i], "','",
df$score[i], "')")
data[[i]] = info
}
rows = do.call(rbind, data)
values = paste(rows[,1], collapse = ',')
send = paste0(load, values)
dbSendQuery(conn, send)
and when they are loaded in sql it comes out
name student_id score
Sara 123 78
Matt 124 83
Kyle 125 91
Steve 126
Maggie 127 88
128 92
Alex 129
Morgan 130 77
I want the blank values to be replaced by NULL
In your code NA gets translated as "NA". You need to replace all 'NA' in send to NULL. Simply add below code at the end -
send <- gsub("'NA'", "NULL", send)
send
"Insert into schema.student_score (name, student_id, score) values('Sara','123','78'),('Matt','124','83'),('Kyle','125','91'),('Steve','126',NULL),('Maggie','127','88'),(NULL,'128','92'),('Alex','129',NULL),('Morgan','130','77')"
Related
I am going to execute SQL command on my multiplayer game, I don't wanna any game stop or something like that because I am going to lose players so everything must go smooth.
Its my command :
UPDATE `players`
SET `mana` = `mana` + 500 WHERE `vocation` != "121;122;123;124;125;126"
AND `id` IN (SELECT `player_id` FROM `player_storage` WHERE `key` = 25128 AND `value` = 15 GROUP BY `player_id`);
I want to update every vocation that is not 121 OR 122 OR 123 OR 124 OR 125 OR 126
You can use NOT IN to filter values that are not in a list.
WHERE vocation NOT IN (121, 122, 123, 124, 125, 126)
Use BETWEEN
....Where location
not between 120
and
127
If you have the list as a string of values with delimiter ; you can use FIND_IN_SET():
WHERE FIND_IN_SET(`vocation`, REPLACE('121;122;123;124;125;126', ';', ',')) = 0
Instead of having one column for each group of values, I made one column named "data" and used HTML like this:
<dt>Phone:</dt><dd>0 23 16/3 82 73 42 23</dd>
<dt>Phone:</dt><dd>0 21 61/81 26 73 13 22</dd>
<dt>Fax:</dt><dd>03 27/3 87 42 37 32</dd>
<dt>Website:</dt><dd>www.example.com</dd>
Now, I recognized, that wasn't very clever and I made a column for each value. My new columns names are "phone", "phone2", "fax" and "website".
I need an SQL code for e.g. selecting all between the delimiters <dt>Phone:</dt><dd> and </dd> and the delimiters itself, insert this string in the column "phone" and delete this string in the "data" column.
But I need to select the first string <dt>Phone:</dt><dd>0 23 16/3 82 73 42 23</dd> not the second <dt>Phone:</dt><dd>0 21 61/81 26 73 13 22</dd>.
Can anybody give me a hint how to do that?
For selecting data between <dt>Phone:</dt><dd> and </dd> you can use SUBSTRING_INDEX.
Like this
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(data, '</dd>', 1), '<dt>Phone:</dt><dd>', -1) as phone,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Phone:</dt><dd>', -1), '</dd>', 1) as phone2,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Fax:</dt><dd>', -1), '</dd>', 1) as fax,
SUBSTRING_INDEX(SUBSTRING_INDEX(data, '<dt>Website:</dt><dd>', -1), '</dd>', 1) as website
from data_col;
Update:
<dt>Phone:</dt> is not always in top.
in case if there is no specified order of data in "data" column, try this one:
SELECT
IF (temp.f_phone > 0, SUBSTR(data, temp.f_phone + LENGTH('<dt>Phone:</dt><dd>'), f_phone_end - temp.f_phone - LENGTH('<dt>Phone:</dt><dd>')), null) as PHONE_1,
IF (temp.s_phone > 0, SUBSTR(data, temp.s_phone + LENGTH('<dt>Phone:</dt><dd>'), s_phone_end - temp.s_phone - LENGTH('<dt>Phone:</dt><dd>')), null) as PHONE_2
from data_col dc
JOIN (
SELECT id, #f_phone:= LOCATE('<dt>Phone:</dt><dd>', data) as f_phone,
LOCATE('</dd>', data, #f_phone+1) f_phone_end,
#s_phone := LOCATE('<dt>Phone:</dt><dd>', data, #f_phone+1) as s_phone,
LOCATE('</dd>', data, #s_phone+1) as s_phone_end
from data_col) temp ON temp.id = dc.id;
First find the starting position of each possible element (e.g. "phone" "phone2") and a position of closing tag <\dd>. And than use SUBSTR from starting position of element + length of delimiter, with length = end_position - start_position - delimiter_length
OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.
I want to sort the following data items in the order they are presented below (numbers 1-12):
1
2
3
4
5
6
7
8
9
10
11
12
However, my query - using order by xxxxx asc sorts by the first digit above all else:
1
10
11
12
2
3
4
5
6
7
8
9
Any tricks to make it sort more properly?
Further, in the interest of full disclosure, this could be a mix of letters and numbers (although right now it is not), e.g.:
A1
534G
G46A
100B
100A
100JE
etc....
Thanks!
update: people asking for query
select * from table order by name asc
People use different tricks to do this. I Googled and find out some results each follow different tricks. Have a look at them:
Alpha Numeric Sorting in MySQL
Natural Sorting in MySQL
Sorting of numeric values mixed with alphanumeric values
mySQL natural sort
Natural Sort in MySQL
Edit:
I have just added the code of each link for future visitors.
Alpha Numeric Sorting in MySQL
Given input
1A 1a 10A 9B 21C 1C 1D
Expected output
1A 1C 1D 1a 9B 10A 21C
Query
Bin Way
===================================
SELECT
tbl_column,
BIN(tbl_column) AS binray_not_needed_column
FROM db_table
ORDER BY binray_not_needed_column ASC , tbl_column ASC
-----------------------
Cast Way
===================================
SELECT
tbl_column,
CAST(tbl_column as SIGNED) AS casted_column
FROM db_table
ORDER BY casted_column ASC , tbl_column ASC
Natural Sorting in MySQL
Given input
Table: sorting_test
-------------------------- -------------
| alphanumeric VARCHAR(75) | integer INT |
-------------------------- -------------
| test1 | 1 |
| test12 | 2 |
| test13 | 3 |
| test2 | 4 |
| test3 | 5 |
-------------------------- -------------
Expected Output
-------------------------- -------------
| alphanumeric VARCHAR(75) | integer INT |
-------------------------- -------------
| test1 | 1 |
| test2 | 4 |
| test3 | 5 |
| test12 | 2 |
| test13 | 3 |
-------------------------- -------------
Query
SELECT alphanumeric, integer
FROM sorting_test
ORDER BY LENGTH(alphanumeric), alphanumeric
Sorting of numeric values mixed with alphanumeric values
Given input
2a, 12, 5b, 5a, 10, 11, 1, 4b
Expected Output
1, 2a, 4b, 5a, 5b, 10, 11, 12
Query
SELECT version
FROM version_sorting
ORDER BY CAST(version AS UNSIGNED), version;
Just do this:
SELECT * FROM table ORDER BY column `name`+0 ASC
Appending the +0 will mean that:
0,
10,
11,
2,
3,
4
becomes :
0,
2,
3,
4,
10,
11
I hate this, but this will work
order by lpad(name, 10, 0) <-- assuming maximum string length is 10
<-- you can adjust to a bigger length if you want to
I know this post is closed but I think my way could help some people. So there it is :
My dataset is very similar but is a bit more complex. It has numbers, alphanumeric data :
1
2
Chair
3
0
4
5
-
Table
10
13
19
Windows
99
102
Dog
I would like to have the '-' symbol at first, then the numbers, then the text.
So I go like this :
SELECT name, (name = '-') boolDash, (name = '0') boolZero, (name+0 > 0) boolNum
FROM table
ORDER BY boolDash DESC, boolZero DESC, boolNum DESC, (name+0), name
The result should be something :
-
0
1
2
3
4
5
10
13
99
102
Chair
Dog
Table
Windows
The whole idea is doing some simple check into the SELECT and sorting with the result.
This works for type of data:
Data1,
Data2, Data3 ......,Data21. Means "Data" String is common in all rows.
For ORDER BY ASC it will sort perfectly, For ORDER BY DESC not suitable.
SELECT * FROM table_name ORDER BY LENGTH(column_name), column_name ASC;
I had some good results with
SELECT alphanumeric, integer FROM sorting_test ORDER BY CAST(alphanumeric AS UNSIGNED), alphanumeric ASC
This type of question has been asked previously.
The type of sorting you are talking about is called "Natural Sorting".
The data on which you want to do sort is alphanumeric.
It would be better to create a new column for sorting.
For further help check
natural-sort-in-mysql
If you need to sort an alpha-numeric column that does not have any standard format whatsoever
SELECT * FROM table ORDER BY (name = '0') DESC, (name+0 > 0) DESC, name+0 ASC, name ASC
You can adapt this solution to include support for non-alphanumeric characters if desired using additional logic.
This should sort alphanumeric field like:
1/ Number only, order by 1,2,3,4,5,6,7,8,9,10,11 etc...
2/ Then field with text like: 1foo, 2bar, aaa11aa, aaa22aa, b5452 etc...
SELECT MyField
FROM MyTable
order by
IF( MyField REGEXP '^-?[0-9]+$' = 0,
9999999999 ,
CAST(MyField AS DECIMAL)
), MyField
The query check if the data is a number, if not put it to 9999999999 , then order first on this column, then order on data with text
Good luck!
Instead of trying to write some function and slow down the SELECT query, I thought of another way of doing this...
Create an extra field in your database that holds the result from the following Class and when you insert a new row, run the field value that will be naturally sorted through this class and save its result in the extra field. Then instead of sorting by your original field, sort by the extra field.
String nsFieldVal = new NaturalSortString(getFieldValue(), 4).toString()
The above means:
- Create a NaturalSortString for the String returned from getFieldValue()
- Allow up to 4 bytes to store each character or number (4 bytes = ffff = 65535)
| field(32) | nsfield(161) |
a1 300610001
String sortString = new NaturalSortString(getString(), 4).toString()
import StringUtils;
/**
* Creates a string that allows natural sorting in a SQL database
* eg, 0 1 1a 2 3 3a 10 100 a a1 a1a1 b
*/
public class NaturalSortString {
private String inStr;
private int byteSize;
private StringBuilder out = new StringBuilder();
/**
* A byte stores the hex value (0 to f) of a letter or number.
* Since a letter is two bytes, the minimum byteSize is 2.
*
* 2 bytes = 00 - ff (max number is 255)
* 3 bytes = 000 - fff (max number is 4095)
* 4 bytes = 0000 - ffff (max number is 65535)
*
* For example:
* dog123 = 64,6F,67,7B and thus byteSize >= 2.
* dog280 = 64,6F,67,118 and thus byteSize >= 3.
*
* For example:
* The String, "There are 1000000 spots on a dalmatian" would require a byteSize that can
* store the number '1000000' which in hex is 'f4240' and thus the byteSize must be at least 5
*
* The dbColumn size to store the NaturalSortString is calculated as:
* > originalStringColumnSize x byteSize + 1
* The extra '1' is a marker for String type - Letter, Number, Symbol
* Thus, if the originalStringColumn is varchar(32) and the byteSize is 5:
* > NaturalSortStringColumnSize = 32 x 5 + 1 = varchar(161)
*
* The byteSize must be the same for all NaturalSortStrings created in the same table.
* If you need to change the byteSize (for instance, to accommodate larger numbers), you will
* need to recalculate the NaturalSortString for each existing row using the new byteSize.
*
* #param str String to create a natural sort string from
* #param byteSize Per character storage byte size (minimum 2)
* #throws Exception See the error description thrown
*/
public NaturalSortString(String str, int byteSize) throws Exception {
if (str == null || str.isEmpty()) return;
this.inStr = str;
this.byteSize = Math.max(2, byteSize); // minimum of 2 bytes to hold a character
setStringType();
iterateString();
}
private void setStringType() {
char firstchar = inStr.toLowerCase().subSequence(0, 1).charAt(0);
if (Character.isLetter(firstchar)) // letters third
out.append(3);
else if (Character.isDigit(firstchar)) // numbers second
out.append(2);
else // non-alphanumeric first
out.append(1);
}
private void iterateString() throws Exception {
StringBuilder n = new StringBuilder();
for (char c : inStr.toLowerCase().toCharArray()) { // lowercase for CASE INSENSITIVE sorting
if (Character.isDigit(c)) {
// group numbers
n.append(c);
continue;
}
if (n.length() > 0) {
addInteger(n.toString());
n = new StringBuilder();
}
addCharacter(c);
}
if (n.length() > 0) {
addInteger(n.toString());
}
}
private void addInteger(String s) throws Exception {
int i = Integer.parseInt(s);
if (i >= (Math.pow(16, byteSize)))
throw new Exception("naturalsort_bytesize_exceeded");
out.append(StringUtils.padLeft(Integer.toHexString(i), byteSize));
}
private void addCharacter(char c) {
//TODO: Add rest of accented characters
if (c >= 224 && c <= 229) // set accented a to a
c = 'a';
else if (c >= 232 && c <= 235) // set accented e to e
c = 'e';
else if (c >= 236 && c <= 239) // set accented i to i
c = 'i';
else if (c >= 242 && c <= 246) // set accented o to o
c = 'o';
else if (c >= 249 && c <= 252) // set accented u to u
c = 'u';
else if (c >= 253 && c <= 255) // set accented y to y
c = 'y';
out.append(StringUtils.padLeft(Integer.toHexString(c), byteSize));
}
#Override
public String toString() {
return out.toString();
}
}
For completeness, below is the StringUtils.padLeft method:
public static String padLeft(String s, int n) {
if (n - s.length() == 0) return s;
return String.format("%0" + (n - s.length()) + "d%s", 0, s);
}
The result should come out like the following
-1
-a
0
1
1.0
1.01
1.1.1
1a
1b
9
10
10a
10ab
11
12
12abcd
100
a
a1a1
a1a2
a-1
a-2
áviacion
b
c1
c2
c12
c100
d
d1.1.1
e
MySQL ORDER BY Sorting alphanumeric on correct order
example:
SELECT `alphanumericCol` FROM `tableName` ORDER BY
SUBSTR(`alphanumericCol` FROM 1 FOR 1),
LPAD(lower(`alphanumericCol`), 10,0) ASC
output:
1
2
11
21
100
101
102
104
S-104A
S-105
S-107
S-111
This is from tutorials point
SELECT * FROM yourTableName ORDER BY
SUBSTR(yourColumnName FROM 1 FOR 2),
CAST(SUBSTR(yourColumnName FROM 2) AS UNSIGNED);
it is slightly different from another answer of this thread
For reference, this is the original link
https://www.tutorialspoint.com/mysql-order-by-string-with-numbers
Another point regarding UNSIGNED is written here
https://electrictoolbox.com/mysql-order-string-as-int/
While this has REGEX too
https://www.sitepoint.com/community/t/how-to-sort-text-with-numbers-with-sql/346088/9
SELECT length(actual_project_name),actual_project_name,
SUBSTRING_INDEX(actual_project_name,'-',1) as aaaaaa,
SUBSTRING_INDEX(actual_project_name, '-', -1) as actual_project_number,
concat(SUBSTRING_INDEX(actual_project_name,'-',1),SUBSTRING_INDEX(actual_project_name, '-', -1)) as a
FROM ctts.test22
order by
SUBSTRING_INDEX(actual_project_name,'-',1) asc,cast(SUBSTRING_INDEX(actual_project_name, '-', -1) as unsigned) asc
This is a simple example.
SELECT HEX(some_col) h
FROM some_table
ORDER BY h
order by len(xxxxx),xxxxx
Eg:
SELECT * from customer order by len(xxxxx),xxxxx
Try this For ORDER BY DESC
SELECT * FROM testdata ORDER BY LENGHT(name) DESC, name DESC
SELECT
s.id, s.name, LENGTH(s.name) len, ASCII(s.name) ASCCCI
FROM table_name s
ORDER BY ASCCCI,len,NAME ASC;
Assuming varchar field containing number, decimal, alphanumeric and string, for example :
Let's suppose Column Name is "RandomValues" and Table name is "SortingTest"
A1
120
2.23
3
0
2
Apple
Zebra
Banana
23
86.Akjf9
Abtuo332
66.9
22
ABC
SELECT * FROM SortingTest order by IF( RandomValues REGEXP '^-?[0-9,.]+$' = 0,
9999999999 ,
CAST(RandomValues AS DECIMAL)
), RandomValues
Above query will do sorting on number & decimal values first and after that all alphanumeric values got sorted.
This will always put the values starting with a number first:
ORDER BY my_column REGEXP '^[0-9]' DESC, length(my_column + 0), my_column ";
Works as follows:
Step1 - Is first char a digit? 1 if true, 0 if false, so order by this DESC
Step2 - How many digits is the number? Order by this ASC
Step3 - Order by the field itself
Input:
('100'),
('1'),
('10'),
('0'),
('2'),
('2a'),
('12sdfa'),
('12 sdfa'),
('Bar nah');
Output:
0
1
2
2a
10
12 sdfa
12sdfa
100
Bar nah
Really problematic for my scenario...
select * from table order by lpad(column, 20, 0)
My column is a varchar, but has numeric input (1, 2, 3...) , mixed numeric (1A, 1B, 1C) and too string data (INT, SHIP)
I have a text file that looks like this:
gene1 gene2 gene3
a d c
b e d
c f g
d g
h
i
(Each column is a human gene, and each contains a variable number of proteins (strings, shown as letters here) that can bind to those genes).
What I want to do is count how many columns each string is represented in, output that number and all the column headers, like this:
a 1 gene1
b 1 gene1
c 2 gene1 gene3
d 3 gene1 gene2 gene3
e 1 gene2
f 1 gene2
g 2 gene2 gene3
h 1 gene2
i 1 gene2
I have been trying to figure out how to do this in Perl and R, but without success so far. Thanks for any help.
This solution seems like a bit of a hack, but it gives the desired output. It relies on using both plyr and reshape packages, though I'm sure you could find base R alternatives. The trick is that function melt lets us flatten the data out into a long format, which allows for easy(ish) manipulation from that point forward.
library(reshape)
library(plyr)
#Recreate your data
dat <- data.frame(gene1 = c(letters[1:4], NA, NA),
gene2 = letters[4:9],
gene3 = c("c", "d", "g", NA, NA, NA)
)
#Melt the data. You'll need to update this if you have more columns
dat.m <- melt(dat, measure.vars = 1:3)
#Tabulate counts
counts <- as.data.frame(table(dat.m$value))
#I'm not sure what to call this column since it's a smooshing of column names
otherColumn <- ddply(dat.m, "value", function(x) paste(x$variable, collapse = " "))
#Merge the two together. You could fix the column names above, or just deal with it here
merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Gives:
> merge(counts, otherColumn, by.x = "Var1", by.y = "value")
Var1 Freq V1
1 a 1 gene1
2 b 1 gene1
3 c 2 gene1 gene3
4 d 3 gene1 gene2 gene3
....
In perl, assuming the proteins in each column don't have duplicates that need to be removed. (If they do, a hash of hashes should be used instead.)
use strict;
use warnings;
my $header = <>;
my %column_genes;
while ($header =~ /(\S+)/g) {
$column_genes{$-[1]} = "$1";
}
my %proteins;
while (my $line = <>) {
while ($line =~ /(\S+)/g) {
if (exists $column_genes{$-[1]}) {
push #{ $proteins{$1} }, $column_genes{$-[1]};
}
else {
warn "line $. column $-[1] unexpected protein $1 ignored\n";
}
}
}
for my $protein (sort keys %proteins) {
print join("\t",
$protein,
scalar #{ $proteins{$protein} },
join(' ', sort #{ $proteins{$protein} } )
), "\n";
}
Reads from stdin, writes to stdout.
A one liner (or rather 3 liner)
ddply(na.omit(melt(dat, m = 1:3)), .(value), summarize,
len = length(variable),
var = paste(variable, collapse = " "))
If it's not a lot of columns, you can do something like this in sql. You basically flatten out the data into a 2 column derived table of protein/gene and then summarize it as needed.
;with cte as (
select gene1 as protein, 'gene1' as gene
union select gene2 as protein, 'gene2' as gene
union select gene3 as protein, 'gene3' as gene
)
select protein, count(*) as cnt, group_concat(gene) as gene
from cte
group by protein
In mysql, like so:
select protein, count(*), group_concat(gene order by gene separator ' ') from gene_protein group by protein;
assuming data like:
create table gene_protein (gene varchar(255) not null, protein varchar(255) not null);
insert into gene_protein values ('gene1','a'),('gene1','b'),('gene1','c'),('gene1','d');
insert into gene_protein values ('gene2','d'),('gene2','e'),('gene2','f'),('gene2','g'),('gene2','h'),('gene2','i');
insert into gene_protein values ('gene3','c'),('gene3','d'),('gene3','g');