r fetch data from mysql db loop - mysql

I successfully fetch data from my mysql db using r:
library(RMySQL)
mydb = dbConnect(MySQL(), user='user', password='pass', dbname='fib', host='myhost')
rs = dbSendQuery(mydb, 'SELECT distinct(DATE(date)) as date, open,close FROM stocksng WHERE symbol = "FIB7F";')
data <- fetch(rs, n=-1)
dbHasCompleted(rs)
so now I've an object a list:
> print (typeof(data))
[1] "list"
each elements is a tuple(?) like date(charts),open(long),close(long)
ok well now my problem: I want to get a vector of percentuale difference betwen close (x) and next day open (x+1) until the end BUT I can't access properly to the item!
Example: ((open)/close*100)-100)
I try:
for (item in data){
print (item[2])
}
and all possible combination like:
for (item in data){
print (item[][2])
}
but cannot access to right element :! anyone could help?

You have a bigger problem than this in your MySQL query, because you did not specify an ORDER BY clause. Consider using the following query:
SELECT DISTINCT
DATE(date) AS date,
open,
close
FROM stocksng
WHERE
symbol = "FIB7F"
ORDER BY
date
Here we order the result set by date, so that it makes sense to speak of the current and next open or close. Now with a proper query in place if you wanted to get the percentile difference between the current close and the next day open you could try:
require(dplyr)
(lead(open, 1) / close*100) - 100
Or using base R:
(open[2:(length(open)+1)] / close*100) - 100

naif version:
for (row in 1:nrow(data)){
date <- unname (data[row,"date"])
open <- unname (data[row+1,"open"])
close <- unname (data[row,"close"])
var <- abs((close/open*100)-100)
print (var)
}

Related

Single Quote R for SQL Query

Hello I have the following data that I want to paste into an SQL query through a R connection.
UKWinnersID<-c("1W167X6", "QM6VY8", "ZDNZX0", "8J49D8", "RGNSW9",
"BH7D3P1", "W31S84", "NTHDJ4", "H3UA1", "AH9N7",
"DF52B68", "K65C2", "VGT2Q0", "93LR6", "SJAJ0",
"WQBH47", "CP8PW9", "5H2TD5", "TFLKV4", "X42J1" )
The query / code in R is as following:
UKSQL6<-data.frame(sqlQuery(myConn, paste("SELECT TOP 10000 [AxiomaDate]
,[RiskModelID] ,[AxiomaID],[Factor1],[Factor2],[Factor3],[Factor4],[Factor5]
,[Factor6],[Factor7],[Factor8],[Factor9],[Factor10],[Factor11],[Factor12]
,[Factor13],[Factor14],[Factor15]FROM [PortfolioAnalytics].[Data_Axioma].[SecurityExposures]
Where AxiomaDate IN (
SELECT MAX(AxiomaDate)
FROM [PortfolioAnalytics].[Data_Axioma].[FactorReturns]
GROUP BY MONTH(AxiomaDate), YEAR(AxiomaDate))
AND RiskModelID = 8
AND AxiomaID IN(",paste(UKWinnersID, collapse = ","),")")))
I am pasting the UKWinnersID in the last line of the code above but that format of the UKWinnersID needs to be as ('1W167X6', 'QM6VY8', 'ZDNZX0'.. etc) with a single quote which I just cant get to work.
Consider running a parameterized query using the RODBCext package (extension of RODBC), assuming this is the API being used. Parameterized queries do more than insulate from SQL injection but abstracts data from code and avoids the messy quote enclosure and string interpolation and concatenation for cleaner, maintainable code.
Below replaces your TOP 10000 into TOP 500 for each of the 20 ids:
library(RODBC)
library(RODBCext)
conn <- odbcConnect("DBName", uid="user", pwd="password")
ids_df <- data.frame(UKWinnersID = c("1W167X6", "QM6VY8", "ZDNZX0", "8J49D8", "RGNSW9",
"BH7D3P1", "W31S84", "NTHDJ4", "H3UA1", "AH9N7",
"DF52B68", "K65C2", "VGT2Q0", "93LR6", "SJAJ0",
"WQBH47", "CP8PW9", "5H2TD5", "TFLKV4", "X42J1"))
# SQL STATEMENT (NO DATA)
query <- "SELECT TOP 500 [AxiomaDate], [RiskModelID], [AxiomaID], [Factor1],[Factor2]
, [Factor3], [Factor4], [Factor5], [Factor6], [Factor7], [Factor8]
, [Factor9], [Factor10], [Factor11], [Factor12]
, [Factor13], [Factor14], [Factor15]
FROM [PortfolioAnalytics].[Data_Axioma].[SecurityExposures]
WHERE AxiomaDate IN (
SELECT MAX(AxiomaDate)
FROM [PortfolioAnalytics].[Data_Axioma].[FactorReturns]
GROUP BY MONTH(AxiomaDate), YEAR(AxiomaDate)
)
AND RiskModelID = 8
AND AxiomaID = ?"
# PASS DATAFRAME VALUES TO BIND TO QUERY PARAMETERS
UKSQL6 <- sqlExecute(conn, query, ids_df, fetch=TRUE)
odbcClose(conn)
Alternatively, if you really need to use the IN() clause:
# SQL STATEMENT (NO DATA)
query <- paste("SELECT TOP 10000
...same as above...
AND AxiomaID IN (", paste(rep("?", nrow(ids_df)), collapse=", "), ")")
# TRANSPOSE DATA FRAME FOR COLUMN EQUAL TO ? PLACEHOLDERS
UKSQL6 <- sqlExecute(conn, query, t(ids_df), fetch=TRUE)

How to stop large float numbers from being output in exponential form (scientific notation) in R?

I'm working with ID numbers such as: "868130240945684480", which are stored into a MySQL database.
Whenever I make a query with RMySQL, the output is as follows: ""8.681302e+17" and then R reads it as "868130200000000000", which is a totally different ID.
Is there anyway to avoid this?
Here is the query I make to retrieve my data:
require(RMySQL)
con <- dbConnect(MySQL(), user='xxx', password='xxx', dbname='xxx', host='xxx')
rs <- dbSendQuery(con, "select sprinklr_twitter.Tweet_Id from sprinklr_twitter
inner join twitter
on sprinklr_twitter.Tweet_id=twitter.Tweet_id")
Tweet_id<-fetch(rs, n=-1)
Tweet_id<-as.data.frame(Tweet_id)

transforming / updating / typecasting list values from int & num to strings

I'm retrieving data from a MySQL database using an R script and then writing it to a CSV, but I'm having an issue where two of the columns of data that I want to write out as strings are being written out as integers and numbers (in this case, in scientific notation).
I would like to have these written out as string values instead, but I'm not finding this is a straightforward task, in spite of doing a fair bit of googling and experimentation.
The relevant code:
conn <- dbConnect(MySQL(), host = "127.0.0.1", user="REDACTED", password="REDACTED", dbname="REDACTED", port=8906)
type_data <- dbGetQuery(conn, paste("SELECT * FROM ", arg, " WHERE 1 LIMIT 10", sep=""))
# Problem: "Subscribed" and "TimeUpdated" are coming through as numbers instead of strings
write.csv(type_data, paste("./",arg,".csv", sep=""), row.names=F)
dbDisconnect(conn)
Desired results:
"Id","EntityId","EntityType","CommunicationType","Subscribed","TimeUpdated"
"0002INKRyUolIrjG5DbUa0lDqUjxt","4374484","PERSON","MFS","1","1385297883000000000"
"0004WaXpmvbOh3WG3hd6kQtPINibv","8361929","PERSON","MFS","1","1437798832740631885"
"0005l1fy1TJiFhyiEK2IXRCxfqee5","4197014","PERSON","SURVEYS_AND_POLLS","0","1146917239000000000"
"0008Qb2ra1PoSLgbumc2wmDfvexx8","4155704","PERSON","MFS","1","1345053223000000000"
"000C1IKgHrFaqmlHlKGGhigGyoaw4","4515071","PERSON","PARTNER","1","1215098959000000000"
"000Czw8Gv5w3eNoOmOFVTKLIuc2ti","4372360","PERSON","MFS","1","1384952236000000000"
"000DOsk9xlYKvs11PzZFRgmOpYfiA","4347384","PERSON","SURVEYS_AND_POLLS","1","1177513307000000000"
"000IQ4TKYHAbb334zFYdWVCZZfMYo","4470083","PERSON","PARTNER","1","1446945757133940400"
"000LbifV4rUa2MhxFlVZ52PSek5kG","499194","PERSON","SURVEYS_AND_POLLS","0","1097867573000000000"
Actual results:
"Id","EntityId","EntityType","CommunicationType","Subscribed","TimeUpdated"
"0002INKRyUolIrjG5DbUa0lDqUjxt","4374484","PERSON","MFS",1,1.385297883e+18
"0004WaXpmvbOh3WG3hd6kQtPINibv","8361929","PERSON","MFS",1,1437798832740631808
"0005l1fy1TJiFhyiEK2IXRCxfqee5","4197014","PERSON","SURVEYS_AND_POLLS",0,1.146917239e+18
"0008Qb2ra1PoSLgbumc2wmDfvexx8","4155704","PERSON","MFS",1,1.345053223e+18
"000C1IKgHrFaqmlHlKGGhigGyoaw4","4515071","PERSON","PARTNER",1,1.215098959e+18
"000Czw8Gv5w3eNoOmOFVTKLIuc2ti","4372360","PERSON","MFS",1,1.384952236e+18
"000DOsk9xlYKvs11PzZFRgmOpYfiA","4347384","PERSON","SURVEYS_AND_POLLS",1,1.177513307e+18
"000IQ4TKYHAbb334zFYdWVCZZfMYo","4470083","PERSON","PARTNER",1,1446945757133940480
"000LbifV4rUa2MhxFlVZ52PSek5kG","499194","PERSON","SURVEYS_AND_POLLS",0,1.097867573e+18
"000OWvUHdmjeL34XzuVLmHQBple7X","4176205","PERSON","MFS",1,1.143985154e+18
Assistance would be most appreciated!
Thanks to #Bernhard for the help with this - here's a working solution.
options(scipen = 999) # so that TimeUpdated isn't outputted using scientific notation
conn <- dbConnect(MySQL(), host = "127.0.0.1", user="REDACTED", password="REDACTED", dbname="REDACTED", port=8906)
type_data <- dbGetQuery(conn, paste("SELECT * FROM ", arg, " WHERE 1", sep=""))
# convert the subscribed and timeupdated columns to strings
type_data$Subscribed <- as.character(type_data$Subscribed)
type_data$TimeUpdated <- as.character(type_data$TimeUpdated)
write.csv(type_data, paste(args[[1]], "/", arg, ".csv", sep=""), row.names=F)
dbDisconnect(conn)

Fetching data in parallel from mysql using R doParallel or foreach

I am trying to fetch data in parallel from MySQL database using R. Following code is fetching data one by one and working fine. But I want to speed up the process by sending multiple queries and save it into different variables. Later I will merge timeseries inside the variables.
library(RMySQL)
dbConnect(MySQL(), user='external', password='xxxxxxx', dbname='GMT_Minute_Data', host='xx.xx.xxx.xxx')
sqlData <-select TradeTime, Open, High, Low, Close from ad where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data1= dbFetch(sqlData, n=-1)
sqlData <-select TradeTime, Open, High, Low, Close from ty where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data2 = dbFetch(sqlData, n=-1)
sqlData <-select TradeTime, Open, High, Low, Close from ax where tradetime between ‘2014-01-01’ and ‘2015-10-20’
data3 = dbFetch(sqlData, n=-1)
connections <- dbListConnections(MySQL())
for(i in connections) {dbDisconnect(i)}
I have tried to fetch data in parallel using following code:
library(foreach)
library(doParallel)
library(RMySQL)
fetchData<- function(nInst, inst1, inst2, inst3, inst4, inst5, startDate, endDate, con1){
inst<-NULL
sqlData <-NULL
if(nInst==1)
inst<-inst1
else if(nInst==2)
inst<-inst2
else if(nInst==3)
inst<-inst3
else if(nInst==4)
inst<-inst4
else if(nInst==5)
inst<-inst5
sqlData <- dbSendQuery(con1, paste0('select TradeTime, Open, High, Low, Close from ', inst, ' where tradetime between \'', startDate, '\' and \'', endDate, '\'' ))
data1 = dbFetch(sqlData, n=-1)
print(head(data1))
data1
}
cluster = makeCluster(5, type = "SOCK")
registerDoParallel(cluster)
mydb <- NULL
clusterEvalQ(cluster, {
mydb <- dbConnect(MySQL(), user='external', password='xxxxxx', dbname='GMT_Minute_Data', host='xx.xx.xxx.xxx')
NULL
})
allDataList<-foreach(n =1:2, .verbose=TRUE, .packages=('RMySQL')) %dopar% {
fetchData(n, inst1, inst2, inst3, inst4, inst5, startDate, endDate, mydb)
}
stopCluster(cluster)
on.exit(dbDisconnect(mydb))
Sometime code is only fetching data for the first instrument but not for the rest of the instruments.
Please assist if someone know the solution.
Thanks,
I think the problem is that foreach is auto-exporting the mydb variable to the workers, thus defeating the purpose of initializing with it clusterEvalQ. Database connections can't be serialized and sent to other machines properly, which is why it's useful to initialize it manually with clusterEvalQ. The foreach .verbose=TRUE option let's you verify that mydb is not auto-exported. If it says that it is auto-exported, you need to prevent it.
In your example, you can prevent mydb from being auto-exported by simply removing the mydb <- NULL statement, but I suggest that you use the foreach .noexport='mydb' option to be certain that it's never auto-exported. Here's a stripped-down example that does that:
library(doParallel)
fetchData <- function(ignore) {
mydb
}
cluster <- makeCluster(5, type = "SOCK")
registerDoParallel(cluster)
clusterEvalQ(cluster, {
mydb <- sample(100, 1) # different value for each worker
NULL
})
r <- foreach(n=1:2, .noexport='mydb', .verbose=TRUE) %dopar% {
fetchData(n)
}
In this case, foreach analyzes the fetchData function and notices that it's using a variable named mydb. Thus, if mydb is defined on the master, it will auto-export it unless you tell it not to. That's why I suggest using .noexport='mydb' even if it's not defined in the local environment. It makes doubly sure that your function doesn't use a corrupt database connection.

RMySQL dbWriteTable adding columns to table (dynamically?)

I just started using the R package called RMySQL in order to get around some memory limitations on my computer. I am trying to take a matrix with 100 columns in R (called data.df), then make a new table on an SQL database that has "100 choose 2" (=4950) columns, where each column is a linear combination of two columns from the initial matrix. So far I have something like this:
countnumber <- 1
con <- dbConnect(MySQL(), user = "root", password = "password", dbname = "myDB")
temp <- as.data.frame(data.df[,1] - data.df[,2])
colnames(temp) <- paste(pairs[[countnumber]][1], pairs[[countnumber]][2], sep = "")
dbWriteTable(con, "spreadtable", temp, row.names=T, overwrite = T)
for(i in 1:(n-1)){
for(j in (i+1):n){
if(!((i==1)&&(j==2))){ #this part excludes the first iteration already taken care of
temp <- as.data.frame(data.df[,i] - data.df[,j])
colnames(temp) <- "hola"
dbWriteTable(con, "spreadtable", value = temp, append = TRUE, overwrite = FALSE, row.names = FALSE)
countnumber <- countnumber + 1
}
}
}
I've also tried toying around with the "field.types" argument of RMySQL::dbWriteTable(), which was suggested at RMySQL dbWriteTable with field.types. Sadly it hasn't helped me out too much.
Questions:
Is making your own sql database a valid solution to the memory-bound nature of R, even if it has 4950 columns?
Is the dbWriteTable() the proper function to be using here?
Assuming the answer is "yes" to both of the previous questions...why isn't this working?
Thanks for any help.
[EDIT]: code with error output:
names <- as.data.frame(index)
names <- t(names)
#dim(names) is 1 409
con <- dbConnect(MySQL(), user = "root", password = "password", dbname = "taylordatabase")
dbGetQuery(con, dbBuildTableDefinition(MySQL(), name="spreadtable", obj=names, row.names = F))
#I would prefer these to be double types with 8 decimal spaces instead of text
#dim(temp) is 1 409
temp <- as.data.frame(data.df[,1] - (ratios[countnumber]*data.df[,2]))
temp <- t(temp)
temp <- as.data.frame(temp)
dbWriteTable(con, name = "spreadtable", temp, append = T)
The table is created successfully in the database (I will change variable type later), but the dbWriteTable() line produces the error:
Error in mysqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: Unknown column 'row_names' in 'field list')
[1] FALSE
Warning message:
In mysqlWriteTable(conn, name, value, ...) : could not load data into table
If I make a slight change, I get a different error message:
dbWriteTable(con, name = "spreadtable", temp, append = T, row.names = F)
and
Error in mysqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: Unknown column 'X2011_01_03' in 'field list')
[1] FALSE
Warning message:
In mysqlWriteTable(conn, name, value, ...) : could not load data into table
I just want to use "names" as a bunch of column labels. They were initially dates. The actual data I would like to be "temp."
Having a query with 4950 rows is ok, the problem is that what columns you need.
If you always "select * ", you will eventually exhaust all you system memory (in the case that the table has 100 columns)
Why not give us some error message if you have encountered any problems ?