Blob to Dataframe in R - mysql

I have my Rstudio connected to a MySQL database. The table I'm importing has a MySQL JSON column type: https://dev.mysql.com/doc/refman/5.7/en/json.html
When I import it into R, it becomes a BLOb. You can see the table, as its imported, here:
'data.frame': 15 obs. of 5 variables:
$ id :integer64 1 2 3 4 5 6 7 8 ...
$ user_id : chr
$ survey_id:integer64 3 10 10 10 10 3 10 10 ...
$ p_id : chr "22zdae" "0" "0" "0" ...
$ data : blob [1:15] ..$ : raw 7b 22 45 78 ...
When I go to extract information from the blob I use the following code:
for(row in 1:NROW(data)){
print(row)
tryCatch({
if(is_empty(data$data[[row]])==TRUE){
x<-NA
} else {
x <- rawToChar(data$data[[row]])
}
survey_data <- rbind(survey_data,x)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}
)}
Every row is transformed into only partially what in the database. For example:
Status": "Never married", "Liberal_Conserv": "Very Liberal",
"Political_Party": "Republican", "Kids_18yo_Number": ""}
This row has 251 variables in the database, not 4.
How can I accurately transform a blob into workable data?

Related

Undefined columns selected using panelvar package

Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.

Transform list cell in data frame into rows

I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.
Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13

"NA" in JSON file translates to NA logical

I have json files with data for countries. One of the files has the following data:
"[{\"count\":1,\"subject\":{\"name\":\"Namibia\",\"alpha2\":\"NA\"}}]"
I have the following code convert the json into a data.frame using the jsonlite package:
df = as.data.frame(fromJSON(jsonfile), flatten=TRUE))
I was expecting a data.frame with numbers and strings:
count subject.name subject.alpha2
1 Namibia "NA"
Instead, the NA alpha2 code is being automatically converted into NA logical, and this is what I get:
str(df)
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: logi NA
I want alpha2 to be a string, not logical. How do I fix this?
That particular implementation of fromJSON (and there are three different packages with that name for a function) has a simplifyVector argument which appears to prevent the corecion:
require(jsonlite)
> as.data.frame( fromJSON(test, simplifyVector=FALSE ) )
count subject.name subject.alpha2
1 1 Namibia NA
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : Factor w/ 1 level "Namibia": 1
$ subject.alpha2: Factor w/ 1 level "NA": 1
> str( as.data.frame( fromJSON(test, simplifyVector=FALSE ) ,stringsAsFactors=FALSE) )
'data.frame': 1 obs. of 3 variables:
$ count : int 1
$ subject.name : chr "Namibia"
$ subject.alpha2: chr "NA"
I tried seeing if that option worked well with the flatten argument, but was disappointed:
> str( fromJSON(test, simplifyVector=FALSE, flatten=TRUE) )
List of 1
$ :List of 2
..$ count : int 1
..$ subject:List of 2
.. ..$ name : chr "Namibia"
.. ..$ alpha2: chr "NA"
The accepted answer did not solve my use case.
However, rjson::fromJSON does this naturally, and to my surprise, 10 times faster on my data.

standard unambiguos format [R] MySQL imported data

OK, to set the scene, I have written a function to import multiple tables from MySQL (using RODBC) and run randomForest() on them.
This function is run on multiple databases (as separate instances).
In one particular database, and one particular table, the "error in as.POSIXlt.character(x, tz,.....): character string not in a standard unambiguous format" error is thrown. The function runs on around 150 tables across two databases without any issues except this one table.
Here is a head() print from the table:
MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00 184 24 8 24 67 147 Flat
2 2014-11-05 23:57:00 203 184 204 67 51 147 Flat
3 2014-11-06 00:40:00 179 309 49 189 75 19 Flat
4 2014-11-06 00:46:00 28 192 60 49 152 147 Flat
5 2014-11-06 01:20:00 309 48 9 11 24 19 Flat
6 2014-11-06 01:31:00 24 177 64 152 188 19 Flat
And here is the function:
GenerateRF <- function(db, countstable, RFcutoff) {
'load required libraries'
library(RODBC)
library(randomForest)
library(caret)
library(ff)
library(stringi)
'connection and data preparation'
connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")
'import count table and check if RF is allowed to be built'
query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
row.counts <- sqlQuery (connection, query.str)
'Operate only on tables that have >= RFcutoff'
for (i in 1:nrow (row.counts)) {
table.name <- as.character (row.counts[i,1])
col.count <- as.numeric (row.counts[i,2])
row.count <- as.numeric (row.counts[i,3])
if (row.count >= 20) {
'Delete old RFs and DFs for input pattern'
if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
file.remove (paste0 (table.name, '_RF.Rdata'))
}
if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
file.remove (paste0 (table.name, '_DF.Rdata'))
}
'import and clean data'
query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
raw.data <- sqlQuery(connection, query.str2)
'partition data into training/test sets'
set.seed(489)
index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
data.train <- raw.data [index,]
data.test <- raw.data [-index,]
'find optimal trees to grow (without outcome and dates)
data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE))
best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]
'compress df'
data.ff <- as.ffdf (data.train)
'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
importance=TRUE, proximity=FALSE)
'generate and print variable importance plot'
varImpPlot (data.rf, main = table.name)
'predict on test data'
data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))
'get dates and name date column'
data.test.dates <- data.frame (data.test[,1])
colnames (data.test.dates) <- 'MQLTime'
'attach dates to prediction df'
data.test.res <- cbind (data.test.dates, data.test.pred)
'force date coercion to attempt negating unambiguous format error '
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
rownames (data.test.res)<-NULL
data.test.res <- as.data.frame (data.test.res)
root.table <- stri_sub(table.name, 0, -5)
sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")
'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F,
row.names = F, quote = F)
'end of conditional block'
}
'end of for loop'
}
'close all connection to MySQL'
odbcCloseAll()
'clear workspace'
rm(list=ls())
'end of function'
}
At this line:
data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")
I have tried coercing MQLTime using various functions including: as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())
and have also tried:
"%y" vs "%Y" and "%OS" vs "%S"
All variants seem to have no effect on the error and the function is still able to run on all other tables. I have checked the table manually (which contains almost 1500 rows) and also in MySQL looking for NULL dates or dates like "0000-00-00 00:00:00".
Also, if I run the function line by line in R terminal, this offending table is processed without any problems which just confuses the hell out me.
I've exhausted all the functions/solutions I can think of (and also all those I could find through Dr. Google) so I am pleading for help here.
I should probably mention that the MQLTime column is stored as varchar() in MySQL. This was done to try and get around issues with type conversions between R and MySQL
SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions,
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
Edit: Str() output on the data as imported from MySQl showing MQLTime is already in POSIXct format:
> str(raw.data)
'data.frame': 1472 obs. of 8 variables:
$ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
$ bar5 : int 184 203 179 28 309 24 156 48 309 437 ...
$ bar4 : int 24 184 309 192 48 177 48 68 60 71 ...
$ bar3 : int 8 204 49 60 9 64 68 27 192 147 ...
$ bar2 : int 24 67 189 49 11 152 27 56 437 67 ...
$ bar1 : int 67 51 75 152 24 188 56 147 71 0 ...
$ pat1 : int 147 147 19 147 19 19 147 19 147 19 ...
$ baXRC : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...
So I have tried declaring stringsAsfactors = FALSE in the dataframe operations and this had no effect.
Interestingly, if the offending table is removed from processing through an additional conditional statement in the first 'if' block, the function stops on the table immediately preceeding the blocked table.
If both the original and the new offending tables are removed from processing, then the function stops on the table immediately prior to them. I have never seen this sort of behavior before and it really has me stumped.
I watched system resources during the function and they never seem to max out.
Could this be a problem with the 'for' loop and not necessarily date formats?
There appears to be some egg on my face. The table following the table where the function was stopping had a row with value '0000-00-00 00:00:00'. I added another statement in my MySQL function to remove these rows when pre-processing the tables. Thanks to those that had a look at this.

How do I write a json array from R that has a sequence of lat and long?

How do I write a json array from R that has a sequence of lat and long?
I would like to write:
[[[1,2],[3,4],[5,6]]]
the best I can do is:
toJSON(matrix(1:6, ncol = 2, byrow = T))
#"[ [ 1, 2 ],\n[ 3, 4 ],\n[ 5, 6 ] ]"
How can I wrap the thing in another array (the json kind)?
This is important to me so I can write files into a geojson format as a LineString.
I usually use fromJSON to get the target object :
ll <- fromJSON('[[[1,2],[3,4],[5,6]]]')
str(ll)
List of 1
$ :List of 3
..$ : num [1:2] 1 2
..$ : num [1:2] 3 4
..$ : num [1:2] 5 6
So we should create , a list of unnamed list, each containing 2 elements:
xx <- list(setNames(split(1:6,rep(1:3,each=2)),NULL))
identical(toJSON(xx),'[[[1,2],[3,4],[5,6]]]')
[1] TRUE
If you have a matrix
m1 <- matrix(1:6, ncol=2, byrow=T)
may be this helps:
library(rjson)
paste0("[",toJSON(setNames(split(m1, row(m1)),NULL)),"]")
#[1] "[[[1,2],[3,4],[5,6]]]"