Fitting a R's data frame into Mysql Table - mysql

I have some hundreds of csv that I want insert into MySql database with a R script.
library(RMySQL)
library(caroline)
...
...
con <- dbConnect(MySQL(), user="root", password="mypass", dbname="Data", host="localhost")
fields <- dbListFields(con, db.table.name)
dbWriteTable2(con,db.table.name,data,row.names=FALSE,fill.null=TRUE)
dbDisconnect(con)
...
...
R answer me:
Error in mysqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not run statement: Unknown column 'id' in 'field list')
But there is no id field in any of the two tables, my data frame and mysql table.
So what R is trying to telling me?
Here it is my database schema.
And here the str() of my data
str(data)
'data.frame': 306 obs. of 22 variables:
$ Division : chr "B1" "B1" "B1" "B1" ...
$ Eventdate: Date, format: "2000-08-12" "2000-08-12" "2000-08-12" "2000-08-12" ...
$ HomeTeam : chr "Beveren" "Mechelen" "Louvieroise" "Mouscron" ...
$ AwayTeam : chr "Charleroi" "Genk" "Lokeren" "Germinal" ...
$ FTHG : int 1 0 0 3 2 1 6 3 0 2 ...
$ FTAG : int 2 0 0 0 1 3 2 4 0 1 ...
$ FTR : chr "A" "D" "D" "H" ...
$ HTHG : int 1 0 0 1 1 0 3 1 0 1 ...
$ HTAG : int 1 0 0 0 0 1 0 2 0 0 ...
$ HTR : chr "D" "D" "D" "H" ...
$ GBH : num 2.2 2 2.3 1.9 1.41 2.7 1.8 2.6 4 1.5 ...
$ GBD : num 3.5 3.6 3.3 3.3 4.1 3.3 3.4 3.2 3 3.5 ...
$ GBA : num 2.5 2.65 2.45 3.5 4.7 2.1 3.5 2.2 1.75 5.1 ...
$ IWH : num 2 2.1 2.35 1.8 1.75 2.3 1.6 2.6 3.8 1.4 ...
$ IWD : num 3 3 3 3 3.1 3 3.2 3 3.2 3.6 ...
$ IWA : num 2.9 2.8 2.35 3.5 3.6 2.4 4.2 2.2 1.65 5.5 ...
$ SBH : num 2.05 2.3 2.45 1.85 1.6 2.6 1.6 2.75 4.1 1.53 ...
$ SBD : num 3.4 3.4 3.4 3.4 3.5 3.2 3.5 3.25 3.45 3.5 ...
$ SBA : num 2.95 2.55 2.4 3.5 4.5 2.3 4.5 2.2 1.7 5 ...
$ WHH : num 2 2.4 2.4 -1 1.61 2.4 1.61 2.6 3.8 1.5 ...
$ WHD : num 3.4 3.3 3.3 -1 3.6 3.3 3.75 3.3 3.5 3.6 ...
$ WHA : num 2.9 2.4 2.4 -1 4.2 2.4 4 2.25 1.7 5.2 ...

Based on some testing of package caroline, dbWriteTable2 is only useful for tables that have an id column. From ?dbWriteTable2: it returns If successful, the ids of the newly added database records (invisible). If the table does not have an id column to return, dbWriteTable2 seems to fail.

Related

statsmodels OLS gives parameters despite perfect multicollinearity

Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?

Blob to Dataframe in R

I have my Rstudio connected to a MySQL database. The table I'm importing has a MySQL JSON column type: https://dev.mysql.com/doc/refman/5.7/en/json.html
When I import it into R, it becomes a BLOb. You can see the table, as its imported, here:
'data.frame': 15 obs. of 5 variables:
$ id :integer64 1 2 3 4 5 6 7 8 ...
$ user_id : chr
$ survey_id:integer64 3 10 10 10 10 3 10 10 ...
$ p_id : chr "22zdae" "0" "0" "0" ...
$ data : blob [1:15] ..$ : raw 7b 22 45 78 ...
When I go to extract information from the blob I use the following code:
for(row in 1:NROW(data)){
print(row)
tryCatch({
if(is_empty(data$data[[row]])==TRUE){
x<-NA
} else {
x <- rawToChar(data$data[[row]])
}
survey_data <- rbind(survey_data,x)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")}
)}
Every row is transformed into only partially what in the database. For example:
Status": "Never married", "Liberal_Conserv": "Very Liberal",
"Political_Party": "Republican", "Kids_18yo_Number": ""}
This row has 251 variables in the database, not 4.
How can I accurately transform a blob into workable data?

Round off to 0.5 SQL

I'm new in SQL.
How do I round off if:
1.01 -- 1.24 -> 1
1.25 -- 1.49 -> 1.5
1.51 -- 1.74 -> 1.5
1.75 -- 1.99 -> 2
Thanks for your help, much appreciated.
You can just do:
select floor(val * 2 + 0.5) / 2

how to load from a CSV file with multiple symbols containing OHLCV using getSymbols and date in YYYYMMDD

I am relatively new to R and in here. I am trying to read in a CSV file that has multiple symbols with OHLCV and date in string YYYYMMDD format
Data format example
I have tried:
data <- read.csv(file="DFM.csv", sep=",", dec=".", header=TRUE, col.names = c("Symbols", "Date", "Open", "High", "Low", "Close", "Volume"), stringsAsFactors = FALSE)
> class(data)
[1] "data.frame"
> head(data)
Symbols Date Open High Low Close Volume
1 DIB 20160630 5.03 5.12 5.03 5.11 6171340
2 DIB 20160629 5.10 5.11 5.02 5.02 5241741
3 DIB 20160628 5.05 5.11 5.02 5.07 5258839
4 DIB 20160627 5.01 5.11 5.01 5.03 5038589
5 DIB 20160626 4.94 5.04 4.90 5.02 10593471
6 DIB 20160623 5.14 5.14 5.09 5.12 3069970
as.Date(data$Date, format="%Y%m%d") # didn't work
Somehow I need to load it in getSymbols() so I can use chart_Series() to plot the charts. Can anyone help?
Using your example data this is one possible solution to import the file, convert the Date column, split the file by Symbol and arrange it in a way to chart the individual objects(stocks) in a straightforward way:
First and last 3 lines of original file data (allStocks):
> both(allStocks)
Symbol Date Open High Low Close Volme
1 DIB 20160630 5.03 5.12 5.03 5.11 6171340
2 DIB 20160629 5.10 5.11 5.02 5.02 5241741
3 DIB 20160628 5.05 5.11 5.02 5.07 5258839
Symbol Date Open High Low Close Volme
16 CBD 20160627 5.6 5.6 5.6 5.6 0
17 CBD 20160626 5.6 5.6 5.6 5.6 0
18 CBD 20160623 5.6 5.6 5.6 5.6 0
Lets's start by converting the Date column:
allStocks$Date <- as.Date(as.character(allStocks$Date), format="%Y%m%d")
Next, split allStocks by Symbol which gives you a list where each list element represents an individual stock with name Symbol :
allStocks <- split(allStocks,allStocks$Symbol)
Next, get rid of the Symbol column to prepare for a xts object:
allStocks <- lapply(allStocks, function(x) as.xts(x[,3:7],order.by=x[,2]))
and finally convert the list into individual xts-objects each representing a stock with name Symbol:
list2env(allStocks,envir=.GlobalEnv)
You should now have 3 nicely formatted objects in your GlobalEnvironment ready to be charted.
i.e. str and first,last lines of stock DIB:
> str(DIB)
An ‘xts’ object on 2016-06-23/2016-06-30 containing:
Data: num [1:6, 1:5] 5.14 4.94 5.01 5.05 5.1 5.03 5.14 5.04 5.11 5.11 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "Open" "High" "Low" "Close" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
NULL
> both(DIB)
Open High Low Close Volme
2016-06-23 5.14 5.14 5.09 5.12 3069970
2016-06-26 4.94 5.04 4.90 5.02 10593471
2016-06-27 5.01 5.11 5.01 5.03 5038539
Open High Low Close Volme
2016-06-28 5.05 5.11 5.02 5.07 5258839
2016-06-29 5.10 5.11 5.02 5.02 5241741
2016-06-30 5.03 5.12 5.03 5.11 6171340

AWK wrong math on first line only

This is the input file input.awk DOS type
06-13-2014,08:43:11
RLS007817
RRC001021
yes,71.61673,0,150,37,1
no,11,156,1.35,306.418
4,3,-1,2.5165,20,-1.4204
-4,0,11,0,0,0
1.00E-001,0.2,3.00E-001,0.6786031,0.5,6.37E-002
110,40,30,222,200,-539
120,50,35,215,220,-547
130,60,40,207,240,-553
140,70,45,196,260,-560
150,80,50,184,280,-566
160,90,55,170,300,-573
170,100,60,157,320,-578
180,110,65,141,340,-582
190,120,70,126,360,-586
200,130,75,110,380,-590
This is what I basically need:
Ignore the first 8 lines (OK)
Pick and split the numbers on lines 6,7 & 8 (OK)
Do AWK math on columns (Error only in first line?)
BASH code
#!/bin/bash
myfile="input.awk"
vzeros=$(sed '6q;d' $myfile)
vshift=$(sed '7q;d' $myfile)
vcalib=$(sed '8q;d' $myfile)
IFS=','
read -a avz <<< "${vzeros}"
read -a avs <<< "${vshift}"
read -a avc <<< "${vcalib}"
z1=${avz[0]};s1=${avs[0]};c1=${avc[0]}
z2=${avz[1]};s2=${avs[1]};c2=${avc[1]}
z3=${avz[2]};s3=${avs[2]};c3=${avc[2]}
z4=${avz[4]};s4=${avs[4]};c4=${avc[4]}
#The single variables will be passed to awk
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" -v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" -v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" 'NR>8 { FS = "," ;
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile > test.plot
This is the result on the file test.plot
11 -0.6 -3 -10
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
This is the weird part... Only in the first line and after the first column all is wrong... And I have no idea why.
This is the expected result file:
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
I've printed the correction factors captured from lines 6,7 & 8 and everything is fine. All math is fine, except on the first line, after the first column.
OS: Slackware 13.37.
AWK: GNU Awk 3.1.6 Copyright (C) 1989, 1991-2007 Free Software Foundation.
I agree with #jeanrjc.
I copied your file and script to my machine and reduced it to processing the first 2 lines of your data.
With your code as is, I duplicate your results, i.e.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2= z2=3 s2=0
11 -0.6 -3 -10
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2= z2=3 s2=0
12 -0.6 -3 -10
With FS=","; commented out, and -F, added in the option list the output is what you are looking for.
#dbg $0=110,40,30,222,200,-539
#dbg c2=0.2 $2=40 z2=3 s2=0
11 7.4 6 90
#dbg $0=120,50,35,215,220,-547
#dbg c2=0.2 $2=50 z2=3 s2=0
12 9.4 7.5 100
So make sure you have removed the FS=","; from the block of code, and you are using -F, In any case, I would say, that resetting the FS="," for each line that is processed is not useful.
If that still doesn't solve it, try the corrected code on a machine with a newer version of awk.
It would take a small magazine article to completely illustrate what is happening while reading thru the first 8 records (when FS="[[:space:]]), the transition to the first row that meets your rule NR>8, the FS is still [:space:] when the fields are parsed, then, FS is set to ,, but that first row is not rescanned.
IHTH!
Your sample is too complex to reproduce something, but I guess you should try :
awk -F"," 'NR>8{...
instead of
awk 'NR>8 { FS = "," ;
You can also try with BEGIN:
awk 'BEGIN{FS=","}NR>8{...
I eventually tested your script, and you should change the position of the FS parameter, as I told you:
awk -v z1="$z1" -v c1="$c1" -v s1="$s1" -v z2="$z2" \
-v c2="$c2" -v s2="$s2" -v z3="$z3" -v c3="$c3" \
-v s3="$s3" -v z4="$z4" -v c4="$c4" -v s4="$s4" -F"," 'NR>8 {
nc1 = c1 * ( $1 - z1 - s1 );
nc2 = c2 * ( $2 - z2 - s2 );
nc3 = c3 * ( $3 - z3 - s3 );
nc4 = c4 * ( $5 - z4 - s4 );
print nc1,nc2,nc3,nc4 }' $myfile
11 7.4 6 90
12 9.4 7.5 100
13 11.4 9 110
14 13.4 10.5 120
15 15.4 12 130
16 17.4 13.5 140
17 19.4 15 150
18 21.4 16.5 160
19 23.4 18 170
20 25.4 19.5 180
0 -0.6 -3 -10
Why you had a problem ?
Because awk parses the line before executing the block, so if you tell it to change something related to parsing, the changes will occur from the next line.
HTH