Related
I have a table in MYSQL that contains the user interactions with a Web Page, I needed to extract the rows for the users where the date of that interaction is lower than a certain benchmark date and that benchmark date is different for each customer (I extract that date from a different database).
My approach was to set a json variable in which the key is a user and the value is the benchmark date, and used it in the query to extract the intended fields.
Example in R:
#MainDF contains the user and the benchmark date from a different database
json_str <- mapply(function(uid, bench_date){
paste0(
'{','"',cust,'"', ':', '"', bench_date, '"','}'
)
}, MainDF[, 'uid'],
MainDF[, 'date']
)
json_str <- paste0("'", '[', paste0(json_str , collapse = ','), ']', "'")
temp_var <- paste('set #test=', json_str)
The intention was to make temp_var to be like:
set #test= '{"0001":"2010-05-05",
"0012":"2015-05-05",
"0101":"2018-07-20"}'
but it actually looks like :
set #test= '{\"0001\":\"2010-05-05\",
\"0012\":\"2015-05-05\",
\"0101\":\"2018-07-20\"}'
then create the main query:
main_Q <- "select user_id, date
from interaction
where 1=1
and json_contains(json_keys(#test), concat('\"',user_id,'\"')) = 1
and date <= json_unquote(json_extract(#test,
concat('$.','\"',user_id, '\"')
)
)
"
For the execution, first, set the temporal variable and then execute the main query
dbSendQuery(connection, temp_var)
resp <- dbSendQuery(connection, main_Q )
target_df <- fetch(resp, n=-1)
dbClearResult(resp )
When I test a fraction of it in a SQL IDE it does works. However, in R it doesn't return anything.
I think that the issue is that R escape the double quotes in temp_var and SQL end up reading
set #test= '{\"0001\":\"2010-05-05\",
\"0012\":\"2015-05-05\",
\"0101\":\"2018-07-20\"}'
which is not won't work.
For example if I execute:
set #test= '{"0001":"2010-05-05",
"0012":"2015-05-05",
"0101":"2018-07-20"}'
select json_keys(#test)
it will return an array with the keys, but that is not the case with
set #test= '{\"0001\":\"2010-05-05\",
\"0012\":\"2015-05-05\",
\"0101\":\"2018-07-20\"}'
select json_keys(#test)
I am not sure how to solve the issue, but I need double quotes to specify the JSON. Is there any other approach that I should try or a way to make this work?
First, I think it is generally better to use a well-known library/package for converting to/from JSON, for several reasons.
This gives you a string that you should be able to place just about anywhere.
json_str <- jsonlite::toJSON(setNames(as.list(MainDF$date), MainDF$uid), auto_unbox=TRUE)
json_str
# {"0001":"2010-05-05","0012":"2015-05-05","0101":"2018-07-20"}
And while looking at the object on the R console will give the escaped-doublequotes,
as.character(json_str)
# [1] "{\"0001\":\"2010-05-05\",\"0012\":\"2015-05-05\",\"0101\":\"2018-07-20\"}"
that is merely R's representation (shows all strings within double-quotes, and therefore needs to escape any double-quotes within the string).
Adding it into some script should be straight-forward:
cat(paste('set #test=', sQuote(json_str)), '\n')
# set #test= '{"0001":"2010-05-05","0012":"2015-05-05","0101":"2018-07-20"}'
I'm assuming that having each on its own row is not critical. If it is, and indentation is important, perhaps this is more your style:
spaces <- strrep(' ', 2+nchar('set #test = '))
cat(paste0('set #test = ', sQuote(gsub(",", paste0(",\n", spaces), json_str))), '\n')
# set #test = '{"0001":"2010-05-05",
# "0012":"2015-05-05",
# "0101":"2018-07-20"}'
Data:
MainDF <- read.csv(stringsAsFactors=FALSE, colClasses='character', text='
uid,date
0001,2010-05-05
0012,2015-05-05
0101,2018-07-20')
Excuse me for not being more specific in the title, but I don't know how to explain this without an example.
I have a .html file that looks like this:
<TR><TD>log p-value:</TD><TD>-2.797e+02</TD></TR>
<TR><TD>Information Content per bp:</TD><TD>1.736</TD></TR>
<TR><TD>Number of Target Sequences with motif</TD><TD>894.0</TD></TR>
<TR><TD>Percentage of Target Sequences with motif</TD><TD>47.58%</TD></TR>
<TR><TD>Number of Background Sequences with motif</TD><TD>10864.6</TD></TR>
<TR><TD>Percentage of Background Sequences with motif</TD><TD>22.81%</TD></TR>
<TR><TD>Average Position of motif in Targets</TD><TD>402.4 +/- 261.2bp</TD></TR>
<TR><TD>Average Position of motif in Background</TD><TD>400.6 +/- 246.8bp</TD></TR>
<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>-0.0</TD></TR>
<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>1.48</TD></TR>
I read it in:
html = readLines("file.html")
I am interested in whatever is between </TD><TD> and </TD></TR>. When I run the following, I get the result I want:
mypattern = '<TR><TD>log p-value:</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
[1] "-2.797e+02"
It works well for almost all lines I want to match, but when I do the same thing for the last two lines, it does not extract anything.
mypattern = '<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
mypattern = '<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
Why is this happening?
Thank you for your help.
If your data structure is really like this. You have a xml file with keys and values so I assume it is easier to utilize this!
library(xml2)
xd <- read_xml("file.html", as_html = TRUE)
key_values <- xml_text(xml_find_all(xd, "//td"))
is_key <- as.logical(seq_along(key_values) %% 2)
setNames(key_values[!is_key], key_values[is_key])
First, I'll say that I would actually solve this problem like this:
gsub(".+>([^<]+)</TD></TR>", "\\1", html)
#> [1] "-2.797e+02" "1.736" "894.0"
#> [4] "47.58%" "10864.6" "22.81%"
#> [7] "402.4 +/- 261.2bp" "400.6 +/- 246.8bp" "-0.0"
#> [10] "1.48"
But, to answer the question of why your way didn't work, we need to checkout the help file for R regular expressions (help("regex")):
Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? ...
The patterns that you had trouble with included parentheses, which you needed to escape (note the double backslash, since backslashes themselves need to be escaped):
mypattern = '<TR><TD>Multiplicity \\(# of sites on avg that occur together\\)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
# [1] "1.48"
Hello I have the following data that I want to paste into an SQL query through a R connection.
UKWinnersID<-c("1W167X6", "QM6VY8", "ZDNZX0", "8J49D8", "RGNSW9",
"BH7D3P1", "W31S84", "NTHDJ4", "H3UA1", "AH9N7",
"DF52B68", "K65C2", "VGT2Q0", "93LR6", "SJAJ0",
"WQBH47", "CP8PW9", "5H2TD5", "TFLKV4", "X42J1" )
The query / code in R is as following:
UKSQL6<-data.frame(sqlQuery(myConn, paste("SELECT TOP 10000 [AxiomaDate]
,[RiskModelID] ,[AxiomaID],[Factor1],[Factor2],[Factor3],[Factor4],[Factor5]
,[Factor6],[Factor7],[Factor8],[Factor9],[Factor10],[Factor11],[Factor12]
,[Factor13],[Factor14],[Factor15]FROM [PortfolioAnalytics].[Data_Axioma].[SecurityExposures]
Where AxiomaDate IN (
SELECT MAX(AxiomaDate)
FROM [PortfolioAnalytics].[Data_Axioma].[FactorReturns]
GROUP BY MONTH(AxiomaDate), YEAR(AxiomaDate))
AND RiskModelID = 8
AND AxiomaID IN(",paste(UKWinnersID, collapse = ","),")")))
I am pasting the UKWinnersID in the last line of the code above but that format of the UKWinnersID needs to be as ('1W167X6', 'QM6VY8', 'ZDNZX0'.. etc) with a single quote which I just cant get to work.
Consider running a parameterized query using the RODBCext package (extension of RODBC), assuming this is the API being used. Parameterized queries do more than insulate from SQL injection but abstracts data from code and avoids the messy quote enclosure and string interpolation and concatenation for cleaner, maintainable code.
Below replaces your TOP 10000 into TOP 500 for each of the 20 ids:
library(RODBC)
library(RODBCext)
conn <- odbcConnect("DBName", uid="user", pwd="password")
ids_df <- data.frame(UKWinnersID = c("1W167X6", "QM6VY8", "ZDNZX0", "8J49D8", "RGNSW9",
"BH7D3P1", "W31S84", "NTHDJ4", "H3UA1", "AH9N7",
"DF52B68", "K65C2", "VGT2Q0", "93LR6", "SJAJ0",
"WQBH47", "CP8PW9", "5H2TD5", "TFLKV4", "X42J1"))
# SQL STATEMENT (NO DATA)
query <- "SELECT TOP 500 [AxiomaDate], [RiskModelID], [AxiomaID], [Factor1],[Factor2]
, [Factor3], [Factor4], [Factor5], [Factor6], [Factor7], [Factor8]
, [Factor9], [Factor10], [Factor11], [Factor12]
, [Factor13], [Factor14], [Factor15]
FROM [PortfolioAnalytics].[Data_Axioma].[SecurityExposures]
WHERE AxiomaDate IN (
SELECT MAX(AxiomaDate)
FROM [PortfolioAnalytics].[Data_Axioma].[FactorReturns]
GROUP BY MONTH(AxiomaDate), YEAR(AxiomaDate)
)
AND RiskModelID = 8
AND AxiomaID = ?"
# PASS DATAFRAME VALUES TO BIND TO QUERY PARAMETERS
UKSQL6 <- sqlExecute(conn, query, ids_df, fetch=TRUE)
odbcClose(conn)
Alternatively, if you really need to use the IN() clause:
# SQL STATEMENT (NO DATA)
query <- paste("SELECT TOP 10000
...same as above...
AND AxiomaID IN (", paste(rep("?", nrow(ids_df)), collapse=", "), ")")
# TRANSPOSE DATA FRAME FOR COLUMN EQUAL TO ? PLACEHOLDERS
UKSQL6 <- sqlExecute(conn, query, t(ids_df), fetch=TRUE)
I am using RMySQL package to write (append) data in current table.
I am using R, version 3.3.2.
My code looks like this:
library(RMySQL)
df_final <- some_data
m<-dbDriver("MySQL")
mydb <- dbConnect(m, user='odvjet12_mislav',
password='my_pass',
host='91.234.46.219',
dbname='odvjet12_fina_pn')
dbWriteTable(mydb, value = df_final, name = "fina_pn", append = TRUE, row.names = FALSE)
This code works fine for some time, but in last ten days, it always return an error:
Error in .local(conn, statement, ...) :
could not run statement: The used command is not allowed with this MySQL version
I don't understand how it is possible for code to work for some time and now, it returns an error?
I kindly ask for feedback on this issue.
Best,
Mislav Šagovac
You could also use dbGetQuery from the RMySQL package and iterate over the rows, which was my solution when I reached a similar error for a dataframe I wanted to write to a MySQL DB:
mydb = dbConnect(MySQL(), user='user', password='password', dbname='databasename', host='hostname')
for(i in 1:nrow(df)){
dbGetQuery(mydb,paste0("INSERT INTO MYTABLE (COL1,COL2) VALUES(",df$col1[i],",",df$col2[i],")"))
}
I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.