doParallel() and mySQL in R: Database not receiving data - mysql

I am using RMySQL() to send data from R to a MySQL database. The problem is that the database does not receive any data.... I am using doParallel() since i am running over 4500 iterations.... could it be because i try to send the data to the database in the pullSpread() function?
library(RMySQL)
library(doParallel)
library(stringr)
library(foreach)
makeCluster(detectCores()) # ANSWER = 4
cl <- makeCluster(4, type="SOCK") # also used PSOCK & FORK but receive the same problem
registerDoParrallel(cl)
# Now use foreach() and %dopar% to pull data...
# the apply(t(stock1), 2, pullSpread) works but not "parallelized"
# I have also used clusterApply() but is unsuccessful
system.time(
foreach(a=t(stock1)) %dopar% pullSpread(a)
)
When I look in my working directory, all the files are copied successfully onto a .csv file as it should but when I check MySQL workbench or even call the files from R they do not exist...
Here is the stock1() character vector and the pullSpread() function used...
# This list contains more than 4500 iterations.. so I am only posting a few
stock1<-c(
"SGMS.O","SGNL.O","SGNT.O",
"SGOC.O","SGRP.O", ...)
Important Dates needed for function:
Friday <- Sys.Date()-10
# Get Previous 5 days
Thursday <- Friday - 1
Wednesday <- Thursday -1
Tuesday <- Wednesday -1
Monday <- Tuesday -1
#Make Them readable for NetFonds
Friday <- format(Friday, "%Y%m%d")
Thursday<- format(Thursday, "%Y%m%d")
Wednesday<- format(Wednesday, "%Y%m%d")
Tuesday<- format(Tuesday, "%Y%m%d")
Monday<-format(Monday, "%Y%m%d")
Here is the pullSpread() function:
pullSpread = function (stock1){
AAPL_FRI<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Friday,"&paper=",stock1,"&csv_format=txt"))
tryit <- try(AAPL_FRI[,c(1:7)])
if(inherits(tryit, "try-error")){
rm(AAPL_FRI)
} else {
AAPL_THURS<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Thursday,"&paper=",stock1,"&csv_format=txt"))
AAPL_WED<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Wednesday,"&paper=",stock1,"&csv_format=txt"))
AAPL_TUES<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Tuesday,"&paper=",stock1,"&csv_format=txt"))
AAPL_MON<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Monday,"&paper=",stock1,"&csv_format=txt"))
SERIES <- rbind(AAPL_MON,AAPL_TUES,AAPL_WED,AAPL_THURS,AAPL_FRI)
#Write .CSV File
write.csv(SERIES,paste(sep="",stock1,"_",Friday,".csv"), row.names=FALSE)
dbWriteTable(con2,paste0( "",str_sub(stock1, start = 1L, end = -3L),""),paste0(
"~/Desktop/R/",stock1,"_",Friday,".csv"), append=T)
}
}

Retrieve last Friday using something like this:
Friday <- Sys.Date()
while(weekdays(Friday) != "Friday")
{
Friday <- Friday - 1
}
As a matter of good practice, when retrieving data from the internet, separate the act of downloading it with processing it. That way, when the processing fails, you don't waste time and bandwidth redownloading things.
lastWeek <- format(Friday - 0:4, "%Y%m%d")
stockDatePairs <- expand.grid(Stock = stock1, Date = lastWeek)
urls <- with(
stockDatePairs,
paste0(
"http://www.netfonds.no/quotes/posdump.php?date=",
Date,
"&paper=",
Stock,
"&csv_format=txt"
)
)
for(url in urls)
{
# or whatever file name you want
download.file(url, paste0("data from ", make.names(url), ".txt"))
}
Make sure that you know which directory those files are being saved to. (Either provide an absolute path or set your working directory.)
Now try reading and rbinding those files.
If that works, then you can try doing things in parallel.
Also note that many online data services will limit the rate that you can download, unless you are paying for the service. So parallel downloads may just mean that you hit the limit quicker.

Related

Troubleshooting a Loop in R

I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.
However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).
My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.
iteration_number cumulative_results
1 13432 5673
2 13433 5673
3 13434 5673
But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?
Thank you!
Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.
Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).
I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:
}, error = function(e){})
to:
}, error = function(e){ message(e) })
After a while, I got output like:
HTTP error 429
The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution.
I modified your script as follows:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
Sys.sleep(1) # Changed. Change the value as needed.
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){
message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.
As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt number:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
maxattempt <- 3
for (i in 1:1000)
{
attempt <- 1
r_i <- NULL
while( is.null(r_i) && attempt <= maxattempt ) {
if (attempt>1) {print(paste("retry: i =",i))}
attempt <- attempt + 1
url_i<- paste0(part1, i+1, part2, i, part3)
try({
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i))) })
}
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Result:
[1] 1 250
[1] 2 498
[1] 3 748
[1] 4 997
[1] 5 1247
[1] 6 1497
[1] 7 1747
[1] 8 1997
[1] 9 2247
[1] 10 2497
...
[1] 416 101527
[1] 417 101776
Error in open.connection(con, "rb") :
Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received
[1] "retry: i = 418"
[1] 418 102026
[1] 419 102276
...
On above example, a retry occured for i = 418.
Note that for i in 1:500, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000 : do you have enough RAM?

Running timeseries graphing function in Rmd producing cluttered x-axis labels (not present in test code)

I have a folder of xx .csv timeseries that I want to graph and knit into a clean HTML document. I have a ggplot code that produces the plot that I want using a single timeseries.csv. However, when I try to put the bones of that ggplot code in a function inside of a for loop to run each of the timeseries.csv files through the function I get a some plots with pretty different formatting.
Plot generated with my test ggplot code:
Plot generated with function and for loop:
Changes I'm trying to make to the ugly Rmd plot:
Nicely space the x-axis tick marks to whole mins (i.e. "11:14:00", "11:15:00")
Connect the data points (solved with subbing geom_line() with geom_path())
Example Rmd Code Below. Please Note that the graphs produced still have nice formatting, I'm not sure how to reproduce this problem sort of posting a 500 row dataframe. I also don't know how to post my rmd code without SO using the formatting commands in this post, so I threw in at 3 of " around my header formatting and at the end of the code to disable it.
Edits and Updates
I am getting a persistent error geom_path: Each group consists of only one observation. Do you need to adjust the group
aesthetic?.
As suggested by the commenters I tried removing plot() and using the the createChlDiffPlot() directly and replacing plot() with print(). Both produce the same ugly plots as before.
Replaced geom_line() with geom_path(). The points are now connected! x-axis cluttering is still there.
Time variable is reading as hms num
Many thanks for any help on this!
```
---
title: "Chl Filtration"
output:
flexdashboard::flex_dashboard:
theme: yeti
orientation: rows
editor_options:
chunk_output_type: console
---
```{r setup}
library(flexdashboard)
library(dplyr)
library(ggplot2)
library(hms)
library(ggthemes)
library(readr)
library(data.table)
#### Example Data
df1 <- data.frame(Time = as_hms(c("11:22:33","11:22:34","11:22:35","11:22:38","11:23:00","11:23:01","11:23:02")),
Chl_ug_L_Up = c(0.2,0.1,0.25,-0.2,-0.3,-0.15,0.1),
Chl_ug_L_Down = c(0.5,0.4,0.3,0.2,0.1,0,-0.1))
df2 <- data.frame(Time = as_hms(c("08:02:33","08:02:34","08:02:35","08:02:40","08:02:42","08:02:43","08:02:49")),
Chl_ug_L_Up = c(-0.2,-0.1,-0.25,0.2,0.3,0.15,-0.1),
Chl_ug_L_Down = c(-0.1,0,0.1,0.2,0.3,0.4,0.1))
data_directory = "./" # data folder in R project folder in the real deal
output_directory = "./" # output graph directory in R project folder
write_csv(df1, file.path(data_directory, "SO_example_df1.csv"))
write_csv(df2, file.path(data_directory, "SO_example_df2.csv"))
#### Function to create graphs
createChlDiffPlot = function(aTimeSeriesFile, aFileName, aGraphOutputDirectory, aType)
{
aFile_Mod = aTimeSeriesFile %<>%
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down)
one_plot = ggplot(data = aFile_Mod, aes(x = Time, y = Chl_diff)) + # tried adding 'group = 1' in aes to connect points
geom_path(size = 1, color = "green") +
geom_point(color = "green") +
theme_gdocs() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title = element_blank()) +
labs(x = "", y = "Chl Difference", title = paste0(aFileName, " - ", "Filtration"))
one_graph_name = paste0(gsub(".csv", "", aFileName), "_", aType, ".pdf")
ggsave(one_graph_name, one_plot, dpi = 600, width = 7, height = 5, units = "in", device = "pdf", aGraphOutputDirectory)
return(one_plot)
}
"``` ### remove the quotes when running example
Plots - After Velocity Adjustment
=====================================" ### remove quotes when running example
```{r, fig.width=13.5, fig.height=5}
all_files_Filtration = list.files(data_directory, pattern = ".csv")
# Loop to plot function
for(file in 1 : length(all_files_Filtration))
{
file_name = all_files_Filtration[file]
one_file = fread(file.path(data_directory, file_name))
# plot the time series agains
plot(createChlDiffPlot(one_file, file_name, output_directory, "Velocity_Paired"))
}
"``` #remove quotes when running example
```
I finally figured it out.
1) Replacing geom_line() with geom_path() connected the data points when rendered in Rmd.
2) df1$Time was formatted as a difftime object. When I looked at the dataframe in the global environment, Time :hmsnum 11:11:09 .... This made me think my format was ok, but when I ran class(df1$Time) I got [1] "hms" "difftime". With a quick google I found out difftime objects are not quite the same as hms, and my original time was generated by subtracting times. I added a conversion into my mutate function:
select(Time, Chl_ug_L_Up, Chl_ug_L_Down) %>%
mutate(Chl_diff = Chl_ug_L_Up - Chl_ug_L_Down,
Time = as_hms(Time)) # convert difftime objecct to hms
ggplot I think has some auto-formatting for hms variables, which is why difftime variable was producing ugly crowded x- axes.

r fetch data from mysql db loop

I successfully fetch data from my mysql db using r:
library(RMySQL)
mydb = dbConnect(MySQL(), user='user', password='pass', dbname='fib', host='myhost')
rs = dbSendQuery(mydb, 'SELECT distinct(DATE(date)) as date, open,close FROM stocksng WHERE symbol = "FIB7F";')
data <- fetch(rs, n=-1)
dbHasCompleted(rs)
so now I've an object a list:
> print (typeof(data))
[1] "list"
each elements is a tuple(?) like date(charts),open(long),close(long)
ok well now my problem: I want to get a vector of percentuale difference betwen close (x) and next day open (x+1) until the end BUT I can't access properly to the item!
Example: ((open)/close*100)-100)
I try:
for (item in data){
print (item[2])
}
and all possible combination like:
for (item in data){
print (item[][2])
}
but cannot access to right element :! anyone could help?
You have a bigger problem than this in your MySQL query, because you did not specify an ORDER BY clause. Consider using the following query:
SELECT DISTINCT
DATE(date) AS date,
open,
close
FROM stocksng
WHERE
symbol = "FIB7F"
ORDER BY
date
Here we order the result set by date, so that it makes sense to speak of the current and next open or close. Now with a proper query in place if you wanted to get the percentile difference between the current close and the next day open you could try:
require(dplyr)
(lead(open, 1) / close*100) - 100
Or using base R:
(open[2:(length(open)+1)] / close*100) - 100
naif version:
for (row in 1:nrow(data)){
date <- unname (data[row,"date"])
open <- unname (data[row+1,"open"])
close <- unname (data[row,"close"])
var <- abs((close/open*100)-100)
print (var)
}

Reading a huge json file in R , issues

I am trying to read very huge json file using R , and I am using the RJSON library with this commend json_data <- fromJSON(paste(readLines("myfile.json"), collapse=""))
The problem is that I am getting this error message
Error in paste(readLines("myfile.json"), collapse = "") :
could not allocate memory (2383 Mb) in C function 'R_AllocStringBuffer'
Can anyone help me with this issue
Well, just sharing my experience about read json file. the progress of
I am trying to read 52.8MB,19.7MB,1.3GB,93.9MB,158.5MB json files cost me 30minutes and finally auto resume R session, after that tried to apply parallel computing and would like to see the progress but failed.
https://github.com/hadley/plyr/issues/265
And then I tried to add the parameter pagesize = 10000, its work and more efficient then ever. Well, we only need read once and later save as RData/Rda/Rds format as by saveRDS.
> suppressPackageStartupMessages(library('BBmisc'))
> suppressAll(library('jsonlite'))
> suppressAll(library('plyr'))
> suppressAll(library('dplyr'))
> suppressAll(library('stringr'))
> suppressAll(library('doParallel'))
>
> registerDoParallel(cores=16)
>
> ## https://www.kaggle.com/c/yelp-recsys-2013/forums/t/4465/reading-json-files-with-r-how-to
> ## https://class.coursera.org/dsscapstone-005/forum/thread?thread_id=12
> fnames <- c('business','checkin','review','tip','user')
> jfile <- paste0(getwd(),'/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_',fnames,'.json')
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.parallel=TRUE)
> dat
list()
> jfile
[1] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_business.json"
[2] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_checkin.json"
[3] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json"
[4] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_tip.json"
[5] "/home/ryoeng/Coursera-Data-Science-Capstone/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_user.json"
> dat <- llply(as.list(jfile), function(x) stream_in(file(x),pagesize = 10000),.progress='=')
opening file input connection.
Imported 61184 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
Imported 45166 records. Simplifying into dataframe...
closing file input connection.
opening file input connection.
Found 470000 records...
I got the same problem while working with huge datasets in R.I had used jsonlite package in R for reading json in R.I had used the following code to read json in R:
library(jsonlite)
get_tweets <- stream_in(file("tweets.json"),pagesize = 10000)
here tweets.json is the my file name and the location where it exists,pagesize represents how many number of lines it reads in one iteration.Hope it helps.
For some reason the above solutions all caused R to terminate or worse.
This solution worked for me, with the same data set:
library(jsonlite)
file_name <- 'C:/Users/Downloads/yelp_dataset/yelp_dataset~/dataset/business.JSON'
business<-jsonlite::stream_in(textConnection(readLines(file_name, n=100000)),verbose=F)
Took about 15 minutes

How do I feed in my own data into PyAlgoTrade?

I'm trying to use PyAlogoTrade's event profiler
However I don't want to use data from yahoo!finance, I want to use my own but can't figure out how to parse in the CSV, it is in the format:
Timestamp Low Open Close High BTC_vol USD_vol [8] [9]
2013-11-23 00 800 860 847.666666 886.876543 853.833333 6195.334452 5248330 0
2013-11-24 00 745 847.5 815.01 860 831.255 10785.94131 8680720 0
The complete CSV is here
I want to do something like:
def main(plot):
instruments = ["AA", "AES", "AIG"]
feed = yahoofinance.build_feed(instruments, 2008, 2009, ".")
Then replace yahoofinance.build_feed(instruments, 2008, 2009, ".") with my CSV
I tried:
import csv
with open( 'FinexBTCDaily.csv', 'rb' ) as csvfile:
data = csv.reader( csvfile )
def main( plot ):
feed = data
But it throws an attribute error. Any ideas how to do this?
I suggest to create your own Rowparser and Feed, which is much easier than it sounds, have a look here: yahoofeed
This also allows you to work with intraday data and cleanup the data if needed, like your timestamp.
Another possibility, of course, would be to parse your file and save it, so it looks like a yahoo feed. In your case, you would have to adapt the columns and the Timestamp.
Step A: follow PyAlgoTrade doc on GenericBarFeed class
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.16
On this link see the addBarsFromCSV() in CSV section of the BarFeed class in v0.17
Note
- The CSV file must have the column names in the first row.
- It is ok if the Adj Close column is empty.
- When working with multiple instruments:
--- If all the instruments loaded are in the same timezone, then the timezone parameter may not be specified.
--- If any of the instruments loaded are in different timezones, then the timezone parameter should be set.
addBarsFromCSV( instrument, path, timezone = None )
Loads bars for a given instrument from a CSV formatted file. The instrument gets registered in the bar feed.
Parameters:
(string) instrument – Instrument identifier.
(string) path – The path to the CSV file.
(pytz) timezone – The timezone to use to localize bars.Check pyalgotrade.marketsession.
Next:
A BarFeed loads bars from CSV files that have the following format:
Date Time, Open, High, Low, Close, Volume, Adj Close
2013-01-01 13:59:00,13.51001,13.56,13.51,13.56789,273.88014126,13.51001
Step B: implement a documented CSV-file pre-formatting
Your CSV data will need a bit of sanity ( before will be able to be used in PyAlgoTrade methods ),however it is doable and you can create an easy transformator either by hand or with a use of a powerful numpy.genfromtxt() lambda-based converters facilities.
This sample code is intended for an illustration purpose, to see immediately the powers of converters for your own transformations, as CSV-structure differs.
with open( getCsvFileNAME( ... ), "r" ) as aFH:
numpy.genfromtxt( aFH,
skip_header = 1, # Ref. pyalgotrade
delimiter = ",",
# v v v v v v
# 2011.08.30,12:00,1791.20,1792.60,1787.60,1789.60,835
# 2011.08.30,13:00,1789.70,1794.30,1788.70,1792.60,550
# 2011.08.30,14:00,1792.70,1816.70,1790.20,1812.10,1222
# 2011.08.30,15:00,1812.20,1831.50,1811.90,1824.70,2373
# 2011.08.30,16:00,1824.80,1828.10,1813.70,1817.90,2215
converters = { 0: lambda aString: mPlotDATEs.date2num( datetime.datetime.strptime( aString, "%Y.%m.%d" ) ), #_______________________________________asFloat ( 1.0, +++ )
1: lambda aString: ( ( int( aString[0:2] ) * 60 + int( aString[3:] ) ) / 60. / 24. ) # ( 15*60 + 00 ) / 60. / 24.__asFloat < 0.0, 1.0 )
# HH: :MM HH MM
}
)
You can use pyalgotrade.barfeed.addBarsFromSequence with list comprehension to feed in data from CSV row by row/bar by bar. Basically you create a bar from each row, pass OHLCV as init parameters and extra columns with additional data in a dictionary. You can try something like this (with all the required imports):
data = pd.DataFrame(index=pd.date_range(start='2021-11-01', end='2021-11-05'), columns=['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume', 'ExtraCol1', 'ExtraCol3', 'ExtraCol4', 'ExtraCol5'], data=np.random.rand(5, 10))
feed = yahoofeed.Feed()
feed.addBarsFromSequence('instrumentID', data.index.map(lambda i:
BasicBar(
i,
data.loc[i, 'Open'],
data.loc[i, 'High'],
data.loc[i, 'Low'],
data.loc[i, 'Close'],
data.loc[i, 'Volume'],
data.loc[i, 'Adj Close'],
Frequency.DAY,
data.loc[i, 'ExtraCol1':].to_dict())
).values)
The input data frame was created with random values to make this example easier to reproduce, but the part where the bars are added to the feed should work the same for data frames from CSVs given that the valid column names are used.