Troubleshooting a Loop in R - html

I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.
However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).
My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.
iteration_number cumulative_results
1 13432 5673
2 13433 5673
3 13434 5673
But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?
Thank you!

Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.
Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).
I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:
}, error = function(e){})
to:
}, error = function(e){ message(e) })
After a while, I got output like:
HTTP error 429
The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution.
I modified your script as follows:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
for (i in 1:50000)
{tryCatch({
{
Sys.sleep(1) # Changed. Change the value as needed.
url_i<- paste0(part1, i+1, part2, i, part3)
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i)))
}
}, error = function(e){
message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.

As the error seems to come randomly, depending on API availability, you could retry, and set a maxattempt number:
library(jsonlite)
part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="
part2 = "h&before="
part3 = "h&size=500"
results = list()
maxattempt <- 3
for (i in 1:1000)
{
attempt <- 1
r_i <- NULL
while( is.null(r_i) && attempt <= maxattempt ) {
if (attempt>1) {print(paste("retry: i =",i))}
attempt <- attempt + 1
url_i<- paste0(part1, i+1, part2, i, part3)
try({
r_i <- data.frame(fromJSON(url_i))
results[[i]] <- r_i
myvec_i <- sapply(results, NROW)
print(c(i, sum(myvec_i))) })
}
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")
Result:
[1] 1 250
[1] 2 498
[1] 3 748
[1] 4 997
[1] 5 1247
[1] 6 1497
[1] 7 1747
[1] 8 1997
[1] 9 2247
[1] 10 2497
...
[1] 416 101527
[1] 417 101776
Error in open.connection(con, "rb") :
Timeout was reached: [api.pushshift.io] Operation timed out after 21678399 milliseconds with 0 out of 0 bytes received
[1] "retry: i = 418"
[1] 418 102026
[1] 419 102276
...
On above example, a retry occured for i = 418.
Note that for i in 1:500, results size is already 250Mb, meaning that you could expect 25Gb for i in 1:50000 : do you have enough RAM?

Related

Generator usage for log analysing

I have a working python code to analyze logs. Logs are at least 10 MBytes of size and they can sometimes reach 250-300 Mbytes depending on failures, retries.
I used generator which could yield the big file as chunks and it can be configurable and I normally use 1 or 2 Mbytes of log to yield. So I analyze logs as 1Mb chunks for verification of some tests.
My problem is when I use generator it could bring up some edge cases. In log analyzing I check for subsequent appearance of some patterns as the following, so if only those 4 list seen then I keep them for next verification part of the code. The following 4 pattern can be seen in the logs once or twice, not more.
listA
listB
listC
listD
if these all occurs subsequently then I keep them all to evaluate in next step, otherwise ignore..
However now there is a small change the following could happen, some patterns (lists as I use regex findall method to find patterns) can be in next chunk to complete the check. So in the following I have 3 matching case chunk 3-4 and 5-6 and 7-8 creates the condition to take into account.
---- chunk 1 -----
listA
listB
----- chunk 2 -----
nothing
----- chunk 3 -----
listA
listB
----- chunk 4 -----
listC
listD
----- chunk 5 -----
listA
----- chunk 6 -----
listB
listC
listD
---- chunk 7 ------
listA
listB
listC
----- chunk 8 ------
listD
---------------------
Usually it does not happen like this, some patterns (B,C,D) is mostly seen subsequently in logs but listA can be seen 20 maybe the most 30 rows earlier than the rest. But any scenario like above can happen.
Please advise a good approach, I'm not sure what to use, I know there is next() function can be used to check next chunk, in such case
should I use any([listA, listB, listC, listD]) method and if any of the patterns occurs then do I need to check one by one the rest in next chunk like the following?:
if any([listA, listB, listC, listD]):
Then here check which of the patterns not seen and keep them in a notSeen list then check them one by one in next chunk?
next_chunk = next(gen_func(chunksize))
isListA = re.findall(pattern, next_chunk)
Or maybe I completely miss an easier approach for this little project. please let me know your thoughts as you might experience such situation before.
I have used next_chunk = next(gen_func(chunksize))
and added necessary if statements underneath to check only 1 next log piece becase I would arrange log chunks with a generator suitably:
I shared only a part of the code as the rest confidential
import re, os
def __init__(self, logfile):
self.read = self.ReadLog(logfile)
self.search = self.SearchData(logfile)
self.file = os.path.abspath(logfile)
self.data = self.read.read_in_chunks
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
def test123(self, r_reason: str, cyc: int, b_r):
''' Usage : verify_log.py
--log ${LOGS_FOLDER}/log --reason r_low 1 <True | False>'''
ret = False
r_count = 2*int(cyc) if b_r.lower() == "true" else int(cyc)
r_search_combined, scan_result, r_complete, r_failed = [], [], [], []
result_pattern = self.search.r_scan_pattern()
def check_patterns(chunk):
search_cached = re.findall(self.search.r_search_cached, chunk)
search_full = re.findall(self.search.r_search_full, chunk)
scan_complete = re.findall(self.search.r_scan_complete, chunk)
scan_result = re.findall(result_pattern, chunk)
r_complete = re.findall(self.search.r_auth_complete, chunk)
return search_cached, search_full, scan_complete, scan_result, r_complete
with open(self.file) as rf:
for idx, piece in enumerate(self.data(rf), start=1):
is_failed = re.findall(self.search.r_failure, piece)
if is_failed:
print(f'general failure received : {is_failed}')
r_failed.extend(is_failed)
is_r_search_cached, is_r_search_full, is_scan_complete, is_scan, is_r_complete = check_patterns(piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_complete, is_scan, is_r_complete]):
if is_r_search_cached:
r_search_combined.extend(is_r_search_cached)
if is_r_search_full:
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete)
elif (is_r_search_cached or is_r_search_full) and not any([is_scan, is_r_complete]):
next_piece = next(self.data(rf))
_, _, _, is_scan_next, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan_next, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan_next)
r_complete.extend(is_r_complete_next)
elif (is_r_search_cached or is_r_search_full) and is_scan and not is_r_complete:
next_piece = next(self.data(rf))
_, _, _, _, is_r_complete_next = check_patterns(next_piece)
if (is_r_search_cached or is_r_search_full) and all([is_scan, is_r_complete_next]):
r_search_combined.extend(is_r_search_cached)
r_search_combined.extend(is_r_search_full)
scan_result.extend(is_scan)
r_complete.extend(is_r_complete_next)

rOpengov/mpg, looping through VIN numbers returns error vs single use?

I'm trying to loop through the fevehicle() function in the mpg package courtesy of:
https://github.com/rOpenGov/mpg
I've been trying to feed the function multiple vinids, even giving the function 5 seconds of rest between loops just in case, but I keep getting an HTTP error even though alone, the function works fine. Any ideas what it might be? Below is the code:
#using a loop
vin = c("19UUA86209A000532", "19UUA86239A021598", "19UUA8F20CA037748", "19UUA8F21CA008002", "19UUA8F21CA017878")
for (i in vin) {
library(mpg)
print(i)
print(substr(i, 13, 17))
q = substr(i, 13, 17)
z = feVehicle(q)
Sys.sleep(5)
z = t(unlist(z))
}
or
#using lapply to see a difference
lapply(vin, feVehicle)
both throw the following error:
[1] "19UUA86209A000532"
[1] "00532"
failed to load HTTP resource
Error in t.default(unlist(z)) : argument is not a matrix
> lapply(vin, feVehicle)
failed to load HTTP resource
failed to load HTTP resource
failed to load HTTP resource
failed to load HTTP resource
failed to load HTTP resource
But when I run it on just one at a time it works fine:
mpg::feVehicle(00532)
Vehicle data:
value
atvType Diesel
barrels08 16.616739130434784
barrelsA08 0.0
c240Dscr NULL
c240bDscr NULL
charge120 0.0
charge240 0.0
charge240b 0.0
city08 21
city08U 0.0
cityA08 0
cityA08U 0.0
city
It's because in your single example you gave a number but in the loop you used a character:
#using a loop
vin = c("19UUA86209A000532", "19UUA86239A021598", "19UUA8F20CA037748", "19UUA8F21CA008002", "19UUA8F21CA017878")
for (i in vin) {
library(mpg)
print(i)
print(substr(i, 13, 17))
q = substr(i, 13, 17)
z = feVehicle(as.numeric(q))
Sys.sleep(5)
z = t(unlist(z))
}
[1] "19UUA86209A000532"
[1] "00532"
[1] "19UUA86239A021598"
[1] "21598"
[1] "19UUA8F20CA037748"
[1] "37748"
[1] "19UUA8F21CA008002"
[1] "08002"
[1] "19UUA8F21CA017878"
[1] "17878"

Using the LDAvis package in R to create a gist file of the result

I'm using LDAvis for topic modeling and trying to use the as.gist option to create a gist. When serVis executes there is a timeout in curl::curl_fetch_memory after about 10 seconds. If I immediately execute serVis again I get a different error 'Problems parsing JSON' and from then on whenever serVis is run that same error recurs.
If I start all over with a fresh workspace the same behavior occurs. The first time serVis is run, curl::curl_fetch_memory times out after about 10 seconds. Subsequent executions return 'Problems parsing JSON'.
If I don't use the as.gist option it works fine, but of course doesn't create a gist.
Very rarely, it works and a gist is created. If I change parameters to reduce the size of the JSON object it usually works, which makes me think it may be related to size.
I have explored the various RCurlOptions timeout settings. Currently, they are set as
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem",
package = "RCurl"),
connecttimeout = 300, timeout = 3000,
followlocation = TRUE, dns.cache.timeout = 300))
Below is a console listing with debug set on curl::curl_fetch_memory.
> json <- createJSON(phi = cases$phi,
+ theta = cases$theta,
+ doc.len .... [TRUNCATED]
> serVis(json, open.browser = TRUE, as.gist = TRUE, description = 'APM Community')
debugging in: curl::curl_fetch_memory(url, handle = handle)
debug: {
output <- .Call(R_curl_fetch_memory, url, handle)
res <- handle_response_data(handle)
res$content <- output
res
}
Browse[2]> output <- .Call(R_curl_fetch_memory, url, handle)
Error: Timeout was reached
Browse[2]> output <- .Call(R_curl_fetch_memory, url, handle)
Browse[2]> rawToChar(output)
[1] "{\"message\":\"Problems parsing JSON\",\"documentation_url\":\"https://developer.github.com/v3\"}"
Browse[2]>
.
.
exiting from: curl::curl_fetch_memory(url, handle = handle)
Error: Problems parsing JSON
Any hints on how to debug this problem?

doParallel() and mySQL in R: Database not receiving data

I am using RMySQL() to send data from R to a MySQL database. The problem is that the database does not receive any data.... I am using doParallel() since i am running over 4500 iterations.... could it be because i try to send the data to the database in the pullSpread() function?
library(RMySQL)
library(doParallel)
library(stringr)
library(foreach)
makeCluster(detectCores()) # ANSWER = 4
cl <- makeCluster(4, type="SOCK") # also used PSOCK & FORK but receive the same problem
registerDoParrallel(cl)
# Now use foreach() and %dopar% to pull data...
# the apply(t(stock1), 2, pullSpread) works but not "parallelized"
# I have also used clusterApply() but is unsuccessful
system.time(
foreach(a=t(stock1)) %dopar% pullSpread(a)
)
When I look in my working directory, all the files are copied successfully onto a .csv file as it should but when I check MySQL workbench or even call the files from R they do not exist...
Here is the stock1() character vector and the pullSpread() function used...
# This list contains more than 4500 iterations.. so I am only posting a few
stock1<-c(
"SGMS.O","SGNL.O","SGNT.O",
"SGOC.O","SGRP.O", ...)
Important Dates needed for function:
Friday <- Sys.Date()-10
# Get Previous 5 days
Thursday <- Friday - 1
Wednesday <- Thursday -1
Tuesday <- Wednesday -1
Monday <- Tuesday -1
#Make Them readable for NetFonds
Friday <- format(Friday, "%Y%m%d")
Thursday<- format(Thursday, "%Y%m%d")
Wednesday<- format(Wednesday, "%Y%m%d")
Tuesday<- format(Tuesday, "%Y%m%d")
Monday<-format(Monday, "%Y%m%d")
Here is the pullSpread() function:
pullSpread = function (stock1){
AAPL_FRI<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Friday,"&paper=",stock1,"&csv_format=txt"))
tryit <- try(AAPL_FRI[,c(1:7)])
if(inherits(tryit, "try-error")){
rm(AAPL_FRI)
} else {
AAPL_THURS<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Thursday,"&paper=",stock1,"&csv_format=txt"))
AAPL_WED<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Wednesday,"&paper=",stock1,"&csv_format=txt"))
AAPL_TUES<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Tuesday,"&paper=",stock1,"&csv_format=txt"))
AAPL_MON<- read.delim(header=TRUE, stringsAsFactor=FALSE,
paste(sep="",
"http://www.netfonds.no/quotes/posdump.php?date=",
Monday,"&paper=",stock1,"&csv_format=txt"))
SERIES <- rbind(AAPL_MON,AAPL_TUES,AAPL_WED,AAPL_THURS,AAPL_FRI)
#Write .CSV File
write.csv(SERIES,paste(sep="",stock1,"_",Friday,".csv"), row.names=FALSE)
dbWriteTable(con2,paste0( "",str_sub(stock1, start = 1L, end = -3L),""),paste0(
"~/Desktop/R/",stock1,"_",Friday,".csv"), append=T)
}
}
Retrieve last Friday using something like this:
Friday <- Sys.Date()
while(weekdays(Friday) != "Friday")
{
Friday <- Friday - 1
}
As a matter of good practice, when retrieving data from the internet, separate the act of downloading it with processing it. That way, when the processing fails, you don't waste time and bandwidth redownloading things.
lastWeek <- format(Friday - 0:4, "%Y%m%d")
stockDatePairs <- expand.grid(Stock = stock1, Date = lastWeek)
urls <- with(
stockDatePairs,
paste0(
"http://www.netfonds.no/quotes/posdump.php?date=",
Date,
"&paper=",
Stock,
"&csv_format=txt"
)
)
for(url in urls)
{
# or whatever file name you want
download.file(url, paste0("data from ", make.names(url), ".txt"))
}
Make sure that you know which directory those files are being saved to. (Either provide an absolute path or set your working directory.)
Now try reading and rbinding those files.
If that works, then you can try doing things in parallel.
Also note that many online data services will limit the rate that you can download, unless you are paying for the service. So parallel downloads may just mean that you hit the limit quicker.

Custom function to create an index of results

I’m trying to create a function which creates an index (starting at 100) and then adjust this index according to the results of investments. So, in a nutshell, if the first investment gives an profit of 5%, then the index will stand 105, if the second result is -7%, then the index stands at 97.65. In this question when I use the word "index", I'm not referring to the index function of the zoo package.
Besides creating this index, my goal is also to create an function which can be applied to various subsets of my complete data set (i.e. with the use of sapply and it's friends).
Here’s the function which I have so far (data at end of this question):
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- 100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues,
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
In words the function does (read: is intended to do) the following:
If the value is the first, use 100 as an starting point for the index. If the value is not the first, use the previous calculated index value as the starting point for calculating the new index value. Besides this, the function also takes the weight of the individual result (compared with the totalAccount value) into account.
The problem:
Using this CalculateIndex function on the theData data frame gives the following incorrect output:
> CalculateIndex(theData)
[1] 99.97901 99.94180 99.65632 101.88689 100.89309 98.92878 102.02911 100.49159 98.52955 102.02243 98.43655 100.76502 99.34869 100.76401 101.18014 99.75136 97.90130
[18] 100.39935 99.81311 101.34961
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
Edit:
Wow, I already got an vote down, though I thought my question was already too long. Sorry, I thought/think the problem lay inside my loop, so I didn't want to bore you with the details, which I thought would only give less answers. Sorry, misjudgement on my part.
The problem is, with the above output from CalculateIndex, that the results are wildly different from Excel. Even though this could be resulting from rounding errors (as Joris mentions below), I doubt it. In comparison with the Excel results, the R results differ quite some:
R output Excel calculate values
99,9790085700 99,97900857
99,9418035700 99,92081189
99,6563228600 99,57713687
101,8868850000 101,4639947
100,8930864300 102,3570786
98,9287771400 101,2858564
102,0291071400 103,3149664
100,4915864300 103,806556
98,5295542900 102,3361186
102,0224285700 104,3585552
98,4365550000 102,795089
100,7650171400 103,5601228
99,3486857100 102,9087897
100,7640057100 103,6728077
101,1801400000 104,8529634
99,7513600000 104,6043164
97,9013000000 102,5055298
100,3993485700 102,9048999
99,8131085700 102,7179995
101,3496071400 104,0676555
I think it would be fair to say that the difference in output isn't the result of R versus Excel problems, but more an error in my function. So, let's focus on the function.
The manual calculation of the function
The function uses different variables:
Size.Units.; this is the number of units which are bought at the EntryPrice.
EntryPrice: the price at which the stocks are bought,
TradeResult.Percent.: the percentage gain or loss resulting from the investment,
TradeResult.Currency.: the currency value ($) of the gain or loss resulting from the investment,
These variables are used in the following section of the function:
100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
and
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
Both of the formula's are essentially the same, with the difference that the the first starts at 100, and the second uses the previous value to calculate the new indexed value.
The formula can be broken down in different steps:
First, x$Size.Units. * x$EntryPrice determines the total position that was taken, in the sense that buying 100 shares at an price of 48.98 gives an position of $4898.
The resulting total position is then divided by the total account size (i.e. totalAccount). This is needed to correct the impact of one position relative to the complete portfolio. For example, if our 100 shares bought at 48.98 drop 10 percent, the calculated index (i.e. the CalculateIndex function) doesn't have to drop 10%, because off course not all the money in totalAccount is invested in one stock. So, by dividing the total position by the totalAccount we get an ratio which tells us how much money is invested. For example, the position with the size of 4898 dollar (on a total account of 14000) results in a total account loss of 3.49% if the stock drops 10%. (i.e. 4898 / 14000 = 0.349857. 0.349857 * 10% = 3.49857%)
This ratio (of invested amount versus total amount) is then in the formula multiplied with x$TradeResult.Percent., so to get the percentage impact on the total account (see calculation example in the previous paragraph).
As an last step, the percentage loss on the total account is applied to the index value (which starts at 100). In this case, the first investment in 100 shares bought at 48.89 dollar let's the index drop from it starting point at 100 to 99.97901, reflecting the losing trade's impact on the total account.
End of Edit
Stripping the function clean and then adding a part of the formula at a time, so to uncover the error, I came to the following step where the error seems to reside:
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- totalAccount
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues, totalAccount)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
> CalculateIndex(theData)
[1] 14000
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
So, it seems that if I just use the totalAccount variable, the function doesn’t get updated correctly. This seems to suggest there is some error with the basics of the if else statement, because it only outputs the first value.
If I remove the else statement from the function, I do get values for each of the rows in theData. However, these are then wrongly calculated. So, it seems to me that there is some error in how this function updates the totalAccount variable. I don’t see where I made an error, so any suggestion would be highly appreciated. What am I doing wrong?
The Data
Here’s what my data looks like:
> theData
Size.Units. EntryPrice TradeResult.Percent. TradeResult.Currency.
1 100 48.98 -0.06 -3
11 100 32.59 -0.25 -8
12 100 32.51 -1.48 -48
2 100 49.01 5.39 264
13 100 32.99 3.79 125
14 100 34.24 -4.38 -150
3 100 51.65 5.50 284
4 100 48.81 1.41 69
15 100 35.74 -5.76 -206
5 100 49.50 5.72 283
6 100 46.67 -4.69 -219
16 100 33.68 3.18 107
7 100 44.48 -2.05 -91
17 100 32.61 3.28 107
8 100 45.39 3.64 165
9 100 47.04 -0.74 -35
10 100 47.39 -6.20 -294
18 100 33.68 1.66 56
19 100 33.12 -0.79 -26
20 100 32.86 5.75 189
theData <- structure(list(X = c(1L, 11L, 12L, 2L, 13L, 14L, 3L, 4L, 15L,
5L, 6L, 16L, 7L, 17L, 8L, 9L, 10L, 18L, 19L, 20L), Size.Units. = c(100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), EntryPrice = c(48.98,
32.59, 32.51, 49.01, 32.99, 34.24, 51.65, 48.81, 35.74, 49.5,
46.67, 33.68, 44.48, 32.61, 45.39, 47.04, 47.39, 33.68, 33.12,
32.86), TradeResult.Percent. = c(-0.06, -0.25, -1.48, 5.39, 3.79,
-4.38, 5.5, 1.41, -5.76, 5.72, -4.69, 3.18, -2.05, 3.28, 3.64,
-0.74, -6.2, 1.66, -0.79, 5.75), TradeResult.Currency. = c(-3L,
-8L, -48L, 264L, 125L, -150L, 284L, 69L, -206L, 283L, -219L,
107L, -91L, 107L, 165L, -35L, -294L, 56L, -26L, 189L)), .Names = c("X",
"Size.Units.", "EntryPrice", "TradeResult.Percent.", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, -20L))
# Set the account start # 14000
> accountValueStart <- 14000
Your code looks very strange, and it seems you have a lot of misconceptions about R that come from another programming language. Gavin and Gillespie pointed out already why you get the warniong. Let me add some tips for far more optimal coding:
[-1] does NOT mean: drop the last one. It means "keep everything but the first value", which also explains why you get erroneous results.
calculate common things in the beginning, to unclutter your code.
head(x$TradeResult.Currency., n = 1) is the same as x$TradeResult.Currency.[1].
Keep an eye on your vectors. Most of the mistakes in your code come from forgetting you're working with vectors.
If you need a value to be the first in a vector, put that OUTSIDE of any loop you'd use, never add an if-clause in the function.
predefine your vectors/matrices as much as possible, that goes a lot faster and gives less memory headaches when working with big data.
vectorization, vectorization, vectorization. Did I mention vectorization?
Learn the use of debug(), debugonce() and browser() to check what your function is doing. Many of your problems could have been solved by checking the objects when manipulated within the function.
This said and taken into account, your function becomes :
CalculateIndex <- function(x,accountValueStart){
# predifine your vector
indexedValues <- vector("numeric",nrow(x))
# get your totalAccount calculated FAST. This is a VECTOR!!!
totalAccount <- cumsum(c(accountValueStart,x$TradeResult.Currency.))
#adjust length:
totalAccount <- totalAccount[-(nrow(x)+1)]
# only once this calculation. This is a VECTOR!!!!
totRatio <- 1+(((x$Size.Units. * x$EntryPrice)/totalAccount) *
x$TradeResult.Percent.)/100
# and now the calculations
indexedValues[1] <- 100 * totRatio[1]
for(i in 2:nrow(x)){
indexedValues[i] <- indexedValues[i-1]*totRatio[i]
}
return(indexedValues)
}
and returns
> CalculateIndex(theData,14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
103.80656 102.33612 104.35856 102.79509 103.56012
[13] 102.90879 103.67281 104.85296 104.60432 102.50553 102.90490 102.71800
104.06766
So now you do:
invisible(replicate(10,print("I will never forget about vectorization any more!")))
The warning message is coming from this line:
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
It is easy to see why; x$TradeResult.Currency is a vector and thus the comparison with head(x$TradeResult.Currency., n = 1) yields a vector of logicals. (By the way, why not x$TradeResult.Currency[1] instead of the head() call?). if() requires a single logical not a vector of logicals, and that is what the warning is about. ifelse() is useful if you want to do one of two things depending upon a condition that gives a vector of logicals.
In effect, what you are doing is only entering the if() part of the statement and it gets executed once only, because the first element of x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1) is TRUE and R ignores the others.
> if(c(TRUE, FALSE)) {
+ print("Hi")
+ } else {
+ print("Bye")
+ }
[1] "Hi"
Warning message:
In if (c(TRUE, FALSE)) { :
the condition has length > 1 and only the first element will be used
> ifelse(c(TRUE, FALSE), print("Hi"), print("Bye"))
[1] "Hi"
[1] "Bye"
[1] "Hi" "Bye"
As to solving your real problem:
CalculateIndex2 <- function(x, value, start = 100) {
rowSeq <- seq_len(NROW(x))
totalAc <- cumsum(c(value, x$TradeResult.Currency.))[rowSeq]
idx <- numeric(length = nrow(x))
interm <- (((x$Size.Units. * x$EntryPrice) / totalAc) *
x$TradeResult.Percent.) / 100
for(i in rowSeq) {
idx[i] <- start + (start * interm[i])
start <- idx[i]
}
idx
}
which when used on theData gives:
> CalculateIndex2(theData, 14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
[8] 103.80656 102.33612 104.35856 102.79509 103.56012 102.90879 103.67281
[15] 104.85296 104.60432 102.50553 102.90490 102.71800 104.06766
What you want is a recursive function (IIRC); the current index is some function of the previous index. These are hard to solve in a vectorised way in R, hence the loop.
I'm still slightly confused as to what exactly you want to do, but hopefully the following will be helpful.
Your R script gives the same answers as your Excel function for the first value. You see a difference because R doesn't print out all digits.
> tmp = CalculateIndex(thedata)
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
> print(tmp, digits=10)
[1] 99.97900857 99.94180357 99.65632286 101.88688500 100.89308643
<snip>
The reason for the warning message is because x$TradeResult.Currency is a vector that is being compared to a single number.
That warning message is also where your bug lives. In your if statement, you never execute the else part, since only the value of x$TradeResult.Currency is being used. As the warning message states, only the first element of x$TradeResult.Currency is being used.