Unexpected output of emmeans averaged accross variables - lme4

I transformed a variable (e.g. leaf_area) using a simple square transformation and then fitted to the following model containing an interaction:
fit <- lmer(leaf_area^2 ~genotype*soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
To obtain the emmeans averaged accross genotypes and soil type for each measurement date, I further use the following command:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement, type = "response")
The emmeans are, nevertheless, averaged for the variable date_measurement.
As represented in the following example, emmeans are averages of genotypes x, y and z in the soil MT and in the date of measurement 27.4, but the measurement dates actually occured on 21, 23, 28, 30 and 35 das.
genotype soil_type date_measurement emmean SE df lower.CL upper.CL
x MT 27.4 0.190 0.0174 126.0 0.155 0.224
y MT 27.4 0.220 0.0147 74.1 0.191 0.250
z MT 27.4 0.210 0.0157 108.6 0.179 0.241
When I fit the model without interaction between genotype and soil type and run the emmeans, the results are still averaged for the measurement dates.
fit <- lmer(leaf_area^2 ~genotype + soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
fit.emm <- emmeans(fit, ~ genotype + soil_type + date_measurement, type = "response")
My question is: how can I obtain the emmeans averaged accross genotype and soil but separated for each date of measurement?
Class of variables:
date_measurement, light, x_position, y_position: numeric
genotype and soil_type: factor
Thank you in advance.

When you have a numerical predictor in the model, the default is to obtain predictions at the average value of that covariate. If you want the covariates treated like factors, you have to say so:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement,
cov.reduce = FALSE)
In addition, emmeans cannot auto-detect your square transformation. You can fix it up by doing
fit.emm <- update(fit.emm, tran = make.tran("power", 2),
type = "response")
Then I think you will want to subsequently obtain marginal means by averaging over date_measurement at least -- i.e.,
fit.emm2 <- emmeans(fit.emm, ~ genotype*soil_type)
It will retain the transformation and type = "response" setting.

Related

Multiple regression model

I have multiple regression model which looks like this:
wine.lm <- lm(Alc.vol. ~ pc2+pc1* factor(pc_parametr$color),
data = pc_parametr)
I have to extract the coefficients and translate them into one regression equation for each color (white,red)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.6803846 0.2511799 50.4832872 1.104478e-26
pc2 -3.7524814 3.5850681 -1.0466974 3.052541e-01
pc1 -9.3332435 7.5111530 -1.2425847 2.255503e-01
factor(pc_parameter$color)white 0.4778615 0.2926116 1.6330914 1.149828e-01
pc1:factor(pc_parameter$color)white 6.7281697 8.0999853 0.8306397 4.140399e-01
I was trying to do it manually but I am confused.
Y = 12.68 + -3.75 * pc2 + -9.33 * pc1 + 0.48 * factor(pc_parametr$color)white + 6.73 * pc1:factor(pc_parametr$color)white + e"
Is there a code for the calculation for different colors or manually how is the correct way
Here is correct equation notation:
Red color is the reference level and is always included in intercept. In the plot the colors move the regression curve along the y-axis.

Julia and StatsModels: print results without header line (StatsModels.DataFrameRegressionModel{...})

I run a linear regression with Julia, GLM, and StatsModels and print the results, which I include directly in the research report. This printout includes a header line with the object type, which is a distraction in the report. For example, this code:
using GLM, StatsModels, DataFrames
df = DataFrames.DataFrame(a = rand(10), b = rand(10))
f = fit(LinearModel, #formula(a ~ b), df)
println(f)
prints:
StatsModels.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
I can avoid the first line by casting the object into a string and splitting at newlines:
f2 = split(string(f), "\n")
for i in 2:length(f2)
println(f2[i])
end
and then I get:
Formula: a ~ 1 + b
Coefficients:
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.238502 0.224529 1.06223 0.3191
b 0.333053 0.366105 0.909721 0.3896
But this is ugly and prone to errors. In the GLM documentation of methods applied to a fit object, I found no methods or arguments for this. Does anyone have a cleaner way?
Following discussion in the comments. If you only need to get summary of coefficients of your model write:
julia> coeftable(f)
Estimate Std.Error t value Pr(>|t|)
(Intercept) 0.337666 0.205716 1.64142 0.1393
b -0.0887478 0.378739 -0.234324 0.8206

Calculate cutoff and sensitivity for specific values of specificity?

After calculating several regression models, I want to calculate sensitivity-values and the cut-off for pre-specified values of specificity (i.e, 0.99, 0.90, 0.85 and so on) to find the best model. I have created code to calculate sensitivity and specificity for given values of the cut-off (from 0.1 to 0.9), but now I want to use specific values of specificity (i.e., calculate the corresponding cut-off value and sensitivity-values), and here I'm stuck.
Suppose I have the following regression model (using the example dataset 'mtcars'):
data(mtcars)
model <- glm(formula= vs ~ wt + disp, data=mtcars, family=binomial)
Here is the code I've used for the calculation of sens and spec for given values of the cut-off:
predvalues <- model$fitted.values
getMisclass <- function(cutoff, p, labels){
d <- cut(predvalues, breaks=c(-Inf, cutoff, Inf), labels=c("0", "1"))
print(confusionMatrix(d, mtcars$vs, positive="1"))
cat("cutoff", cutoff, ":\n")
t <- table(d, mtcars$vs)
print(round(sum(t[c(1,4)])/sum(t), 2))
}
cutoffs <- seq(.1,.9,by=.1)
sapply(cutoffs, getMisclass, p=predval, labels=mtcars$vs)
Can someone help me how to rewrite this code for the calculation of sensitivity and cut-off scores given a range of specificity-values? Is it possible?
The values for the cutoff should be
cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75)
Thanks a lot!
This is closely related to how ROC curves are calculated: if those are calculated with fine granularity you essentially get a sensitivity and specificity for "every" threshold value. So, what you could do is simply calculate the sensitivities, specificities and corresponding threshold as if you would want to obtain a ROC curve...
library(pROC)
myRoc <- roc(predictor = predvalues, response = mtcars$vs)
plot(myRoc)
myRoc$specificities
print(with(myRoc, data.frame(specificities, sensitivities, thresholds)))
# specificities sensitivities thresholds
# 1 0.00000000 1.00000000 -Inf
# 2 0.05555556 1.00000000 0.002462809
# 3 0.11111111 1.00000000 0.003577104
# 4 0.16666667 1.00000000 0.004656164
# 5 0.22222222 1.00000000 0.005191974
# 6 0.27777778 1.00000000 0.006171197
# [...]
...and then lookup the corresponding sensitivities and thresholds for whichever specificities you are interested in, e.g. as:
cutoffs <- c(0.99, 0.90, 0.85, 0.80, 0.75)
myData <- with(myRoc, data.frame(specificities, sensitivities, thresholds))
library(plyr)
print(laply(cutoffs, function(cutoff) myData$sensitivities[which.min(abs(myData$specificities-cutoff))]))
# [1] 0.7857143 0.8571429 0.8571429 0.9285714 0.9285714

Shapefile with overlapping polygons: calculate average values

I have a very big polygon shapefile with hundreds of features, often overlapping each other. Each of these features has a value stored in the attribute table. I simply need to calculate the average values in the areas where they overlap.
I can imagine that this task requires several intricate steps: I was wondering if there is a straightforward methodology.
I’m open to every kind of suggestion, I can use ArcMap, QGis, arcpy scripts, PostGis, GDAL… I just need ideas. Thanks!
You should use the Union tool from ArcGIS. It will create new polygons where the polygons overlap. In order to keep the attributes from both polygons, add your polygon shapefile twice as input and use ALL as join_attributes parameter.This creates also polygons intersecting with themselves, you can select and delete them easily as they have the same FIDs. Then just add a new field to the attribute table and calculate it based on the two original value fields from the input polygons.
This can be done in a script or directly with the toolbox's tools.
After few attempts, I found a solution by rasterising all the features singularly and then performing cell statistics in order to calculate the average.
See below the script I wrote, please do not hesitate to comment and improve it!
Thanks!
#This script processes a shapefile of snow persistence (area of interest: Afghanistan).
#the input shapefile represents a month of snow cover and contains several features.
#each feature represents a particular day and a particular snow persistence (low,medium,high,nodata)
#these features are polygons multiparts, often overlapping.
#a feature of a particular day can overlap a feature of another one, but features of the same day and with
#different snow persistence can not overlap each other.
#(potentially, each shapefile contains 31*4 feature).
#the script takes the features singularly and exports each feature in a temporary shapefile
#which contains only one feature.
#Then, each feature is converted to raster, and after
#a logical conditional expression gives a value to the pixel according the intensity (high=3,medium=2,low=1,nodata=skipped).
#Finally, all these rasters are summed and divided by the number of days, in order to
#calculate an average value.
#The result is a raster with the average snow persistence in a particular month.
#This output raster ranges from 0 (no snow) to 3 (persistent snow for the whole month)
#and values outside this range should be considered as small errors in pixel overlapping.
#This script needs a particular folder structure. The folder C:\TEMP\Afgh_snow_cover contains 3 subfolders
#input, temp and outputs. The script takes care automatically of the cleaning of temporary data
import arcpy, numpy, os
from arcpy.sa import *
from arcpy import env
#function for finding unique values of a field in a FC
def unique_values_in_table(table, field):
data = arcpy.da.TableToNumPyArray(table, [field])
return numpy.unique(data[field])
#check extensions
try:
if arcpy.CheckExtension("Spatial") == "Available":
arcpy.CheckOutExtension("Spatial")
else:
# Raise a custom exception
#
raise LicenseError
except LicenseError:
print "spatial Analyst license is unavailable"
except:
print arcpy.GetMessages(2)
finally:
# Check in the 3D Analyst extension
#
arcpy.CheckInExtension("Spatial")
# parameters and environment
temp_folder = r"C:\TEMP\Afgh_snow_cover\temp_rasters"
output_folder = r"C:\TEMP\Afgh_snow_cover\output_rasters"
env.workspace = temp_folder
unique_field = "FID"
field_Date = "DATE"
field_Type = "Type"
cellSize = 0.02
fc = r"C:\TEMP\Afgh_snow_cover\input_shapefiles\snow_cover_Dec2007.shp"
stat_output_name = fc[-11:-4] + ".tif"
#print stat_output_name
arcpy.env.extent = "MAXOF"
#find all the uniquesID of the FC
uniqueIDs = unique_values_in_table(fc, "FID")
#make layer for selecting
arcpy.MakeFeatureLayer_management (fc, "lyr")
#uniqueIDs = uniqueIDs[-5:]
totFeatures = len(uniqueIDs)
#for each feature, get the date and the type of snow persistence(type can be high, medium, low and nodata)
for i in uniqueIDs:
SC = arcpy.SearchCursor(fc)
for row in SC:
if row.getValue(unique_field) == i:
datestring = row.getValue(field_Date)
typestring = row.getValue(field_Type)
month = str(datestring.month)
day = str(datestring.day)
year = str(datestring.year)
#format month and year string
if len(month) == 1:
month = '0' + month
if len(day) == 1:
day = '0' + day
#convert snow persistence to numerical value
if typestring == 'high':
typestring2 = 3
if typestring == 'medium':
typestring2 = 2
if typestring == 'low':
typestring2 = 1
if typestring == 'nodata':
typestring2 = 0
#skip the NoData features, and repeat the following for each feature (a feature is a day and a persistence value)
if typestring2 > 0:
#create expression for selecting the feature
expression = ' "FID" = ' + str(i) + ' '
#select the feature
arcpy.SelectLayerByAttribute_management("lyr", "NEW_SELECTION", expression)
#create
#outFeatureClass = os.path.join(temp_folder, ("M_Y_" + str(i)))
#create faeture class name, writing the snow persistence value at the end of the name
outFeatureClass = "Afg_" + str(year) + str(month) + str(day) + "_" + str(typestring2) + '.shp'
#export the feature
arcpy.FeatureClassToFeatureClass_conversion("lyr", temp_folder, outFeatureClass)
print "exported FID " + str(i) + " \ " + str(totFeatures)
#create name of the raster and convert the newly created feature to raster
outRaster = outFeatureClass[4:-4] + ".tif"
arcpy.FeatureToRaster_conversion(outFeatureClass, field_Type, outRaster, cellSize)
#remove the temporary fc
arcpy.Delete_management(outFeatureClass)
del SC, row
#now many rasters are created, representing the snow persistence types of each day.
#list all the rasters created
rasterList = arcpy.ListRasters("*", "All")
print rasterList
#now the rasters have values 1 and 0. the following loop will
#perform CON expressions in order to assign the value of snow persistence
for i in rasterList:
print i + ":"
inRaster = Raster(i)
#set the value of snow persistence, stored in the raster name
value_to_set = i[-5]
inTrueRaster = int(value_to_set)
inFalseConstant = 0
whereClause = "Value > 0"
# Check out the ArcGIS Spatial Analyst extension license
arcpy.CheckOutExtension("Spatial")
print 'Executing CON expression and deleting input'
# Execute Con , in order to assign to each pixel the value of snow persistence
print str(inTrueRaster)
try:
outCon = Con(inRaster, inTrueRaster, inFalseConstant, whereClause)
except:
print 'CON expression failed (probably empty raster!)'
nameoutput = i[:-4] + "_c.tif"
outCon.save(nameoutput)
#delete the temp rasters with values 0 and 1
arcpy.Delete_management(i)
#list the raster with values of snow persistence
rasterList = arcpy.ListRasters("*_c.tif", "All")
#sum the rasters
print "Caclulating SUM"
outCellStats = CellStatistics(rasterList, "SUM", "DATA")
#calculate the number of days (num of rasters/3)
print "Calculating day ratio"
num_of_rasters = len(rasterList)
print 'Num of rasters : ' + str(num_of_rasters)
num_of_days = num_of_rasters / 3
print 'Num of days : ' + str(num_of_days)
#in order to store decimal values, multiplicate the raster by 1000 before dividing
outCellStats = outCellStats * 1000 / num_of_days
#save the output raster
print "saving output " + stat_output_name
stat_output_name = os.path.join(output_folder,stat_output_name)
outCellStats.save(stat_output_name)
#delete the remaining temporary rasters
print "deleting CON rasters"
for i in rasterList:
print "deleting " + i
arcpy.Delete_management(i)
arcpy.Delete_management("lyr")
Could you rasterize your polygons into multiple layers, each pixel could contain your attribute value. Then merge the layers by averaging the attribute values?

Custom function to create an index of results

I’m trying to create a function which creates an index (starting at 100) and then adjust this index according to the results of investments. So, in a nutshell, if the first investment gives an profit of 5%, then the index will stand 105, if the second result is -7%, then the index stands at 97.65. In this question when I use the word "index", I'm not referring to the index function of the zoo package.
Besides creating this index, my goal is also to create an function which can be applied to various subsets of my complete data set (i.e. with the use of sapply and it's friends).
Here’s the function which I have so far (data at end of this question):
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- 100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues,
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
In words the function does (read: is intended to do) the following:
If the value is the first, use 100 as an starting point for the index. If the value is not the first, use the previous calculated index value as the starting point for calculating the new index value. Besides this, the function also takes the weight of the individual result (compared with the totalAccount value) into account.
The problem:
Using this CalculateIndex function on the theData data frame gives the following incorrect output:
> CalculateIndex(theData)
[1] 99.97901 99.94180 99.65632 101.88689 100.89309 98.92878 102.02911 100.49159 98.52955 102.02243 98.43655 100.76502 99.34869 100.76401 101.18014 99.75136 97.90130
[18] 100.39935 99.81311 101.34961
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
Edit:
Wow, I already got an vote down, though I thought my question was already too long. Sorry, I thought/think the problem lay inside my loop, so I didn't want to bore you with the details, which I thought would only give less answers. Sorry, misjudgement on my part.
The problem is, with the above output from CalculateIndex, that the results are wildly different from Excel. Even though this could be resulting from rounding errors (as Joris mentions below), I doubt it. In comparison with the Excel results, the R results differ quite some:
R output Excel calculate values
99,9790085700 99,97900857
99,9418035700 99,92081189
99,6563228600 99,57713687
101,8868850000 101,4639947
100,8930864300 102,3570786
98,9287771400 101,2858564
102,0291071400 103,3149664
100,4915864300 103,806556
98,5295542900 102,3361186
102,0224285700 104,3585552
98,4365550000 102,795089
100,7650171400 103,5601228
99,3486857100 102,9087897
100,7640057100 103,6728077
101,1801400000 104,8529634
99,7513600000 104,6043164
97,9013000000 102,5055298
100,3993485700 102,9048999
99,8131085700 102,7179995
101,3496071400 104,0676555
I think it would be fair to say that the difference in output isn't the result of R versus Excel problems, but more an error in my function. So, let's focus on the function.
The manual calculation of the function
The function uses different variables:
Size.Units.; this is the number of units which are bought at the EntryPrice.
EntryPrice: the price at which the stocks are bought,
TradeResult.Percent.: the percentage gain or loss resulting from the investment,
TradeResult.Currency.: the currency value ($) of the gain or loss resulting from the investment,
These variables are used in the following section of the function:
100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
and
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
Both of the formula's are essentially the same, with the difference that the the first starts at 100, and the second uses the previous value to calculate the new indexed value.
The formula can be broken down in different steps:
First, x$Size.Units. * x$EntryPrice determines the total position that was taken, in the sense that buying 100 shares at an price of 48.98 gives an position of $4898.
The resulting total position is then divided by the total account size (i.e. totalAccount). This is needed to correct the impact of one position relative to the complete portfolio. For example, if our 100 shares bought at 48.98 drop 10 percent, the calculated index (i.e. the CalculateIndex function) doesn't have to drop 10%, because off course not all the money in totalAccount is invested in one stock. So, by dividing the total position by the totalAccount we get an ratio which tells us how much money is invested. For example, the position with the size of 4898 dollar (on a total account of 14000) results in a total account loss of 3.49% if the stock drops 10%. (i.e. 4898 / 14000 = 0.349857. 0.349857 * 10% = 3.49857%)
This ratio (of invested amount versus total amount) is then in the formula multiplied with x$TradeResult.Percent., so to get the percentage impact on the total account (see calculation example in the previous paragraph).
As an last step, the percentage loss on the total account is applied to the index value (which starts at 100). In this case, the first investment in 100 shares bought at 48.89 dollar let's the index drop from it starting point at 100 to 99.97901, reflecting the losing trade's impact on the total account.
End of Edit
Stripping the function clean and then adding a part of the formula at a time, so to uncover the error, I came to the following step where the error seems to reside:
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- totalAccount
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues, totalAccount)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
> CalculateIndex(theData)
[1] 14000
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
So, it seems that if I just use the totalAccount variable, the function doesn’t get updated correctly. This seems to suggest there is some error with the basics of the if else statement, because it only outputs the first value.
If I remove the else statement from the function, I do get values for each of the rows in theData. However, these are then wrongly calculated. So, it seems to me that there is some error in how this function updates the totalAccount variable. I don’t see where I made an error, so any suggestion would be highly appreciated. What am I doing wrong?
The Data
Here’s what my data looks like:
> theData
Size.Units. EntryPrice TradeResult.Percent. TradeResult.Currency.
1 100 48.98 -0.06 -3
11 100 32.59 -0.25 -8
12 100 32.51 -1.48 -48
2 100 49.01 5.39 264
13 100 32.99 3.79 125
14 100 34.24 -4.38 -150
3 100 51.65 5.50 284
4 100 48.81 1.41 69
15 100 35.74 -5.76 -206
5 100 49.50 5.72 283
6 100 46.67 -4.69 -219
16 100 33.68 3.18 107
7 100 44.48 -2.05 -91
17 100 32.61 3.28 107
8 100 45.39 3.64 165
9 100 47.04 -0.74 -35
10 100 47.39 -6.20 -294
18 100 33.68 1.66 56
19 100 33.12 -0.79 -26
20 100 32.86 5.75 189
theData <- structure(list(X = c(1L, 11L, 12L, 2L, 13L, 14L, 3L, 4L, 15L,
5L, 6L, 16L, 7L, 17L, 8L, 9L, 10L, 18L, 19L, 20L), Size.Units. = c(100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), EntryPrice = c(48.98,
32.59, 32.51, 49.01, 32.99, 34.24, 51.65, 48.81, 35.74, 49.5,
46.67, 33.68, 44.48, 32.61, 45.39, 47.04, 47.39, 33.68, 33.12,
32.86), TradeResult.Percent. = c(-0.06, -0.25, -1.48, 5.39, 3.79,
-4.38, 5.5, 1.41, -5.76, 5.72, -4.69, 3.18, -2.05, 3.28, 3.64,
-0.74, -6.2, 1.66, -0.79, 5.75), TradeResult.Currency. = c(-3L,
-8L, -48L, 264L, 125L, -150L, 284L, 69L, -206L, 283L, -219L,
107L, -91L, 107L, 165L, -35L, -294L, 56L, -26L, 189L)), .Names = c("X",
"Size.Units.", "EntryPrice", "TradeResult.Percent.", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, -20L))
# Set the account start # 14000
> accountValueStart <- 14000
Your code looks very strange, and it seems you have a lot of misconceptions about R that come from another programming language. Gavin and Gillespie pointed out already why you get the warniong. Let me add some tips for far more optimal coding:
[-1] does NOT mean: drop the last one. It means "keep everything but the first value", which also explains why you get erroneous results.
calculate common things in the beginning, to unclutter your code.
head(x$TradeResult.Currency., n = 1) is the same as x$TradeResult.Currency.[1].
Keep an eye on your vectors. Most of the mistakes in your code come from forgetting you're working with vectors.
If you need a value to be the first in a vector, put that OUTSIDE of any loop you'd use, never add an if-clause in the function.
predefine your vectors/matrices as much as possible, that goes a lot faster and gives less memory headaches when working with big data.
vectorization, vectorization, vectorization. Did I mention vectorization?
Learn the use of debug(), debugonce() and browser() to check what your function is doing. Many of your problems could have been solved by checking the objects when manipulated within the function.
This said and taken into account, your function becomes :
CalculateIndex <- function(x,accountValueStart){
# predifine your vector
indexedValues <- vector("numeric",nrow(x))
# get your totalAccount calculated FAST. This is a VECTOR!!!
totalAccount <- cumsum(c(accountValueStart,x$TradeResult.Currency.))
#adjust length:
totalAccount <- totalAccount[-(nrow(x)+1)]
# only once this calculation. This is a VECTOR!!!!
totRatio <- 1+(((x$Size.Units. * x$EntryPrice)/totalAccount) *
x$TradeResult.Percent.)/100
# and now the calculations
indexedValues[1] <- 100 * totRatio[1]
for(i in 2:nrow(x)){
indexedValues[i] <- indexedValues[i-1]*totRatio[i]
}
return(indexedValues)
}
and returns
> CalculateIndex(theData,14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
103.80656 102.33612 104.35856 102.79509 103.56012
[13] 102.90879 103.67281 104.85296 104.60432 102.50553 102.90490 102.71800
104.06766
So now you do:
invisible(replicate(10,print("I will never forget about vectorization any more!")))
The warning message is coming from this line:
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
It is easy to see why; x$TradeResult.Currency is a vector and thus the comparison with head(x$TradeResult.Currency., n = 1) yields a vector of logicals. (By the way, why not x$TradeResult.Currency[1] instead of the head() call?). if() requires a single logical not a vector of logicals, and that is what the warning is about. ifelse() is useful if you want to do one of two things depending upon a condition that gives a vector of logicals.
In effect, what you are doing is only entering the if() part of the statement and it gets executed once only, because the first element of x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1) is TRUE and R ignores the others.
> if(c(TRUE, FALSE)) {
+ print("Hi")
+ } else {
+ print("Bye")
+ }
[1] "Hi"
Warning message:
In if (c(TRUE, FALSE)) { :
the condition has length > 1 and only the first element will be used
> ifelse(c(TRUE, FALSE), print("Hi"), print("Bye"))
[1] "Hi"
[1] "Bye"
[1] "Hi" "Bye"
As to solving your real problem:
CalculateIndex2 <- function(x, value, start = 100) {
rowSeq <- seq_len(NROW(x))
totalAc <- cumsum(c(value, x$TradeResult.Currency.))[rowSeq]
idx <- numeric(length = nrow(x))
interm <- (((x$Size.Units. * x$EntryPrice) / totalAc) *
x$TradeResult.Percent.) / 100
for(i in rowSeq) {
idx[i] <- start + (start * interm[i])
start <- idx[i]
}
idx
}
which when used on theData gives:
> CalculateIndex2(theData, 14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
[8] 103.80656 102.33612 104.35856 102.79509 103.56012 102.90879 103.67281
[15] 104.85296 104.60432 102.50553 102.90490 102.71800 104.06766
What you want is a recursive function (IIRC); the current index is some function of the previous index. These are hard to solve in a vectorised way in R, hence the loop.
I'm still slightly confused as to what exactly you want to do, but hopefully the following will be helpful.
Your R script gives the same answers as your Excel function for the first value. You see a difference because R doesn't print out all digits.
> tmp = CalculateIndex(thedata)
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
> print(tmp, digits=10)
[1] 99.97900857 99.94180357 99.65632286 101.88688500 100.89308643
<snip>
The reason for the warning message is because x$TradeResult.Currency is a vector that is being compared to a single number.
That warning message is also where your bug lives. In your if statement, you never execute the else part, since only the value of x$TradeResult.Currency is being used. As the warning message states, only the first element of x$TradeResult.Currency is being used.