Related
I have data as real and imaginary parts of complex number, and I want to fit them according to a complex function. More in detail, they are data from electrochemical impedance spectroscopy (EIS) experiments, and the function comes from an equivalent circuit.
I am using Octave 7.2.0 in a Windows 10 computer. I need to use the leasqr algorithm, present in the Optim package. The leasqr used the Levenberg-Marquardt nonlinear regression, typical in EIS data fitting.
Regarding the data, xdata are linear frequency, y data are ReZ + j*ImZ.
If I try to fit the complex data with the complex fitting function, I get the following error:
error: weighted residuals are not real
error: called from
__lm_svd__ at line 147 column 20
leasqr at line 662 column 25
Code_for_StackOverflow at line 47 column 73
I tried to fit the real part of the data with the real part of the fitting function, and the imaginary parts of the data, with the imaginary part of the function. The fits are successfully performed, but I have two sets of fitted parameters, while I need only one set.
Here the code I wrote.
clear -a;
clf;
clc;
pkg load optim;
pkg load symbolic;
Linear_freq = [1051.432, 394.2871, 112.6535, 39.42871, 11.59668, 3.458659, 1.065641, 0.3258571, 0.1000221];
ReZ = [84.10412, 102.0962, 178.8031, 283.0663, 366.7088, 431.3653, 514.4105, 650.5853, 895.9588];
MinusImZ = [27.84804, 59.56786, 116.5972, 123.2293, 102.6806, 117.4836, 178.1147, 306.256, 551.2337];
Z = [88.5946744, 118.2030626, 213.4606653, 308.7264008, 380.8131426, 447.0776424, 544.3739605, 719.0646495, 1051.950932];
MinusPhase = [18.32042302, 30.26135402, 33.1083029, 23.52528583, 15.64255593, 15.23515301, 19.09841797, 25.2082044, 31.60167787];
ImZ = -MinusImZ;
Angular_freq = 2*pi*Linear_freq;
xdata = Angular_freq;
ydata = ReZ + j*ImZ;
Fitting_Function = #(xdata, p) (p(1) + ((p(2) + (1./(p(3)*(j*xdata).^0.5))).^-1 + (1./(p(4)*(j*xdata).^p(5))).^-1).^-1);
p = [80, 300, 6.63E-3, 5E-5, 0.8]; # True parameters values, taken with a dedicated software: 76, 283, 1.63E-3, 1.5E-5, 0.876
options.fract_prec = [0.0005, 0.0005, 0.0005, 0.0005, 0.0005].';
niter=400;
tol=1E-12;
dFdp="dfdp";
dp=1E-9*ones(size(p));
wt = abs(sqrt(ydata).^-1);
#[Fitted_Parameters_ReZ pfit_real cvg_real iter_real corp_real covp_real covr_real stdresid_real z_real r2_real] = leasqr(xdata, ReZ, p, Fitting_Function_ReZ, tol, niter, wt, dp, dFdp, options);
#[Fitted_Parameters_ImZ pfit_imag cvg_imag iter_imag corp_imag covp_imag covr_imag stdresid_imag z_imag r2_imag] = leasqr(xdata, ImZ, p, Fitting_Function_ImZ, tol, niter, wt, dp, dFdp, options);
[Fitted_Parameters pfit cvg iter corp covp covr stdresid z r2] = leasqr(xdata, ydata, p, Fitting_Function, tol, niter, wt, dp, dFdp, options);
#########################################################################
# Calculate the fitted functions, with the fitted paramteres array
#########################################################################
Fitted_Function_Real = real(pfit_real(1) + ((pfit_real(2) + (1./(pfit_real(3)*(j*xdata).^0.5))).^-1 + (1./(pfit_real(4)*(j*xdata).^pfit_real(5))).^-1).^-1);
Fitted_Function_Imag = imag(pfit_imag(1) + ((pfit_imag(2) + (1./(pfit_imag(3)*(j*xdata).^0.5))).^-1 + (1./(pfit_imag(4)*(j*xdata).^pfit_imag(5))).^-1).^-1);
Fitted_Function = Fitted_Function_Real + j.*Fitted_Function_Imag;
Fitted_Function_Mod = abs(Fitted_Function);
Fitted_Function_Phase = (-(angle(Fitted_Function))*(180./pi));
################################################################################
# Calculate the residuals, from https://iopscience.iop.org/article/10.1149/1.2044210
# An optimum fit is obtained when the residuals are spread randomly around the log Omega axis.
# When the residuals show a systematic deviation from the horizontal axis, e.g., by forming
# a "trace" around, above, or below the log co axis, the complex nonlinear least squares (CNLS) fit is not adequate.
################################################################################
Residuals_Real = (ReZ-Fitted_Function_Real)./Fitted_Function_Mod;
Residuals_Imag = (ImZ-Fitted_Function_Imag)./Fitted_Function_Mod;
################################################################################
# Calculate the chi-squared - reduced value, with the fitted paramteres array NOVA manual page 452
################################################################################
chi_squared_ReZ = sum(((ReZ-Fitted_Function_Real).^2)./Z.^2)
chi_squared_ImZ = sum(((ImZ-Fitted_Function_Imag).^2)./Z.^2)
Pseudo_chi_squared = sum((((ReZ-Fitted_Function_Real).^2)+((ImZ-Fitted_Function_Imag).^2))./Z.^2)
disp('The values of the parameters after the fit of the real function are '), disp(pfit_real);
disp('The values of the parameters after the fit of the imaginary function are '), disp(pfit_imag);
disp("R^2, the coefficient of multiple determination, intercept form (not suitable for non-real residuals) is "), disp(r2_real), disp(r2_imag);
###################################################
## PLOT Data and the Function
###################################################
#Set plot parameters
set(0, "defaultlinelinewidth", 1);
set(0, "defaulttextfontname", "Verdana");
set(0, "defaulttextfontsize", 20);
set(0, "DefaultAxesFontName", "Verdana");
set(0, 'DefaultAxesFontSize', 12);
figure(1);
## Nyquist plot (Argand diagram)
subplot(1,2,1, "align");
plot((ReZ), (MinusImZ), "o", "markersize", 2, (Fitted_Function_Real), -(Fitted_Function_Imag), "-k");
axis ("square");
grid on;
daspect([1 1 2]);
title ('Nyquist Plot - Argand Diagram');
xlabel ('Z'' / \Omega' , 'interpreter', 'tex');
ylabel ('-Z'''' / \Omega', 'interpreter', 'tex');
## Bode Modulus
subplot (2, 2, 2);
loglog((Linear_freq), (Z), "o", "markersize", 2, (Linear_freq), (Fitted_Function_Mod), "-k");
grid on;
title ('Bode Plot - Modulus');
xlabel ('\nu (Hz)' , 'interpreter', 'tex');
ylabel ('|Z| / \Omega', 'interpreter', 'tex');
## Bode Phase
subplot (2, 2, 4);
semilogx((Linear_freq), (MinusPhase), "o", "markersize", 2, (Linear_freq), (Fitted_Function_Phase), "-k");
set(gca,'YTick',0:10:90);
grid on;
title ('Bode Plot - Phase');
xlabel ('\nu (Hz)' , 'interpreter', 'tex');
ylabel ('-\theta (°)', 'interpreter', 'tex');
figure(2)
## Bode Z'
subplot (2, 1, 1);
semilogx((Linear_freq), (ReZ), "o", "markersize", 2, (Linear_freq), (Fitted_Function_Real), "-k");
grid on;
title ('Bode Plot Z''');
xlabel ('\nu (Hz)' , 'interpreter', 'tex');
ylabel ('Z'' / \Omega', 'interpreter', 'tex');
## Bode -Z''
subplot (2, 1, 2);
#subplot (2, 2, 4);
semilogx((Linear_freq), (MinusImZ), "o", "markersize", 2, (Linear_freq), -(Fitted_Function_Imag), "-k");
grid on;
title ('Bode Plot -Z''''');
xlabel ('\nu (Hz)' , 'interpreter', 'tex');
ylabel ('-Z'''' / \Omega', 'interpreter', 'tex');
figure(3)
## Residuals Real
subplot (2, 1, 1);
semilogx((Angular_freq), (Residuals_Real), "-o", "markersize", 2);
grid on;
title ('Residuals Real');
xlabel ('\omega (Hz)' , 'interpreter', 'tex');
ylabel ('\Delta_{re} / \Omega', 'interpreter', 'tex');
## Residuals Imaginary
subplot (2, 1, 2);
#subplot (2, 2, 4);
semilogx((Angular_freq), (Residuals_Imag), "-o", "markersize", 2);
grid on;
title ('Residuals Imaginary');
xlabel ('\omega (Hz)' , 'interpreter', 'tex');
ylabel ('\Delta_{im} / \Omega', 'interpreter', 'tex');
Octave should be able to handle complex numbers. What do I do wrong?
I was thinking to fit the real part of the data with the real part of the fitting function, and then using the Kramers-Kronig relations to get the imaginary part of the fitted function, but I would like to avoid this method, if possible.
Any help would be greatly appreciated, thanks in advance.
From your data drawing the complex impedances diagram makes appear a rather common shape that can be model with a lot of equivalent circuits :
Reference : https://fr.scribd.com/doc/71923015/The-Phasance-Concept
You chose the model n°2 probably according to some physical considerations. This is not the subject to be discussed here.
Also according to physical consideration and/or by graphical inspection you correctly assumed that one phasance is of Warbourg kind (Phi=-pi/4 ; nu=-1/2).
The problem is to fit an equation with five adjustable parameters. This is a difficult problem of non linear regression of a complex equation. The usual method consists in an iterative process starting from "guessed values" of the five parameters.
The "guessed values" have to be not far from the unknown correct values. One can find some approximates from graphical inspection of the impedances diagram. Often this is a cause of failure of convergence of the iterative process.
A more reliable method consists in using a combination of linear regression wrt most of the parameters and non-linear regression wrt only few parameters.
In the present case it is shown below that the nonlinear regression can be reduced to only one parameter while the other parameters can be handled by a simple linear regression. This is a big simplification.
A software for mixed linear and nonlinear regression (in cases involving several phasors) was developed in years 1980-1990. Infortunately I have no access to it presently.
Nevertheless in the present case of one phasor only we don't need a sledgehammer to crack a nut. The Newton-Raphson method is sufficient. Graphical inspection gives a rough approximate of (nu) between -0.7 and -0.8 The chosen initial value is nu=-0.75 giving the next first run :
Since all calculus are carried out in complex numbers the resulting values are complex instead of real as expected. They are noted ZR1, ZR2, ZP1, ZP2 to distinguish from real R1, R2, P1, P2. This is because the value of (nu) isn't optimal.
The more (nu) converges to the final value the more the imaginary parts vanishes. After a few runs of the Newton-Raphson process the imaginary parts become quite negligible. The final result is shown below.
Publications :
"Contribution à l'interprétation de certaines mesures d'impédances". 2-ième Forum sur les Imédances Electrochimiques, 28-29 octobre 1987.
"Calcul de réseau électriques équivalents à partir de mesures d'impédances". 3-ième Forum sur les Imédances Electrochimiques, 24 novembre 1988.
"Synthèse de circuits électiques équivalents à partir de mesures d'impédances complexes". 5-ième Forum sur les Imédances Electrochimiques, 28 novembre 1991.
I transformed a variable (e.g. leaf_area) using a simple square transformation and then fitted to the following model containing an interaction:
fit <- lmer(leaf_area^2 ~genotype*soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
To obtain the emmeans averaged accross genotypes and soil type for each measurement date, I further use the following command:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement, type = "response")
The emmeans are, nevertheless, averaged for the variable date_measurement.
As represented in the following example, emmeans are averages of genotypes x, y and z in the soil MT and in the date of measurement 27.4, but the measurement dates actually occured on 21, 23, 28, 30 and 35 das.
genotype soil_type date_measurement emmean SE df lower.CL upper.CL
x MT 27.4 0.190 0.0174 126.0 0.155 0.224
y MT 27.4 0.220 0.0147 74.1 0.191 0.250
z MT 27.4 0.210 0.0157 108.6 0.179 0.241
When I fit the model without interaction between genotype and soil type and run the emmeans, the results are still averaged for the measurement dates.
fit <- lmer(leaf_area^2 ~genotype + soil_type + date_measurement + light + (1|repetition) + (1|y_position) + (1|x_position), data = dataset)
fit.emm <- emmeans(fit, ~ genotype + soil_type + date_measurement, type = "response")
My question is: how can I obtain the emmeans averaged accross genotype and soil but separated for each date of measurement?
Class of variables:
date_measurement, light, x_position, y_position: numeric
genotype and soil_type: factor
Thank you in advance.
When you have a numerical predictor in the model, the default is to obtain predictions at the average value of that covariate. If you want the covariates treated like factors, you have to say so:
fit.emm <- emmeans(fit, ~ genotype*soil_type + date_measurement,
cov.reduce = FALSE)
In addition, emmeans cannot auto-detect your square transformation. You can fix it up by doing
fit.emm <- update(fit.emm, tran = make.tran("power", 2),
type = "response")
Then I think you will want to subsequently obtain marginal means by averaging over date_measurement at least -- i.e.,
fit.emm2 <- emmeans(fit.emm, ~ genotype*soil_type)
It will retain the transformation and type = "response" setting.
I have a simple issue after running a regression with panel data using plm with a dataset that resembles the one below:
dataset <- data.frame(id = rep(c(1,2,3,4,5), 2),
time = rep(c(0,1), each = 5),
group = rep(c(0,1,0,0,1), 2),
Y = runif(10,0,1))
model <-plm(Y ~ time*group, method = 'fd', effect = 'twoways', data = dataset,
index = c('id', 'time'))
summary(model)
stargazer(model)
As you can see, both the model summary and the table displayed by stargazer would say that my number of observations is 10. However, is it not more correct to say that N = 5, since I have taken away the time element after with the first differences?
You are right about the number of observations. However, your code does not what you want it to do (a first differenced model).
If you want a first differenced model, switch the argument method to model (and delete argument effect because it does not make sense for a first differenced model):
model <-plm(Y ~ time*group, model = 'fd', data = dataset,
index = c('id', 'time'))
summary(model)
## Oneway (individual) effect First-Difference Model
##
## Call:
## plm(formula = Y ~ time * group, data = dataset, model = "fd",
## index = c("id", "time"))
##
## Balanced Panel: n = 5, T = 2, N = 10
## Observations used in estimation: 5
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.3067240 -0.0012185 0.0012185 0.1367080 0.1700160
## [...]
In the summary output, you can see the number of observations in your original data (N=10) and the number of observations used in the FD model (5).
I get sums of squares and mean sums of squares 10x higher when I use anova on lmerTest:: lmer compared to lme4:: lmer objects. See the R log file below. Note the warning that when I attach the lmerTest package, the stats::sigma function overrides the lme4::sigma function, and I suspect that it is this that leads to the discrepancy. In addition, the anova report now says that it is a Type III anova rather than the expected Type I. Is this a bug in the lmerTest package, or is there something about use of the Kenward-Roger approximation for degrees of freedom that changes the calculation of SumSQ and MSS and specification of the anova Type that I don't understand?
I would append the test file, but it is confidential clinical trial information. If necessary I can see if I can cobble up a test case.
Thanks in advance for any advice you all can provide about this.
> library(lme4)
Loading required package: Matrix
Attaching package: ‘lme4’
The following object is masked from ‘package:stats’:
sigma
> test100 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
Warning message:
replacing previous import ‘lme4::sigma’ by ‘stats::sigma’ when loading
‘lmerTest’
> test200 <- lmer(log(value) ~ prepost * lowhi + (1|CID/LotNo/rep),
REML = F, data = GSIRlong, subset = !is.na(value))
> anova(test100)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
prepost 1 3.956 3.956 18.4825
lowhi 1 130.647 130.647 610.3836
prepost:lowhi 1 0.038 0.038 0.1758
> anova(test200, ddf = 'Ken')
Analysis of Variance Table of type III with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
prepost 37.15 37.15 1 308.04 18.68 2.094e-05 ***
lowhi 1207.97 1207.97 1 376.43 607.33 < 2.2e-16 ***
prepost:lowhi 0.35 0.35 1 376.43 0.17 0.676
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Update: Thanks, Ben. I did a little code archeology on lmerTest to see if I could dope out an explanation for the above anomalies. First, it turns out that lmerTest::lmer just submits the model to lme4::lmer and then relabels the result as an "mermodLmerTest" object. The only effect of this is to invoke versions of summary() and anova() from the lmerTest package rather than the usual defaults from base and stats. (These lmerTest functions are compiled, and I have not yet gone farther to look at the C++ code.) lmerTest::summary just adds three columns to the base::summary result, giving df, t value, and Pr. Note that lmerTest::anova, by default, computes a type III anova rather than a type I as in stats::anova. (Explanation of my second question above.) Not a great choice if one's model includes interactions. One can request a type I, II, or III anova using the type = 1/2/3 option.
However there are other surprises using the nlmeTest versions of summary and anova, as shown in the R console file below. I used lmerTest's included sleepstudy data so that this code should be replicable.
a. Note that "sleepstudy" has 180 records (with 3 variables)
b. The summaries of fm1 and fm1a are identical except for the added Fixed effects columns. But note that in the lmerTest::summary the ddfs for the intercept and Days are 1371 and 1281 respectively; odd given that there are only 180 records in "sleepstudy."
c. Just as in my original model above, the nlm4 anad nlmrTest versions of anova give very different values of Sum Sq and Mean Sq. (30031 and 446.65 respectively).
d: Interestingly, the nlmrTest versions of anova using Satterthwaite and Kenward-Rogers estimates of the DenDF are wildly different (5794080 and 28 respecitvely). The K-R value seems more reasonable.
Given the above issues, I am reluctant at this point to depend on lmerTest to give reliable p-values. Based on your (Doug Bates's) blog entry (https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html), I am using now (and recommending) the method from the posting by Dan Mirman (http://mindingthebrain.blogspot.ch/2014/02/three-ways-to-get-parameter-specific-p.html) in the final bit of code below to estimate a naive t-test p-value (assuming essentially infinite degrees of freedom) and a Kenward-Rogers estimate of df (using the R package 'pbkrtest' -- the same package used by lmerTest). I couldn't find R code to compute the Satterthwaite estimates. The naive t-test p-value is clearly anti-conservative, but the KR estimate is thought to be pretty good. If the two give similar estimates of p, then I think that one can feel comfortable with a p-value in the range of [naive t-test, KR estimate].
> library(lme4); library(lmerTest); library(pbkrtest);
dim(sleepstudy)
[1] 180 3
>
> fm1 <- lme4::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
> fm1a <- lmerTest::lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
>
> summary(fm1)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.84
Days 10.467 1.546 6.77
Correlation of Fixed Effects:
(Intr)
Days -0.138
> summary(fm1a)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to
degrees of freedom [lmerMod]
Formula: Reaction ~ Days + (Days | Subject)
Data: sleepstudy
REML criterion at convergence: 1743.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.9536 -0.4634 0.0231 0.4634 5.1793
Random effects:
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 251.405 6.825 1371.100 302.06 <2e-16 ***
Days 10.467 1.546 1281.700 55.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Days -0.138
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML
criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> anova(fm1)
Analysis of Variance Table
Df Sum Sq Mean Sq F value
Days 1 30031 30031 45.853
> anova(fm1a, ddf = 'Sat', type = 1)
Analysis of Variance Table of type I with Satterthwaite
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 5794080 45.853 1.275e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
> anova(fm1a, ddf = 'Ken', type = 1)
Analysis of Variance Table of type I with Kenward-Roger
approximation for degrees of freedom
Sum Sq Mean Sq NumDF DenDF F.value Pr(>F)
Days 446.65 446.65 1 27.997 45.853 2.359e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In deviance.merMod(object, ...) :
deviance() is deprecated for REML fits; use REMLcrit for the REML criterion or deviance(.,REML=FALSE) for deviance calculated at the REML fit
>
> # t.test
> coefs <- data.frame(coef(summary(fm1)))
> coefs$p.z <- 2 * (1 - pnorm(abs(coefs$t.value)))
> coefs
Estimate Std..Error t.value p.z
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11
>
> # Kenward-Rogers
> df.KR <- get_ddf_Lb(fm1, fixef(fm1))
> df.KR
[1] 25.89366
> coefs$p.KR <- 2 * (1 - pt(abs(coefs$t.value), df.KR))
> coefs
Estimate Std..Error t.value p.z p.KR
(Intercept) 251.40510 6.824556 36.838311 0.000000e+00 0.0000e+00
Days 10.46729 1.545789 6.771485 1.274669e-11 3.5447e-07
I’m trying to create a function which creates an index (starting at 100) and then adjust this index according to the results of investments. So, in a nutshell, if the first investment gives an profit of 5%, then the index will stand 105, if the second result is -7%, then the index stands at 97.65. In this question when I use the word "index", I'm not referring to the index function of the zoo package.
Besides creating this index, my goal is also to create an function which can be applied to various subsets of my complete data set (i.e. with the use of sapply and it's friends).
Here’s the function which I have so far (data at end of this question):
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- 100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues,
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
In words the function does (read: is intended to do) the following:
If the value is the first, use 100 as an starting point for the index. If the value is not the first, use the previous calculated index value as the starting point for calculating the new index value. Besides this, the function also takes the weight of the individual result (compared with the totalAccount value) into account.
The problem:
Using this CalculateIndex function on the theData data frame gives the following incorrect output:
> CalculateIndex(theData)
[1] 99.97901 99.94180 99.65632 101.88689 100.89309 98.92878 102.02911 100.49159 98.52955 102.02243 98.43655 100.76502 99.34869 100.76401 101.18014 99.75136 97.90130
[18] 100.39935 99.81311 101.34961
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
Edit:
Wow, I already got an vote down, though I thought my question was already too long. Sorry, I thought/think the problem lay inside my loop, so I didn't want to bore you with the details, which I thought would only give less answers. Sorry, misjudgement on my part.
The problem is, with the above output from CalculateIndex, that the results are wildly different from Excel. Even though this could be resulting from rounding errors (as Joris mentions below), I doubt it. In comparison with the Excel results, the R results differ quite some:
R output Excel calculate values
99,9790085700 99,97900857
99,9418035700 99,92081189
99,6563228600 99,57713687
101,8868850000 101,4639947
100,8930864300 102,3570786
98,9287771400 101,2858564
102,0291071400 103,3149664
100,4915864300 103,806556
98,5295542900 102,3361186
102,0224285700 104,3585552
98,4365550000 102,795089
100,7650171400 103,5601228
99,3486857100 102,9087897
100,7640057100 103,6728077
101,1801400000 104,8529634
99,7513600000 104,6043164
97,9013000000 102,5055298
100,3993485700 102,9048999
99,8131085700 102,7179995
101,3496071400 104,0676555
I think it would be fair to say that the difference in output isn't the result of R versus Excel problems, but more an error in my function. So, let's focus on the function.
The manual calculation of the function
The function uses different variables:
Size.Units.; this is the number of units which are bought at the EntryPrice.
EntryPrice: the price at which the stocks are bought,
TradeResult.Percent.: the percentage gain or loss resulting from the investment,
TradeResult.Currency.: the currency value ($) of the gain or loss resulting from the investment,
These variables are used in the following section of the function:
100 + ( 100 *((((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100))
and
indexedValues[-1] + (indexedValues[-1] *(((x$Size.Units. * x$EntryPrice) / totalAccount) * x$TradeResult.Percent.) / 100)
Both of the formula's are essentially the same, with the difference that the the first starts at 100, and the second uses the previous value to calculate the new indexed value.
The formula can be broken down in different steps:
First, x$Size.Units. * x$EntryPrice determines the total position that was taken, in the sense that buying 100 shares at an price of 48.98 gives an position of $4898.
The resulting total position is then divided by the total account size (i.e. totalAccount). This is needed to correct the impact of one position relative to the complete portfolio. For example, if our 100 shares bought at 48.98 drop 10 percent, the calculated index (i.e. the CalculateIndex function) doesn't have to drop 10%, because off course not all the money in totalAccount is invested in one stock. So, by dividing the total position by the totalAccount we get an ratio which tells us how much money is invested. For example, the position with the size of 4898 dollar (on a total account of 14000) results in a total account loss of 3.49% if the stock drops 10%. (i.e. 4898 / 14000 = 0.349857. 0.349857 * 10% = 3.49857%)
This ratio (of invested amount versus total amount) is then in the formula multiplied with x$TradeResult.Percent., so to get the percentage impact on the total account (see calculation example in the previous paragraph).
As an last step, the percentage loss on the total account is applied to the index value (which starts at 100). In this case, the first investment in 100 shares bought at 48.89 dollar let's the index drop from it starting point at 100 to 99.97901, reflecting the losing trade's impact on the total account.
End of Edit
Stripping the function clean and then adding a part of the formula at a time, so to uncover the error, I came to the following step where the error seems to reside:
CalculateIndex <- function(x){
totalAccount <- accountValueStart
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
indexedValues <- totalAccount
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
else{ # the value is not the first
indexedValues <- c(indexedValues, totalAccount)
# Update the accountvalue
totalAccount <- totalAccount + x$TradeResult.Currency.
}
return(indexedValues)
}
> CalculateIndex(theData)
[1] 14000
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
So, it seems that if I just use the totalAccount variable, the function doesn’t get updated correctly. This seems to suggest there is some error with the basics of the if else statement, because it only outputs the first value.
If I remove the else statement from the function, I do get values for each of the rows in theData. However, these are then wrongly calculated. So, it seems to me that there is some error in how this function updates the totalAccount variable. I don’t see where I made an error, so any suggestion would be highly appreciated. What am I doing wrong?
The Data
Here’s what my data looks like:
> theData
Size.Units. EntryPrice TradeResult.Percent. TradeResult.Currency.
1 100 48.98 -0.06 -3
11 100 32.59 -0.25 -8
12 100 32.51 -1.48 -48
2 100 49.01 5.39 264
13 100 32.99 3.79 125
14 100 34.24 -4.38 -150
3 100 51.65 5.50 284
4 100 48.81 1.41 69
15 100 35.74 -5.76 -206
5 100 49.50 5.72 283
6 100 46.67 -4.69 -219
16 100 33.68 3.18 107
7 100 44.48 -2.05 -91
17 100 32.61 3.28 107
8 100 45.39 3.64 165
9 100 47.04 -0.74 -35
10 100 47.39 -6.20 -294
18 100 33.68 1.66 56
19 100 33.12 -0.79 -26
20 100 32.86 5.75 189
theData <- structure(list(X = c(1L, 11L, 12L, 2L, 13L, 14L, 3L, 4L, 15L,
5L, 6L, 16L, 7L, 17L, 8L, 9L, 10L, 18L, 19L, 20L), Size.Units. = c(100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L,
100L, 100L, 100L, 100L, 100L, 100L, 100L, 100L), EntryPrice = c(48.98,
32.59, 32.51, 49.01, 32.99, 34.24, 51.65, 48.81, 35.74, 49.5,
46.67, 33.68, 44.48, 32.61, 45.39, 47.04, 47.39, 33.68, 33.12,
32.86), TradeResult.Percent. = c(-0.06, -0.25, -1.48, 5.39, 3.79,
-4.38, 5.5, 1.41, -5.76, 5.72, -4.69, 3.18, -2.05, 3.28, 3.64,
-0.74, -6.2, 1.66, -0.79, 5.75), TradeResult.Currency. = c(-3L,
-8L, -48L, 264L, 125L, -150L, 284L, 69L, -206L, 283L, -219L,
107L, -91L, 107L, 165L, -35L, -294L, 56L, -26L, 189L)), .Names = c("X",
"Size.Units.", "EntryPrice", "TradeResult.Percent.", "TradeResult.Currency."
), class = "data.frame", row.names = c(NA, -20L))
# Set the account start # 14000
> accountValueStart <- 14000
Your code looks very strange, and it seems you have a lot of misconceptions about R that come from another programming language. Gavin and Gillespie pointed out already why you get the warniong. Let me add some tips for far more optimal coding:
[-1] does NOT mean: drop the last one. It means "keep everything but the first value", which also explains why you get erroneous results.
calculate common things in the beginning, to unclutter your code.
head(x$TradeResult.Currency., n = 1) is the same as x$TradeResult.Currency.[1].
Keep an eye on your vectors. Most of the mistakes in your code come from forgetting you're working with vectors.
If you need a value to be the first in a vector, put that OUTSIDE of any loop you'd use, never add an if-clause in the function.
predefine your vectors/matrices as much as possible, that goes a lot faster and gives less memory headaches when working with big data.
vectorization, vectorization, vectorization. Did I mention vectorization?
Learn the use of debug(), debugonce() and browser() to check what your function is doing. Many of your problems could have been solved by checking the objects when manipulated within the function.
This said and taken into account, your function becomes :
CalculateIndex <- function(x,accountValueStart){
# predifine your vector
indexedValues <- vector("numeric",nrow(x))
# get your totalAccount calculated FAST. This is a VECTOR!!!
totalAccount <- cumsum(c(accountValueStart,x$TradeResult.Currency.))
#adjust length:
totalAccount <- totalAccount[-(nrow(x)+1)]
# only once this calculation. This is a VECTOR!!!!
totRatio <- 1+(((x$Size.Units. * x$EntryPrice)/totalAccount) *
x$TradeResult.Percent.)/100
# and now the calculations
indexedValues[1] <- 100 * totRatio[1]
for(i in 2:nrow(x)){
indexedValues[i] <- indexedValues[i-1]*totRatio[i]
}
return(indexedValues)
}
and returns
> CalculateIndex(theData,14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
103.80656 102.33612 104.35856 102.79509 103.56012
[13] 102.90879 103.67281 104.85296 104.60432 102.50553 102.90490 102.71800
104.06766
So now you do:
invisible(replicate(10,print("I will never forget about vectorization any more!")))
The warning message is coming from this line:
if(x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)){
It is easy to see why; x$TradeResult.Currency is a vector and thus the comparison with head(x$TradeResult.Currency., n = 1) yields a vector of logicals. (By the way, why not x$TradeResult.Currency[1] instead of the head() call?). if() requires a single logical not a vector of logicals, and that is what the warning is about. ifelse() is useful if you want to do one of two things depending upon a condition that gives a vector of logicals.
In effect, what you are doing is only entering the if() part of the statement and it gets executed once only, because the first element of x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1) is TRUE and R ignores the others.
> if(c(TRUE, FALSE)) {
+ print("Hi")
+ } else {
+ print("Bye")
+ }
[1] "Hi"
Warning message:
In if (c(TRUE, FALSE)) { :
the condition has length > 1 and only the first element will be used
> ifelse(c(TRUE, FALSE), print("Hi"), print("Bye"))
[1] "Hi"
[1] "Bye"
[1] "Hi" "Bye"
As to solving your real problem:
CalculateIndex2 <- function(x, value, start = 100) {
rowSeq <- seq_len(NROW(x))
totalAc <- cumsum(c(value, x$TradeResult.Currency.))[rowSeq]
idx <- numeric(length = nrow(x))
interm <- (((x$Size.Units. * x$EntryPrice) / totalAc) *
x$TradeResult.Percent.) / 100
for(i in rowSeq) {
idx[i] <- start + (start * interm[i])
start <- idx[i]
}
idx
}
which when used on theData gives:
> CalculateIndex2(theData, 14000)
[1] 99.97901 99.92081 99.57714 101.46399 102.35708 101.28586 103.31497
[8] 103.80656 102.33612 104.35856 102.79509 103.56012 102.90879 103.67281
[15] 104.85296 104.60432 102.50553 102.90490 102.71800 104.06766
What you want is a recursive function (IIRC); the current index is some function of the previous index. These are hard to solve in a vectorised way in R, hence the loop.
I'm still slightly confused as to what exactly you want to do, but hopefully the following will be helpful.
Your R script gives the same answers as your Excel function for the first value. You see a difference because R doesn't print out all digits.
> tmp = CalculateIndex(thedata)
Warning message:
In if (x$TradeResult.Currency == head(x$TradeResult.Currency., n = 1)) { :
the condition has length > 1 and only the first element will be used
> print(tmp, digits=10)
[1] 99.97900857 99.94180357 99.65632286 101.88688500 100.89308643
<snip>
The reason for the warning message is because x$TradeResult.Currency is a vector that is being compared to a single number.
That warning message is also where your bug lives. In your if statement, you never execute the else part, since only the value of x$TradeResult.Currency is being used. As the warning message states, only the first element of x$TradeResult.Currency is being used.