Assume the following df:
ib c d1 d2
0 1.14 1 1 0
1 1.0 1 1 0
2 0.71 1 1 0
3 0.6 1 1 0
4 0.66 1 1 0
5 1.0 1 1 0
6 1.26 1 1 0
7 1.29 1 1 0
8 1.52 1 1 0
9 1.31 1 1 0
10 0.89 1 0 1
d1 and d2 are perfectly colinear. Now I estimate the following regression model:
import statsmodels.api as sm
reg = sm.OLS(df['ib'], df[['c', 'd1', 'd2']]).fit().summary()
reg
This gives me the following output:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: ib R-squared: 0.087
Model: OLS Adj. R-squared: -0.028
Method: Least Squares F-statistic: 0.7590
Date: Thu, 17 Nov 2022 Prob (F-statistic): 0.409
Time: 12:19:34 Log-Likelihood: -1.5470
No. Observations: 10 AIC: 7.094
Df Residuals: 8 BIC: 7.699
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
c 0.7767 0.111 7.000 0.000 0.521 1.033
d1 0.2433 0.127 1.923 0.091 -0.048 0.535
d2 0.5333 0.213 2.499 0.037 0.041 1.026
==============================================================================
Omnibus: 0.257 Durbin-Watson: 0.760
Prob(Omnibus): 0.879 Jarque-Bera (JB): 0.404
Skew: 0.043 Prob(JB): 0.817
Kurtosis: 2.019 Cond. No. 8.91e+15
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 2.34e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
"""
However, including c, d1 and d2 represents the well known dummy variable trap which, from my understanding, should make it impossible to estimate the model. Why is this not the case here?
I was playing around with a simple linear models when I noticed that, in the ANOVA table, the ratio MSreg/MSres does not exactly correspond to the F-value. Indeed, the two values are very similar but not the same.
Here my script
#quick view of the dataset
> head(my_data)
Diameter Height
1 0.325 0.080
2 0.320 0.100
3 0.280 0.110
4 0.125 0.040
5 0.400 0.135
6 0.335 0.100
#setting up the lm()
> ls1 <- lm(Diameter~Height, data=my_data)
> anova(ls1)
Analysis of Variance Table
Response: Diameter
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 0.82415 0.82415 602.63 < 2.2e-16 ***
Residuals 98 0.13402 0.00137
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Here 0.82415/0.00137=601.5693 which is not the F value in the table. Is there a particular reason for that?
Have anyone used panel var in R?
Currently I'm using the package panelvar of R. And I'm getting this error :
Error in `[.data.frame`(data, , c(colnames(data)[panel_identifier], required_vars)) :
undefined columns selected
And my syntax currently is:
model1<-pvargmm(
dependent_vars = c("Change.."),
lags = 2,
exog_vars = c("Price"),
transformation = "fd",
data = base1,
panel_identifier = c("id", "t"),
steps = c("twostep"),
system_instruments = FALSE,
max_instr_dependent_vars = 99,
min_instr_dependent_vars = 2L,
collapse = FALSE)
I don't know why my panel_identifier is not working, it's pretty similar to the example given by panelvar package, however, it doesn't work, I want to appoint that base1 is on data.frame format. any ideas? Also, my data is structured like this:
head(base1)
id t country DDMMYY month month_text day Date_txt year Price Open
1 1 1296 China 1-4-2020 4 Apr 1 Apr 01 2020 12588.24 12614.82
2 1 1295 China 31-3-2020 3 Mar 31 Mar 31 2020 12614.82 12597.61
High Low Vol. Change..
1 12775.83 12570.32 NA -0.0021
2 12737.28 12583.05 NA 0.0014
thanks in advance !
Check the documentation of the package and the SSRN paper. For me it helped to ensure all entered formats are identical (you can check this with str(base1) command). For example they write:
library(panelvar)
data("Dahlberg")
ex1_dahlberg_data <-
pvargmm(dependent_vars = .......
When I look at it I get
>str(Dahlberg)
'data.frame': 2385 obs. of 5 variables:
$ id : Factor w/ 265 levels "114","115","120",..: 1 1 1 1 1 1 1 1 1 2 ...
$ year : Factor w/ 9 levels "1979","1980",..: 1 2 3 4 5 6 7 8 9 1 ...
$ expenditures: num 0.023 0.0266 0.0273 0.0289 0.0226 ...
$ revenues : num 0.0182 0.0209 0.0211 0.0234 0.018 ...
$ grants : num 0.00544 0.00573 0.00566 0.00589 0.00559 ...
For example the input data must be a data.frame (in my case it had additional type specifications like tibble or data.table). I resolved it by casting as.data.frame() on it.
I am trying to calculate the Hamming weight of a vector in Matlab.
function Hamming_weight (vet_dec)
Ham_Weight = sum(dec2bin(vet_dec) == '1')
endfunction
The vector is:
Hamming_weight ([208 15 217 252 128 35 50 252 209 120 97 140 235 220 32 251])
However, this gives the following result, which is not what I want:
Ham_Weight =
10 10 9 9 9 5 5 7
I would be very grateful if you could help me please.
You are summing over the wrong dimension!
sum(dec2bin(vet_dec) == '1',2).'
ans =
3 4 5 6 1 3 3 6 4 4 3 3 6 5 1 7
dec2bin(vet_dec) creates a matrix like this:
11010000
00001111
11011001
11111100
10000000
00100011
00110010
11111100
11010001
01111000
01100001
10001100
11101011
11011100
00100000
11111011
As you can see, you're interested in the sum of each row, not each column. Use the second input argument to sum(x, 2), which specifies the dimension you want to sum along.
Note that this approach is horribly slow, as you can see from this question.
EDIT
For this to be a valid, and meaningful MATLAB function, you must change your function definition a bit.
function ham_weight = hamming_weight(vector) % Return the variable ham_weight
ham_weight = sum(dec2bin(vector) == '1', 2).'; % Don't transpose if
% you want a column vector
end % endfunction is not a MATLAB command.
I have a data frame with results for certain instruments, and I want to create a new column which contains the totals of each row. Because I have different numbers of instruments each time I run an analysis on new data, I need a function to dynamically calculate the new column with the Row Total.
To simply my problem, here’s what my data frame looks like:
Type Value
1 A 10
2 A 15
3 A 20
4 A 25
5 B 30
6 B 40
7 B 50
8 B 60
9 B 70
10 B 80
11 B 90
My goal is to achieve the following:
A B Total
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 70 70
6 80 80
7 90 90
I’ve tried various method, but this way holds the most promise:
myList <- list(a = c(10, 15, 20, 25), b = c(30, 40, 50, 60, 70, 80, 90))
tmpDF <- data.frame(sapply(myList, '[', 1:max(sapply(myList, length))))
> tmpDF
a b
1 10 30
2 15 40
3 20 50
4 25 60
5 NA 70
6 NA 80
7 NA 90
totalSum <- rowSums(tmpDF)
totalSum <- data.frame(totalSum)
tmpDF <- cbind(tmpDF, totalSum)
> tmpDF
a b totalSum
1 10 30 40
2 15 40 55
3 20 50 70
4 25 60 85
5 NA 70 NA
6 NA 80 NA
7 NA 90 NA
Even though this way did succeeded in combining two data frames of different lengths, the ‘rowSums’ function gives the wrong values in this example. Besides that, my original data isn't in a list format, so I can't apply such a 'solution'.
I think I’m overcomplicating this problem, so I was wondering how can I …
Subset data from a data frame on the basis of ‘Type’,
Insert these individual subsets of different lengths into a new data frame,
Add an ‘Total’ column to this data frame which is the correct sum of the
individual subsets.
An added complication to this problem is that this needs to be done in an function or in an otherwise dynamic way, so that I don’t need to manually subset the dozens of ‘Types’ (A, B, C, and so on) in my data frame.
Here’s what I have so far, which doesn’t work, but illustrates the lines I’m thinking along:
TotalDf <- function(x){
tmpNumberOfTypes <- c(levels(x$Type))
for( i in tmpNumberOfTypes){
subSetofData <- subset(x, Type = i, select = Value)
if( i == 1) {
totalDf <- subSetOfData }
else{
totalDf <- cbind(totalDf, subSetofData)}
}
return(totalDf)
}
Thanks in advance for any thoughts or ideas on this,
Regards,
EDIT:
Thanks to the comment of Joris (see below) I got an end in the right direction, however, when trying to translate his solution to my data frame, I run into additional problems. His proposed answer works, and gives me the following (correct) sum of the values of A and B:
> tmp78 <- tapply(DF$value,DF$id,sum)
> tmp78
1 2 3 4 5 6
6 8 10 12 9 10
> data.frame(tmp78)
tmp78
1 6
2 8
3 10
4 12
5 9
6 10
However, when I try this solution on my data frame, it doesn’t work:
> subSetOfData <- copyOfTradesList[c(1:3,11:13),c(1,10)]
> subSetOfData
Instrument AccountValue
1 JPM 6997
2 JPM 7261
3 JPM 7545
11 KFT 6992
12 KFT 6944
13 KFT 7069
> unlist(sapply(rle(subSetOfData$Instrument)$lengths,function(x) 1:x))
Error in rle(subSetOfData$Instrument) : 'x' must be an atomic vector
> subSetOfData$InstrumentNumeric <- as.numeric(subSetOfData$Instrument)
> unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
> subSetOfData$id <- unlist(sapply(rle(subSetOfData$InstrumentNumeric)$lengths,function(x) 1:x))
Error in `$<-.data.frame`(`*tmp*`, "id", value = c(1L, 2L, 3L, 1L, 2L, :
replacement has 3 rows, data has 6
I have the disturbing idea that I’m going around in circles…
Two thoughts :
1) you could use na.rm=T in rowSums
2) How do you know which one has to go with which? You might add some indexing.
eg :
DF <- data.frame(
type=c(rep("A",4),rep("B",6)),
value = 1:10,
stringsAsFactors=F
)
DF$id <- unlist(lapply(rle(DF$type)$lengths,function(x) 1:x))
Now this allows you to easily tapply the sum on the original dataframe
tapply(DF$value,DF$id,sum)
And, more importantly, get your dataframe in the correct form :
> DF
type value id
1 A 1 1
2 A 2 2
3 A 3 3
4 A 4 4
5 B 5 1
6 B 6 2
7 B 7 3
8 B 8 4
9 B 9 5
10 B 10 6
> library(reshape)
> cast(DF,id~type)
id A B
1 1 1 5
2 2 2 6
3 3 3 7
4 4 4 8
5 5 NA 9
6 6 NA 10
TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B")
, Value = c(10,15,20,25,30,40,50,60,70,80,90)
, stringsAsFactors = FALSE)
# Added Type C for testing
# TV <- data.frame(Type = c("A","A","A","A","B","B","B","B","B","B","B", "C", "C", "C")
# , Value = c(10,15,20,25,30,40,50,60,70,80,90, 100, 150, 130)
# , stringsAsFactors = FALSE)
lnType <- with(TV, tapply(Value, Type, length))
lnType <- as.integer(lnType)
lnType
id <- unlist(mapply(FUN = rep_len, length.out = lnType, x = list(1:max(lnType))))
(TV <- cbind(id, TV))
require(reshape2)
tvWide <- dcast(TV, id ~ Type)
# Alternatively
# tvWide <- reshape(data = TV, direction = "wide", timevar = "Type", ids = c(id, Type))
tvWide <- subset(tvWide, select = -id)
# If you want something neat without the <NA>
# for(i in 1:ncol(tvWide)){
#
# if (is.na(tvWide[j,i])){
# tvWide[j,i] = 0
# }
#
# }
# }
tvWide
transform(tvWide, rowSum=rowSums(tvWide, na.rm = TRUE))