MANOVA or Multiple Regression - regression

We have several independent variables (some are continuous with more than 5 levels, some binary and some quasi-interval (5 levels - categorical). We also have 5 dependent variables that share a common construct. Is it useful to conduct MANOVA with all the continues/quasi-interval as covariates, and the binary as factor variables - or preform a 5 separate multiple regression analysis?
Thank you

In general, it is inadvisable to perform multiple (univariate) analyses when you can replace them with one (multivariate). Neglecting this can result in Type I errors when a suitable correction is not applied. It might also lead to overlooking an effect arising from interaction between variables analysed separately.
With this in mind, you would probably be better off performing a single MANOVA. Covariates, however, should be determined by your experimental design rather than by the type of a variable (nominal, categorical, and continuous).

Related

Simulating a matrix of variables with predefined correlation structure

For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.

Multiple Regression Assumption

If I am running a multiple linear regression model with six independent variables against dependent variable,
do the assumptions of multiple regression need to be satisfied?
or does this only applies if we are using the least squares method.
Yes, the assumptions should be satisfied in either of the cases. Typically it can be difficult to fulfill all the assumptions but the more assumptions are satisfied the more reliable the results become.
Here is a helpful link regarding the same: https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php

Can you merge two principal components?

I am doing a regression on the big 5 personality traits, and how birth order affect those traits. First I am trying to build 5 variables based on surveys that captures those traits. I have thought about creating dummies for each question in the category (trait) and then taking the average, but some of the questions are highly correlated, so the weight would be wrong.
I have made a principal components analysis, which gives me four components with an eigenvalue over one. The problem is that none of them accounts for over 40 pct. Of the variance.
Is there some way that I can merge the four into one variable? It is the dependent variable, so there can only be one.
Otherwise do you have another idea of how the index can be constructed?
It doesn't really make sense to merge your principal components as they are by definition orthogonal/uncorrelated.
I can't advise on how an index could be constructed as I think this requires subject matter expertise, but you might want to consider using a multivariate technique that allows for multiple response variables. See this answer for a possible approach (assuming your response variables are ordinal).

Logic or lookup table: Best practices

Suppose you have a function/method that uses two metric to return a value — essentially a 2D matrix of possible values. Is it better to use logic (nested if/switch statements) to choose the right value, or just build that matrix (as an Array/Hash/Dictionary/whatever), and then the return value becomes simply a matter of performing a lookup?
My gut feeling says that for an M⨉N matrix, relatively small values for both M and N (like ≤3) would be OK to use logic, but for larger values it would be more efficient to just build the matrix.
What are general best practices for this? What about for an N-dimensional matrix?
The decision depends on multiple factors, including:
Which option makes the code more readable and hence easier to maintain
Which option performs faster, especially if the lookup happens squillions of times
How often do the values in the matrix change? If the answer is "often" then it is prob better to externalise the values out of the code and put them in an matrix stored in a way that can be edited simply.
Not only how big is the matrix but how sparse is it?
What I say is that about nine conditions is the limit for an if .. else ladder or a switch. So if you have a 2D cell you can reasonably hard-code the up, down, diagonals, and so on. If you go to three dimensions you have 27 cases and it's too much, but OK if you're restricted to the six cub faces.
Once you've got a a lot of conditions, start coding via look-up tables.
But there's no real answer. For example Windows message loops need to deal with a lot of different messages, and you can't sensibly encode the handling code in look-up tables.

Calculating 4th power differences

I am using Modelica for solving a system of equations for heat transfer problems, and one of them is radiation which is written as
Ta^4-Tb^4
Can someone say if it is computationally faster solving a system with the equation written as:
(Ta-Tb)(Ta+Tb)(Ta^2+Tb^2)
?
There cannot be a definitive answer to this question. This is because the Modelica specification is used to formally define the problem statement but it says nothing about how tools solve such equations. Furthermore, since most Modelica tools do symbolic manipulation anyway, it is hard to predict what steps they might take with such an equation. For example, a tool may very well transform this into a Horner polynomial on its own (without your manual intervention).
If you are going to solve for the temperatures in such an equation as a non-linear system, be careful about negative temperature solutions. You should investigate the "start" attribute to specify initial (positive) guesses when these temperatures are iteration variables in non-linear problems.
I would say that there are two reasons why splitting it into (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) is SLOWER and NOT FASTER.
(Ta^2+Tb^2) requires 2 multiplications and an addition, which means that (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) requires 4 multiplications and 3 additions. On the other hand, i guess that Ta^4-Tb^4 is done like this: ((Ta^2)^2 - (Tb^2)^2) which means 1 addition and 4 multiplications.
Mathematica, like a more generic compiler probably knows very well how to optimise these very simple expression. Which means that it is generally safer in terms of computation time to use simple patterns which will be easily caugth and translated into super efficient machine code.
I might obviously be wrong, but I cannot see any reason why (Ta-Tb)(Ta+Tb)(Ta^2+Tb^2) could be FASTER. Hope it helps.
Oscar