If I am running a multiple linear regression model with six independent variables against dependent variable,
do the assumptions of multiple regression need to be satisfied?
or does this only applies if we are using the least squares method.
Yes, the assumptions should be satisfied in either of the cases. Typically it can be difficult to fulfill all the assumptions but the more assumptions are satisfied the more reliable the results become.
Here is a helpful link regarding the same: https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
Related
For a simulation study I am working on, we are trying to test an algorithm that aims to identify specific culprit factors that predict a binary outcome of interest from a large mixture of possible exposures that are mostly unrelated to the outcome. To test this algorithm, I am trying to simulate the following data:
A binary dependent variable
A set of, say, 1000 variables, most binary and some continuous, that are not associated with the outcome (that is, are completely independent from the binary dependent variable, but that can still be correlated with one another).
A group of 10 or so binary variables which will be associated with the dependent variable. I will a-priori determine the magnitude of the correlation with the binary dependent variable, as well as their frequency in the data.
Generating a random set of binary variables is easy. But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?
Thank you!
"But is there a way of doing this while ensuring that none of these variables are correlated with the dependent outcome?"
With statistical sampling you can't ensure anything, you can only adjust the acceptable risk. Finding an acceptable level of risk may be harder than many people think.
Spurious correlations are a very real phenomenon. Real independent observations will often contain correlations, and if you want to actually test your algorithm to see how it will perform in reality then your tests should produce such phenomena in a manner similar to the real world—you should be generating independent candidate factors and allowing spurious correlations to occur.
If you are performing ~1000 independent tests of candidate factors, and you're targeting a risk level of α = 0.05, you can expect 50 non-significant terms to leak through into your analysis. To avoid this, you need to adjust your testing threshold using something along the lines of a Bonferroni correction. Recall that statistical discriminating power is based on standard error, which is inversely proportional to the square root of the sample size. Bonferroni says that 1000 simultaneous tests need their individual test threshold to be adjusted by a factor of 1000, which in turn means the sample size needs to be a million times larger than when performing a single test for significance.
So in summary I'd say that you shouldn't attempt to ensure lack of correlation, it's going to occur in the real world. You can mitigate the risk of non-predictive factors being included due to spurious correlation by generating massive amounts of data. In practice there will be non-predictors that leak through unless you can obtain enough data, so I'd suggest that your testing should address the rates of occurrence as a function of number of candidate factors and the sample size.
I’m working with different types of financial data inputs for my models and I would like to know more about normalization of them.
In particular, working with some technical indicators, I’ve normalized them to have a range between 0 and 1.
Others were normalized to have a range between -1 and 1.
What is your experience with mixed normalized data?
Could it be acceptable to have these two ranges or is it always better to have the training dataset with a single range i.e. [0 1]?
It is important to note that when we discuss data normalization, we are usually referring to the normalization of continuous data. Categorical data (usually) doesn't require the former.
Furthermore, not all ML methods need you to normalize data for them to function well. Examples of such methods include Random Forests and Gradient Boosting Machines. Others, however, do. For instance, Support Vector Machines and Neural Networks.
The reasons for input data normalization are dependent on the methods themselves. For SVMs, data normalization is done to ensure that input features are given equal importance in influencing the model's decisions. For neural networks, we normalize data to allow the gradient descent process to converge smoothly.
Finally, to answer your question, if you are working with continuous data and using a neural network to model your data, just make sure that the normalized data's values are close to each other (even if they are not the same range) because that is what determines the ease with which the gradient descent process converges. If you are working with an SVM, it would be better to normalize your data to a single range, so that all features may be given equal importance by the similarity/ distance function that your SVM uses. In other cases, the need for data normalization, whatever the ranges, may be removed entirely. Ultimately, it depends on the modeling technique you are using!
Credit to #user3666197 for the helpful feedback in the comments.
I use Albumentations augmentations in my computer vision tasks. However, I don't fully understand when to use normalization on my images (I use min-max normalization). Do I need to use normalization before augmentation functions, but values would not be between 0-1, or do I use normalization just after augmentations, so that values are between 0-1, or I use normalization in both cases - before and after augmentations?
For example, when I use Sharpen, values are not in 0-1 range (they vary in -0.5-1.5 range). Does that affect model performance? If yes, how?
Thanks in advance.
The basic idea is that you should have the input of your neural network around 0 and with a variance of 1. There is a mathematical reason why it helps the learning process of neural network. This is not the case for other algorithms like tree boosting.
If you train from scratch the type of normalization (min max or other) should not impact the model performance (except if, for exemple your max/min value is really extrem compare to your other data point).
Suppose you have a function/method that uses two metric to return a value — essentially a 2D matrix of possible values. Is it better to use logic (nested if/switch statements) to choose the right value, or just build that matrix (as an Array/Hash/Dictionary/whatever), and then the return value becomes simply a matter of performing a lookup?
My gut feeling says that for an M⨉N matrix, relatively small values for both M and N (like ≤3) would be OK to use logic, but for larger values it would be more efficient to just build the matrix.
What are general best practices for this? What about for an N-dimensional matrix?
The decision depends on multiple factors, including:
Which option makes the code more readable and hence easier to maintain
Which option performs faster, especially if the lookup happens squillions of times
How often do the values in the matrix change? If the answer is "often" then it is prob better to externalise the values out of the code and put them in an matrix stored in a way that can be edited simply.
Not only how big is the matrix but how sparse is it?
What I say is that about nine conditions is the limit for an if .. else ladder or a switch. So if you have a 2D cell you can reasonably hard-code the up, down, diagonals, and so on. If you go to three dimensions you have 27 cases and it's too much, but OK if you're restricted to the six cub faces.
Once you've got a a lot of conditions, start coding via look-up tables.
But there's no real answer. For example Windows message loops need to deal with a lot of different messages, and you can't sensibly encode the handling code in look-up tables.
We have several independent variables (some are continuous with more than 5 levels, some binary and some quasi-interval (5 levels - categorical). We also have 5 dependent variables that share a common construct. Is it useful to conduct MANOVA with all the continues/quasi-interval as covariates, and the binary as factor variables - or preform a 5 separate multiple regression analysis?
Thank you
In general, it is inadvisable to perform multiple (univariate) analyses when you can replace them with one (multivariate). Neglecting this can result in Type I errors when a suitable correction is not applied. It might also lead to overlooking an effect arising from interaction between variables analysed separately.
With this in mind, you would probably be better off performing a single MANOVA. Covariates, however, should be determined by your experimental design rather than by the type of a variable (nominal, categorical, and continuous).