Adjusting Binary Logistic Formula in SPSS - binary

I am running a binary logistic regression in SPSS, to test the effect of e.g. TV advertisements on the probability of a consumer to buy a product. My problem is that with the formula of binary logistic regression:
P=1/(1+e^(-(a+b*Adv)) )
the maximum probability will be equal to 100%. However,even if I increase the number of advertisements by 1000, it is not sensible to assume that the probability to purchase will be 100%. So if I draw the graph of the logistic regression with the coefficients from the Binary Logistic Regression, at some point the probability reaches 100%, which is never the case in a real life setting. How can I control for that?
Is there a way to change the SPSS binary logistic regression to have a maximum probability of e.g. 20%?
Thank you!

The maximum hypothetical probability is 100%, but if you use real-world data, your model will fit the data in such a way that the predicted y-value for any given value of x will be no higher than the real-world y-value (+/- your model's error term). I wouldn't worry too much about the hypothetical maximum probability as long as my model fit the data reasonably well. One of the key reasons for using logistic regressions instead of OLS linear regressions is to avoid impossible predicted values.

Related

Doesn't introduction of polynomial features lead to increased collinearity?

I was going through Linear and Logistic regression from ISLR and in both cases I found that one of the approaches adopted to increase the flexibility of the model was to use polynomial features - X and X^2 both as features and then apply the regression models as usual while considering X and X^2 as independent features (in sklearn, not the polynomial fit of statsmodel). Does that not increase the collinearity amongst the features though? How does it affect the model performance?
To summarize my thoughts regarding this -
First, X and X^2 have substantial correlation no doubt.
Second, I wrote a blog demonstrating that, at least in Linear regression, collinearity amongst features does not affect the model fit score though it makes the model less interpretable by increasing coefficient uncertainty.
So does the second point have anything to do with this, given that model performance is measured by the fit score.
Multi-collinearity isn't always a hindrance. It depends from data to data. If your model isn't giving you the best results(high accuracy or low loss), you then remove the outliers or highly correlated features to improve it but is everything is hunky-dory, you don't bother about them.
Same goes with polynomial regression. Yes it adds multi-collinearity in your model by introducing x^2, x^3 features into your model.
To overcome that, you can use orthogonal polynomial regression which introduces polynomials that are orthogonal to each other.
But it will still introduce higher degree polynomials which can become unstable at the boundaries of your data space.
To overcome this issue, you can use Regression Splines in which it divides the distribution of the data into separate portions and fit linear or low degree polynomial functions on each of these portions. The points where the division occurs are called Knots. Functions which we can use for modelling each piece/bin are known as Piecewise functions. This function has a constraint , suppose, if it is introducing 3 degree of polynomials or cubic features and then the function should be second-order differentiable.
Such a piecewise polynomial of degree m with m-1 continuous derivatives is called a Spline.

Using OLS regression on binary outcome variable

I have previously been told that -- for reasons that make complete sense -- one shouldn't run OLS regressions when the outcome variable is binary (i.e. yes/no, true/false, win/loss, etc). However, I often read papers in economics/other social sciences in which researchers run OLS regressions on binary variables and interpret the coefficients just like they would for a continuous outcome variable. A few questions about this:
Why do they not run a logistic regression? Is there any disadvantage/limitation to using logit models? In economics, for example, I very often see papers using OLS regression for binary variable and not logit. Can logit only be used in certain situations?
In general, when can one run an OLS regression on ordinal data? If I have a variable that captures "number of times in a week survey respondent does X", can I - in any circumstance - use it as a dependent variable in a linear regression? I often see this being done in literature as well, even though we're always told in introductory statistics/econometrics that outcome variables in an OLS regression should be continuous.
The application of applying OLS to a binary outcome is called Linear Probability Model. Compared to a logistic model, LPM has advantages in terms of implementation and interpretation that make it an appealing option for researchers conducting impact analysis. In LPM, parameters represent mean marginal effects while parameters represent log odds ratio in logistic regression. To calculate the mean marginal effects in logistic regression, we need calculate that derivative for every data point and then
calculate the mean of those derivatives. While logistic regression and the LPM usually yield the same expected average impact estimate[1], researchers prefer LPM for estimating treatment impacts.
In general, yes, we can definitely apply OLS to an ordinal outcome. Similar to the previous case, applying OLS to a binary or ordinal outcome result in violations of the assumptions of OLS. However, within econometrics, they believe the practical effect of violating these assumptions is minor and that the simplicity of interpreting an OLS outweighs the technical correctness of an ordered logit or probit model, especially when the ordinal outcome looks quasi-normal.
Reference:
[1] Deke, J. (2014). Using the linear probability model to estimate impacts on binary outcomes in randomized controlled trials. Mathematica Policy Research.

When to choose zero-inflated Poisson models over traditional Poisson regression? When the proportion of zeros start being problematic?

I am learning generalised linear models on my own and while reading about regression models for count outcomes I found a recommendation to use zero-inflated Poisson or Negative Binomial regressions when facing an "excessive" or "considerable" amount of zero values in the outcome variable. However, I have been having a lot of difficulty trying to find a reference explicitly stating what proportion of zeros should be considered "excessive" to the point of warranting the use of zero-inflated models.

How to perform multiple logistic regression for a continuous dependent variable with values between 0 and 1?

I thought that logistic regression should only be used for true binary variables (0 or 1), but now one reviewer of my paper is asking me to perform multiple logistic regression for a dependent variable that is relative abundance (i.e. proportion data). I saw that logistic regression can also be used when there is an upper boundary to the value of the dependent variable. However, when I try to perform logistic regression in R with my data, I have the following error message:
In eval(family$initialize): non-integer #successes in a binomial glm!
My dependent variable is the number of species weighted by abundance or cumulative abundance of all species, that belong to a cluster (group) in a certain plot, divided by the total abundance in a plot (considering all functional groups), so it varies between 0 and 1. I have seen that some methods to apply logistic regression require two columns as the dependent variable, one with the resulting abundance and the other with 100-abundance (can also be 1-abundance), but I could not apply it given my limited R knowledge.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.