ANOVA and sample group associated to a variable - anova

I have a variable X and and 16 groups of samples. I would like to know which group is the most associated to this variable (the one with the lowest values actually). I performed an ANOVA and a TukeyHSD/post-hoc but that only highlight which groups are different for variable X.
Is there a way to determine which group is significantly associated at lowest values for variable X ?
Thanks for your help

With the post-hoc comparisons already in place, and with the information which groups differ from one another, all you need to know is the mean of X within each group.
The group means are easily calculated in standard statistical software. You already know, which of those means are significantly different from one another.
Alternatively you can use a dummy coding for the group variable (i.e., 5 indicator variables with one reference group that replace the 6-level factor). A regression model that regresses X on the dummy variables is equivalent to the ANOVA model (in most parts) and allows for most pairwise comparisons (depending on the coding).
The regression coefficients will indicate the difference between groups, and the test for the coefficients will indicate whether or not these are significant on some level of confidence.

Related

anova with unbalanced sample size (small sample size in one group)

I'm doing an ANOVA with a 3-level categorical variable to identify covariate for later regression. The sample size of one of the group is n= 2 to 3, and the sample size for the largest group is around 100. The results suggested that there are significant difference among groups, but I just wonder if I can trust the results when one group has such small sample size?
Also, if I would like to include this categorical group variable (with 3 level) as a covariate in a regression. With one of the category has only 2 person, should I still keep this category as a separate group (when dummy coding)
Thank you!

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

In DQN, why y_i is calculated but not stored?

The DQN algorithm below
Source
We have phi_t, a_t, r_t and phi_{t+1} fields in D's records. Why don't we have a 'y' field in D's records, so we can store 'y' values once calculated?
I mean, the minibatches are chosen randomly from D without any restrictions, so one record may be chosen multiple times, especially when the number of D's records are not large enough. If that happen, y needs to be recalculated multiple times. Am I thinking it correctly?
Because y_i is computed using the the function Q, which changes from iteration to iteration. Therefore, the values stored in one iteration are not valid for the next iterations.
Within the same iteration, I thikn you are rigth pointing out that if you sample the same transition several times, then it's not necessary to compute y_i several times, instead you can use the same result. I guess the pseudo code is more focused in the key concepts than in this kind of implementation details.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.

Calculate conditional mean

I'm new to cuda programming and am interested in implementing an algorithm that when coded serially calculates two or more means from a vector in one pass. What would be an efficient scheme for doing something like this in cuda?
There are two vectors of length N, element values and an indicator values identifying which subset each element belongs to.
Is there an efficient way to do this in one pass or should this be done in M passes, where M is the number of means to be calcuated and use a vector of index keys for the element values of each subset?
You can achieve this with one pass over the data with a single call to thrust::reduce_by_key. In particular, look at the "summary statistics" example, which computes several statistical properties of a single vector at once. You could generalize this method to reduce_by_key which computes reductions over many sub-vectors in parallel. Your "indicator values" would provide be the "keys" reduce_by_key uses to determine which sub-vector each element belongs to.
Partition each vector into smaller vectors and use threads to sum required elements of each sub vector. Then combine the sums and generate the global means. I would try to generate the M means at the same time rather than do M passes.