anova with unbalanced sample size (small sample size in one group) - regression

I'm doing an ANOVA with a 3-level categorical variable to identify covariate for later regression. The sample size of one of the group is n= 2 to 3, and the sample size for the largest group is around 100. The results suggested that there are significant difference among groups, but I just wonder if I can trust the results when one group has such small sample size?
Also, if I would like to include this categorical group variable (with 3 level) as a covariate in a regression. With one of the category has only 2 person, should I still keep this category as a separate group (when dummy coding)
Thank you!

Related

Centering variables for multiple regression - interested in group effects

I'm trying to run a multiple regression model looking at the length-weight relationship in fish. So y = weight, x = weight. What I want to examine specifically is if the length-weight relationship between different populations (same species) differs - I've run the model as:
weight = length * population
BUT have also reading a lot about centering data in regression models. It seems to make no sense to me to grand-mean centre length for this analysis as i'm specifically interested in the differences in L-W relationship between the groups, but should I group-centre for length? Or, not centre at all?
Any help or pointers greatly apriciated.
Cheers.
G.

SQL - How to find optimal performance numbers for query

First time here so forgive me for any faux pas. I have a question about the limitation of SQL as I am new to the code, and what I need I believe to be rather complex.
Is it possible to automate finding the optimal data for a specific query. For example, say I have the following columns:
1) Vehicle type (Text) e.g. car,bike,bus
2) Number of passengers (Numeric) e.g. 0-7
3) Was in an accident (Boolean) e.g. t or f
From here, I would like to get percentages. So if I were to select only cars with 3 passengers, what percentage of the total accidents does that account for.
I understand how to get this as a one off or mathematically calculate it, however my question relates how to automate this process to get the optimum number.
So, keeping with this example, say I look at just cars, what number of passengers covers the highest percentage of accidents?
At the moment, I am currently going through and testing number by number, is there a way to 'find' the optimal number? It is easy when it is just 0-7 like in the example, but I would naturally like to deal with a larger range and even multiple ranges. For example, say we add another variable titled:
4) Number of doors (numeric) e-g- 0-3
Would there be a way of finding the best combination of numbers from these two variables that cover the highest percentage of accidents?
So say we took: Car, >2 passengers, <3 doors on the vehicle. Out of the accidents variable 50% were true
But if we change that to:Car, >4 passengers, <3 doors. Out of the accidents variable 80% were true.
I hope I have explained this well. I understand that this is most likely not possible with SQL, however is there another way to find these optimum numbers?
Thanks in advance
Here's an example that will give you an answer for all possibilities. You could add a limit clause to show only the top answer, or add to the where clause to limit to specific terms.
SELECT
`vehicle_type`,
`num_passengers`,
sum(if(`in_accident`,1,0)) as `num_accidents`,
count(*) as `num_in_group`,
sum(if(`in_accident`,1,0)) / count(*) as `percent_accidents`
FROM `accidents`
GROUP BY `vehicle_type`,
`num_passengers`
ORDER BY sum(if(`in_accident`,1,0)) / count(*)

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.

ANOVA and sample group associated to a variable

I have a variable X and and 16 groups of samples. I would like to know which group is the most associated to this variable (the one with the lowest values actually). I performed an ANOVA and a TukeyHSD/post-hoc but that only highlight which groups are different for variable X.
Is there a way to determine which group is significantly associated at lowest values for variable X ?
Thanks for your help
With the post-hoc comparisons already in place, and with the information which groups differ from one another, all you need to know is the mean of X within each group.
The group means are easily calculated in standard statistical software. You already know, which of those means are significantly different from one another.
Alternatively you can use a dummy coding for the group variable (i.e., 5 indicator variables with one reference group that replace the 6-level factor). A regression model that regresses X on the dummy variables is equivalent to the ANOVA model (in most parts) and allows for most pairwise comparisons (depending on the coding).
The regression coefficients will indicate the difference between groups, and the test for the coefficients will indicate whether or not these are significant on some level of confidence.

How to convert the following MySQL schema into CouchDB?

I am not sure how I can design the following problem in CouchDB.
I have a logger web app that keeps track of how many items are in a warehouse. To simplify the problem we just need to know the total number items currently in warehouse and how long is each item stays in warehouse before it ships. Lets say the warehouse only have shoes but each shoe have different id and need to keep track by id.
MySQL schema looks like this
id name date-in data-out
1 shoe 08/0/2010 null
2 shoe 07/20/2010 08/01/2010
The output will be
Number of shoe in warehouse: 1
Average time in warehouse: 14 days
Thanks
jhs' answer is great, but I just wanted to add something:
To use the build-in reduce function for the avg calculation (_stats in your case), you have to use two "separate" views. But if your map-function is exactly the same, CouchDB will detect that and not generate a whole new index for that second view. This way you can have one map function feeding multiple reduce functions.
If each shoe is a document, with a date_in and date_out, then your reduce function will +1 if the date_out is null, and +0 (no change) if date_out is not null. That will give you the total count of shoes in the warehouse.
To compute the average time, for each shoe, you know the time in the warehouse. So the reduce function simply accumulates the average. Since reduce functions must be commutative and associative, you use a different average algorithm. The easiest way is to reduce to a [sum, count] array, where sum is an accumulator of all time for all shoes, and count is a counter for the number of shoes counted. Then the client simply divides sum / count to compute the final average.
I think you could combine both of these into one big reduce if you want, perhaps building up a {"shoes in warehouse": 1, "average time in warehouse": [253, 15]} kind of object.
However, if you can accept two different views for this data, then there is a shortcut for the average. In the map, emit(null, time) where time is the time spent in the warehouse. In the reduce, set the entire reduce value to _stats (see Built-in reduce functions). The view output will be an object with the sum and count already computed.