Suppose I have a dataset with 'number of centuries scored by batsmen across the world'. Number of centuries column in this dataset is ordered categorical or discret Numerical variable?
I assume number of centuries is a measure of how batsmen performed in a match or the quality of batsmen good/average/bad. It doesn't have any mathematical qualities, just gives you the performance order. Is it correct or not?
Related
Let's say you have a model that predicts the purchases of a specific user over a specific period of time.
It seems to work well when we build a model that predicts whether or not to buy and sort users based on that probability.
In the same way, when a model that predicts the purchase amount is constructed and users are sorted based on the predicted amount, the expected performance does not seem to be achieved.
For me, it is important to predict that A will pay more than B. Matching the purchase amount is not important.
What metrics and models can be used in this case?
I am using lightgbm regression as a base.
There is a large variance in the purchase amount. Most users spend 0 won for a certain period, but purchasers spend from a minimum of $1,000 to a maximum of $100,000.
Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.
I have a variable X and and 16 groups of samples. I would like to know which group is the most associated to this variable (the one with the lowest values actually). I performed an ANOVA and a TukeyHSD/post-hoc but that only highlight which groups are different for variable X.
Is there a way to determine which group is significantly associated at lowest values for variable X ?
Thanks for your help
With the post-hoc comparisons already in place, and with the information which groups differ from one another, all you need to know is the mean of X within each group.
The group means are easily calculated in standard statistical software. You already know, which of those means are significantly different from one another.
Alternatively you can use a dummy coding for the group variable (i.e., 5 indicator variables with one reference group that replace the 6-level factor). A regression model that regresses X on the dummy variables is equivalent to the ANOVA model (in most parts) and allows for most pairwise comparisons (depending on the coding).
The regression coefficients will indicate the difference between groups, and the test for the coefficients will indicate whether or not these are significant on some level of confidence.
I am running a series of GLMs for a number of species of the form:
glm.sp<-glm(number~site+as.factor(year)+offset(log(visits)),family=poisson,data=data.sp)
Note that the year term is deliberately a factor as it's not reasonable to assume a linear relationship. The yearly coefficients produced from this model are a measure of the number of each species per year, taking account of the ammount of effort (visits). I then want to extract, exponentiate and index (relative to the last year) the year coefficients and run a GAM on them.
Currently I do this by eyeballing the coefficients and calling them directly:
data.sp.coef$coef<-exp(glm.sp$coefficients[60:77])
However as the number of sites and the number of years recorded for each species are different this means I need to eyeball each species. For example, a different species might have the year coefficients at 51:64. I'd rather not do that, and feel there must be a better way of calling out the coefficients for the years.
I've tried, the below (which doesn't work!)
> coef(glm.sp)["year"]
<NA>
NA
And I also tried saving all the coefficients as a dataframe and using a fuzzy search to extract all the values that contained "year" (the coefficients are automatically saved in the format yearXXXX-YY).
I'm certain I'm missing something simple, so would very much appreciate being proded in the right direction!
Thanks
Matt
I am not sure how I can design the following problem in CouchDB.
I have a logger web app that keeps track of how many items are in a warehouse. To simplify the problem we just need to know the total number items currently in warehouse and how long is each item stays in warehouse before it ships. Lets say the warehouse only have shoes but each shoe have different id and need to keep track by id.
MySQL schema looks like this
id name date-in data-out
1 shoe 08/0/2010 null
2 shoe 07/20/2010 08/01/2010
The output will be
Number of shoe in warehouse: 1
Average time in warehouse: 14 days
Thanks
jhs' answer is great, but I just wanted to add something:
To use the build-in reduce function for the avg calculation (_stats in your case), you have to use two "separate" views. But if your map-function is exactly the same, CouchDB will detect that and not generate a whole new index for that second view. This way you can have one map function feeding multiple reduce functions.
If each shoe is a document, with a date_in and date_out, then your reduce function will +1 if the date_out is null, and +0 (no change) if date_out is not null. That will give you the total count of shoes in the warehouse.
To compute the average time, for each shoe, you know the time in the warehouse. So the reduce function simply accumulates the average. Since reduce functions must be commutative and associative, you use a different average algorithm. The easiest way is to reduce to a [sum, count] array, where sum is an accumulator of all time for all shoes, and count is a counter for the number of shoes counted. Then the client simply divides sum / count to compute the final average.
I think you could combine both of these into one big reduce if you want, perhaps building up a {"shoes in warehouse": 1, "average time in warehouse": [253, 15]} kind of object.
However, if you can accept two different views for this data, then there is a shortcut for the average. In the map, emit(null, time) where time is the time spent in the warehouse. In the reduce, set the entire reduce value to _stats (see Built-in reduce functions). The view output will be an object with the sum and count already computed.