Warning appears in mixed effect model using spss - warnings

I am using spss to conduct mixed effect model of the following project: 
The participant is being asked some open ended questions and their answers are recorded.
For example, if the participant's answer is related to equality, the variable "equality" is coded as "1". Otherwise, it is coded as "0". Therefore, dependent variable is the variable "equality".
Fixed effects: 
- participant's country (Asians vs. Westerners)
- gender (Male vs Female)
- age group (younger age group vs. older age group)
- condition (control group vs. intervention group)
Random effect: Subject ID (participants)
Sample size: over 600 participants
My syntax in spss: 
MIXED  Equality BY Country Gender AgeGroup Condition
/CRITERIA=CIN(95) MXITER(100) MXSTEP(10) SCORING(1) SINGULAR(0.000000000001) HCONVERGE(0, ABSOLUTE) LCONVERGE(0, ABSOLUTE) PCONVERGE(0.000001, ABSOLUTE)
/FIXED= Country Gender AgeGroup Condition  | SSTYPE(3)
/METHOD=ML
/PRINT=SOLUTION TESTCOV
/RANDOM=INTERCEPT | SUBJECT(SubID_R) COVTYPE(VC).
When running this analysis in spss, the following warning appears: 
Iteration was terminated but convergence has not been achieved.
The MIXED procedure continues despite this warning. Subsequent results produced are based on the last iteration. Validity of the model fit is uncertain.    
I try to increase the number of "MXSTEP" from 10 to 10000 in syntax, but another warning appear: 
The final Hessian matrix is not positive definite although all
convergence criteria are satisfied.
The MIXED procedure continues despite this warning. Validity of subsequent results cannot be ascertained.  
I also try to increase the number of  "MXITER" but the warning remains. May I ask how to deal with this problem to get rid of the warning? 

Aside from what you've already tried, in some cases increasing the number of Fisher scoring steps can be helpful, but it may be the case that your random intercept variance is truly redundant and you won't be able to resolve this problem with those data and that model.
Also, typically you would not use a linear model for a binary response variable, but would use something like a logistic model (this can be done in GENLINMIXED, under Analyze>Mixed Models>Generalized Linear in the menus).

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

LSTM Evolution Forecast

I have a confusion about the way the LSTM networks work when forecasting with an horizon that is not finite but I'm rather searching for a prediction in whatever time in future. In physical terms I would call it the evolution of the system.
Suppose I have a time series $y(t)$ (output) I want to forecast, and some external inputs $u_1(t), u_2(t),\cdots u_N(t)$ on which the series $y(t)$ depends.
It's common to use the lagged value of the output $y(t)$ as input for the network, such that I schematically have something like (let's consider for simplicity just lag 1 for the output and no lag for the external input):
[y(t-1), u_1(t), u_2(t),\cdots u_N(t)] \to y(t)
In this way of thinking the network, when one wants to do recursive forecast it is forced to use the predicted value at the previous step as input for the next step. In this way we have an effect of propagation of error that makes the long term forecast badly behaving.
Now, my confusion is, I'm thinking as a RNN as a kind of an (simple version) implementation of a state space model where I have the inputs, my output and one or more state variable responsible for the memory of the system. These variables are hidden and not observed.
So now the question, if there is this kind of variable taking already into account previous states of the system why would I need to use the lagged output value as input of my network/model ?
Getting rid of this does my long term forecast would be better, since I'm not expecting anymore the propagation of the error of the forecasted output. (I guess there will be anyway an error in the internal state propagating)
Thanks !
Please see DeepAR - a LSTM forecaster more than one step into the future.
The main contributions of the paper are twofold: (1) we propose an RNN
architecture for probabilistic forecasting, incorporating a negative
Binomial likelihood for count data as well as special treatment for
the case when the magnitudes of the time series vary widely; (2) we
demonstrate empirically on several real-world data sets that this
model produces accurate probabilistic forecasts across a range of
input characteristics, thus showing that modern deep learning-based
approaches can effective address the probabilistic forecasting
problem, which is in contrast to common belief in the field and the
mixed results
In this paper, they forecast multiple steps into the future, to negate exactly what you state here which is the error propagation.
Skipping several steps allows to get more accurate predictions, further into the future.
One more thing done in this paper is predicting percentiles, and interpolating, rather than predicting the value directly. This adds stability, and an error assessment.
Disclaimer - I read an older version of this paper.

Negative coefficient for dummy variable regression analysis

I am interpreting multiple regression analysis with the dependent variable being rate of return (ROR) on a stock. Also, I have included a dummy-variable that represents the stock getting included and excluded out of an index (1 = included, 0 = excluded).
The output showed a negative coefficient-value (the value is -0,03852), and I´m not quite sure how to make of this. Does this mean that being included in an index has a negative effect on the dependent variable?
Could someone please provide me an "easy" explanation?
Thanks!
It means when the stock is included (value of 1), you expect the rate of return to decrease by 0.03852. More or less, i guess you can call it a negative effect. What you should check is whether this coefficient is significant or what is the standard error of this coefficient. Usually regression tools should provide that.

Get rows product (multiplication)

SO,
The problem
I have an issue with rows multiplication. In SQL, there is a SUM() function which calculates sum for some field for set of rows. I want to get multiplication, i.e. for table
+------+
| data |
+------+
| 2 |
| -1 |
| 3 |
+------+
that will be 2*(-1)*3 = -6 as a result. I'm using DOUBLE data type for storing my data values.
My approach
From school math it is known that log(A x B) = log(A) + log(B) - so that could be used to created desired expression like:
SELECT
IF(COUNT(IF(SIGN(`col`)=0,1,NULL)),0,
IF(COUNT(IF(SIGN(`col`)<0,1,NULL))%2,-1,1)
*
EXP(SUM(LN(ABS(`col`))))) as product
FROM `test`;
-here you see weakness of this method - since log(X) is undefined when X<=0 - I need to count negative signs before calculating whole expression. Sample data and query for this is given in this fiddle.
Another weakness is that we need to find if there is 0 among column values (Since it is a sample, in real situation I'm going to select product for some subset of table rows with some condition(s) - i.e. I can not simply remove 0-s from my table, because result zero product is a valid and expected result for some rows subsets)
Specifics
And now, finally, my question main part: how to handle situation when we have expression like: X*Y*Z and here X < MAXF, Y<MAXF, but X*Y>MAXF and X*Y*Z<MAXF - so we have possible data type overflow (here MAXF is limit for double MySQL data type). The sample is here. Query above works well, but can I always be sure that it will handle that properly? I.e. may be there is another case with overflow issue when some sub-products causing overflow, but entire product is ok (without overflow).
Or may be there is another way to find rows product? Also, in table there possibly be millions of records (-1.1<X<=1.1 mainly, but probably with values such as 100 or 1000 - i.e. high enough to overflow DOUBLE if multiplied with certain quantity if we have an issue that I've described above) - may be calculating via log will be slow?
I guess this would work...
SELECT IF(MOD(COUNT(data < 0),2)=1
, EXP(SUM(LOG(data)))*-1
, EXP(SUM(LOG(data))))
x
FROM my_table;
If you need this type of calculations often, I suggest you store the signs and the logarithms in separate columns.
The signs can be stored as 1 (for positives), -1 (for negatives) and 0 (for zero.)
The logarithm can be assigned for zero as 0 (or any other value) but it should not be used in calculations.
Then the calculation would be:
SELECT
CASE WHEN EXISTS (SELECT 1 FROM test WHERE <condition> AND datasign = 0)
THEN 0
ELSE (SELECT 1-2*(SUM(datasign=-1)%2) FROM test WHERE <condition>)
END AS resultsign,
CASE WHEN EXISTS (SELECT 1 FROM test WHERE <condition> AND datasign = 0)
THEN -1 -- undefined log for result 0
ELSE (SELECT SUM(datalog) FROM test WHERE <condition> AND datasign <> 0)
END AS resultlog
;
This way, you have no overflow problems. You can check the resultlog if it exceeds some limits or just try to calculate resultdata = resultsign * EXP(resultlog) and see if an error is thrown.
This question is a remarkable one in the sea of low quality ones. Thank you, even reading it was a pleasure.
Precision
The exp(log(a)+log(b)) idea is a good one in itself. However, after reading "What Every Computer Scientist Should Know About Floating-Point Arithmetic", make sure you use DECIMAL or NUMERIC data types to be sure you are using Precision Math, or else your values will be surprisingly inaccurate. For a couple of million rows, errors can add up very quickly! DECIMAL (as per the MySQL doc) has a maximum of 65 digits precision, while for example 64bit IEEE754 floating point values have only up to 16 digits (log10(2^52) = 15.65) precision!
Overflow
As per the relevant part of the MySQL doc:
Integer overflow results in silent wraparound.
DECIMAL overflow results in a truncated result and a warning.
Floating-point overflow produces a NULL result. Overflow for some operations can result in +INF, -INF, or NaN.
So you can detect floating point overflow if it would ever happen.
Sadly, if a series of operations would result in a correct value, fitting into the data type used, but at least one subresult in the process of calculations would not, then you won't get the correct value at the end.
Performance
Premature optimization is the root of all evil. Try it, and if it is slow, take the appropriate actions. Doing this might not be lightning quick, but still might be quicker than getting all the results, and doing it on the application server. Only measurements can decide which gets to be quicker...

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)