Negative coefficient for dummy variable regression analysis - regression

I am interpreting multiple regression analysis with the dependent variable being rate of return (ROR) on a stock. Also, I have included a dummy-variable that represents the stock getting included and excluded out of an index (1 = included, 0 = excluded).
The output showed a negative coefficient-value (the value is -0,03852), and I´m not quite sure how to make of this. Does this mean that being included in an index has a negative effect on the dependent variable?
Could someone please provide me an "easy" explanation?
Thanks!

It means when the stock is included (value of 1), you expect the rate of return to decrease by 0.03852. More or less, i guess you can call it a negative effect. What you should check is whether this coefficient is significant or what is the standard error of this coefficient. Usually regression tools should provide that.

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

what's the meaning of 'parameterize' in deep learning?

no
13
what's the meaning of 'parameterize' in deep learning? As shown in the photo, does it means the matrix 'A' can be changed by the optimization during training?
Yes, when something can be parameterized it means that gradients can be calculated.
This means that the (dE/dw) which means the derivative of Error with respect to weight can be calculated (i.e it must be differentiable) and subtracted from the model weights along with obviously a learning_rate and other params being included depending on the optimizer.
What the paper is saying is that if you make a binary matrix a weight and then find the gradient (dE/dw) of that weight with respect to a loss and then make an update on the binary matrix through backpropagation, there is not really an activation function (which by requirement must be differentiable) that can keep the values discrete (like 0 and 1) but rather you will end up with continous values (like these decimal values).
Therefore it is saying since the idea of having binary values be weights and for them to be back-propagated in a way where the weights + activation function also yields an updated weight matrix that is also binary is difficult, another solution like the Bernoulli Distribution is used instead to initialize parameters of a model.
Hope this helps,

Find marginal effects in multiple equation model with ordered probit - cmp

I am really new to Stata, so that my question might be trivial
I am using package cmp to estimate a bivariate model that goes as follows:
cmp(d_ln_jobs = d_layer) (d_layer = d_tariff), vce(robust) ind($cmp_cont $cmp_probit) nolr quietly difficult
d_layer is an ordered variable that assumes -4, -3, ... 4.
How could I obtain the marginal effect of d_tariff on both dependent variables, evaluated at d_tariff's median?
Here is what I've tried:
margins, dydx(d_tariff) at((median)) force
I don't think this is correct since, as an output, the entry related to dy/dx says 0, and at the header of the output it shows:
"Expression: linear prediction, predict()"
Does this last part mean that it would show predicted probabilities rather than marginal effects? Besides, shouldn't I get a value different from 0? In my mind, d_tariff would change d_layer, which would change d_ln_jobs? Why don't I get two values, one showing the marginal effect on d_layer and the other on d_ln_jobs?

fft: fitting binned data

I want to fit a curve to data obtained from an FFT. While working on this, I remembered that an FFT gives binned data, and therefore I wondered if I should treat this differently with curve-fitting.
If the bins are narrow compared to the structure, I think it should not be necessary to treat the data differently, but for me that is not the case.
I expect the right way to fit binned data is by minimizing not the difference between values of the bin and fit, but between bin area and the area beneath the fitted curve, for each bin, such that the energy in each bin matches the energy in the range of the bin as signified by the curve.
So my question is: am I thinking correctly about this? If not, how should I go about it?
Also, when looking around for information about this subject, I encountered the "Maximum log likelihood" for example, but did not find enough information about it to understand if and how it applied to my situation.
PS: I have no clue if this is the right site for this question, please let me know if there is a better place.
For an unwindowed FFT, the correct interpolation between bins is by using a Sinc (sin(x)/x) or periodic Sinc (Dirichlet) interpolation kernel. For an FFT of samples of a band-limited signal, thus will reconstruct the continuous spectrum.
A very simple and effective way of interpolating the spectrum (from an FFT) is to use zero-padding. It works both with and without windowing prior to the FFT.
Take your input vector of length N and extend it to length M*N, where M is an integer
Set all values beyond the original N values to zeros
Perform an FFT of length (N*M)
Calculate the magnitude of the ouput bins
What you get is the interpolated spectrum.
Best regards,
Jens
This can be done by using maximum log likelihood estimation. This is a method that finds the set of parameters that is most likely to have yielded the measured data - the technique originates in statistics.
I have finally found an understandable source for how to apply this to binned data. Sadly I cannot enter formulas here, so I refer to that source for a full explanation: slide 4 of this slide show.
EDIT:
For noisier signals this method did not seem to work very well. A method that was a bit more robust is a least squares fit, where the difference between the area is minimized, as suggested in the question.
I have not found any literature to defend this method, but it is similar to what happens in the maximum log likelihood estimation, and yields very similar results for noiseless test cases.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)