How can I use median of leaf instead of mean in LightGBM Regression - regression

I have a dataset with 6 numeric inputs and a numeric target col to predict. I'm using LightGBM and getting considerable good results. As far as I know, predicted values are the learning rate weighted averages of the tree leaves. And each leaf's value is the mean of observations it contains. But I suspect that the observations in each leaf may not be normally distributed. So I want to use median value, or maybe trim_mean, of each leaf instead of mean. Is there an option for this?

Related

Is there a way to not select a reference category for logistic regression in SPSS?

When doing logistic regression in SPSS, is there a way to remove the reference category in the independent variables so they're all compared against each other equally rather than against the reference category?
When you have a categorical predictor variable, the most fundamental way to encode it for modeling, sometimes referred to as the canonical representation, is to use a 0-1 indicator for each level of the predictor, where each case takes on a value of 1 for the indicator corresponding to its category, and 0 for all the other indicators. The multinomial logistic regression procedure in SPSS (NOMREG) uses this parameterization.
If you run NOMREG with a single categorical predictor with k levels, the design matrix is built with an intercept column and the k indicator variables, unless you suppress the intercept. If the intercept remains in the model, the last indicator will be redundant, linearly dependent on the intercept and the first k-1 indicators. Another way to say this is that the design matrix is of deficient rank, since any of the columns can be predicted given the other k columns.
The same redundancy will be true of any additional categorical predictors entered as main effects (only k-1 of k indicators can be nonredundant). If you add interactions among categorical predictors, an indicator for each combination of levels of the two predictors is generated, but more than one of these will also be redundant given the intercept and main effects preceding the interaction(s).
The fundamental or canonical representation of the model is thus overparameterized, meaning it has more parameters than can be uniquely estimated. There are multiple ways commonly used to deal with this fact. One approach is the one used in NOMREG and most other more recent regression-type modeling procedures in SPSS, which is to use a generalized inverse of the cross-product of the design matrix, which has the effect of aliasing parameters associated with redundant columns to 0. You'll see these parameters represented by 0 values with no standard errors or other statistics in the SPSS output.
The other way used in SPSS to handle the overparameterized nature of the basic model is to reparameterize the design matrix to full rank, which involves creating k-1 coded variables instead of k indicators for each main effect, and creating interaction variables from these. This is the approach taken in LOGISTIC REGRESSION.
Note that the overall model fit and predicted values from a logistic regression (or other form of linear or generalized linear model) will be the same regardless of what choices are made about parameterization, as long as the appropriate total number of unique columns are in the design matrix. Particular parameter estimates are of course highly dependent upon the particular parameterization used, but you can derive the results from any of the valid approaches using the results from any other valid approach.
If there are k levels in a categorical predictor, there are k-1 degrees of freedom for comparing those k groups, meaning that once you'd made k-1 linearly independent or nonredundant comparisons, any others can be derived from those.
So the short answer is no, you can't do what you're talking about, but you don't need to, because the results for any valid parameterization will allow you to derive those for any other one.

Can I find price floors and ceilings with cuda

Background
I'm trying to convert an algorithm from sequential to parallel, but I am stuck.
Point and Figure Charts
I am creating point and figure charts.
Decreasing
While the stock is going down, add an O every time it breaks through the floor.
Increasing
While the stock is going up, add an X every time it breaks through the ceiling.
Reversal
If the stock reverses direction, but the change is less than a reversal threshold (3 units) do nothing. If the change is greater than the reversal threshold, start a new column (X or O)
Sequential vs Parallel
Sequentially, this is pretty straight forward. I keep a variable for the floor and ceiling. If the current price breaks through the floor or ceiling, or changes more than the reversal threshold, I can take the appropriate action.
My question is, is there a way to find these reversal point in parallel? I'm fairly new to thinking in parallel, so I'm sorry if this is trivial. I am trying to do this in CUDA, but I have been stuck for weeks. I have tried using the finite difference algorithms from NVidia. These produce local max / min but not the reversal points. Small fluctuations produce numerous relative max / min, but most of them are trivial because the change is not greater than the reversal size.
My question is, is there a way to find these reversal point in parallel?
one possible approach:
use thrust::unique to remove periods where the price is numerically constant
use thrust::adjacent_difference to produce 1st difference data
use thrust::adjacent_difference on 1st difference data to get the 2nd difference data, i.e the points where there is a change in the sign of the slope.
use these points of change in sign of slope to identify separate regions of data - build a key vector from these (e.g. with a prefix sum). This key vector segments the price data into "runs" where the price change is in a particular direction.
use thrust::exclusive_scan_by_key on the 1st difference data, to produce the net change of the run
Wherever the net change of the run exceeds a threshold, flag as a "reversal"
Your description of what constitutes a reversal may also be slightly unclear. The above method would not flag a reversal on certain data patterns that you might classify as a reversal. I suspect you are looking beyond a single run as I have defined it here. If that is the case, there may be a method to address that as well - with more steps.

For regression, how many kinds of loss function are there? And what are they and when we should use them?

I just find the good loss function for regression problem. Then use it in Tensorflow.
In the API of tensorflow, there are a lot of loss function, most of them are designed for classification problem.
For regression problem, any other good loss function except the MSE?
I use the following list of fitting targets in my code:
lowest sum of squared absolute error
lowest sum of squared relative error
lowest sum of squared orthogonal distance
lowest sum of absolute value of absolute error
lowest sum of squared log[abs(predicted/actual)]
lowest sum of absolute value of relative error
lowest peak absolute value of absolute error
lowest peak absolute value of relative error
lowest Akaike Information Criterion
lowest Bayesian Information Criterion
Usually the standard lowest sum of squared absolute error is sufficient for my needs, though I have occasionally used the lowest sum of squared relative error. I have only used the other fitting targets rarely or as needed to reduce the fitting effect of outliers in the data.
See the more concise and readily available listing of sub-linear loss functions as discussed in the scipy documentation for robust non-linear regression.
The soft_l1 is often effective at capturing the underlying significance while rejecting outliers. Especially important if you have outliers with high leverage in a sparse data set.
https://scipy-cookbook.readthedocs.io/items/robust_regression.html

Optimal DB query for prefix search

I have a dataset which is a list of prefix ranges, and the prefixes aren't all the same size. Here are a few examples:
low: 54661601 high: 54661679 "bin": a
low: 526219100 high: 526219199 "bin": b
low: 4305870404 high: 4305870404 "bin": c
I want to look up which "bin" corresponds to a particular value with the corresponding prefix. For example, value 5466160179125211 would correspond to "bin" a. In the case of overlaps (of which there are few), we could return either the longest prefix or all prefixes.
The optimal algorithm is clearly some sort of tree into which the bin objects could be inserted, where each successive level of the tree represents more and more of the prefix.
The question is: how do we implement this (in one query) in a database? It is permissible to alter/add to the data set. What would be the best data & query design for this? An answer using mongo or MySQL would be best.
If you make a mild assumption about the number of overlaps in your prefix ranges, it is possible to do what you want optimally using either MongoDB or MySQL. In my answer below, I'll illustrate with MongoDB, but it should be easy enough to port this answer to MySQL.
First, let's rephrase the problem a bit. When you talk about matching a "prefix range", I believe what you're actually talking about is finding the correct range under a lexicographic ordering (intuitively, this is just the natural alphabetic ordering of strings). For instance, the set of numbers whose prefix matches 54661601 to 54661679 is exactly the set of numbers which, when written as strings, are lexicographically greater than or equal to "54661601", but lexicographically less than "54661680". So the first thing you should do is add 1 to all your high bounds, so that you can express your queries this way. In mongo, your documents would look something like
{low: "54661601", high: "54661680", bin: "a"}
{low: "526219100", high: "526219200", bin: "b"}
{low: "4305870404", high: "4305870405", bin: "c"}
Now the problem becomes: given a set of one-dimensional intervals of the form [low, high), how can we quickly find which interval(s) contain a given point? The easiest way to do this is with an index on either the low or high field. Let's use the high field. In the mongo shell:
db.coll.ensureIndex({high : 1})
For now, let's assume that the intervals don't overlap at all. If this is the case, then for a given query point "x", the only possible interval containing "x" is the one with the smallest high value greater than "x". So we can query for that document and check if its low value is also less than "x". For instance, this will print out the matching interval, if there is one:
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(1).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
Suppose now that instead of assuming the intervals don't overlap at all, you assume that every interval overlaps with less than k neighboring intervals (I don't know what value of k would make this true for you, but hopefully it's a small one). In that case, you can just replace 1 with k in the "limit" above, i.e.
db.coll.find({high : {'$gt' : "5466160179125211"}}).sort({high : 1}).limit(k).forEach(
function(doc){ if (doc.low <= "5466160179125211") printjson(doc) }
)
What's the running time of this algorithm? The indexes are stored using B-trees, so if there are n intervals in your data set, it takes O(log n) time to lookup the first matching document by high value, then O(k) time to iterate over the next k documents, for a total of O(log n + k) time. If k is constant, or in fact anything less than O(log n), then this is asymptotically optimal (this is in the standard model of computation; I'm not counting number of external memory transfers or anything fancy).
The only case where this breaks down is when k is large, for instance if some large interval contains nearly all the other intervals. In this case, the running time is O(n). If your data is structured like this, then you'll probably want to use a different method. One approach is to use mongo's "2d" indexing, with your low and high values codifying x and y coordinates. Then your queries would correspond to querying for points in a given region of the x - y plane. This might do well in practice, although with the current implementation of 2d indexing, the worst case is still O(n).
There are a number of theoretical results that achieve O(log n) performance for all values of k. They go by names such as Priority Search Trees, Segment trees, Interval Trees, etc. However, these are special-purpose data structures that you would have to implement yourself. As far as I know, no popular database currently implements them.
"Optimal" can mean different things to different people. It seems that you could do something like save your low and high values as varchars. Then all you have to do is
select bin from datatable where '5466160179125211' between low and high
Or if you had some reason to keep the values as integers in the table, you could do the CASTing in the query.
I have no idea whether this would give you terrible performance with a large dataset. And I hope I understand what you want to do.
With MySQL you may have to use a stored procedure, which you call to map value to bin. Said procedure would query the list of buckets for each row and do arithmetic or string ops to find the matching bucket. You could improve this design by using fixed length prefixes, arranged in a fixed number of layers. You could assign a fixed depth to your tree and each layer has a table. You won't get tree-like performance with either of these approaches.
If you want to do something more sophisticated, I suspect you have to use a different platform.
Sql Server has a Hierarchy data type:
http://technet.microsoft.com/en-us/library/bb677173.aspx
PostgreSQL has a cidr data type. I'm not familiar with the level of query support it has, but in theory you could build a routing table inside of your db and use that to assign buckets:
http://www.postgresql.org/docs/7.4/static/datatype-net-types.html#DATATYPE-CIDR
Peyton! :)
If you need to keep everything as integers, and want it to work with a single query, this should work:
select bin from datatable where 5466160179125211 between
low*pow(10, floor(log10(5466160179125211))-floor(log10(low)))
and ((high+1)*pow(10, floor(log10(5466160179125211))-floor(log10(high)))-1);
In this case, it would search between the numbers 5466160100000000 (the lowest number with the low prefix & the same number of digits as the number to find) and 546616799999999 (the highest number with the high prefix & the same number of digits as the number to find). This should still work in cases where the high prefix has more digits than the low prefix. It should also work (I think) in cases where the number is shorter than the length of the prefixes, where the varchar code in the previous solution can give incorrect results.
You'll want to experiment to compare the performance of having a lot of inline math in the query (as in this solution) vs. the performance of using varchars.
Edit: Performance seems to be really good either way even on big tables with no indexes; if you can use varchars then you might be able to further boost performance by indexing the low and high columns. Note that you'd definitely want to use varchars if any of the prefixes have initial zeroes. Here's a fix to allow for the case where the number is shorter than the prefix when using varchars:
select * from datatable2 where '5466' between low and high
and length('5466') >= length(high);

Randomly sorting an array

Does there exist an algorithm which, given an ordered list of symbols {a1, a2, a3, ..., ak}, produces in O(n) time a new list of the same symbols in a random order without bias? "Without bias" means the probability that any symbol s will end up in some position p in the list is 1/k.
Assume it is possible to generate a non-biased integer from 1-k inclusive in O(1) time. Also assume that O(1) element access/mutation is possible, and that it is possible to create a new list of size k in O(k) time.
In particular, I would be interested in a 'generative' algorithm. That is, I would be interested in an algorithm that has O(1) initial overhead, and then produces a new element for each slot in the list, taking O(1) time per slot.
If no solution exists to the problem as described, I would still like to know about solutions that do not meet my constraints in one or more of the following ways (and/or in other ways if necessary):
the time complexity is worse than O(n).
the algorithm is biased with regards to the final positions of the symbols.
the algorithm is not generative.
I should add that this problem appears to be the same as the problem of randomly sorting the integers from 1-k, since we can sort the list of integers from 1-k and then for each integer i in the new list, we can produce the symbol ai.
Yes - the Knuth Shuffle.
The Fisher-Yates Shuffle (Knuth Shuffle) is what you are looking for.