How to calculate the output of a regression decision tree - regression

I'm trying to understand after doing all possible splits in a regression tree. you end up with terminal node(s) then you need to calculate the output to obtain the regression estimate. In this case what is the rule, is it to average the observations in the terminal node or use some weighted average if there are multiple terminal nodes?!

Yes, that is correct. You simply just take the average of the values in the node. And why does that makes sense?
A decision tree actually divides your data into square-(hyper) boxes. Say your box (node) have the following values [2,3,4] from our training set, then when we get a new data point, we traverse the tree and end in that box, which value should we assign to the new data? Well, on average, we get the smallest error if we assign the average of the values [2,3,4] i.e 3 to the new data.

Related

If I have a multi-modal regression model (output: a,b,c,d) based on some input (x,y,z), can I provide prior (b) to predict one or more of the outputs

if the inputs to my model are x,y,z and my outputs are continuous variables a,b,c,d I can obviously use the model to predict the vector [a,b,c,d] from [x,y,z].
However what happens if I find myself in a situation whereby I have say value b as a prior, before inference? Can i run the network in a manner such that I am predicting [a,c,d] based on [x,y,z,b]?
Real World example: I have an image with pixel locations (x,y) and pixel values (R,G,B). I then build a neural implicit model to predict pixel values based on pixel locations, say now I have a situation where I have the green values for some pixels as well as their locations, can I use these green values as a prior with my original network to get an improved result. Note that I am not interested in training a new network on said data.
In mathematical terms: I have network f(x,y,z) -> (a,b,c,d) how can I perform f(x,y,z|b) -> (a,c,d)?
I have not tried much here, thinking of maybe passing the prior value back through the network but am kinda lost.

How to decide tree depth of LGBM for high dimensional data?

I am using LightGBM for regression problems on my project and the input data has 800 numeric variables which is high dimensional and sparse dataset.
I want to use as many variables as possible in each iterations. In this case, should I unlimit max_depth?
Because I set max_depth=2 to overcome overfitting issue but it seems using only 1~3 variables in each iterations and those variables are used reapetedly.
And I want to know how tree depth affects to learning result of regression tree.
Detailed info. of the present model:
Number of input variables=800 (numeric)
target variables=1 (numeric)
objective=regression
max_depth=2
num_leaves=3
num_iterations=2000
learning_rate=0.01

number ranges redistribution based on query results

I need to find a ways redistribute the number ranges when parallel exporting from MySQL:
Example output (SQL queries results):
what is the best way to redistribute the number ranges after getting the initial results, so the results will be more evenly distributed?
(estimated) desired output:
It seems that originally you believe your data is uniformly distributed, and now you have a view of the amount of entries at every evenly spaced bin. You can now update your belief about the distribution of your data: within every bin, the data is uniformly distributed, but bins with a higher result have a larger concentration.
Your updated distribution will tell you that you believe the number of results q to be equal to the sum of all the results in the buckets where the max bound is below q, plus
(q-min(q))/(max(q)-min(q))*size(q)+
where min(q) and max(q) give you the max and min bounds of the bucket that q belongs to, and size(q) is the amount of results in the bucket that q belongs to. This is a piecewise linear function where the slope at bucket i is its relative size to the total. Now divide by the total number of results to get a probability distribution. To get the places where you should query, find the ten values of x where this piecewise function is equal to .1,.2,.3.... 1.0 . This is a lot easier than inverting an arbitrary function if you exploit the piecewise linear property , for example if you are trying to find the x associated to .2, first find the bucket i such that min(.2)=lower_bnd_bucket(i)/total<=.2<=upper_bnd_bucket(i)=max(.2), This gives you min(.2), max(.2) and size(.2).
Then you just have to invert the linear function associated to that bucket
x=(.2-sum_of_buckets_lessthan_i)*(max(.2)-min(.2))/size(.2)+min(.2)
Note that size should not be 0 since your are dividing by it, which makes sense (if the bucket is 0, you should skip it). Finally, if you want the places you are querying to be integers, you can round the values using your preferred criteria. What we are doing here is updating with Bayes our belief of where the 10 deciles will be, based on observations
on the distribution in the 10 places you already queried. You can refine this further once you see the result of this query, and you will eventually reach convergence.
For the example on your table, to find the updated upper limit of bucket 1 , you check that
2569/264118<0.1 (first ten percent),
then you check that
that (2569+14023)/264118<0.1
and finally you check that ((2569+14023+123762)/264118)>0.1
so your new estimate for the decile should be in between 1014640 and 1141470.
Your new estimate for the upper theshold of the first bucket is
decile_1=(.1-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1024703
similarly, your estimate for the upper bound for the second bucket is:
(.2-(2569+14023)/264118)*(1141470-1014640)/(123762/264118)+1014640=1051770. Note that this linear interpolation will work until the update for the upper limit of bucket 6, since ((2569+14023+123762)/264118)<.6 and you will now need to use the limits for the old bucket ten when updating the buckets 6 and higher.

Making sense of soundMixer.computeSpectrum

All examples that I can find on the Internet just visualize the result array of the function computeSpectrum, but I am tasked with something else.
I generate a music note and I need by analyzing the result array to be able to say what note is playing. I figured out that I need to set the second parameter of the function call 'FFTMode' to true and then it returns sound frequencies. I thought that really it should return only one non-zero value which I could use to determine what note I generated using Math.sin function, but it is not the case.
Can somebody suggest a way how I can accomplish the task? Using the soundMixer.computeSpectrum is a requirement because I am going to analyze more complex sounds later.
FFT will transform your signal window into set of Nyquist sine waves so unless 440Hz is one of them you will obtain more than just one nonzero value! For a single sine wave you would obtain 2 frequencies due to aliasing. Here an example:
As you can see for exact Nyquist frequency the FFT response is single peak but for nearby frequencies there are more peaks.
Due to shape of the signal you can obtain continuous spectrum with peaks instead of discrete values.
Frequency of i-th sample is f(i)=i*samplerate/N where i={0,1,2,3,4,...(N/2)-1} is sample index (first one is DC offset so not frequency for 0) and N is the count of samples passed to FFT.
So in case you want to detect some harmonics (multiples of single fundamental frequency) then set the samplerate and N so samplerate/N is that fundamental frequency or divider of it. That way you would obtain just one peak for harmonics sinwaves. Easing up the computations.

Cumulative Distribution Function For a Set of Values

I have a histogram, where I count the number of occurrences that a function takes particular values in the range 0.8 and 2.2.
I would like to get the cumulative distribution function for the set of values. Is it correct to just count the total number of occurrences until each particular value.
For example, the cdf at 0.9 will be the sum of all the occurrences from 0.8 to 0.9?
Is it correct?
Thank you
The sum normalised by the number of entries will give you an estimate of the cdf, yes. It will be as accurate as the histogram is an accurate representation of the pdf. If you want to evaluate the cdf anywhere except the bin endpoints, it makes sense to include a fraction of the counts, so that if you have break points b_i and b_j, then to evaluate the cdf at some point b_i < p < b_j you should add the fraction of counts (p - b_i) / (b_j-b_i) from the relevant cell. Essentially this assumes uniform density within the cells.
You can get an estimate of the cdf from the underlying values, too (based on your question I'm not quite sure what you have access to, whether its bin counts in the histogram or the actual values). Beware that doing so will give your CDF discontinuities (steps) at each data point, so think about whether you have enough, and what you're using the CDF for, to determine whether this is appropriate.
As a final note of warning, beware that evaluating the cdf outside of the range of observed values will give you an estimated probability of zero or one (zero for x<0.8, one for x>2.2). You should consider whether the function is truly bounded to that interval, and if not, employ some smoothing to ensure small amounts of probability mass outside the range of observed values.