Info messages during CatBoost training - catboost

I am new to CatBoost and I am running CatBoostClassifier training with logging_level = "Info". My data consists of both categorical and numerical variables.
Firstly, for one of the categorical variables, I get the following message in the printed info: feature 21 is redundant categorical feature, skipping it. How is the redundancy of this feature determined?
Furthermore, I'm a bit lost as to what all the info for the iterations stands for. Here is the info output for one single iteration of my training:
{Feature1} pr0 tb1 type0, border=10 score 2.001737609
Feature2, bin=40 score 2.867480488
{Feature3, Feature2 b40} pr2 tb2 type0, border=6 score 3.533462883
Feature4, bin=5 score 4.105045044
46: learn: -1.2759319 total: 13.2s remaining: 843ms
In this case, Feature1 and Feature3 are categorical, while Feature2 and Feature4 are numerical.
What are all the values like pr0, tb1, type0, score etc. stand for? Any pointer to a documentation will be very appreciated.

Related

What is the held-out probability in Mallet LDA? How can we calculate Perplexity by the held-out probability?

I am new to mallet. Now I would like to get the perplexity scores for 10-100 topics in my lda model so I run the held-our probability, it gives me the value of -8926490.73103205 for topic=100, which seems a little bit off. Is that the perplexity score? If now, how we can calculate the perplexity scores based on the output of held-out probability?
Type topic=10 and the held-out probability =-8968935.68290883.
The value you're getting is the log probability of the entire held-out document set. This is the sum of the log probabilities of each word token. Individual word tokens usually have a log prob of around -7, so I'm guessing your held-out set is around 1M tokens. -7 is equivalent to a 1 in 1000 chance. When developing Mallet we usually just focused on log probability directly, you should check for formal definitions of perplexity from work that you want to compare to.
Things you can typically do with a log probability of a collection are divide by the number of tokens to get an average log prob per token. Negating this number and exponentiating will give you a positive score representing the "1 in X" that I mentioned above.

Create regression tables with estout/esttab for interactions in Stata

In one of my models I use the standard built-in notation for interaction terms in Stata, in another, I have to manually code this. In the end, I would like to present nice regression tables, using esttab. How can I show identical, but slightly different coded, interaction terms in the same row? Or imagine, it's actually another interaction, how can I force esttab to ignore that?
// interaction model 1
sysuse auto, clear
regress price weight c.mpg##c.mpg foreign
estimates store model1
// interaction model 2
gen int_mpg_mpg = mpg*mpg
regress price weight mpg int_mpg_mpg foreign
estimates store model2
// make nice regression table sidy-by-side
// manual label manual interactions
label variable int_mpg_mpg "Mileage (mpg) # Mileage (mpg)"
esttab model1 model2, label
// export to latex
label variable int_mpg_mpg "Mileage (mpg) $\times$ Mileage (mpg) "
esttab model1 model2 using "table.tex", ///
label nobaselevels beta not interaction(" $\times$ ") style(tex) replace
Output to console:
Output to LaTeX:
In both cases the manual variable label shows up as a name in regression tables. But identical variables names are not aligned in the same row. I am more interested in the solution for the LaTeX output, but the problem seems to be unrelated to LaTeX.
esttab won't be able to ignore something like that as the variables are unique in how they're specified. I would recommend doing all your interaction terms in the same way that works across both specifications such as interaction model 2.
For multiple different interaction terms, you can rename the interaction terms themselves before the regressions. For example, to estimate heterogenous treatment effects by different covariates, you could run:
foreach var of varlist age education {
cap drop interaction
gen interaction = `var'
reg outcome i.treatment##c.interaction
est store `var'
}
In an esttab or estout there will be one row for the interaction effect, and one row for the main effect. This is a bit of a crude workaround, but normally does the job.
The issue should be addressed on the level "how Stata names the equations and coefficients across estimators". I adapted the code from Andrew:
https://www.statalist.org/forums/forum/general-stata-discussion/general/1551586-align-nls-and-mle-estimates-for-the-same-variable-in-the-same-row-esttab
He is using Ben Jann's program erepost from SSC (ssc install erepost).
* model 1
sysuse auto, clear
eststo clear
gen const=1
qui regress price weight c.mpg##c.mpg foreign
mat b=e(b)
* store estimates
eststo model1
* model 2
gen int_mpg_mpg = mpg*mpg // generate interaction manually
qui regress price weight mpg int_mpg_mpg foreign
* rename interaction with additional package erepost
local coln "b:weight b:mpg b:c.mpg#c.mpg b:foreign b:_cons"
mat colnames b= `coln'
capt prog drop replace_b
program replace_b, eclass
erepost b= b, rename
end
replace_b
eststo model2
esttab model1 model2, mtitle("Interaction V1" "Interaction V2")
Now, all interactions (automatic and manual) are aligned:
--------------------------------------------
(1) (2)
Interactio~1 Interactio~2
--------------------------------------------
main
weight 3.038*** 3.038***
(3.84) (3.84)
mpg -298.1 -298.1
(-0.82) (-0.82)
c.mpg#c.mpg 5.862 5.862
(0.90) (0.90)
foreign 3420.2*** 3420.2***
(4.62) (4.62)
_cons -527.0 -527.0
(-0.08) (-0.08)
--------------------------------------------
N 74 74
--------------------------------------------

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector

How to get trained trees from Catboost?

I used --print-trees --verbose to print trees and get output like this:
441:
(f3, split0) score -0.01684494315
(f1, split0) score 0.00728615875
(f3, split0) score 0.02879532296
learn 0.1080262936passed: 0.00033 sec total: 234ms remaining: 30.7ms
442:
(f0, split0) score 0.02581825636
(f0, split0) score -0.05604439647
learn 0.1080003503passed: 0.000278 sec total: 234ms remaining: 30.1ms
How can I get split values and result class for each tree?
You can convert the model to CoreML format, it is a proto format from which you can get all split values and leaf values.
CoreML format doesn't support statistics on categorical features yet, so it is currently not possible to have a human readable model with these statistics. But we will add it later, there is an issue on GitHub for that: https://github.com/catboost/catboost/issues/23
Check out this one:
https://blog.csdn.net/l_xzmy/article/details/81532281
The idea is to draw trees from the detail info of the exported model:
cat_clf.save_model(fname, format="cbm", export_parameters=None)

Dynamic Topic model output - Blei format

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))