How to get trained trees from Catboost? - catboost

I used --print-trees --verbose to print trees and get output like this:
441:
(f3, split0) score -0.01684494315
(f1, split0) score 0.00728615875
(f3, split0) score 0.02879532296
learn 0.1080262936passed: 0.00033 sec total: 234ms remaining: 30.7ms
442:
(f0, split0) score 0.02581825636
(f0, split0) score -0.05604439647
learn 0.1080003503passed: 0.000278 sec total: 234ms remaining: 30.1ms
How can I get split values and result class for each tree?

You can convert the model to CoreML format, it is a proto format from which you can get all split values and leaf values.
CoreML format doesn't support statistics on categorical features yet, so it is currently not possible to have a human readable model with these statistics. But we will add it later, there is an issue on GitHub for that: https://github.com/catboost/catboost/issues/23

Check out this one:
https://blog.csdn.net/l_xzmy/article/details/81532281
The idea is to draw trees from the detail info of the exported model:
cat_clf.save_model(fname, format="cbm", export_parameters=None)

Related

Named Entity Recognition confidence

I need to get confidence about each extracted entity (not to print it but to get it), however, I can't find a method that returns confidences.
Firstly, I have tried using Stanford Named Entity Recognizer library on Java and this solution:
Display Stanford NER confidence score
but it doesn't work (I guess getCliqueTree method is not available). I also have tried using NLTK in Python and Stanford NER model to extract entities, but again couldn't find a way to get confidences.
I know how to do it on Spacy:
https://github.com/explosion/spaCy/issues/831
but as the author says it's inefficient.
So, can you please advise me, how to get the probabilities of each extracted entity?
Usually NER is a token level classification task.
Confidences are usually derived from each prediction, which is commonly the output of some type of softmax.
The issue then become, how can I get a confidence for a sequence of confidences?
There are multiple ways:
Entropy [Confidence is amount of information]
Average (Mean) [Confidence is the average]
Min/Max of confidences [Confidence is the min/max]
All of these give different answers, none are "better" and it really depends on your use case.
If you would like to order possible entity types, you can start with the following:
Get confidences assuming same label for each token
Get entropy for confidence (probability) sequence
Sort by entropy

How can we define an RNN - LSTM neural network with multiple output for the input at time "t"?

I am trying to construct a RNN to predict the possibility of a player playing the match along with the runs score and wickets taken by the player.I would use a LSTM so that performance in current match would influence player's future selection.
Architecture summary:
Input features: Match details - Venue, teams involved, team batting first
Input samples: Player roster of both teams.
Output:
Discrete: Binary: Did the player play.
Discrete: Wickets taken.
Continous: Runs scored.
Continous: Balls bowled.
Question:
Most often RNN uses "Softmax" or"MSE" in the final layers to process "a" from LSTM -providing only a single variable "Y" as output. But here there are four dependant variables( 2 Discrete and 2 Continuous). Is it possible to stitch together all four as output variables?
If yes, how do we handle mix of continuous and discrete outputs with loss function?
(Though the output from LSTM "a" has multiple features and carries the information to the next time-slot, we need multiple features at output for training based on the ground-truth)
You just do it. Without more detail on the software (if any) in use it is hard to give more detasmail
The output of the LSTM unit is at every times step on of the hidden layers of your network
You can then input it in to 4 output layers.
1 sigmoid
2 i'ld messarfound wuth this abit. Maybe 4x sigmoid(4 wickets to an innnings right?) Or relu4
3,4 linear (squarijng it is as lso an option,e or relu)
For training purposes your loss function is the sum of your 4 individual losses.
Since f they were all MSE you could concatenat your 4 outputs before calculating the loss.
But sincd the first is cross-entropy (for a decision sigmoid) yould calculate seperately and sum.
You can still concatenate them after to have a output vector

How to deal with ordinal labels in keras?

I have data with integer target class in the range 1-5 where one is the lowest and five the highest. In this case, should I consider it as regression problem and have one node in the output layer?
My way of handling it is:
1- first I convert the labels to binary class matrix
labels = to_categorical(np.asarray(labels))
2- in the output layer, I have five nodes
main_output = Dense(5, activation='sigmoid', name='main_output')(x)
3- I use 'categorical_crossentropy with mean_squared_error when compiling
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['mean_squared_error'],loss_weights=[0.2])
Also, can anyone tells me: what is the difference between using categorical_accuracy and 'mean_squared_error in this case?
Regression and classification are vastly different things. If you reimagine this as a regression task than the difference of predicting 2 when the ground truth is 4 will be rated more than if you predict 3 instead of 4. If you have class like car, animal, person you do not care for the ranking between those classes. Predicting car is just as wrong as animal, iff the image shows a person.
Metrics do not impact your learning at all. It is just something that is computed additionally to the loss to show the performance of the model. Here the accuracy makes sense, because this is mostly the metric that we care about. Mean squared error does not tell you how well your model performs. If you get something like 0.0015 mean squared error it sounds good, but it is hard to visualize just how well this performs. In contrast using accuracy and achieving 95% accuracy for example is meaningful.
One last thing you should use softmax instead of sigmoid as your final output to get a probability distribution in your final layer. Softmax will output percentages for every class that sum up to 1. Then crossentropy calculates the difference of the probability distribution of your network output and the ground truth.

Dynamic Topic model output - Blei format

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))

Floating point Data generator

Is there a program or source code for data generation?
I want a data generator for Java. (Language does not matter, if I can get the result file)
I want a correlated data, anti-correlated data, independent data.
I want a data generator program that has
input : min, max, data-distribution (ex., independent, anti-correlated, correlated, Gaussian, Poisson ... ), dimension, # of points (n)
output : n points that follows given data-distribution.
Thank you :)
You can change the interval of the generated numbers with some simple math:
Random r=new Random();
floatx=(r.nextFloat()%(max+min))-min;
The java random class also has an option to return gaussian distributed values.