How to decide tree depth of LGBM for high dimensional data? - regression

I am using LightGBM for regression problems on my project and the input data has 800 numeric variables which is high dimensional and sparse dataset.
I want to use as many variables as possible in each iterations. In this case, should I unlimit max_depth?
Because I set max_depth=2 to overcome overfitting issue but it seems using only 1~3 variables in each iterations and those variables are used reapetedly.
And I want to know how tree depth affects to learning result of regression tree.
Detailed info. of the present model:
Number of input variables=800 (numeric)
target variables=1 (numeric)
objective=regression
max_depth=2
num_leaves=3
num_iterations=2000
learning_rate=0.01

Related

what's the meaning of 'parameterize' in deep learning?

no
13
what's the meaning of 'parameterize' in deep learning? As shown in the photo, does it means the matrix 'A' can be changed by the optimization during training?
Yes, when something can be parameterized it means that gradients can be calculated.
This means that the (dE/dw) which means the derivative of Error with respect to weight can be calculated (i.e it must be differentiable) and subtracted from the model weights along with obviously a learning_rate and other params being included depending on the optimizer.
What the paper is saying is that if you make a binary matrix a weight and then find the gradient (dE/dw) of that weight with respect to a loss and then make an update on the binary matrix through backpropagation, there is not really an activation function (which by requirement must be differentiable) that can keep the values discrete (like 0 and 1) but rather you will end up with continous values (like these decimal values).
Therefore it is saying since the idea of having binary values be weights and for them to be back-propagated in a way where the weights + activation function also yields an updated weight matrix that is also binary is difficult, another solution like the Bernoulli Distribution is used instead to initialize parameters of a model.
Hope this helps,

Why W_q matrix in torch.nn.MultiheadAttention is quadratic

I am trying to implement nn.MultiheadAttention in my network. According to the docs,
embed_dim  – total dimension of the model.
However, according to the source file,
embed_dim must be divisible by num_heads
and
self.q_proj_weight = Parameter(torch.Tensor(embed_dim, embed_dim))
If I understand properly, this means each head takes only a part of features of each query, as the matrix is quadratic. Is it a bug of realization or is my understanding wrong?
Each head uses a different part of the projected query vector. You can imagine it as if the query gets split into num_heads vectors that are independently used to compute the scaled dot-product attention. So, each head operates on a different linear combination of the features in queries (and keys and values, too). This linear projection is done using the self.q_proj_weight matrix and the projected queries are passed to F.multi_head_attention_forward function.
In F.multi_head_attention_forward, it is implemented by reshaping and transposing the query vector, so that the independent attentions for individual heads can be computed efficiently by matrix multiplication.
The attention head sizes are a design decision of PyTorch. In theory, you could have a different head size, so the projection matrix would have a shape of embedding_dim × num_heads * head_dims. Some implementations of transformers (such as C++-based Marian for machine translation, or Huggingface's Transformers) allow that.

Predicting continuous valued output

I am working on predicting Semantic Textual Similarity (SemEval 2017 Task-1) between a pair of texts. The similarity score (output) is a continuous value between [0,5]. The neural network model (link below), therefore, has 6 units in the final layer for prediction between values [0,5]. The objective function used is the Pearson correlation coefficient and softmax activation is used. Now, in order to train the model, how can I give the target output values to the model? Since there are 6 output classes, I should probably send one-hot-encoded vectors of the output. In that case, how can we convert the output (which might be a float value such as 2.33) to a one-hot vector of length 6? Or is there any other way of specifying the target output and training the model?
Paper: http://nlp.arizona.edu/SemEval-2017/pdf/SemEval016.pdf
If the value you're trying to predict is continuously-defined, you might be better off configuring this as a regression architecture. This will be simpler to train and interpret and will give you non-integer predictions (which you can then bucket or threshold however you please).
In order to do this, replace your softmax layer with a layer containing a single neuron with a linear activation function. Then you can simply train this network using your real-valued similarity numbers at the output. For loss function, you can use MSE / L2 unless you have a reason to do otherwise.

how to predict query topics using word-topic matrix?

I'm implementing LDA using Java. I know how the algorithm works. In the end of the training (the given iterations) I will get 2 matrices (topic-word and document-topic) that represent the set of the input documents.
My problem is that when I input a new document (query) I want to use these matrices (or any other way) to get the document-topic vector of that query. How would I do that?
Are you using Variational Inference or Gibbs Sampling?
For Gibbs Sampling a typical approach is adding the new document/s to the inference, and only updating its own counters, keeping constant the counters for the documents you used to learn the model.
This is specified in equations 84 and 85 in Parameter Estimation for Text Analysis
I guess there has to be a similar approach in VI LDA.

Does PyTorch support variable with dynamic dimension?

I've updated my question based upon the variable dimension of variables.
Suppose the input tensor stores the 3d points with dimension 10x3, 10 means the #points and 3 is the feature dimension (say x,y,z coordinates). The dimension of the variable depends on the input tensor, say its dimension is 10x10. When the input tensor changes its dimension to 50x3, then the dimension of the variable will also have to change to 50x50.
I know in Tensorflow, if the input dimension is changing/unknown, we can declare it as tf.placeholder(None,3). However, I never meet the situation where the size of variable is changing/unknown, it seems that the variable will always have the fixed dimension.
I am currently learning PyTorch and don't know whether PyTorch supports this function. Any information would be appreciated!
========= Original question ========
I have a variable in which the size is changeable when input dimension changes. For example, if input is 10x2, then the variable should be 10x10. If input is 25x2, then the variable should be 25x25. As my understanding, the variable is used to store weights, which normally has fixed dimension. However in my case, the dimension of the variable depends on input data, which can change. Does PyTorch currently supports this kind of function?
Thanks!
Your question is little ambiguous. When you say, your input is say, 10x2, you need to define what the input tensor contains.
I am assuming you are talking about torch.autograd.Variable. If you want to use PyTorch's functionality, what you need to do is to provide your input through a tensor in the desired shape of the target function.
For example, if you want to use RNN implemented in PyTorch for an input sentence of length 10 where each word is represented by a 300 dimensional vector (e.g., word embedding), then you can do as follows.
rnn = nn.RNN(300, 256, 2) # emb_size=300,hidden_size=256,num_layers=2
input = Variable(torch.randn(10, 1, 300)) # sent_length=10,batch_size=1,emb_size=300
h0 = Variable(torch.randn(2, 1, 256)) # num_layers=2,batch_size=1,hidden_size=256
output, hn = rnn(input, h0)
If you have more than 1 sentence, then you can provide them in batch. In that case, you need to pad them to handle variable lengths. As you can see, RNN doesn't care about the sentence length, it can handle variable lengths but to provide many sentences in batch, you need padding. You can explore related functionalities in the official documentation.
Since you didn't mention what is your input actually, I am assuming you need variables with variable number of timesteps, in that case PyTorch can serve your purpose. Actually, PyTorch is developed to meet all basic functionalities that are required to build deep neural network architectures.