How to set a fixed random seed for the inner loop of nested hyper-parameter tuning? - mlr

I'm trying to do hyper-parameter tuning with nested crossvalidation. This is my inner loops for two learners lrn1 and lrn2:
inner = makeResampleDesc("CV", iters = 3L)
tune_lrn1 <- makeTuneWrapper(lrn1, resampling = inner, par.set = ps, control = ctrl)
tune_lrn2 <- makeTuneWrapper(lrn2, resampling = inner, par.set = ps, control = ctrl)
Is there any way to set a fixed value of the random seed everytime before "inner" is instantiated, so that the two learners always use exactly the same data partitions for hyper-parameter evaluation?

There are two things that you can do but they might not satisfiy your needs completely.
Fix the resampling for a given task
...or at least a given n. In this example we use the task iris.task.
inner_fixed = makeResampleInstance(inner, iris.task)
tune_lrn1 <- makeTuneWrapper(lrn1, resampling = inner_fixed, par.set = ps, control = ctrl)
tune_lrn2 <- makeTuneWrapper(lrn2, resampling = inner_fixed, par.set = ps, control = ctrl)
If you want to apply this for multiple tasks you have to solve it programmatically.
Setting the seed can fail!
The following setting is already the default
ctrl = makeTuneControl*(same.resampling.instance = TRUE, ...)
This means, that all tuning evaluations are evaluated on the same train/test split. In other words: makeResampleInstance is called at the very beginning of tune(). Now we can go with #pat-s answer, which does not always work because the RNG is used during train for some learners and accordingly the ongoing train/test-splits will "diverge":
library(mlr)
inner = makeResampleDesc("CV", iters = 3L)
task = iris.task
lrn1 = makeLearner("classif.rpart")
lrn2 = makeLearner("classif.svm")
ctrl = makeTuneControlRandom(same.resampling.instance = TRUE, budget = 4)
library(mlrHyperopt)
ps1 = getDefaultParConfig(lrn1)$par.set
ps2 = getDefaultParConfig(lrn2)$par.set
tune_lrn1 = makeTuneWrapper(lrn1, resampling = inner, par.set = ps1, control = ctrl)
tune_lrn2 = makeTuneWrapper(lrn2, resampling = inner, par.set = ps2, control = ctrl)
set.seed(1)
r1 = resample(tune_lrn1, resampling = cv10, task = iris.task, models = TRUE)
set.seed(1)
r2 = resample(tune_lrn2, resampling = cv10, task = iris.task, models = TRUE)
sapply(1:10, function(i) {
identical(r2$models[[i]]$learner.model$opt.result$resampling$train.inds, r1$models[[i]]$learner.model$opt.result$resampling$train.inds)
})
# [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Related

Merging two MLP models that use different datasets but the same target value

I have two datasets with the same target attribute. I designed two MLPs for them. I need to merge them. desired network
I have two datasets with the same target attribute. I designed two MLPs for them. My MLP is like the below:
def MLP(input_shape, neurons_in_layers, activations, classes):
X = Input(input_shape)
i = 0
while i < len(neurons_in_layers):
X = Dense(neurons_in_layers[i])(X)
X = Activation(activations[i])(X)
i += 1
X = Dense(classes)(X)
X = Activation('softmax')(X)
return X
Examples of hyperparameters are:
n_rows_nume, n_columns_nume = dataset_nume.shape
n_rows_cate, n_columns_cate = dataset_cate.shape
target_index_nume = dataset_nume.columns.get_loc("Task")
target_index_cate = dataset_cate.columns.get_loc("Task")
As a result, I created two networks and merged them as below:
module_nume = MLP(input_shape_nume, neurons_nume, activations_nume, n_classes)
module_cate = MLP(input_shape_cate, neurons_cate, activations_cate, n_classes)
merged = average([module_nume, module_cate])
merged_model = Model(inputs = [Input(input_shape_nume), Input(input_shape_cate)], outputs = merged)
But it gives me an error:
ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 12), dtype=tf.float32, name='input_2'), name='input_2', description="created by layer 'input_2'") at layer "dense_5". The following previous layers were accessed without issue: []

PyTorch-lightning models running out of Memory after 1st epoch

I saw a Kaggle kernel on PyTorch and run it with the same img_size, batch_size, etc. and created another PyTorch-lightning kernel with exact same values but my lightning model runs out of memory after about 1.5 epochs (each epoch contains 8750 steps) on the first fold whereas the native PyTorch model runs for whole 5 folds. Is there any way to improve the code or release memory? I could have tried to delete the models or do some garbage collection but if it doesn't complete even the first fold I can't delete the models and things.
def run_fold(fold):
df_train = train[train['fold'] != fold]
df_valid = train[train['fold'] == fold]
train_dataset = G2NetDataset(df_train, get_train_aug())
valid_dataset = G2NetDataset(df_valid, get_test_aug())
train_dl = DataLoader(train_dataset,
batch_size = config.batch_size,
num_workers = config.num_workers,
shuffle = True,
drop_last = True,
pin_memory = True)
valid_dl = DataLoader(valid_dataset,
batch_size = config.batch_size,
num_workers = config.num_workers,
shuffle = False,
drop_last = False,
pin_memory = True)
model = Classifier()
logger = pl.loggers.WandbLogger(project='G2Net', name=f'fold: {fold}')
trainer = pl.Trainer(gpus = 1,
max_epochs = config.epochs,
fast_dev_run = config.debug,
logger = logger,
log_every_n_steps=10)
trainer.fit(model, train_dl, valid_dl)
result = trainer.test(test_dataloaders = valid_dl)
wandb.run.finish()
return result
def main():
if config.train:
results = []
for fold in range(config.n_fold):
result = run_fold(fold)
results.append(result)
return results
results = main()
I cannot say much without looking at your model class, but couple possible issues that I encountered were metric and loss evaluation for logging.
For example, stuff like
pl.metrics.Accuracy(compute_on_step=False)
requires and explicit call of .compute()
def training_epoch_end(self, outputs):
loss = sum([out['loss'] for out in outputs])/len(outputs)
self.log_dict({'train_loss' : loss.detach(),
'train_accuracy' : self.train_metric.compute()})
at the epoch end.

How do i measure perplexity scores on a LDA model made with the textmineR package in R?

I've made a LDA topic model in R, using the textmineR package, it looks as follows.
## get textmineR dtm
dtm2 <- CreateDtm(doc_vec = dat2$fulltext, # character vector of documents
ngram_window = c(1, 2),
doc_names = dat2$names,
stopword_vec = c(stopwords::stopwords("da"), custom_stopwords),
lower = T, # lowercase - this is the default value
remove_punctuation = T, # punctuation - this is the default
remove_numbers = T, # numbers - this is the default
verbose = T,
cpus = 4)
dtm2 <- dtm2[, colSums(dtm2) > 2]
dtm2 <- dtm2[, str_length(colnames(dtm2)) > 2]
############################################################
## RUN & EXAMINE TOPIC MODEL
############################################################
# Draw quasi-random sample from the pc
set.seed(34838)
model2 <- FitLdaModel(dtm = dtm2,
k = 8,
iterations = 500,
burnin = 200,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 4)
The questions are then:
1. Which function should i apply to get the perplexity scores in the textmineR package? I can't seem to find one.
2. how do i measure complexity scores for different numbers of topics(k)?
As asked: there's no way to calculate perplexity with textmineR unless you explicitly program it yourself. TBH, I've never seen value of perplexity that you couldn't get with likelihood and coherence, so I didn't implement it.
However, the text2vec package does have an implementation. See below for example:
library(textmineR)
# model ships with textmineR as example
m <- nih_sample_topic_model
# dtm ships with textmineR as example
d <- nih_sample_dtm
# get perplexity
p <- text2vec::perplexity(X = d,
topic_word_distribution = m$phi,
doc_topic_distribution = m$theta)

Why model's loss is always revolving around 1 in every epoch?

During training, loss of my model is revolving around "1". It is not converging.
I tried various optimizer but it still showing the same pattern. I am using keras with tensorflow backend. What could be possible reasons? Any help or reference link will be appreciable.
here is my model:
def model_vgg19():
vgg_model = VGG19(weights="imagenet", include_top=False, input_shape=(128,128,3))
for layer in vgg_model.layers[:10]:
layer.trainable = False
intermediate_layer_outputs = get_layers_output_by_name(vgg_model, ["block1_pool", "block2_pool", "block3_pool", "block4_pool"])
convnet_output = GlobalAveragePooling2D()(vgg_model.output)
for layer_name, output in intermediate_layer_outputs.items():
output = GlobalAveragePooling2D()(output)
convnet_output = concatenate([convnet_output, output])
convnet_output = Dense(2048, activation='relu')(convnet_output)
convnet_output = Dropout(0.6)(convnet_output)
convnet_output = Dense(2048, activation='relu')(convnet_output)
convnet_output = Lambda(lambda x: K.l2_normalize(x,axis=1)(convnet_output)
final_model = Model(inputs=[vgg_model.input], outputs=convnet_output)
return final_model
model=model_vgg19()
here is my loss function:
def hinge_loss(y_true, y_pred):
y_pred = K.clip(y_pred, _EPSILON, 1.0-_EPSILON)
loss = tf.convert_to_tensor(0,dtype=tf.float32)
g = tf.constant(1.0, shape=[1], dtype=tf.float32)
for i in range(0, batch_size, 3):
try:
q_embedding = y_pred[i+0]
p_embedding = y_pred[i+1]
n_embedding = y_pred[i+2]
D_q_p = K.sqrt(K.sum((q_embedding - p_embedding)**2))
D_q_n = K.sqrt(K.sum((q_embedding - n_embedding)**2))
loss = (loss + g + D_q_p - D_q_n)
except:
continue
loss = loss/(batch_size/3)
zero = tf.constant(0.0, shape=[1], dtype=tf.float32)
return tf.maximum(loss,zero)
What is definitely a problem is that you shuffle your data and then try to learn triplets out of this.
As you can see here: https://keras.io/models/model/ model.fit shuffles your data in each epoch, making your triplet setup obsolete. Try to set the shuffle parameter to false and see what happens, there might be different errors as well.

Additional seedwords argument in LDA() function from topicmodels

I am looking for an in depth example of Latent Dirichlet Allocation (LDA) with seedwords specified for the topicmodels package in R.
The basic function takes on the form:
LDA(x, k, method = "Gibbs", control = NULL, model = NULL, ...)
And the documentation only states:
For method = "Gibbs" an additional argument seedwords can be specified
as a matrix or an object of class "simple_triplet_matrix"; the default
is NULL.
Can anyone point me to a complete example of how this would look and function?
Taken from this answer:
https://stats.stackexchange.com/questions/384183/seeded-lda-using-topicmodels-in-r
library("topicmodels")
data("AssociatedPress", package = "topicmodels")
## We fit 6 topics.
## We specify five seed words for five topics, the sixth topic has no
## seed words.
library("slam")
set.seed(123)
i <- rep(1:5, each = 5)
j <- sample(1:ncol(AssociatedPress), 25)
SeedWeight <- 500 - 0.1
deltaS <- simple_triplet_matrix(i, j, v = rep(SeedWeight, 25),
nrow = 6, ncol = ncol(AssociatedPress))
set.seed(1000)
ldaS <- LDA(AssociatedPress, k = 6, method = "Gibbs", seedwords = deltaS,
control = list(alpha = 0.1, best = TRUE,
verbose = 500, burnin = 500, iter = 100, thin = 100, prefix = character()))
apply(deltaS, 1, function(x) which(x == SeedWeight))
apply(posterior(ldaS)$terms, 1, function(x) order(x, decreasing = TRUE)[1:5])