LSTM Timeseries recursive prediction converge to same value - deep-learning

I'm working on Timeseries sequence prediction using LSTM.
My goal is to use window of 25 past values in order to generate a prediction for the next 25 values. I'm doing that recursively:
I use 25 known values to predict the next value. Append that value as know value then shift the 25 values and predict the next one again until i have 25 new generated values (or more)
I'm using "Keras" to implement the RNN
Architecture:
regressor = Sequential()
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (X_train.shape[1], 1)))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.1))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.1))
regressor.add(Dense(units = 1))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 10, batch_size = 32)
Problem:
Recursive prediction always converge to the some value no matter what sequence comes before.
For sure this is not what I want, I was expecting that the generated sequence will be different depending on what I have before and I'm wondering if someone have an idea about this behavior and how to avoid it. Maybe I'm doing something wrong ...
I tried different epochs number and didn't help much, actually more epochs made it worse. Changing Batch Size, Number of Units , Number of Layers , and window size didn't help too in avoiding this issue.
I'm using MinMaxScaler for the data.
Edit:
scaling new inputs for testing:
dataset_test = sc.transform(dataset_test.reshape(-1, 1))

Related

Training LSTM but output converges to a single value after a few inputs

Background
I'm learning about LSTMs and figured I'd try training a very basic LSTM model (big believer of learning through doing).
To start with something basic, I tried implement a LSTM that would sum up the last 10 inputs it has seen. I generated a dataset consisting of 1000 random numbers between 0 and 1, with 1000 labels representing the sum of the previous 10 numbers (label[i] = data[i-9:i+1].sum()), and tried to train the LSTM to recognize this pattern.
I know that simpler models can solve this (ie. linear regression), but I believe that LSTMs should also be able to solve this fairly basic problem.
My initial implementation does seem to work, but when I try to improve the implementation, I start getting constant output values after a few timestamps, looks like it's approximately the average of the training labels.
I'd appreciate any insights as to why the second and third iterations don't work, especially the third iteration, since it looks like it's the same implementation as what I read in "Deep Learning for Coders with fastai & PyTorch" book.
What I've tried so far
I've done 3 iterations so far:
Initially, I generated all sub-sequences of length 10 from the input data along with the corresponding label ([(data[i-9:i+1], label[i]) for i in range(9, len(data))] and fed this into the LSTM
This iteration worked very well, if I feed a sequence of 10 inputs I get an output from the LSTM that is very close to the sum. However, it is kinda cheating in that I'm basically telling the LSTM that the sequerce is length 10. I believe that LSTM should be able to infer the sequence length, so I tried to remove that bit of information.
In my second iteration, I feed the entire sequence into the LSTM at once: ([data[i] for i in range(len(data)), [label[i] for i in range(len(data))]). Basically, a single input with a sequence length of 1000, with a single output of 1000 labels.
However, after training, while running on validation data of 100, all except the first few labels are always a constant number, approximately the average of the training labels.
In my last iteration, I tried feeding inputs one at a time to the LSTM (1000 inputs with a sequence length of 1), with manually storing the hidden and cell states and passing it into the next run with the next input. This produces similar results to #2.
Network setup
For all runs, I used a single layer LSTM with 25 hidden since it's a fairly simple problem. I did try adding layers with dropout or increasing hidden size but didn't help.
Code samples
First iteration:
class LSTMModel(Module):
def __init__(self, seq_len, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
def forward(self,x):
x, _ = self.lstm1(x)
# [:,-1,:] to grab the last output for sequence length of 10
x = self.fc(x[:,-1,:])
return x[:,0]
Second iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False)
self.fc = nn.Linear(hidden_size, 1)
self.input_size = input_size
def forward(self,x):
x,h = self.lstm1(x)
x = self.fc(x)
return x
Final iteration:
class LSTMModel(Module):
def __init__(self, layers, input_size, hidden_size):
self.lstm1 = nn.LSTM(input_size=input_size, hidden_size=hidden_size, num_layers=layers, bidirectional=False, batch_first=True)
self.fc = nn.Linear(hidden_size, 1)
self.h = [torch.zeros(layers, 1, hidden_size).cuda() for _ in range(2)]
def forward(self,x):
x,h = self.lstm1(x, self.h)
self.h = [h_.detach() for h_ in h]
x = self.fc(x)
return x
def reset(self):
for h in self.h: h.zero_()

PyTorch: How to normalize a tensor when the image is cropped randomly?

Let's say we are working with the CIFAR-10 dataset and we want to apply some data augmentation and additionally normalize the tensors. Here is some reproducible code for this
from torchvision import transforms, datasets
import matplotlib.pyplot as plt
trafo = transforms.Compose([transforms.Pad(padding = 4, fill = 0, padding_mode = "constant"),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomCrop(size = (32, 32)),
transforms.ToTensor(),
transforms.Normalize(mean = (0.0, 0.0, 0.0), std = (1.0, 1.0, 1.0))]
)
cifar10_full = datasets.CIFAR10(root = "CIFAR-10", train = True, transform = trafo, target_transform = None, download = True)
The normalization I chose so far would do nothing with the tensors since I put the mean and std to 0 and 1 respectively. According to the documentation of torchvision.transforms.Normalize, the provided means and standard deviations are for each channel of the input. However, the problem is that that I cannot calculate the mean across each channel because of some random flipping and cropping mean. Therefore, my idea was something along the following lines
trafo_1 = transforms.Compose([transforms.Pad(padding = 4, fill = 0, padding_mode = "constant"),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomCrop(size = (32, 32)),
transforms.ToTensor()
)
cifar10_full = datasets.CIFAR10(root = "CIFAR-10", train = True, transform = trafo_1, target_transform = None, download = True)
Now I could calculate the mean across each channel of the input and then I wanted to normalize the tensors again. However, I cannot simply use transforms.Normalize() as cifar10_full is not the original dataset anymore, but how I could proceed instead? (One solution would be to simply fix the seed of the random generators, i.e use torch.manual_seed(0), but I would like to avoid this for now...)
The mean and std are not for each tensor, but from the whole dataset. What you are trying to do doesn't really matter, you just want a scale that is good enough for the whole data representation, there is no exact mean or std you will get, these are all random operations, just use the mean and std from the actual data, which is pretty much the standard.
First, try to calculate the mean and std of the dataset (try random sampling), and use that for normalization.
# Calculate the mean, std of the complete dataset
import glob
import cv2
import numpy as np
import tqdm
import random
# calculating 3 channel mean and std for image dataset
means = np.array([0, 0, 0], dtype=np.float32)
stds = np.array([0, 0, 0], dtype=np.float32)
total_images = 0
randomly_sample = 5000
for f in tqdm.tqdm(random.sample(glob.glob("dataset_path/**.jpg", recursive = True), randomly_sample)):
img = cv2.imread(f)
means += img.mean(axis=(0,1))
stds += img.std(axis=(0,1))
total_images += 1
means = means / (total_images * 255.)
stds = stds / (total_images * 255.)
print("Total images: ", total_images)
print("Means: ", means)
print("Stds: ", stds)
Just a simple scenario, do you think in actual testing or inference your images will be augmented this way too, probably not, you will have clean images which match closely with the mean and std from the clean version of the data, so it's useless to calculate mean and std (you can take few random samples), unless you want to apply TTA.
If you want to apply TTA too, then you can go ahead and run some augmentation on the images, do random sampling and take the mean and std of those images.

Compare two segmentation maps predictions

I am using consistency between two predicted segmentation maps on unlabeled data. For labeled data, I’m using nn.BCEwithLogitsLoss and Dice Loss.
I’m working on videos that’s why 5 dimensions output.
(batch_size, channels, frames, height, width)
I want to know how can we compare two predicted segmentation maps.
gmentation maps.
# gt_seg - Ground truth segmentation map. - (8, 1, 8, 112, 112)
# aug_gt_seg - Augmented ground truth segmentation map - (8, 1, 8, 112, 112)
predicted_seg_1 = model(data, targets) # (8, 1, 8, 112, 112)
predicted_seg_2 = model(augmented_data, augmented_targets) #(8, 1, 8, 112, 112)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# Consistency loss
if consistency_loss == "l2":
consistency_criterion = nn.MSELoss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
elif consistency_loss == "l1":
consistency_criterion = nn.L1Loss()
cons_loss = consistency_criterion(predicted_gt_seg_1, predicted_gt_seg_2)
total_supervised_loss = supervised_loss_1 + supervised_loss_2
total_consistency_loss = cons_loss
Is this the right way to apply consistency between two predicted segmentation maps?
I’m mainly confused due to the definition on the torch website. It’s a comparison with input x with target y. I thought it looks correct since I want both predicted segmentation maps similar. But, 2nd segmentation map is not a target. That’s why I’m confused. Because if this could be valid, then every loss function can be applied in some or another way. That doesn’t look appealing to me. If it’s the correct way to compare, can it be extended to other segmentation-based losses such as Dice Loss, IoU Loss, etc.?
One more query regarding loss computation on labeled data:
# gt_seg - Ground truth segmentation map
# aug_gt_seg - Augmented ground truth segmentation map
predicted_seg_1 = model(data, targets)
predicted_seg_2 = model(augmented_data, augmented_targets)
# define criterion
seg_criterion_1 = nn.BCEwithLogitsLoss(size_average=True)
seg_criterion_2 = nn.DiceLoss()
# labeled losses
supervised_loss_1 = seg_criterion_1(predicted_seg_1, gt_seg)
supervised_loss_2 = seg_criterion_2(predicted_seg_1, gt_seg)
# augmented labeled losses
aug_supervised_loss_1 = seg_criterion_1(predicted_seg_2, aug_gt_seg)
aug_supervised_loss_2 = seg_criterion_2(predicted_seg_2, aug_gt_seg)
total_supervised_loss = supervised_loss_1 + supervised_loss_2 + aug_supervised_loss_1 + aug_supervised_loss_2
Is the calculation of total_supervised_loss correct? Can I apply loss.backward() on this?
Yes, this is a valid way to implement consistency loss. The nomenclature used by pytorch documentation lists one input as the target and the other as the prediction, but consider that L1, L2, Dice, and IOU loss are all symmetrical (that is, Loss(a,b) = Loss(b,a)). So any of these functions will accomplish a form of consistency loss with no regard for whether one input is actually a ground-truth or "target".

Why does my LSTM autoencoder model couldn't detect outliers?

I am trying to build a LSTM Autoendoer for anomaly detection.
But the model seems not work for my data.
Here is the normal data that I use it for training.
And here is abnormal data that I use it for validation.
If model works, its loss should be high at #200000~#500000.
Unfortunately, here is the result I put the valid data to the model:
In the abnormal interval, the loss still low.
Here is my code of training model.
I would greatly appreciate it if you kindly give me any suggestions.
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(healthy_data)
data_scaled = scaler.transform(healthy_data)
data_broken_scaled = scaler.transform(broken_data)
timesteps=32
data = data_scaled
dim = 1
data.shape = (-1,timesteps,dim)
lr = 0.0001
Nadam = optimizers.Nadam(lr=lr)
model = Sequential()
model.add(LSTM(50,input_shape=(timesteps,dim),return_sequences=True))
model.add(Dense(dim))
model.compile(loss='mae', optimizer=Nadam ,metrics=['mse'])
EStop = EarlyStopping(monitor='val_loss', min_delta=0.001,patience=150, verbose=2, mode='auto',restore_best_weights=True)
history = model.fit(data,data,validation_data=(data,data),epochs=3000,batch_size=72,verbose=2,shuffle=False,callbacks=[EStop]).history
pred_broken = model.predict(data_broken_scaled)
loss_broken = np.mean(np.abs(pred_broken-data_broken_scaled),axis=1)
fig, ax = plt.subplots(figsize=(20, 6), dpi=80, facecolor='w', edgecolor='k')
ax.plot(range(0,len(loss_broken)), loss_broken, '-', color='red', animated = True, linewidth=1)
I Think, it may better to use Fourier transform to detect anomaly in frequency view.
I means, the train and test data will be converted to frequency domain via Fourier transform. Also I recommend to use Time-windows.
Most AI will act better if they have enough good train data. Your anomaly points should be labeled to 1(by pre-processing step of the data) and it should have high frequency at your time-step(time-windows).
In short, I think your train data may not sufficiently representative for anomaly now.

how to work with the catboost overfitting detector

I am trying to understand the catboost overfitting detector. It is described here:
https://tech.yandex.com/catboost/doc/dg/concepts/overfitting-detector-docpage/#overfitting-detector
Other gradient boosting packages like lightgbm and xgboost use a parameter called early_stopping_rounds, which is easy to understand (it stops the training once the validation error hasn't decreased in early_stopping_round steps).
However I have a hard time understanding the p_value approach used by catboost. Can anyone explain how this overfitting detector works and when it stops the training?
It's not documented on the Yandex website or at the github repository, but if you look carefully through the python code posted to github (specifically here), you will see that the overfitting detector is activated by setting "od_type" in the parameters. Reviewing the recent commits on github, the catboost developers also recently implemented a tool similar to the "early_stopping_rounds" parameter used by lightGBM and xgboost, called "Iter."
To set the number of rounds after the most recent best iteration to wait before stopping, provide a numeric value in the "od_wait" parameter.
For example:
fit_param <- list(
iterations = 500,
thread_count = 10,
loss_function = "Logloss",
depth = 6,
learning_rate = 0.03,
od_type = "Iter",
od_wait = 100
)
I am using the catboost library with R 3.4.1. I have found that setting the "od_type" and "od_wait" parameters in the fit_param list works well for my purposes.
I realize this is not answering your question about the way to use the p_value approach also implemented by the catboost developers; unfortunately I cannot help you there. Hopefully someone else can explain that setting to the both of us.
Catboost now supports early_stopping_rounds: fit method parameters
Sets the overfitting detector type to Iter and stops the training
after the specified number of iterations since the iteration with the
optimal metric value.
This works very much like early_stopping_rounds in xgboost.
Here is an example:
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
import numpy as np
y = np.random.normal(0, 1, 1000)
X = np.random.normal(0, 1, (1000, 1))
X[:, 0] += y * 2
X_train, X_eval, y_train, y_eval = train_test_split(X, y, test_size=0.1)
train_pool = Pool(X, y)
eval_pool = Pool(X_eval, y_eval)
model = CatBoostRegressor(iterations=1000, learning_rate=0.1)
model.fit(X, y, eval_set=eval_pool, early_stopping_rounds=10)
The result should be something like this:
522: learn: 0.3994718 test: 0.4294720 best: 0.4292901 (514) total: 957ms remaining: 873ms
523: learn: 0.3994580 test: 0.4294614 best: 0.4292901 (514) total: 958ms remaining: 870ms
524: learn: 0.3994495 test: 0.4294806 best: 0.4292901 (514) total: 959ms remaining: 867ms
Stopped by overfitting detector (10 iterations wait)
bestTest = 0.4292900745
bestIteration = 514
Shrink model to first 515 iterations.
early_stopping_rounds takes into account both od_type='Iter' and od_wait parameters. No need to individually set both od_type and od_wait, just set early_stopping_rounds parameter.