PyTotch CIFAR-10 vs Kaggle CIFAR-10 : Totally different result for exactly same architecture on CIFAR-10 - deep-learning

I have been learning PyTorch for some weeks. While I was practicing with CIFAR-10 dataset from PyTorch datasets, I also thought of practicing with ImageFolder class, so I found a version of Cifar-10 from Kaggle, where the images were foldered.(I you don't remember PyTorch datasets are in tar.gz format, not in folder structure)
To my utter surprise, in spite of using the same loss function, learning rate and architecture, The Kaggle dataset test set accuracy starts from 0.18 and PyTorch dataset accuracy starts from 0.56 at epoch 1.
Finally after 20 epochs ,one almost saturates near 0.45 and the later one almost fixes near 0.86.
I have checked again and again,but not finding any big difference in those two codes.
I really want to know, if I have done anything deadly wrong, or there is anything fundamentally different about those two datasets.
To clarify, I am using this Pytorch dataset, and this Kaggle dataset .
The codes are too large to be provided here, so I am providing links my notebooks, you are welcome to take a look at my whole code, and also can run if necessary [You only need to use your Kaggle API key to download the dataset from kaggle, I can't make mine one public...sorry for the inconvinience]
Kaggle Dataset Notebook here and Pytorch Dataset Notebook here
I am also providing the chunk of code that I think , is mostly different.
Kaggle Dataset:
Epoch 1 score = 0.18
Epoch 20 score = 0.45
from torch.utils.data import DataLoader
def createVal(train_list, root_folder, classes, valid_split ):
try:
os.mkdir(os.path.join(root_folder, 'val'))
except FileExistsError:
pass
for cls in classes:
try:
os.mkdir(os.path.join(root_folder, 'val', cls))
except FileExistsError:
pass
np.random.shuffle(train_list)
valid_len = len(train_list) * valid_split
for i in tqdm(range(int(valid_len))):
shutil.move(train_list[i], train_list[i].replace('/train/', '/val/'))
valid_split = 0.2
batch_size = 32
num_workers = 4
root_folder = "/content/cifar10/cifar10"
train_folder = os.path.join(root_folder, "train")
test_folder = os.path.join(root_folder, "test")
if valid_split:
createVal(train_list, root_folder, classes, valid_split = valid_split)
val_folder = os.path.join(root_folder, "val")
val_data = datasets.ImageFolder(val_folder, transform = transform)
val_loader = DataLoader(val_data, batch_size = batch_size, num_workers = num_workers )
train_data = datasets.ImageFolder(train_folder, transform = transform)
train_loader = DataLoader(train_data, shuffle = True, batch_size = batch_size, num_workers = num_workers )
test_data = datasets.ImageFolder(test_folder, transform = transform)
test_loader = DataLoader(test_data, batch_size = batch_size, num_workers = num_workers )
Pytorch Dataset:
Epoch 1 score = 0.18
Epoch 20 score = 0.45
valid_split = 0.2
batch_size = 32
num_workers = 4
if valid_split:
num_train = len(train_data)
idx = list(range(num_train))
np.random.shuffle(idx)
train_idx = idx[int(valid_split*num_train):]
val_idx = idx[:int(valid_split*num_train)]
train_sampler = SubsetRandomSampler(train_idx)
val_sampler = SubsetRandomSampler(val_idx)
train_loader = DataLoader(train_data, sampler = train_sampler, batch_size = batch_size, num_workers = num_workers )
val_loader = DataLoader(train_data, sampler = val_sampler, batch_size = batch_size, num_workers = num_workers )
else:
train_loader = DataLoader(train_data, batch_size = batch_size, num_workers = num_workers )
test_loader = DataLoader(test_data, batch_size = batch_size, num_workers = num_workers )

I see , difference in method of shuffling the training dataset.
Kaggle dataset : train_loader > shuffle = True
Pytorch Dataset : train_loader > No shuffle
When using shuffle ==True , it will do RandomSampler function .

Related

Multi task problem - Print accuracy and mse after every epoch

I am training a CNN on face images and I want it to perform classification and regression tasks at the same time. I figured out how to train the CNN as below:
resnet = tf.keras.applications.ResNet50(
include_top=False ,
weights='imagenet' ,
input_shape=(96, 96, 3) ,
pooling="avg"
)
for layer in resnet.layers:
layer.trainable = True
inputs = Input(shape=(96, 96, 3), name='main_input')
main_branch = resnet(inputs)
main_branch = Flatten()(main_branch)
expr_branch = Dense(8, activation='softmax', name='expr_output')(main_branch)
va_branch = Dense(2, name='va_output')(main_branch)
model = Model(inputs = inputs,
outputs = [expr_branch, va_branch])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
loss={'expr_output': 'sparse_categorical_crossentropy', 'va_output':
'mean_squared_error'})
I want after every epoch to prin the accuracy and the mse metrics for each task. So far I have written the below:
checkpoint = tf.keras.callbacks.ModelCheckpoint(
model_path,
save_weights_only=True,
verbose=1
)
history = model.fit_generator(
train_generator,
epochs=2,
steps_per_epoch=STEP_SIZE_TRAIN_resnet,
validation_data=test_generator,
validation_steps=STEP_SIZE_TEST_resnet,
max_queue_size=1,
shuffle=True,
callbacks=[checkpoint],
verbose=1
)
When I had only the classification task I would write
checkpoint = tf.keras.callbacks.ModelCheckpoint(
model_path,
monitor='val_accuracy',
save_best_only=True,
mode='max',
verbose=1
)
which printed the val_accuracy at every epoch and saved the weights. How can I do the same (print mse and accuracy and save the weights after every epoch) at a multitask problem?

deep learning RestNet problem calculate confusion matrix and other matrixes

i am new to deep learning.
I am running a code to train and test a model and find its precision recall f1-score support and confusion matrix.
plz see the code and tell me that am i coding right for taking F1 score and other matrixes. my accuracy is .97.
not sure about
have i taken the right prediction
have i compute the right confusion matrix.
guide me that the confusion matrix is ok or not.
enterinput_shape = (128, 128, 3)
batch_size = 64
epochs = 10
epoch_list = list(range(1, epochs+1))
Path to training & testing set.
train_dir = 'train'
test_dir = 'test'
train_dir_fake, test_dir_fake = os.path.join(train_dir, 'forged'), os.path.join(test_dir, 'forged')
train_dir_real, test_dir_real = os.path.join(train_dir, 'real'), os.path.join(test_dir, 'real')
train_fake_fnames, test_fake_fnames = os.listdir(train_dir_fake), os.listdir(test_dir_fake)
train_real_fnames, test_real_fnames = os.listdir(train_dir_real), os.listdir(test_dir_real)"
Training Data Generator.
train_datagen = ImageDataGenerator(rescale=1./255.)
Testing Data Generator.
test_datagen = ImageDataGenerator(rescale=1./255.)
Flow training images in batches of 64 using train_datagen generator
train_generator = train_datagen.flow_from_directory(train_dir,
target_size=(128, 128),
batch_size=batch_size,
shuffle='False',
class_mode='binary')
Flow test images in batches of 64 using test_datagen generator
test_generator = test_datagen.flow_from_directory(test_dir,
target_size=(128, 128),
batch_size=batch_size,
shuffle='False',
class_mode='binary')
ResNet50V2_model = ResNet50V2(input_shape=input_shape, include_top=False, weights="imagenet", classes=2)
for i in range(50):
l = ResNet50V2_model.get_layer(index=i)
l.trainable = True
model = Sequential()
model.add(ResNet50V2_model)
model.add(GlobalAveragePooling2D())
model.add(Dense(units=1, activation='sigmoid'))
Compiling the Model.
model.compile(loss='binary_crossentropy',
optimizer=optimizers.Adam(learning_rate=1e-6, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0),
metrics=['accuracy'])
reduce = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, mode='auto')
early_stopping = EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=5, verbose=0, mode='auto')
Starting the Training.
history = model.fit(train_generator, epochs=epochs, validation_data=test_generator)
storing model
network_name = "ResNet50V2"
try:
os.mkdir("./Reference_Data")
os.mkdir("./Reference_Data/Graphs")
os.mkdir("./Reference_Data/Summary")
os.mkdir("./Reference_Data/Model")
except OSError:
pass
try:
os.mkdir(os.path.join("./Reference_Data/Graphs", network_name))
except OSError:
pass
!dir
acc = np.linspace(min(epoch_list), max(epoch_list), 200)
val_acc = np.linspace(min(epoch_list), max(epoch_list), 200)
#define spline for accuracy
spl1 = make_interp_spline(epoch_list, history.history['accuracy'], k=3)
y_smooth1 = spl1(acc)
#define spline accuracy
spl2 = make_interp_spline(epoch_list, history.history['val_accuracy'], k=3)
y_smooth2 = spl2(val_acc)
with open("./Reference_Data/Summary/" + network_name + "summary.txt", 'w+') as f:
model.summary(print_fn=lambda x: f.write(x + '\n'))
Saving the Model for Inference Purpose.
model.save('./Reference_Data/Model/' + network_name + '/')
model.save('./Reference_Data/Model/' + network_name + '/' + network_name + '.h5')
test_generator.reset()
Y_pred = model.predict(test_generator,)
classes = test_generator.classes[test_generator.index_array]
y_pred = np.argmax(Y_pred, axis=-1)
y_pred=y_pred.round()
sum(y_pred==classes)/10000
pred=model.predict(test_generator,verbose=1)
def get_classification_report(
model, data_dir, batch_size=64,
steps=None, threshold=0.5, output_dict=False
):
data = get_test_data_generator(data_dir, batch_size=batch_size)
predictions = predict(model, data, steps, threshold)
predictions = predictions.reshape((predictions.shape[0],))
return classification_report(data.classes, predictions, output_dict=output_dict)
import sklearn.metrics as metrics
#y_pred = np.argmax(y_pred,axis=0)
#y_true=np.argmax(test_generator.classes,axis=0)
report = metrics.classification_report(true_classes, Y_pred.round(), target_names=class_labels,zero_division=0.0)
print(report)
precision recall f1-score support
forged 0.40 0.40 0.40 773
real 0.60 0.60 0.60 1172
accuracy 0.52 1945
macro avg 0.50 0.50 0.50 1945
weighted avg 0.52 0.52 0.52 1945

Finding the number of of nodes and gpus of DistributedDataParallel

I would like to know what number should I select for nodes and gpus.
I use Tesla V100-SXM2 (8 boards).
I tried:
nodes = 1, gpus=1 (only the first gpu works)
nodes=1, gpus =8 (It took very long time and cannot execute)
Did I got wrong parameter for the nodes and gpus? or Is my code wrong ? I would appreciate if you could help me out. The code below is simplified sample code of DPP.
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int,
help='ranking within the nodes')
parser.add_argument('--epochs', default=200, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = 'host1'
os.environ['MASTER_PORT'] = '7777'
mp.spawn(train, nprocs=args.gpus, args=(args,))
def train(gpu, args):
rank = args.nr * args.gpus + gpu
dist.init_process_group(
backend='nccl',
init_method='env://',
world_size=args.world_size,
rank=rank
)
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 100
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Wrapper around our model to handle parallel training
model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
# Data loading code
train_dataset = get_datasets()
# Sampler that takes care of the distribution of the batches such that
# the data is not repeated in the iteration and sampled accordingly
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_dataset,
num_replicas=args.world_size,
rank=rank
)
# We pass in the train_sampler which can be used by the DataLoader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=0,
pin_memory=True,
sampler=train_sampler)
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(
epoch + 1,
args.epochs,
i + 1,
total_step,
loss.item())
)
if gpu == 0:
print("Training complete)

Pytorch:Network not learning at all + Weights are too low

About the input. Sorry for the bad formatting. The for each two rows first row is the key and second row is the value. 18~20_ride is the label and is not included in the input. Below is one input. And train set consists of 400000 of these.
bus_route_id station_code latitude longitude 6~7_ride
0 4270000 344 33.48990 126.49373
7~8_ride 8~9_ride 9~10_ride 10~11_ride 11~12_ride 6~7_takeoff
0.0 1.0 2.0 5.0 2.0 6.0
7~8_takeoff 8~9_takeoff 9~10_takeoff 10~11_takeoff 11~12_takeoff
0.0 0.0 0.0 0.0 0.0
18~20_ride weekday dis_jejusi dis_seoquipo
0.0 6 2.954920 26.256744
Example weights: Captured at 4th epoch. After 20 epochs of training I got much smaller values (ex. -7e-44 or 1e-55)
2.3937e-11, -2.6920e-12, -1.0445e-11, ..., -1.0754e-11, 1.1128e-11, -1.4814e-11
The model's prediction and target
#Target
[2.],
[0.],
[0.]
#Prediction
[1.4187],
[1.4187],
[1.4187]
MyDataset.py
from torch.utils.data import Dataset, DataLoader
import numpy as np
import pandas as pd
import torch
import os
class MyDataset(Dataset):
def __init__(self, csv_filename):
self.dataset = pd.read_csv(csv_filename, index_col=0)
self.labels = self.dataset.pop("18~20_ride")
self.dataset = self.dataset.values
self.labels = np.reshape(self.labels.values,(-1,1))
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
return self.dataset[idx], self.labels[idx]
Model
class Network(nn.Module):
def __init__(self, input_num):
super(Network, self).__init__()
self.fc1 = nn.Sequential(
nn.Linear(input_num, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc2 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc3 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc4 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc5 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc6 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU)
)
self.fc7 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU()
)
self.fc8 = nn.Sequential(
nn.Linear(64, 64),
nn.BatchNorm1d(64),
GELU())
)
self.fc9 = nn.Linear(64, 1)
The training and validation
def train(model, device, train_loader, optimizer, loss_fn, log_interval, epoch):
print("Training")
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.float().to(device), target.float().to(device)
optimizer.zero_grad()
output = model(data)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
if batch_idx % log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, (batch_idx+1) * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
def validate(model, device, loader, loss_fn):
print("\nValidating")
model.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(loader):
data, target = data.float().to(device), target.float().to(device)
output = model(data)
test_loss += loss_fn(output, target).item() # sum up batch loss
test_loss /= len(loader)
print('Validation average loss: {:.4f}\n'.format(
test_loss))
return test_loss
Entire process of training and validation
from MyDataset import MyDataset
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import StepLR
from datetime import datetime
train_dataset_path = "/content/drive/My Drive/root/bus/dataset/train.csv"
val_dataset_path = "/content/drive/My Drive/root/bus/dataset/val.csv"
model_base_path = "/content/drive/My Drive/root/bus/models/"
model_file = "/content/drive/My Drive/root/bus/models/checkpoints/1574427776.202017.pt"
"""
Training Config
"""
epochs = 10
batch_size = 32
learning_rate = 0.5
check_interval = 4
log_interval = int(40000/batch_size)
gamma = 0.1
load_model = False
save_model = True
make_checkpoint = True
"""
End of config
"""
# Read test set
train_set = MyDataset(train_dataset_path)
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
val_set = MyDataset(val_dataset_path)
val_loader = DataLoader(val_set, batch_size=1)
print("Data READY")
device = torch.device("cuda")
net = Network(19).float().to(device)
if load_model:
net.load_state_dict(torch.load(model_file))
loss_fn = torch.nn.MSELoss()
optimizer = optim.AdamW(net.parameters(), lr=learning_rate)
best_loss = float('inf')
isAbort = False
for epoch in range(1, epochs+1):
train(net, device, train_loader, optimizer, loss_fn, log_interval, epoch)
val_loss = validate(net, device, val_loader, loss_fn)
if epoch%check_interval==0:
if make_checkpoint:
print("Saving new checkpoint")
torch.save(net.state_dict(), model_base_path+"checkpoints/"+str(datetime.today().timestamp())+".pt")
"""
if val_loss < best_loss and epoch%check_interval==0:
best_loss = val_loss
if make_checkpoint:
print("Saving new checkpoint")
torch.save(net.state_dict(), model_base_path+"checkpoints/"+str(datetime.today().timestamp())+".pt")
else:
print("Model overfit detected. Aborting training")
isAbort = True
break
"""
if save_model and not isAbort:
torch.save(net.state_dict(), model_base_path+"finals/"+str(datetime.today().timestamp())+".pt")
So I tried to train a fully connected model for a regression problem, with google colab. But it did not get trained well; The loss absolutely did not decrease. So I dug down and found out that the weights were really small. Any idea why this is happening and how I could avoid this? Thank you
I used MSE for loss and used ADaW optimizer. Below are the things I have tried
Tried other architectures (Changing number of layers sizes, Changed activation function ReLU, GELU)but the loss did not decrease
Tried changing the learning rate from 3e-1~1e-3, even tried 1
Tried other pre-processing(Used day/month/year instead of weekday) for the data
Given the label in the input data but loss did not decrease
Tried different batch_sizes(4, 10, 32, 64)
Removed batch_normalization
Other kinds of optimizer such as SGD, Adam
Training 20 epochs but loss did not decrease at all
The weights do change at loss.backward()
TL;DR: Invalid input data!! Check for NaN or NULL
Well it has been sometime since the question. Tried almost everything and though maybe messed up the project setup. So I deleted the project and tried it again: same. Delete again and migrate to TF2: THE SAME RESULT! So I found out that there wasn't any problem with the setup. So I searched other places. In the end I did find the reason. The input columns were actually modified by myself. (To remove some highly correlated features). It was not original. During the modification I messed up some float values and it ended up having NaN values. So check if you're dataset contains invalid values.

Getting polynomial regression to overfit with TensorFlow

The Sklearn documentation contains an example of a polynomial regression which beautifully illustrates the idea of overfitting (link).
The third plot shows a 15th order polynomial that overfits the simulated data. I replicated this model in TensorFlow, but I cannot get it to overfit.
Even when tuning the learning rate and the numbers of learning epochs, I cannot get the model to overfit. What am I missing?
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
def true_fun(X):
return np.cos(1.5 * np.pi * X)
# Generate dataset
n_samples = 30
np.random.seed(0)
x_train = np.sort(np.random.rand(n_samples)) # Draw from uniform distribution
y_train = true_fun(x_train) + np.random.randn(n_samples) * 0.1
x_test = np.linspace(0, 1, 100)
y_true = true_fun(x_test)
# Helper function
def run_dir(base_dir, dirname='run'):
"Number log directories incrementally"
import os
import re
pattern = re.compile(dirname+'_(\d+)')
try:
previous_runs = os.listdir(base_dir)
except FileNotFoundError:
previous_runs = []
run_number = 0
for name in previous_runs:
match = pattern.search(name)
if match:
number = int(match.group(1))
if number > run_number:
run_number = number
run_number += 1
logdir = os.path.join(base_dir, dirname + '_%02d' % run_number)
return(logdir)
# Define the polynomial model
def model(X, w):
"""Polynomial model
param X: data
param y: coeficients in the polynomial regression
returns: Polynomial function Y(X, w)
"""
terms = []
for i in range(int(w.shape[0])):
term = tf.multiply(w[i], tf.pow(X, i))
terms.append(term)
return(tf.add_n(terms))
# Create the computation graph
order = 15
tf.reset_default_graph()
X = tf.placeholder("float")
Y = tf.placeholder("float")
w = tf.Variable([0.]*order, name="parameters")
lambda_reg = tf.placeholder('float', shape=[])
learning_rate_ph = tf.placeholder('float', shape=[])
y_model = model(X, w)
loss = tf.div(tf.reduce_mean(tf.square(Y-y_model)), 2) # Square error
loss_rg = tf.multiply(lambda_reg, tf.reduce_sum(tf.square(w))) # L2 pentalty
loss_total = tf.add(loss, loss_rg)
loss_hist1 = tf.summary.scalar('loss', loss)
loss_hist2 = tf.summary.scalar('loss_rg', loss_rg)
loss_hist3 = tf.summary.scalar('loss_total', loss_total)
summary = tf.summary.merge([loss_hist1, loss_hist2, loss_hist3])
train_op = tf.train.GradientDescentOptimizer(learning_rate_ph).minimize(loss_total)
init = tf.global_variables_initializer()
def train(sess, x_train, y_train, lambda_val=0, epochs=2000, learning_rate=0.01):
feed_dict={X: x_train, Y: y_train, lambda_reg: lambda_val, learning_rate_ph: learning_rate}
logdir = run_dir("logs/polynomial_regression2/")
writer = tf.summary.FileWriter(logdir)
sess.run(init)
for epoch in range(epochs):
_, summary_str = sess.run([train_op, summary], feed_dict=feed_dict)
writer.add_summary(summary_str, global_step=epoch)
final_cost, final_cost_rg, w_learned = sess.run([loss, loss_rg, w], feed_dict=feed_dict)
return final_cost, final_cost_rg, w_learned
def plot_test(w_learned, x_test, x_train, y_train):
y_learned = calculate_y(x_test, w_learned)
plt.scatter(x_train, y_train)
plt.plot(x_test, y_true, label="true function")
plt.plot(x_test, y_learned,'r', label="learned function")
#plt.title('$\lambda = {:03.2f}$'.format(lambda_values[i]))
plt.ylabel('y')
plt.xlabel('x')
plt.legend()
plt.show()
def calculate_y(x, w):
y = 0
for i in range(w.shape[0]):
y += w[i] * np.power(x, i)
return y
sess = tf.Session()
final_cost, final_cost_rg, w_learned = train(sess, x_train, y_train, lambda_val=0,
learning_rate=0.3, epochs=2000)
sess.close()
plot_test(w_learned, x_test, x_train, y_train)
I have same problem about this. When I do polynomial regression, I also can't overfit the data by using GD in Tensorflow.
Then I compare the coefficients(weights) of the model by using sklearn LinearRegression, I found when the polynomial degree is larger the coefficient of high order is very smaller(i.e. 1e-4), and the low order is relative large(i.e. 0.1).
That's mean when you using GD algorithm for searching the best value of weights, the high order coefficient become extreme sensitive about the value change, and the low order coefficient is not.
And I guess the best coefficient(overfit with data) of low order term is large, and of high order term is tiny. When you set large learning rate, it's impossible to find the right answer, and when you set tiny learning rate, you need lots of iterations.
It's obvious when you using GD algorithm with small data set to make overfit.