How to solve data type error using pickle and pandas udf for XGBoost model deployment python? - pickle

I have created a pandas udf() which splits dataset, fit XGBoost model, save it using pickle and returns a df with the saved model as a string column. The problem is when I call pandas udf(). It gives 'unsupported data type' error. But, when I run code without pandas udf() framework it runs successfully. Does anyone have any ideas on this?
#pandas_udf(schema,PandasUDFType.GROUPED_MAP)
def pickle_model(df1):
X = df1.iloc[:,1:50]
Y= df1.iloc[:,50]
seed = 7
test_size = 0.30
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
model = XGBClassifier(max_depth=3,...)
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=8,...)
model_str = pickle.dumps(model)
model_saved = pd.DataFrame([model_str],columns = ['model_str'])
return model_saved
pickled_model = df2.groupby('id').apply(pickle_model)
pickled_model.collect()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 691.0 failed 4 times, most recent failure: Lost task 0.3 in stage 691.0 (TID 38146, 10.139.64.9, executor 40): java.lang.UnsupportedOperationException: Unsupported data type: struct<type:tinyint,size:int,indices:array<int>,values:array<double>>

For dumping model pickle files use
model_str = str(pickle.dumps(model),'latin1')
And when loading back
use
pickle.loads(bytes(model_str,'latin1'))

Related

ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 1

The model is not learning .. value error coming if the learning command is executed
import jsbsim
import sys
import gymnasium as gym
sys.modules["gym"] = gym
import jsbgym
import os
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
env = gym.make("JSBSim-HeadingControlTask-Cessna172P-Shaping.STANDARD-FG-v0")
for episode in range(1,6):
obs = env.reset()
done = False
total_reward = 0
while not done:
obs, reward, done, _, info = env.step(env.action_space.sample())
env.render()
model = PPO('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=10000)
ERROR :
ValueError: setting an array element with a sequence. The requested array would exceed the maximum number of dimension of 1.

Building neural network using k-fold cross validation

I am new to deep learning, trying to implement a neural network using 4-fold cross-validation for training, testing, and validating. The topic is to classify the vehicle using an existing dataset.
The accuracy result is 0.7.
Traning Accuracy
An example output for epochs
I also don't know whether the code is correct and what to do for increasing the accuracy.
Here is the code:
!pip install category_encoders
import tensorflow as tf
from sklearn.model_selection import KFold
import pandas as pd
import numpy as np
from tensorflow import keras
import category_encoders as ce
from category_encoders import OrdinalEncoder
car_data = pd.read_csv('car_data.csv')
car_data.columns = ['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety','Evaluation']
# Extract the features and labels from the dataset
X = car_data.drop(['Evaluation'], axis=1)
Y = car_data['Evaluation']
encoder = ce.OrdinalEncoder(cols=['Purchasing', 'Maintenance', 'No_Doors','Capacity','BootSize','Safety'])
X = encoder.fit_transform(X)
X = X.to_numpy()
Y_df = pd.DataFrame(Y, columns=['Evaluation'])
encoder = OrdinalEncoder(cols=['Evaluation'])
Y_encoded = encoder.fit_transform(Y_df)
Y = Y_encoded.to_numpy()
input_layer = tf.keras.layers.Input(shape=(X.shape[1]))
# Define the hidden layers
hidden_layer_1 = tf.keras.layers.Dense(units=64, activation='relu', kernel_initializer='glorot_uniform')(input_layer)
hidden_layer_2 = tf.keras.layers.Dense(units=32, activation='relu', kernel_initializer='glorot_uniform')(hidden_layer_1)
# Define the output layer
output_layer = tf.keras.layers.Dense(units=1, activation='sigmoid', kernel_initializer='glorot_uniform')(hidden_layer_2)
# Create the model
model = tf.keras.Model(inputs=input_layer, outputs=output_layer)
# Initialize the 4-fold cross-validation
kfold = KFold(n_splits=4, shuffle=True, random_state=42)
# Initialize a list to store the scores
scores = []
quality_weights= []
# Compile the model
model.compile(optimizer='adam',
loss=''sparse_categorical_crossentropy'',
metrics=['accuracy'],
sample_weight_mode='temporal')
for train_index, test_index in kfold.split(X,Y):
# Split the data into train and test sets
X_train, X_test = X[train_index], X[test_index]
Y_train, Y_test = Y[train_index], Y[test_index]
# Fit the model on the training data
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
# Evaluate the model on the test data
score = model.evaluate(X_test, Y_test)
# Append the score to the scores list
scores.append(score[1])
plt.plot(history.history['accuracy'])
plt.title('Model Training Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train'], loc='upper left')
plt.show()
# Print the mean and standard deviation of the scores
print(f'Mean accuracy: {np.mean(scores):.3f} +/- {np.std(scores):.3f}')
The first thing that caught my attention was here:
model.fit(X_train, Y_train, epochs=300, batch_size=64, sample_weight=quality_weights)
Your quality_weights should be a numpy array of size of the input.
Refer here: https://keras.io/api/models/model_training_apis/#fit-method
If changing that doesn't seemt to help then may be your network doesn't seem to be learning from the data. A few possible reasons could be:
The network is a bit too shallow. Try adding just one more hidden layer to see if that improves anything
From the code I can't see the size of your input data. Does it have enough datapoints for 4-fold cross-validation? Can you somehow augment the data?

tfa.optimizers.MultiOptimizer - TypeError: 'Not JSON Serializable:'

I'm trying to use tfa.optimizers.MultiOptimizer(). I did everything according to the docs (https://www.tensorflow.org/addons/api_docs/python/tfa/optimizers/MultiOptimizer) yet I'm getting the following error:
TypeError: ('Not JSON Serializable:', <tf.Tensor 'gradient_tape/model_80/dense_3/Tensordot/MatMul/MatMul:0' shape=(1, 1) dtype=float32>)
Below is a minimal, working example that reproduces the error, just copy and paste it. The error occurs when the first epoch is finished and the callback trys to save the model.
##############################################################################
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow.keras.layers as l
import tensorflow_addons.layers as la
import tensorflow.keras as ke
import numpy as np
##############################################################################
def build_model_1():
model_input = l.Input(shape=(32,1))
x = l.Dense(1)(model_input)
model = ke.Model(inputs=model_input, outputs=x)
##########
optimizers = [tf.keras.optimizers.Adam(),
tf.keras.optimizers.Adam()]
optimizers_and_layers = [(optimizers[0], model.layers[:5]), (optimizers[1], model.layers[5:])]
optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
test = tf.keras.optimizers.serialize(optimizer)
return model
##############################################################################
input_data = np.arange( 0, 10000, 1).reshape(10000,1)
target_data = np.arange(-10000, 0, 1).reshape(10000,1)
model = build_model_1()
model_checkpoint = ke.callbacks.ModelCheckpoint('best_model.h5',
monitor='val_mse',
mode='min',
save_best_only=True,
verbose=1)
training_history = model.fit(x = input_data,
y = target_data,
validation_split = 0.2,
epochs = 5,
verbose = 1,
callbacks = [model_checkpoint])
##############################################################################
When saving a complete Keras model (with its own structure in the .h5 file) the tf.keras.Model object is completely serialized as a JSON: this means that every property of the model should be JSON serializable.
NOTE: tf.Tensor are NOT JSON serializable.
When using this multi optimizer from tfa you're adding properties to the model that the JSON serializer will try (and fail) to serialize.
In particular there's this attribute gv that I think it comes from the custom optimizer used.
'gv': [(<tf.Tensor 'gradient_tape/model/dense/Tensordot/MatMul/MatMul:0' shape=(1, 1) dtype=float32>, <tf.Variable 'dense/kernel:0' shape=(1, 1) dtype=float32, numpy=array([[-0.55191684]], dtype=float32)>), (<tf.Tensor 'gradient_tape/model/dense/BiasAdd/BiasAddGrad:0' shape=(1,) dtype=float32>, <tf.Variable 'dense/bias:0' shape=(1,) dtype=float32, numpy=array([-0.23444518], dtype=float32)>)]},
All this tf.Tensor are not JSON serializable, that's why it fails.
The only option is to do not save the model completely (with all its attributes, which should be defined as Keras layers, but in this case is not possible) but saving only the model parameters.
In short, if you add the save_weights_only=True to the callback your training (and checkpoint of the weights) will work fine.
model_checkpoint = ke.callbacks.ModelCheckpoint(
"best_model.h5",
monitor="val_mse",
mode="min",
save_best_only=True,
verbose=1,
save_weights_only=True,
)

An error while using tune for hyper-parameters a deep learning model

I'm trying to use tune to tune my model in Pytorch (https://docs.ray.io/en/latest/tune.html), as a starting point; I just tried to run their small example:
import torch.optim as optim
from ray import tune
from ray.tune.examples.mnist_pytorch import get_data_loaders, ConvNet, train, test
def train_mnist(config):
train_loader, test_loader = get_data_loaders()
model = ConvNet()
optimizer = optim.SGD(model.parameters(), lr=config["lr"])
for i in range(10):
train(model, optimizer, train_loader)
acc = test(model, test_loader)
tune.track.log(mean_accuracy=acc)
analysis = tune.run(
train_mnist, config={"lr": tune.grid_search([0.001, 0.01, 0.1])})
print("Best config: ", analysis.get_best_config(metric="mean_accuracy"))
# Get a dataframe for analyzing trial results.
df = analysis.dataframe()
And I got this error:
I run it on a remote machine requesting 100 GB of memory. Any help with that please?

tf.data.Dataset: The `batch_size` argument must not be specified for the given input type

I'm using Talos and Google colab TPU to run hyperparameter tuning of a Keras model. Note that I'm using Tensorflow 1.15.0 and Keras 2.2.4-tf.
import os
import tensorflow as tf
import talos as ta
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
def iris_model(x_train, y_train, x_val, y_val, params):
# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss=params['losses'])
# Convert data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
dataset = dataset.cache()
dataset = dataset.shuffle(1000, reshuffle_each_iteration=True).repeat()
dataset = dataset.batch(params['batch_size'], drop_remainder=True)
# Fit the Keras model on the dataset
out = model.fit(dataset, batch_size=params['batch_size'], epochs=params['epochs'], validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)
return out, model
# Load dataset
X, y = ta.templates.datasets.iris()
# Train and test set
x_train, x_val, y_train, y_val = train_test_split(X, y, test_size=0.30, shuffle=False)
# Create a hyperparameter distributions
p = {'losses': ['logcosh'], 'batch_size': [128, 256, 384, 512, 1024], 'epochs': [10, 20]}
# Use Talos to scan the best hyperparameters of the Keras model
scan_object = ta.Scan(x_train, y_train, params=p, model=iris_model, experiment_name='test', x_val=x_val, y_val=y_val, fraction_limit=0.1)
After converting the train set to a Dataset using tf.data.Dataset, I get the following error when fitting the model with out = model.fit:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-3-c812209b95d0> in <module>()
8
9 # Use Talos to scan the best hyperparameters of the Keras model
---> 10 scan_object = ta.Scan(x_train, y_train, params=p, model=iris_model, experiment_name='test', x_val=x_val, y_val=y_val, fraction_limit=0.1)
8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py in _validate_or_infer_batch_size(self, batch_size, steps, x)
1813 'The `batch_size` argument must not be specified for the given '
1814 'input type. Received input: {}, batch_size: {}'.format(
-> 1815 x, batch_size))
1816 return
1817
ValueError: The `batch_size` argument must not be specified for the given input type. Received input: <DatasetV1Adapter shapes: ((512, 4), (512, 3)), types: (tf.float32, tf.float32)>, batch_size: 512
Then, if I follow those instructions and don't set the batch-size argument to model.fit. I get another error in:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-c812209b95d0> in <module>()
8
9 # Use Talos to scan the best hyperparameters of the Keras model
---> 10 scan_object = ta.Scan(x_train, y_train, params=p, model=iris_model, experiment_name='test', x_val=x_val, y_val=y_val, fraction_limit=0.1)
8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py in _distribution_standardize_user_data(self, x, y, sample_weight, class_weight, batch_size, validation_split, shuffle, epochs, allow_partial_batch)
2307 strategy) and not drop_remainder:
2308 dataset_size = first_x_value.shape[0]
-> 2309 if dataset_size % batch_size == 0:
2310 drop_remainder = True
2311
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'
There seems to be an issue on keras distributed code.
If you take a look at
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-c812209b95d0> in <module>()
8
9 # Use Talos to scan the best hyperparameters of the Keras model
---> 10 scan_object = ta.Scan(x_train, y_train, params=p, model=iris_model, experiment_name='test', x_val=x_val, y_val=y_val, fraction_limit=0.1)
8 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py in _distribution_standardize_user_data(self, x, y, sample_weight, class_weight, batch_size, validation_split, shuffle, epochs, allow_partial_batch)
2307 strategy) and not drop_remainder:
2308 dataset_size = first_x_value.shape[0]
-> 2309 if dataset_size % batch_size == 0:
2310 drop_remainder = True
2311
TypeError: unsupported operand type(s) for %: 'int' and 'NoneType'
you can see that the error is thrown at operation "dataset_size % batch_size" and it states "unsupported operand type(s) for %: 'int' and 'NoneType'". This means that at that point the batch_size variable should have already been inferred from the Dataset object but it is still 'None'
If you take a look at the source code (you can access it from collab by clicking on the path), you will see that in the fit function
def fit(self,
model,
x=None,
y=None,
batch_size=None,
epochs=1,
verbose=1,
callbacks=None,
validation_split=0.,
validation_data=None,
shuffle=True,
class_weight=None,
sample_weight=None,
initial_epoch=0,
steps_per_epoch=None,
validation_steps=None,
validation_freq=1,
**kwargs):
"""Fit loop for Distribution Strategies."""
dist_utils.validate_callbacks(input_callbacks=callbacks,
optimizer=model.optimizer)
dist_utils.validate_inputs(x, y)
batch_size, steps_per_epoch = dist_utils.process_batch_and_step_size(
model._distribution_strategy,
x,
batch_size,
steps_per_epoch,
ModeKeys.TRAIN,
validation_split=validation_split)
batch_size = model._validate_or_infer_batch_size(
batch_size, steps_per_epoch, x)
dataset = model._distribution_standardize_user_data(
there is a step
batch_size = model._validate_or_infer_batch_size(
batch_size, steps_per_epoch, x)
in which the batch_size should change from 'None' (default value when not specified) to the one inferred from the Dataset object (but it doesn't, I checked by printing the variable). I think this might be related to the fact that your batch_size is in fact a list of batch_sizes. If you change the source code (you can directly edit it from collab and then click on restart runtime in order to try) to this:
batch_size, steps_per_epoch = dist_utils.process_batch_and_step_size(
model._distribution_strategy,
x,
batch_size,
steps_per_epoch,
ModeKeys.TRAIN,
validation_split=validation_split)
batch_size = model._validate_or_infer_batch_size(
batch_size, steps_per_epoch, x)
batch_size = 128
dataset = model._distribution_standardize_user_data(
(see that I manually inserted the batch_size in the source code after the point at which it should have been inferred) the program runs with no error.
Maybe the fact of trying different batch_sizes for hyparameter tunning is a feature that is just not feasible with this current versions. I tried tf 2.1 and did not work either.