Active Learning in AllenNLP v2.0.1 - allennlp

I tried implementing an Active Learning procedure in AllenNLP v2.0.1 . However, with the current GradientDescentTrainer implementation, I am unable to continue training on a new batch of Instance.
The model (also trained using AllenNLP) has finished training for the predefined number of epochs on an initial training dataset. I restore the model using the Model.from_archive method and a Trainer is instantiated for it using the Trainer.from_params static constructor.
Thereafter, when I attempt to continue training on a new batch of Instance by calling trainer.train(), it skips training because of the following code snippet in the _try_train method,
for epoch in range(epoch_counter, self._num_epochs)
This is because the epoch_counter is restored to 5, which is from training on the initial training data previously. This is the relevant code snippet for it,
def _try_train(self) -> Tuple[Dict[str, Any], int]:
try:
epoch_counter = self._restore_checkpoint()
self._num_epochs is also 5 which I assume is the number of epochs defined in my .jsonnet training configuration file.
Simply, my requirement is to load an AllenNLP model that has already been trained and to continue training it on a new batch of Instances (single Instance actually which I would load using a SimpleDataLoader)
I have also attached the configuration for the Trainer below. The model I am using is a custom wrapper around the BasicClassifier, solely for the purpose of logging additional metrics.
Thanks in advance.
"trainer": {
"num_epochs": 5,
"patience": 1, // for early stopping
"grad_norm": 5.0,
"validation_metric": "+accuracy",
"optimizer": {
"type": "adam",
"lr": 0.001
},
"callbacks": [
{
"type": "tensorboard"
}
]
}

a couple of advice:
don't turn the recover flag on when you run the command line or through script
specify a different serialization directory for your second training

Related

stable-baseline3, gym, train while also step/predict

With stable-baselines3 given an agent, we can call "action = agent.predict(obs)". And then with Gym, this would be "new_obs, reward, done, info = env.step(action)". (more or less, maybe missed an input or an output).
We also have "agent.learn(10_000)" as an example, yet here we're less involved in the process and don't call the environment.
Looking for a way to train the agent while still calling "env.step". If you wander why, just trying to implement self play (agent and a previous version of it) playing with one environment (for example turns play as Chess).
WKR, Oren.
But why do you need it? If you take a look at the implementation of any learn method, you will see it is nothing more than an iteration over time steps calling collect_rollouts and train with some additional logging and setup at the beginning for, e.g., further saving the agent etc. Your env.step is called inside collect_rollouts.
I'd better suggest you to write a callback based on CheckpointCallback, which saves your agent (model) after N training steps and then attach this callback to your learn call. In your environment you could instantiate each N steps a "new previous" version of your model by calling ModelClass.load(file) on the file saved by a callback, so that finally you would be able to select actions of the other player using a self-play in your environment

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark 2.4.5, GPU
It is for distributed deep learning model training.
I just tried a very simple example at first:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=2)
def train():
print('in train')
import tensorflow as tf
print('after import tf')
hvd.init()
print('done')
hr.run(train)
But, the command is alway running without any progress.
HorovodRunner will stream all training logs to notebook cell output. If there are too many
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.
### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod Timeline.
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location
of the
timeline file to be created. You can then open the timeline file using the chrome://tracing
facility of the Chrome browser.
Do I miss something or need to set up something to make it work ?
Thanks
your code does no actual training within it.. you might have better luck with the better example code
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

Clienterror: An error occured when calling the CreateModel operation

I want to deploy sklearn model in sagemaker. I created a training script.
scripPath=' sklearn.py'
sklearn=SKLearn(entry_point=scripPath,
train_instance_type='ml.m5.xlarge',
role=role, output_path='s3://{}/{}/output'.format(bucket,prefix), sagemaker_session=session)
sklearn.fit({"train-dir' : train_input})
When I deploy it
predictor=sklearn.deploy(initial_count=1,instance_type='ml.m5.xlarge')
It throws,
Clienterror: An error occured when calling the CreateModel operation:Could not find model data at s3://tree/sklearn/output/model.tar.gz
Can anyone say how to solve this issue?
When deploying models, SageMaker looks up S3 to find your trained model artifact. It seems that there is no trained model artifact at s3://tree/sklearn/output/model.tar.gz. Make sure to persist your model artifact in your training script at the appropriate local location in docker which is /opt/ml/model.
for example, in your training script this could look like:
joblib.dump(model, /opt/ml/model/mymodel.joblib)
After training, SageMaker will copy the content of /opt/ml/model to s3 at the output_path location.
If you deploy in the same session a model.deploy() will map automatically to the artifact path. If you want to deploy a model that you trained elsewhere, possibly during a different session or in a different hardware, you need to explicitly instantiate a model before deploying
from sagemaker.sklearn.model import SKLearnModel
model = SKLearnModel(
model_data='s3://...model.tar.gz', # your artifact
role=get_execution_role(),
entry_point='script.py') # script containing inference functions
model.deploy(
instance_type='ml.m5.xlarge',
initial_instance_count=1,
endpoint_name='your_endpoint_name')
See more about Sklearn in SageMaker here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html

Get H2OFrame as object instead of getting reference to a location in H2O cluster

We have created and trained model using H2O libraries. Configured H2O in OpenShift container and deployed the trained model for getting realtime inference. It worked well when we have one container. We have to scale up to handle the increase in transaction volume. Encountered an issue with the statefull nature of the H2OFrame. Please see my sample code.
Step-1: Converts the JSON dictionary in to Pandas frame.
Step-2: Converts the Pandas frame in to H2O frame.
Step-3: Run the model with H2O frame as input.
Here step-2 is returning a handle to the data stored in the container. "H2OFrame is similar to pandas’ DataFrame, or R’s data.frame. One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data." So Step-3's request must go to the same container. If not it cannot find the H2O frame and throws error.
Step-1: convert JSON dictionary to data frame using Pandas dataFrame
ToBeScored = pd.DataFrame([jsonDictionary])
Step-2: convert panda data frame to H2o frame
ToBeScored_hex = h2o.H2OFrame(ToBeScored)
Step-3: run the model
outPredections = rf_model.predict(ToBeScored_hex)
If the H2OFrame can be returned as an in memory object in step-2 then the statefull nature can be avoided. Is there any way?
Or, Can the H2O clustering be configured to store the H2OFrame such a way that it can be accessible from any OpenShift container in the cluster?
Useful links H2O’s Predict function accepts data only in H2OFrame format.
Predict function - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html#h2o.model.model_base.ModelBase.predict
H2O frame data type - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html
Updated on 6/19/2019 continuation question to #ErinLeDell's clarification
We have upgraded to H2O 3.24 and used MOJO model. Removed step 2 and replaced step 3 with this function call.
import h2o as h
result = h.mojo_predict_csv(input_csv_path="PredictionDataRow.csv",mojo_zip_path="rf_model.zip",
genmodel_jar_path="h2o-genmodel.jar", java_options='-Xmx512m -XX:ReservedCodeCacheSize=256m', verbose=True)
Internally it executed the below command which initialized a new JVM and started H2O local server for every call. H2O local server is initialized to find the path to java.
java = H2OLocalServer._find_java() // Find java path then creates below command line
C:\Program Files (x86)\Common Files\Oracle\Java\javapath\java.exe -Xmx512m -XX:ReservedCodeCacheSize=256m -cp h2o-genmodel.jar hex.genmodel.tools.PredictCsv --mojo C:\Users\admin\Documents\Code\python\rf_model.zip --input PredictionDataRow.csv --output C:\Users\admin\Documents\Code\python\prediction.csv --decimal
Question-1: Is there any way to use an existing JVM and not always spawn a new one for every transaction? 
Question-2: Is there a way to pass the java path to avoid the H2O local server initialization? Is H2OLocalServer required for anything other than finding java path? If it cannot be avoided then, Is it possible to initialize local server once and direct new requests to existing H2O local server instead of starting a new H2O local server?
An alternative is to use an H2O MOJO model (instead of a binary model which needs to exist in H2O cluster memory to make predictions). MOJO models can sit on disk and do not require a running H2O cluster. Then you can skip Step 2 and use the h2o.mojo_predict_pandas() function in Step 3.

Creating a serving graph separately from training in tensorflow for Google CloudML deployment?

I am trying to deploy a tf.keras image classification model to Google CloudML Engine. Do I have to include code to create serving graph separately from training to get it to serve my models in a web app? I already have my model in SavedModel format (saved_model.pb & variable files), so I'm not sure if I need to do this extra step to get it to work.
e.g. this is code directly from GCP Tensorflow Deploying models documentation
def json_serving_input_fn():
"""Build the serving inputs."""
inputs = {}
for feat in INPUT_COLUMNS:
inputs[feat.name] = tf.placeholder(shape=[None], dtype=feat.dtype)
return tf.estimator.export.ServingInputReceiver(inputs, inputs)
You are probably training your model with actual image files, while it is best to send images as encoded byte-string to a model hosted on CloudML. Therefore you'll need to specify a ServingInputReceiver function when exporting the model, as you mention. Some boilerplate code to do this for a Keras model:
# Convert keras model to TF estimator
tf_files_path = './tf'
estimator =\
tf.keras.estimator.model_to_estimator(keras_model=model,
model_dir=tf_files_path)
# Your serving input function will accept a string
# And decode it into an image
def serving_input_receiver_fn():
def prepare_image(image_str_tensor):
image = tf.image.decode_png(image_str_tensor,
channels=3)
return image # apply additional processing if necessary
# Ensure model is batchable
# https://stackoverflow.com/questions/52303403/
input_ph = tf.placeholder(tf.string, shape=[None])
images_tensor = tf.map_fn(
prepare_image, input_ph, back_prop=False, dtype=tf.float32)
return tf.estimator.export.ServingInputReceiver(
{model.input_names[0]: images_tensor},
{'image_bytes': input_ph})
# Export the estimator - deploy it to CloudML afterwards
export_path = './export'
estimator.export_savedmodel(
export_path,
serving_input_receiver_fn=serving_input_receiver_fn)
You can refer to this very helpful answer for a more complete reference and other options for exporting your model.
Edit: If this approach throws a ValueError: Couldn't find trained model at ./tf. error, you can try it the workaround solution that I documented in this answer.