Get H2OFrame as object instead of getting reference to a location in H2O cluster - openshift

We have created and trained model using H2O libraries. Configured H2O in OpenShift container and deployed the trained model for getting realtime inference. It worked well when we have one container. We have to scale up to handle the increase in transaction volume. Encountered an issue with the statefull nature of the H2OFrame. Please see my sample code.
Step-1: Converts the JSON dictionary in to Pandas frame.
Step-2: Converts the Pandas frame in to H2O frame.
Step-3: Run the model with H2O frame as input.
Here step-2 is returning a handle to the data stored in the container. "H2OFrame is similar to pandas’ DataFrame, or R’s data.frame. One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data." So Step-3's request must go to the same container. If not it cannot find the H2O frame and throws error.
Step-1: convert JSON dictionary to data frame using Pandas dataFrame
ToBeScored = pd.DataFrame([jsonDictionary])
Step-2: convert panda data frame to H2o frame
ToBeScored_hex = h2o.H2OFrame(ToBeScored)
Step-3: run the model
outPredections = rf_model.predict(ToBeScored_hex)
If the H2OFrame can be returned as an in memory object in step-2 then the statefull nature can be avoided. Is there any way?
Or, Can the H2O clustering be configured to store the H2OFrame such a way that it can be accessible from any OpenShift container in the cluster?
Useful links H2O’s Predict function accepts data only in H2OFrame format.
Predict function - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html#h2o.model.model_base.ModelBase.predict
H2O frame data type - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html
Updated on 6/19/2019 continuation question to #ErinLeDell's clarification
We have upgraded to H2O 3.24 and used MOJO model. Removed step 2 and replaced step 3 with this function call.
import h2o as h
result = h.mojo_predict_csv(input_csv_path="PredictionDataRow.csv",mojo_zip_path="rf_model.zip",
genmodel_jar_path="h2o-genmodel.jar", java_options='-Xmx512m -XX:ReservedCodeCacheSize=256m', verbose=True)
Internally it executed the below command which initialized a new JVM and started H2O local server for every call. H2O local server is initialized to find the path to java.
java = H2OLocalServer._find_java() // Find java path then creates below command line
C:\Program Files (x86)\Common Files\Oracle\Java\javapath\java.exe -Xmx512m -XX:ReservedCodeCacheSize=256m -cp h2o-genmodel.jar hex.genmodel.tools.PredictCsv --mojo C:\Users\admin\Documents\Code\python\rf_model.zip --input PredictionDataRow.csv --output C:\Users\admin\Documents\Code\python\prediction.csv --decimal
Question-1: Is there any way to use an existing JVM and not always spawn a new one for every transaction? 
Question-2: Is there a way to pass the java path to avoid the H2O local server initialization? Is H2OLocalServer required for anything other than finding java path? If it cannot be avoided then, Is it possible to initialize local server once and direct new requests to existing H2O local server instead of starting a new H2O local server?

An alternative is to use an H2O MOJO model (instead of a binary model which needs to exist in H2O cluster memory to make predictions). MOJO models can sit on disk and do not require a running H2O cluster. Then you can skip Step 2 and use the h2o.mojo_predict_pandas() function in Step 3.

Related

How can I externalize ISchedulerExecutorService to run tasks in an external hazelcast cluster(Hazecast 5.2) without using UserCodeDeployment?

I am working on externalizing our IScheduledExecutorService so I can run tasks externally on a external cluster. I am able to write a test and get the Runnable to actually run ONLY if I turn on UserCode deployment. If I want to change this task at all and run the tests again I get the below in my external cluster member's logs..
java.lang.IllegalStateException: Class com.mycompany.task.ScheduledTask is already in local cache and has conflicting byte code representation
I want to be able to change the task if I could and redeploy to Hazelcast to just handle it. I do this kind of thing with our external maps now. It can handle different versions of our objects using compact serialization.
Am I stuck using user code deployment for these functional objects? If I need to make a change to it I need to change the class name and redeploy to production. I'm hoping to get this task right the first time and not have to ever do that but I have a way of handling it if I do.
The cluster is already running in production and I'll have to add the following to each member
HZ_USERCODEDEPLOYMENT_ENABLED=true
and the appropriate client code(listed below) to enable this.
What I've done...
Added the following to my local docker file
HZ_USERCODEDEPLOYMENT_ENABLED=true
and also in the code that creates a hazelcast client connecting to my external cluster with
ClientConfig clientConfig = new ClientConfig(); ClientUserCodeDeploymentConfig clientUserCodeDeploymentConfig = new ClientUserCodeDeploymentConfig(); clientUserCodeDeploymentConfig.addClass("com.mycompany.task.ScheduledTask"); clientUserCodeDeploymentConfig.setEnabled(true); clientConfig.setUserCodeDeploymentConfig(clientUserCodeDeploymentConfig);
However, if I remove those two pieces I get the following Exception with a failing test. It doesn't know about my class at all.
com.hazelcast.nio.serialization.HazelcastSerializationException: java.lang.ClassNotFoundException: com.mycompany.task.ScheduledTask
Side Note:
We are using compact serialization for several maps already and when I try to configure this Runnable task via compact serialization I get the below error. I don't think that's the right approach either.
[Scheduler: myScheduledExecutorService][Partition: 121][Task: 7afe68d5-3185-475f-b375-5a82a7088de3] Exception occurred during run
java.lang.ClassCastException: class com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord cannot be cast to class java.lang.Runnable (com.hazelcast.internal.serialization.impl.compact.DeserializedGenericRecord is in unnamed module of loader 'app'; java.lang.Runnable is in module java.base of loader 'bootstrap')
at com.hazelcast.scheduledexecutor.impl.ScheduledRunnableAdapter.call(ScheduledRunnableAdapter.java:49) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.scheduledexecutor.impl.TaskRunner.call(TaskRunner.java:78) ~[hazelcast-5.2.0.jar:5.2.0]
at com.hazelcast.internal.util.executor.CompletableFutureTask.run(CompletableFutureTask.java:64) ~[hazelcast-5.2.0.jar:5.2.0]

A simple distributed training python program for deep learning models by Horovod on GPU cluster

I am trying to run some example python3 code
https://docs.databricks.com/applications/deep-learning/distributed-training/horovod-runner.html
on databricks GPU cluster (with 1 driver and 2 workers).
Databricks environment:
ML 6.6, scala 2.11, Spark 2.4.5, GPU
It is for distributed deep learning model training.
I just tried a very simple example at first:
from sparkdl import HorovodRunner
hr = HorovodRunner(np=2)
def train():
print('in train')
import tensorflow as tf
print('after import tf')
hvd.init()
print('done')
hr.run(train)
But, the command is alway running without any progress.
HorovodRunner will stream all training logs to notebook cell output. If there are too many
logs, you
can adjust the log level in your train method. Or you can set driver_log_verbosity to
'log_callback_only' and use a HorovodRunner log callback on the first worker to get concise
progress updates.
The global names read or written to by the pickled function are {'print', 'hvd'}.
The pickled object size is 1444 bytes.
### How to enable Horovod Timeline? ###
HorovodRunner has the ability to record the timeline of its activity with Horovod Timeline.
To
record a Horovod Timeline, set the `HOROVOD_TIMELINE` environment variable to the location
of the
timeline file to be created. You can then open the timeline file using the chrome://tracing
facility of the Chrome browser.
Do I miss something or need to set up something to make it work ?
Thanks
your code does no actual training within it.. you might have better luck with the better example code
https://docs.databricks.com/applications/machine-learning/train-model/distributed-training/mnist-pytorch.html

Clienterror: An error occured when calling the CreateModel operation

I want to deploy sklearn model in sagemaker. I created a training script.
scripPath=' sklearn.py'
sklearn=SKLearn(entry_point=scripPath,
train_instance_type='ml.m5.xlarge',
role=role, output_path='s3://{}/{}/output'.format(bucket,prefix), sagemaker_session=session)
sklearn.fit({"train-dir' : train_input})
When I deploy it
predictor=sklearn.deploy(initial_count=1,instance_type='ml.m5.xlarge')
It throws,
Clienterror: An error occured when calling the CreateModel operation:Could not find model data at s3://tree/sklearn/output/model.tar.gz
Can anyone say how to solve this issue?
When deploying models, SageMaker looks up S3 to find your trained model artifact. It seems that there is no trained model artifact at s3://tree/sklearn/output/model.tar.gz. Make sure to persist your model artifact in your training script at the appropriate local location in docker which is /opt/ml/model.
for example, in your training script this could look like:
joblib.dump(model, /opt/ml/model/mymodel.joblib)
After training, SageMaker will copy the content of /opt/ml/model to s3 at the output_path location.
If you deploy in the same session a model.deploy() will map automatically to the artifact path. If you want to deploy a model that you trained elsewhere, possibly during a different session or in a different hardware, you need to explicitly instantiate a model before deploying
from sagemaker.sklearn.model import SKLearnModel
model = SKLearnModel(
model_data='s3://...model.tar.gz', # your artifact
role=get_execution_role(),
entry_point='script.py') # script containing inference functions
model.deploy(
instance_type='ml.m5.xlarge',
initial_instance_count=1,
endpoint_name='your_endpoint_name')
See more about Sklearn in SageMaker here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html

How do I retrieve the json representation of an azure data factory pipeline?

I want to track pipeline changes in source control, and I'm looking for a way to programmatically retrieve the json representation from the ADF.
The .Net routines return the objects, but sadly ToString() does not return json (wouldn't THAT be convenient?), so right now I'm looking at copying the json down by hand (shoot me now!), or possibly trying to recreate the json from the .Net objects (shoot me later!).
Please tell me I'm being dense and there is an obvious way to do this.
You can serialize the object using Newtonsoft Json.
See (https://azure.microsoft.com/en-us/documentation/articles/data-factory-create-data-factories-programmatically/) for how to connect via the ADF SDK
var aadTokenCredentials = new TokenCloudCredentials(ConfigurationManager.AppSettings["SubscriptionId"], GetAuthorizationHeader());
var resourceManagerUri = new Uri(ConfigurationManager.AppSettings["ResourceManagerEndpoint"]);
var manager = new DataFactoryManagementClient(aadTokenCredentials, resourceManagerUri);
var pipeline = manager.Pipelines.Get(resourceGroupName, dataFactoryName, pipelineName);
var pipelineAsJson = JsonConvert.SerializeObject(pipeline.Pipeline, Formatting.Indented);
I was expecting something more complex but looking at the sdk source GitHub it is not doing anything special.
Our team has a deployment tool that takes git changes and deploy them appropriately. Everything is done asynchronously and being controlled and versioned through git.
In a nutshell our deployment has the following flow:
Any completed git merge request triggers a VSO build. This is simply
building the whole solution via MsBuild.
Every successful build is applied a Git tag for tracking of Last Known Good.
Next (if build succeeded) our .net ADFPublisher starts by taking only the changed data factory files and asynchronously publishing them based on their
git operation (modified, add, delete, etc.).
For some failures cases our ADFPublisher will perform a retry.
This whole process (Build + publish) takes ~ 65 seconds and has
already saved us from having several bugs. It also allows us to move
definitions from one environment to another very easily.
Let me know if you think this is something that you will be interested in and I will setup a way to share it with you

spring batch: Dump a set of queries over a database in parallel to flat files

So my scenario drilled down to the essence is as follows:
Essentially, I have a config file containing a set of SQL queries whose result sets need to be exported as CSV files.
Since some queries may return billions of rows, and because something may interrupt the process (bug, crash, ...), I want to use a framework such as spring batch, which gives me restartabilty and job monitoring.
I am using a file based H2 database for persisting spring batch jobs.
So, here are my questions:
Upon creating a Job, I need to provide my RowMapper some initial configuration. So what happens when a job needs to be restarted after a e.g. crash? Concretly:
Is the state of the RowMapper automatically persisted, and upon restart Spring batch will try to restore the object from its database, or
will the RowMapper object be used that is part of the original spring batch XML config file, or
I have to maintain the RowMapper's state using the step's/job's ExecutionContext?
Above question is related to whether there is magic going on when using the spring batch XML configuration, or whether I could as well create all these beans in a programmatic way:
Since I need to parse my own config format into a spring batch job config, I rather just use spring batch's Java classes (beans) and fill them out appropriately, rather attempting to manually write out valid XML. However, if my Job crashes, I would create all the beans myself again. Does spring batch automagically restore the Job state from its database?
If I really need XML, is there a way to serialize a spring-batch JobRepository (or one of these objects) as a spring batch XML config?
Right now, I tried to configure my Step with the following code - but I am unsure if this is the proper way to do this:
Is TaskletStep the way to go?
Is the way I create the chunked reader/writer correct, or is there some other object which I should use instead?
I would have assumed that opening of the reader and writer would occur automatically as part of the JobExecution, but if I don't open these resources prior to running the Job, I get an exception telling me that I need to open them first. Maybe I need to create some other object that manages the resoures (jdbc connection and file handle)?
JdbcCursorItemReader<Foobar> itemReader = new JdbcCursorItemReader<Foobar>();
itemReader.setSql(sqlStr);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(rowMapper);
itemReader.afterPropertiesSet();
ExecutionContext executionContext = new ExecutionContext();
itemReader.open(executionContext);
FlatFileItemWriter<String> itemWriter = new FlatFileItemWriter<String>();
itemWriter.setLineAggregator(new PassThroughLineAggregator<String>());
itemWriter.setResource(outResource);
itemWriter.afterPropertiesSet();
itemWriter.open(executionContext);
int commitInterval = 50000;
CompletionPolicy completionPolicy = new SimpleCompletionPolicy(commitInterval);
RepeatTemplate repeatTemplate = new RepeatTemplate();
repeatTemplate.setCompletionPolicy(completionPolicy);
RepeatOperations repeatOperations = repeatTemplate;
ChunkProvider<Foobar> chunkProvider = new SimpleChunkProvider<Foobar>(itemReader, repeatOperations);
ItemProcessor<Foobar, String> itemProcessor = new ItemProcessor<Foobar, String>() {
/* Custom implemtation */ };
ChunkProcessor<Foobar> chunkProcessor = new SimpleChunkProcessor<Foobar, String>(itemProcessor, itemWriter);
Tasklet tasklet = new ChunkOrientedTasklet<QuadPattern>(chunkProvider, chunkProcessor); //new SplitFilesTasklet();
TaskletStep taskletStep = new TaskletStep();
taskletStep.setName(taskletName);
taskletStep.setJobRepository(jobRepository);
taskletStep.setTransactionManager(transactionManager);
taskletStep.setTasklet(tasklet);
taskletStep.afterPropertiesSet();
job.addStep(taskletStep);
Most of you questions are really complex and can be difficult give a good answer without write a long paper.
I'm new with spring-batch as you, and I found a lot of really useful info - and all the answers to your questions - reading Spring batch in action: it's completed, well explained, full of example and cover all aspects of framework (reader/writer/processor, job/tasklet/chunk lifecycle/persistence, tx/resources management, job flow, integration with other service, partitioning, restarting/retry, failure management and a lot of interesting things).
Hope to help