I'm developing an app which uses RapidMiner for classification. I train the classifier time to time (e.g. daily) but I use the classifier in very high rate (250 per sec).
For this purpose, I created two Processes using RM GUI. First one trains the classifier and saves it into a model file while the second one uses it for classification.
In the second process I load the model file which the first process creates. This made it very slow since it seems that the process loads it every time I want to classify an input.
You can see the second process in the following picture:
(source: shiaupload.ir)
What's the more smart way of doing this?
P.S. I think a solution is to create another process which only loads the created classifier only once and then give it to ApplyModel subprocess as another input. But I didn't find a solution for doing so in Java code.
Already discussed and solved here.
Related
I want to know if a job I'm debugging is using incremental computation or not since it's necessary for my debugging techniques.
There are two ways to tell: the job's Spark Details will indicate this (if it's using Python), as will its code.
Spark Details
If you navigate to the Spark Details page as noted here, you'll notice there's a tab for Snapshot / Incremental. In this tab, and if your job is using Python, you'll get a description of if your job is running using incremental computation. If the page reports No Incremental Details Found and you ran the job recently, this means it is not using Incremental computation. However, if your job is somewhat old (typically more than a couple of days), this may not be correct as these Spark details are removed for retention reasons.
A quick way to check if your job's information has been removed due to retention is to navigate to the Query Plan tab and see if any information is present. If nothing is present, this means your Spark details have been deleted and you will need to re-run your job in order to see anything. If you want a more reliable way of determining if a job is using incremental computation, I'd suggest following the second method below.
Code
If you navigate to the code backing the Transform, you'll want to look for a couple indicators, depending on the language used.
Python
The Transform will have an #incremental() decorator on it if it's using incremental computation. This doesn't indicate however whether it will choose to write or read an incremental view. The backing code can choose what types of reads or writes it wishes to do, so you'll want to inspect the code more closely to see what it's written to do.
from transforms.api import transform, Input, Output, incremental
#incremental() # This being present indicates incremental computation
#transform(...)
def my_compute_function(...):
...
Java
The Transform will have the getReadRange and getWriteMode methods overridden in the backing code.
I'm running Fairseq in the command line. Fairseq loads language models on the fly and do the translation. It works fine but it takes time to load the models and do the translation. I'm thinking, if we run the Fairseq as an in-memory service and pre-load all language models, it will be quick to run the service and do the translations.
My questions are,
Will it be more efficient if we run the Fairseq as an in-memory service and pre-load the language models?
How much efficiency increase that we can expect?
How easy will it be to implement such an in-memory Fairseq service?
Thank you very much for helping out.
There is an issue about preloading models:
https://github.com/pytorch/fairseq/issues/1694
For a custom model, the code below shows how to preload fairseq model in memory, which is an official example and can be found in: https://github.com/pytorch/fairseq/tree/master/examples/translation#example-usage-torchhub
from fairseq.models.transformer import TransformerModel
zh2en = TransformerModel.from_pretrained(
'/path/to/checkpoints',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='data-bin/wmt17_zh_en_full',
bpe='subword_nmt',
bpe_codes='data-bin/wmt17_zh_en_full/zh.code'
)
zh2en.translate('你好 世界')
# 'Hello World'
You can go through the source code to find more details about the method from_pretrained: https://github.com/pytorch/fairseq/blob/579a48f4be3876082ea646880061a98c94357af1/fairseq/models/fairseq_model.py#L237
Once preload, you can repeatly use without command lines.
If you want to use gpu, remember execute: model.to('cuda').
Certainly it can be more efficient if you preload. For a big model of quite big size, it takes seconds to be loaded into memory.
Caffe requires at least three .prototxt files: for training, for deployment and to define solver parameters.
My training and deployment files contain identical pieces, describing network architecture. Is it possible to refactor this, by moving this common part out of them into a separate file?
You are looking for "all-in-one" network.
See this github discussion for more information.
Apparently, you can achieve this by using not only include {phase: XXX}, but also take advantage of stage and state.
I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.
I have been using KNIME 2.7.4 for running analysis algorithm. I have integrated KNIME with our existing application to run in BATCH mode using the below command.
<<KNIME_ROOT_PATH>>\\plugins\\org.eclipse.equinox.launcher_1.2.0.v20110502.jar -application org.knime.product.KNIME_BATCH_APPLICATION -reset -workflowFile=<<Workflow Archive>> -workflow.variable=<<parameter>>,<<value>>,<<DataType>
Knime provide different kinds of plot which I want to use. However I am running the workflow in batch mode. Is there any option in KNIME where I can specify the Node Id and "View" option as a parameter to KNIME_BATCH_APPLICATION.
Would need suggestion or guidance to achieve this functionality.
I have posted this question in KNIME forum and got the satisfactory answer mentioned below
As per concept of command line execution, this requirement does not fit in. Also there is now way for batch executor to open the view of specific plot node.
Hence there could be two solutions
Solution 1
Write the output of workflow in a file and use any charitng plugin to plot the graph and do the drilldown activity.
Solution 2
Use jFreeChart and write the image using ImageWriter node which can be displayed in any screen.