Is it possible to prevent command-line CatBoost to shuffle of the input data in cross-validation mode? - catboost

My data contains of 10 weeks observations. I would like to cross-validate the model in 9-to-1 week mode. So, I dont want CatBoost to shuffle the data before splitting. Is it possible with command line?
I'm not sure if "--cv-rand 0" (or any other value) works as "non shuffle".

Use --cv-no-shuffle option, it's just been implemented in catboost.

Change parameter stratified
stratified = False

Related

BentoML - Seving a CatBoostClassifier with cat_features

I am trying to create a BentoML service for a CatBoostClassifier model that was trained using a column as a categorical feature. If i save the model and I try to make some predictions with the saved model (not as a BentoML service) all works as expected, but when I create the service using BentML I get an error
_catboost.CatBoostError: Bad value for num_feature[non_default_doc_idx=0,feature_idx=2]="Tertiary": Cannot convert 'b'Tertiary'' to float
The value is found in a column named 'road_type' and the model was trained using 'object' as the data type for the column.
If I try to give a float or an integer for the 'road_type' column I get the following error
_catboost.CatBoostError: catboost/libs/data/model_dataset_compatibility.cpp:53: Feature road_type is Categorical in model but marked different in the dataset
If someone has encountered the same issue and found a solution I would appreciate it. Thanks!
I have tried different approaches for saving the model or loading the model but unfortunately it did not worked.
You can try to explicitly pass the cat_features to the bentoml runner.
It would be something like this:
from catboost import Pool
runner = bentoml.catboost.get("bentoml_catboost_model:latest").to_runner()
cat_features = [2] # specify your cat_features indexes
prediction = runner.predict.run(Pool(input_data, cat_features=cat_features))

Working on migration of SPL 3.0 to 4.2 (TEDA)

I am working on migration of 3.0 code into new 4.2 framework. I am facing a few difficulties:
How to do CDR level deduplication in new 4.2 framework? (Note: Table deduplication is already done).
Where to implement PostDedupProcessor - context or chainsink custom? In either case, do I need to remove duplicate hashcodes from the list or just reject the tuples? Here I am also doing column updating for a few tuples.
My file is not moving into archive. The temporary output file is getting generated and that too empty and outside load directory. What could be the possible reasons? - I have thoroughly checked config parameters and after putting logs, it seems correct output is being sent from transformer custom, so I don't know where it is stuck. I had printed TableRowGenerator stream for logs(end of DataProcessor).
1. and 2.:
You need to select the type of deduplication. It is not a big difference if you choose "table-" or "cdr-level-deduplication".
The ite.businessLogic.transformation.outputType does affect this. There is one Dedup only. You can not have both.
Select recordStream for "cdr-level-deduplication", do the transformation to table row format (e.g. if you like to use the TableFileWriter) in xxx.chainsink.custom::PostContextDataProcessor.
In xxx.chainsink.custom::PostContextDataProcessor you need to add custom code for duplicate-handling: reject (discard) tuples or set special column values or write them to different target tables.
3.:
Possibly reasons could be:
Missing forwarding of window punctuations or statistic tuple
error in BloomFilter configuration, you would see it easily because PE is down and error log gives hints about wrong sha2 functions be used
To troubleshoot your ITE application, I recommend to enable the following debug sinks if checking the StreamsStudio live graph is not sufficient:
ite.businessLogic.transformation.debug=on
ite.businessLogic.group.debug=on
ite.businessLogic.sink.debug=on
Run a test with a single input file only and check the flow of your record and statistic tuples. "Debug sinks" write punctuations markers also to debug files.

How to check whether a file is serial or partitioned using abinitio functions only

I need to resolve a parameter value depending upon whether I have a serial or multifile. Below is the scenario...
I have created a generic graph where I have a reformat component just after the input file component... At run time! I need to check input file if it is serial or multi... And accordingly I have to populate the layout of reformat...!
Hence.. To achieve this I am looking for some specific abinitio function...!
Thanks
I think there is a function - m_fs_check.
You can use this function in the graph parameters and use the resolved value as a condition to determine the layout.
m_fs_check will check if the directory is a serial or multi directory. However a user can still create a serial file on a multi directory. One option is fire a m_ls -lt command. Result displays a flag 'M' which denotes that a file is multi file. For serial files this flag remains blank.
Use m_expand($INPUT_FILE_PATH) in PDL at PSET level to identify directory depth. if depth is greater than one then its multifile else serial.then use output flag into your reformat.

Splitting Data into a test set and training set

Which operator in Rapidminer can I use to make an out of bag sample as my training set, and use the remaining data as my test set?
The Split Data operator is one option. This makes 2 or more example sets split up the way you want and you can do what you want with these. An alternative that incorporates the training and test aspects is Split-Validation.
Use the X-validation operator.
Attach your data set to the X-validation operator, then attach the output of the operator to the output node.
After this, go into the x-validation operator by double clicking it or clicking the small double blue window at its bottom right corner.
Once inside the operator, attach whatever model you wish to create (for this instance I used a decision tree model) on the training side of the data then on the testing side, attach the apply model operator to the performance operator. Finally attach the performance operator to the output.
Then press play. It should work.

Test and Training Set are Not Compatible

I have seen various articles about the same issue, Tried a lot of solutions and nothing is working. Kindly advice.
I am getting an error in WEKA:
"Problem Evaluating Classifier: Test and Training Set are Not
Compatible".
I am using
J48 as my algorithm
This is my Test set:
Trainset:
https://www.dropbox.com/s/fm0n1vkwc4yj8yn/train.csv
Evalset:
https://www.dropbox.com/s/2j9jgxnoxr8xjdx/Eval.csv
(I am unable to copy and paste due to long code)
I have tried "Batch Filtering" in WEKA (for Traningset) but it still does not work.
EDIT: I have even converted my .csv to .arff but still the same
issue.
EDIT2: I have made sure the headers in both CSV's match. Even
then same issue. Please help!
Please advice.
A common error in converting ".csv" files to ".arff" with Weka is when values for nominal attributes appear in a different order or not at all from dataset to dataset.
Your evaluation ".arff" file probably looks like this (skipping irrelevant data):
#relation Eval
#attribute a321 {TRUE}
Your train ".arff" file probably looks like this (skipping irrelevant data):
#relation train
#attribute a321 {FALSE}
However, both should contain all possible values for that attribute, and in the same order:
#attribute a321 {TRUE, FALSE}
You can remedy this by post-processing your ".arff" files in a text editor and changing the header so that your nominal values appear in the same order (and quantity) from file to file.
How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
Load the full dataset
select the RemovePercentage filter in the preprocess panel
set the correct percentage for the split
apply the filter
save the generated data as a new file
test set:
Load the full dataset (or just use undo to revert the changes to the dataset)
select the RemovePercentage filter if not yet selected
set the invertSelection property to true
apply the filter
save the generated data as new file