Azure Machine Learning Data Transformation

Azure Machine Learning Data Transformation - cortana-intelligence

Can machine learning be used to transform/modifiy a list of numbers.
I have many pairs of binary files read from vehicle ECUs, an original or stock file before the vehicle was tuned, and a modified file which has the engine parameters altered. The files are basically lists of little or big endian 16 bit numbers.
I was wondering if it is at all possible to feed these pairs of files into machine learning, and for it to take a new stock file and attempt to transform or tune that stock file.
I would appreciate it if somebody could tell me if this is something which is at all possible. All of the examples I've found appear to make decisions on data rather than do any sort of a transformation.
Also I'm hoping to use azure for this.

We would need more information about your specific problem to answer. But, supervised machine learning can take data with a lot of inputs (like your stock file, perhaps) and an output (say a tuned value), and learn the correlations between those inputs and output, and then be able to predict the output for new inputs. (In machine learning terminology, these inputs are called "features" and the output is called a "label".)
Now, within supervised machine learning, there is a category of algorithms called regression algorithms. Regression algorithms allow you to predict a number (sounds like what you want).
Now, the issue that I see, if I'm understanding your problem correctly, is that you have a whole list of values to tune. Two things:
Do those values depend on each other and influence each other? Do any other factors not included in your stock file affect how the numbers should be tuned? Those will need to be included as features in your model.
Regression algorithms predict a single value, so you would need to build a model for each of the values in your stock file that you want to tune.
For more information, you might want to check out Choosing an Azure Machine Learning Algorithm and How to choose algorithms for Microsoft Azure Machine Learning.
Again, I would need to know more about your data to make better suggestions, but I hope that helps.

Related

Is there a reality capture parameter to request the desired number of vertices?

In the previous reality capture system users could set a parameter which would determine the resolution of the output models. I want to wind up with models about 100-150K vertices. Is there a setting that allows me to request the modeler to keep the number of generated vertices within some bounds, somewhere in the forge API?

The vertex/triangle decimation is usually, what can be called "subjective" task, which can also explain why there are so many optimization algorithms in the wild.
One type of optimization you would need and expect for "organic" models, and totally different one for an architectural building.
The Reality Capture API provides you only with raw Hi-Res results, avoiding "opionated" optimizations. This should be considered just as a step in automation pipeline.
Another step, would be, upon receiving, to automatically optimize the resulted mesh based on set of settings you need.
One of these steps could be Design Automation for 3ds Max, where you feed a model and using the ProOptimizer Modifier within 3ds Max, you output the mesh with needed detail. A sample of this step, can be found here: https://forge-showroom.autodesk.io/post/prooptimizer.
There are also numerous opensource solutions which should help you cover this post-processing step.

Creating a dataset of images for object detection for extremely specific task

Even though I am quite familiar with the concepts of Machine Learning & Deep Learning, I never needed to create my own dataset before.
Now, for my thesis, I have to create my own dataset with images of an object that there are no datasets available on the internet(just assume that this is ground-truth).
I have limited computational power so I want to use YOLO, SSD or efficientdet.
Do I need to go over every single image I have in my dataset by my human eyes and create bounding box center coordinates and dimensions to log them with their labels?
Thanks

Yes, you will need to do that.
At the same time, though the task is niche, you could benefit from the concept of transfer learning. That is, you can use a pre-trained backbone in order to help your model to learn faster/achieve better results/need fewer annotations example, but you will still need to annotate the new dataset on your own.
You can use software such as LabelBox, as a starting point, it is very good since it allows you to output the format in Pascal(VOC) format, YOLO and COCO format, so it is a matter of choice/what is more suitable for you.

Best practices to fine-tune a model?

I have a few questions regarding the fine-tuning process.
I'm building an app that is able to recognize data from the following documents:
ID Card
Driving license
Passport
Receipts
All of them have different fonts (especially receipts) and it is hard to match exactly the same font and I will have to train the model on a lot of similar fonts.
So my questions are:
Should I train a separate model for each of the document types for better performance and accuracy or it is fine to train a single eng model on a bunch of fonts that are similar to the fonts that are being used on this type of documents?
How many pages of training data should I generate per font? By default, I think tesstrain.sh generates around 4k pages.
Maybe any suggestions on how I can generate training data that is closest to real input data
How many iterations should be used?
For example, if I'm using some font that has a high error rate and I want to target 98% - 99% accuracy rate.
As well maybe some of you had experience working with this type of documents and maybe you know some common fonts that are being used for these documents?
I know that MRZ in passport and id cards is using OCR-B font, but what about the rest of the document?
Thanks in advance!

Ans 1
you can train a single model to achieve the same but if you want to detect different languages then I think you will need different models.
Ans 2
If you are looking for some datasets then have a look at this Mnist Png Dataset which has digits as well as alphabets from various computer-based fonts. Here is a link to some starter code to use the data set implemented in Pytorch.
Ans 3
You can use optuna to find the best set of params for your model, but you will need some of the
using-optuna-to-optimize-pytorch-hyperparameters
Have a look at these
PAN-Card-OCR
document-details-parsing-using-ocr
They are trying to achieve similar task.
Hope it answers your Question...!

I would train a classifier on the 4 different types to classify an ID, license, passport, receipts. Basically so you know that a passport is a passport vs a drivers license ect. Then I would have 4 more models that are used for translating each specific type (passport, drivers license, ID, and receipts). It should be noted that if you are working with multiple languages this will likely mean making 4 models based each specific language meaning that if you have L languages you make need 4*L number of models for translating those.
Likely a lot. I don’t think that font is really an issue. Maybe what you should do is try and define some templates for things like drivers license and then generate based on that template?
This is the least of your problems, just test for this.

Assuming you are referring to a ML data model that might be used to perform ocr using computer vision I'd recommend to:
Setup your taxonomy as required by your application requirements.
This means to categorize the expected font sets per type of scanned document (png,jpg tiff etc.) to include inside the appropriate dataset. Select the fonts closest to the ones being used as well as the type of information you need to gather (Digits only, Alphabetic characters).
Perform data cleanup on your dataset and make sure you have homogenous data for the OCR functionality. For example, all document images should be of png type, with max dimensions of 46x46 to have an appropriate training model. Note that higher resolution images and smaller scale means higher accuracy.
Cater for handwritting as well, if you have damaged or non-visible font images. This might improve character conversion options in cases that fonts on paper are not clearly visible/worn out.
In case you are using keras module with TF on mnist provided datasets, setup a cancellation rule for ML model training when you reach 98%-99% accuracy for more control in case you expect your fonts in images to be error-prone (as stated above). This helps avoid higher margin of errors when you have bad images in your training dataset. For a dataset of 1000+ images, a good setting would be using TF Dense of 256 and 5 epochs.
A sample training dataset can be found here.
If you just need to do some automation with your application or do data entry that requires OCR conversion from images, a good open source solution would be to use information gathering automatically via PSImaging module (Powershell) use the degrees of confidence retrieved (from png) and run them against your current datasets to improve your character match accuracy.
You can find the relevant link here

Custom translator - Model adjustment after training

I've used three parallel sentence files to train my custom translator model. No dictionary files and no tuning files too. After training is finished and I've checked test results, I want to make some adjustments in the model. And here are several questions:
Is it possible to tune the model after training? Am I right that the model can't be changed and the only way is to train a new model?
The best approach to adjusting the model is to use tune files. Is it correct?
There is no way to see an autogenerated tune file, so I have to provide my own tuning file for a more manageable tuning process. Is it so?
Could you please describe how the tuning file is generated, when I have 3 sentence files with different amount of sentences, which is: 55k, 24k and 58k lines. Are all tuning sentences is from the first file or from all three files proportionally to their size? Which logic is used?

I wish there were more authoritative answers on this, I'll share what I know as a fellow user.
What Microsoft Custom Translator calls "tuning data" is what is normally known as a validation set. It's just a way to avoid overfitting.
Is it possible to tune the model after training? Am I right that the model can't be changed and the only way is to train a new model?
Yes, with Microsoft Custom Translator you can only train a model based on the generic category you have selected for the project.
(With Google AutoML technically you can choose to train a new model based on one of your previous custom models. However, it's also not usable without some trial and error.)
The best approach to adjusting the model is to use tune files. Is it correct?
It's hard to make a definitive statement on this. The training set also has an effect. A good validation set on top of a bad training set won't get us good results.
There is no way to see an autogenerated tune file, so I have to provide my own tuning file for a more manageable tuning process. Is it so?
Yes, it seems to me that if you let it decide how to split the training set into the training set, tuning set and test set, you can only download the training set and the test set.
Maybe neither includes the tuning set, so theoretically you can diff them. But that doesn't solve the problem of the split being different between different models.
... Which logic is used?
Good question.

Is it possible to let SPSS only display values that are significant in the Output?

Is it possible to display only significant P-values and/or R-values in the output of SPSS?
This would simplify output significantly and reduces the tables to display only the relevant parts (the ones I need).

I'm not sure that this is a good idea, but if you want to do things such as highlight significant coefficients in a regression or blank out nonsignificant correlations in a correlation matrix, the SPSSINC MODIFY OUTPUT extension command can do this. It is included in the Python Essentials for SPSS Statistics and can be downloaded from the SPSS Community site at www.ibm.com/developerworks/spssdevcentral or, for V21, from the same site where Statistics is kept for download or the trial site.

I agree that there are many cases where this is not a good idea.
In general, I find post-processing of SPSS output tables to be a little bit awkward. This is one area in which R is a lot easier to use.
For occasional analyses I often find it useful to paste an SPSS output table into Excel for further processing. For example, you could sort columns by size (e.g., mean difference, p-value, r etc.), calculate new values (e.g., mean differences, absolute correlation, etc.), make table easier to read and so on.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008