I want to recognise English alphabets using a hidden Markov model. I have extracted features using the zoning method.
I want to use HTK toolkit for the training. What is the format for giving the feature matrix as input to HTK? How should I give input?
How can the feature vector matrix be placed into the train.scp file?
Related
I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.
For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.
Putting it as a pipeline, I would describe this as:
Using a pre-trained BERT tokenizer.
Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
Generating text that resembles the text within the small custom corpus.
Does this sound familiar? Is it possible with hugging-face?
I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:
Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.
Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.
Is it possible with hugging-face?
I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.
It is entirely possible to both pre-train and further pre-train BERT (or almost any other model that is available in the huggingface library).
Regarding the tokenizer - if you are pre-training on a a small custom corpus (and therefore using a trained bert checkpoint), then you have to use the tokenizer that was used to train Bert. Otherwise, you will just confuse the model.
If your use case is text generation (from some initial sentence/part of sentence), then I can advise you to check gpt-2 (https://huggingface.co/gpt2). I haven't used GPT-2, but with some basic research I think you can do:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
and follow this tutorial: https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171 on how to train the gpt-2 model.
Note: I am not sure if DeBERTa-V3, for example, can be pre-trained as usual. I have checked their github repo and it seems that for V3 there is no official pre-training code (https://github.com/microsoft/DeBERTa/issues/71). However, I think that using huggingface we can actually do it. Once I have time I will run a pre-training script and verify this.
I have created an image classification model using the Microsoft model builder. Now I need to use that model to detect objects in a video stream and draw bounding boxes once the object is detected. I can not find a c# sample that uses the generated model from the model builder. All samples of object detection use ONNX models. I have not found a tool to convert the model.zip generated for model builder to model.onnx.
Any help would be appreciated.
The image classification in the model builder cannot detect object in images - it is a different problem.
What you can do is to combine the ONNX sample of object detection in images, with your own custom model.
Basically you run the onnx sample up until the parsing of the bounding boxes. Then you run that part of the image through your image classifier and use that label instead.
It is somewhat of a hack, and you will have a hard time getting anywhere near realtime performance.
ONNX sample for ONNX detection:
https://github.com/dotnet/machinelearning-samples/tree/master/samples/csharp/getting-started/DeepLearning_ObjectDetection_Onnx
I have found ways to do CAM/saliency map for multi class, but not multi label multi class. Do you know of any resources I can use to do it so I don't reinvent the wheel, or rather do you have advice for implementing it?
My specific use case is that I have a transfer learned ResNet that outputs a binary 1x11 vector. Each entry corresponds to presence of a certain feature in the input image. I want to be able to get a saliency map for each feature, so I can know what the network was looking at when deciding if each image has each of those features.
I want to train the basic translation system with only a glossary.
The language pair is ENtoKO. I trained 1,700 sentences in the dictionary tab in the manner described in the article.
I did not select anything in the "Training" tab.
https://cognitive.uservoice.com/knowledgebase/articles/1166938-hub-building-a-custom-system-using-a-dictionary-o
enter image description here
However, expectation and system did not translate the terms. and unlike the document (Microsoft Translator Hub User Guide.pdf), the training completes much time.
Dictionary only training: You can now train a custom translation system when with just a dictionary and no other parallel documents. There is no minimum size for that dictionary, one entry is enough. Just upload the dictionary, which is an Excel file with the language identifier as column header, include it in your training set, and hit train. The training completes very quickly, then you can deploy and use your system with that dictionary. The dictionary applies the translation you provided with 100% probability, regardless of context. This type of training does not produce bleu score and this option only available if MS models are available for given language pair.
Why this training only losing silp Dicionary would like to know. If the update is a feature that is not the intended schedule?
In addition, I am wondering if there is a plan to introduce the Dictionary application function to the NMT Api function as well.
Customizing NMT is available now by using Custom Translator (Preview) and we expect the Dictionary feature to be available when Custom Translator is Generally Available.
You do need to be using the Microsoft Translator Text API v3 and Custom Translator supports language pairs where NMT languages are available today (Korean is a NMT language).
Thank you.
Yes.
You can customize our en-ko general domain baseline with your dictionary. Please follow our quick start documentation.
I just moved in recently from theano, lasagne to keras.
When I in theano, I used such custom embedding layer.
How to keep the weight value to zero in a particular location using theano or lasagne?
It' was useful when deal of variable length input by adding padding.
In keras, such custom embedding layer possible?
Then, how can I make it?
And, such embedding layer may be wrong?
This may not be exactly what you want, but the solution I personally use as it is used in Keras examples (e.g. this one) is to pad the data to a constant length before feeding it to network.
Keras itself provide this pre-processing tool for sequences in keras.preprocessing.sequence.pad_sequences(seq, length)