How to train AllenNLP SRL on non-English languages? - allennlp

I have been reading through the AllenNLP guide and documentation and was hoping to train an SRL Bert model on French.
On the SRL demo page you have the command for training a SRL Bert model as seen below:
allennlp train \
https://raw.githubusercontent.com/allenai/allennlp-models/main/training_config/structured_prediction/bert_base_srl.jsonnet \
-s /path/to/output
Looking into that jsonnet file AllenNLP points out that they use the CONLL formatted Ontonotes 5.0 data. Since, as AllenNLP mentions, this data is not publicly available I went searching for what the format of this data looked like. Which lead me here.
Not fully understanding the format at that link I found this description in AllenNLP's code for their Ontonotes class which was extremely helpful.
In light of all the details above I have a couple questions:
When setting the environment variables SRL_TRAIN_DATA_PATH and SRL_VALIDATION_DATA_PATH that are used in the jsonnet file does the directory structure need to look exactly like the structure described in the Ontonotes class code (seen below) or what is the bare minimum if I will only have one file for training?
└── train
└── data
└── english
└── annotations
├── bc
├── bn
├── mz
├── nw
├── pt
├── tc
└── wb
My second question, using whatever directory structure is necessary, will I be able to train a French model if I create a file just like the CONLL one but all the words would be in French?
Third and finally if I can train a SRL Bert model using a CONLL file in the appropriate format are all of the columns in the CONLL file necessary to have data in. For example, Column 11 is the named entities, is it necessary to have named entities for training or can that column just be blank (i.e. nothing but hyphens). If it is the case that not all columns need data, which columns need to have data for training and which can be empty?
I know it's a fair amount of questions so thank you in advance.

If you use the Ontonotes reader as it is, I think you'll require a structure similar to the one described. From this line, I think the subfolders don't need to be exactly the same. You can also write your own dataset reader that reads the data in whatever format you have.
In theory, yes, you should be able to do that. The quality of the data will probably impact the model.
You can take a look at the srl dataset reader which uses the Ontonotes reader. From the _read() method, it looks like only words and srl_frames are being used in the model. And I believe srl_frames requires the columns [11:-1] based on this.
To summarize, if you use the exact model in allennlp, those will be the required columns. You may choose to experiment using the other data in your custom model as well.

Related

How to write a configuration file to tell the AllenNLP trainer to randomly split dataset into train and dev

The official document of AllenNLP suggests specifying "validation_data_path" in the configuration file, but what if one wants to construct a dataset from a single source and then randomly split it into train and validation datasets with a given ratio?
Does AllenNLP support this? I would greatly appreciate your comments.
AllenNLP does not have this functionality yet, but we are working on some stuff to get there.
In the meantime, here is how I did it for the VQAv2 reader: https://github.com/allenai/allennlp-models/blob/main/allennlp_models/vision/dataset_readers/vqav2.py#L354
This reader supports Python slicing syntax where you, for example, specify a data_path as "my_source_file[:1000]" to take the first 1000 instances from my_source_file. You can also supply multiple paths by setting data_path: ["file1", "file2[:1000]", "file3[1000-"]]. You can probably steal the top two blocks in that file (line 354 to 369) and put them into your own dataset reader to achieve the same result.

Building a Pipline Model using allennlp

I am pretty new to allennlp and I am struggling with building a model that does not seem to fit perfectly in the standard way of building model in allennlp.
I want to build a pipeline model using NLP. The pipeline consists basically of two models, let's call them A and B. First A is trained and based on the prediction of the full train A, B trained afterwards.
What I have seen is that people define two separate models, train both using the command line interface allennlp train ... in a shell script that looks like
# set a bunch of environment variables
...
allennlp train -s $OUTPUT_BASE_PATH_A --include-package MyModel --force $CONFIG_MODEL_A
# prepare environment variables for model b
...
allennlp train -s $OUTPUT_BASE_PATH_B --include-package MyModel --force $CONFIG_MODEL_B
I have two concerns about that:
This code is hard to debug
It's not very flexible. When I want to do a forward pass of the fully trained model I have write another script that bash script that does that.
Any ideas on how to do that in a better way?
I thought about using a python script instead of a shell script and invoke allennlp.commands.main(..) directly. Doing so at least you have a joint python module you can run using a debugger.
There are two possibilities.
If you're really just plugging the output of one model into the input of another, you could merge them together into one model and run it that way. You can do this with two already-trained models if you initialize the combined model with the two trained models using a from_file model. To do it at training time is a little harder, but not impossible. You would train the first model like you do now. For the second step, you train the combined model directly, with the inner first model's weights frozen.
The other thing you can do is use AllenNLP as a library, without the config files. We have a template up on GitHub that shows you how to do this. The basic insight is that everything you configure in one of the Jsonnet configuration files corresponds 1:1 to a Python class that you can use directly from Python. There is no requirement to use the configuration files. If you use AllenNLP this way, have much more flexibility, including chaining things together.

How should a "project" file be written?

With popular software packages, like Microsoft Word or Photoshop, we often have an option to save our progress as a "project" file and later can open that file to edit our works furthermore. This file often contains all the options and the progress that the user has made (i.e the essay you typed in Word).
So my question is, if I am doing a similar application that requires creating a similar "project" file, how should I go about this? My application is a scientific application, which means it required a lot of (multi-dimension) arrays. I understand there will be a lot of options to do this, but I would like to know the de facto way.
Here are some of the options I have outline out:
XML: Human readable. The size is too big and it's too much work to deal with arrays.
JSON: More popular/modern. Good with array.
Protocol Buffer: It is created by Google. Probably faster.
Database: Probably not a good use case since "project" files are most likely "temporary". Also, working with arrays is not very straight forward.
Creating your own binary format: Might be the most difficult solution for an inexperienced programmer like myself.
???
I would like to get some advice from you guys. Thank you :).
(Good question. :) Only some thoughts) I'd prefer text format for the main project file. You can make diffs and open and read and modify it easily. Large ascii or binary data can be stored as serialized data in external files or in a database like SQLite from where it can be easily accessed and processed through the application. The main project has links to the external data store. My advice for the main project file is a simple XML format that can easily be transformed to JSON format. A list of key value pairs (dict) is good for the beginning. value can be of basic datatype or be an array or dict. A complicated XML tree is not good. The key name can also help to describe and structure data. So i'd prefer key="rect.4711.pos.x" value="500" and not <rect id="4711"><pos><x>500</x>...</pos>.... Important aspect is that the project data is portable and self-contained, and the user can see the project as a single unit even if it is a directory on the file system, for this purpose supporting some kind of zipped format of project data is good.

How do I train tesseract 4 with image data instead of a font file?

I'm trying to train Tesseract 4 with images instead of fonts.
In the docs they are explaining only the approach with fonts, not with images.
I know how it works, when I use a prior version of Tesseract but I didn't get how to use the box/tiff files to train with LSTM in Tesseract 4.
I looked into tesstrain.sh, which is used to generate LSTM training data but couldn't find anything helpful. Any ideas?
Clone the tesstrain repo at https://github.com/tesseract-ocr/tesstrain.
You’ll also need to clone the tessdata_best repo, https://github.com/tesseract-ocr/tessdata_best. This acts as the starting point for your training. It takes hundreds of thousands of samples of training data to get accuracy, so using a good starting point lets you fine-tune your training with much less data (~tens to hundreds of samples can be enough)
Add your training samples to the directory in the tesstrain repo named ./tesstrain/data/my-custom-model-ground-truth
Your training samples should be image/text file pairs that share the same name but different extensions. For example, you should have an image file named 001.png that is a picture of the text foobar and you should have a text file named 001.gt.txt that has the text foobar.
These files need to be single lines of text.
In the tesstrain repo, run this command:
make training MODEL_NAME=my-custom-model START_MODEL=eng TESSDATA=~/src/tessdata_best
Once the training is complete, there will be a new file tesstrain/data/.traineddata. Copy that file to the directory Tesseract searches for models. On my machine, it was /usr/local/share/tessdata/.
Then, you can run tesseract and use that model as a language.
tesseract -l my-custom-model foo.png -

Weka: Limitations on what one can output as source?

I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.
RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.
Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.