Pre-processing training data for Microsoft Custom Translator Text JA->EN? (tokenization, lowercase) - microsoft-translator

I'm creating a custom model out of a training set in Microsoft Translator Text for Japanese (JA) to English (EN) translation. Should the training data be tokenized, and is all lowercase preferable?
In Japanese the quotation characters (「」 and 『』)are different than in English. In JA training data should these be tokenized (separated by a space)? In parallel EN training data should the EN quotation marks ("") be used, or the JA quotation marks?
Beyond that, is any other pre-processing desirable such as transforming the text to all lowercase? The text casing returned by the model when deployed does not matter.

Leave the training material as you would present it to a human reader, with casing and punctuation intact. Casing and punctuation matter in translation, it is a relevant signal for the engine to receive. No reason to apply your own tokenization, it would interfere with the system's tokenization.
The best training material is sentence- or segment aligned, like you would get it in a TMX or XLIFF in an export from a TM.

Related

Word2Vec - How can I store and retrieve extra information regarding each instance of corpus?

I need to combine Word2Vec with my CNN model. To this end, I need to persist a flag (a binary one is enough) for each sentence as my corpus has two types (a.k.a. target classes) of sentences. So, I need to retrieve this flag of each vector after creation. How can I store and retrieve this information inside the input sentences of Word2Vec as I need both of them in order to train my deep neural network?
p.s. I'm using Gensim implementation of Word2Vec.
p.s. My corpus has 6,925 sentences, and Word2Vec produces 5,260 vectors.
Edit: More detail regarding my corpus (as requested):
The structure of the corpus is as follows:
sentences (label: positive) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
sentences (label: negative) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
Then all the sentences were given as the input to Word2Vec.
word2vec = Word2Vec(all_sentences, min_count=1)
I'll feed my CNN with the extracted features (which is the vocabulary in this case) and the targets of sentences. So, I need these labels of the sentences as well.
Because the Word2Vec model doesn't retain any representation of the individual training texts, this is entirely a matter for you in your own Python code.
That doesn't seem like very much data. (It's rather tiny for typical Word2Vec purposes to have just a 5,260-word final vocabulary.)
Unless each text (aka 'sentence') is very long, you could even just use a Python dict where each key is the full string of a sentence, and the value is your flag.
But if, as is likely, your source data has some other unique identifier per text – like a unique database key, or even a line/row number in the canonical representation – you should use that identifier as a key instead.
In fact, if there's a canonical source ordering of your 6,925 texts, you could just have a list flags with 6,925 elements, in order, where each element is your flag. When you need to know the status of a text from position n, you just look at flags[n].
(To make more specific suggestions, you'd need to add more details about the original source of the data, and exactly when/why you'd need to be checking this extra property later.)

Language translation using TorchText (PyTorch)

I have recently started with ML/DL using PyTorch. The following pytorch example explains how we can train a simple model for translating from German to English.
https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
However I am confused on how to use the model for running inference on custom input. From my understanding so far :
1) We will need to save the "vocab" for both German (input) and English(output) [using torch.save()] so that they can be used later for running predictions.
2) At the time of running inference on a German paragraph, we will first need to convert the German text to tensor using the german vocab file.
3) The above tensor will be passed to the model's forward method for translation
4) The model will again return a tensor for the destination language i.e., English in current example.
5) We will use the English vocab saved in first step to convert this tensor back to English text.
Questions:
1) If the above understanding is correct, can the above steps be treated as a generic approach for running inference on any language translation model if we know the source and destination language and have the vocab files for the same? Or can we use the vocab provided by third party libraries like spacy?
2) How do we convert the output tensor returned from model back to target language? I couldn't find any example on how to do that. The above blog explains how to convert the input text to tensor using source-language vocab.
I could easily find various examples and detailed explanation for image/vision models but not much for text.
Yes globally what you are saying is correct, and of course you can any vocab, e.g. provided by spacy. To convert a tensor into natrual text, one of the most used thechniques is to keep both a dict that maps indexes to words and an other dict that maps words to indexes, the code below can do this:
tok2idx = defaultdict(lambda: 0)
idx2tok = {}
for seq in sequences:
for tok in seq:
if not tok in tok2idx:
tok2idx[tok] = index
idx2tok[index] = tok
index += 1
Here sequences is a list of all the sequences (i.e. sentences in your dataset). You can change the model easily if you have only a list of words or tokens, by only keeping the inner loop.

Is there a unicode character that is specifically not meant to be used normally, but instead only functions as a CSV separator?

Is there a unicode character that is specifically not meant to be used normally, but instead only functions as a CSV separator? I know CSV stands for comma separated, but I use it here since it is the most common term for the concept I'm trying to ask about. Basically I would like to know whether there is a code point that was only added to unicode for the purpose of being used as a separator character between records in a text file.
Yes, 0x1C … 0x1F. They were specifically created for what you intend (and then standardised into ANSI_X3.4-1968 and later into Unicode).
Summary from English Wikipedia:
Can be used as delimiters to mark fields of data structures. If used for hierarchical levels, US is the lowest level (dividing plain-text data items), while RS, GS, and FS are of increasing level to divide groups made up of items of the level beneath it.

What is the likely meaning of this character sequence? A&#C

I'm working on an application that imports data from a CSV file. I am told that the data in the CSV file comes from SAP, which I am totally unfamiliar with.
My client indicates that there is an issue. One column of data in the CSV file contains postal addresses. Sometimes, the system doesn't see a valid address. Here is a slightly fictionalized example:
1234 MAIN ST A&#C HOUSTON
As you can see, there is a street number, a street name, and a city, all in capital letters. There is no state or zip code specified. In the CSV file, all addresses are assumed to be in the same state.
Normally, where there is text between the street name and city, it is an apartment number or letter. In the above example, we get errors when we try to use the address with other services, such as Google geolocation. One suggested fix is to simply strip out there special characters, but I believe that there must be a better way.
I want to know what this A&#C means. It looks like some sort of escape sequence, but it isn't in a format I'm familiar with. Please tell me what these strange character sequence means.
I'm not totally sure, but I doubt there's a "canonical" escape sequence that looks like this. In the ABAP environment, # is used to replace non-printable characters. It might be that the data was improperly sanitized when importing into the SAP system in the first place, and when writing to the output file, some non-printable character was replaced by #. Another explanation might be that one of the field contained a non-ASCII unicode character (like,   ) and the export program failed to convert that to the selected target codepage. It's hard to tell without examining the actual source dataset. Of course, it might also be some programming error or a weird custom field separator...

storing tamil values in database

I have stored tamil content something as &agrave..........
But for some content it is stored as #2220.......
So while retrieving there arise a problem with it when I try to decode it as original tamil content.
How to convert the values from #2220........to &grave.......
In XML, &#xxxx; is a hexidecimal character entity. It refers to a Unicode character U+xxxx.
In HTML, there is a set of named character entities, like à. You can use them in XML if your DTD includes their definitions.
In any case, any conforming XML parser will convert either one to the corresponding Unicode character. When you put your text into your database, ` was converted to a single unicode character. When you pulled it out, the mechanism you used to pull it out did not choose to represent it with the symbolic name, but rather it used the general hex form.
If you want symbolic names, chances are that you need to post-process to get them.