Can one define a chunk grammar for OpenNLP? - nltk

NLTK supports the defining of a chunk grammar. For example, I can define a chunk grammar for a noun phrase (NP) like this:
grammar = "NP: {<DT>?<JJ>*<NN>}"
I don't see a way to tell OpenNLP what a noun phrase is. I imagine I could walk the parse tree but that seems painful.

Related

Determining a Part-of-Speech without nlkt or spacey but rather regex

How can I identify if a word is a noun, adjective or verb without using an external library like nlkt or spacey but simply using regex

Word2Vec - How can I store and retrieve extra information regarding each instance of corpus?

I need to combine Word2Vec with my CNN model. To this end, I need to persist a flag (a binary one is enough) for each sentence as my corpus has two types (a.k.a. target classes) of sentences. So, I need to retrieve this flag of each vector after creation. How can I store and retrieve this information inside the input sentences of Word2Vec as I need both of them in order to train my deep neural network?
p.s. I'm using Gensim implementation of Word2Vec.
p.s. My corpus has 6,925 sentences, and Word2Vec produces 5,260 vectors.
Edit: More detail regarding my corpus (as requested):
The structure of the corpus is as follows:
sentences (label: positive) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
sentences (label: negative) -- A Python list
Feature-A: String
Feature-B: String
Feature-C: String
Then all the sentences were given as the input to Word2Vec.
word2vec = Word2Vec(all_sentences, min_count=1)
I'll feed my CNN with the extracted features (which is the vocabulary in this case) and the targets of sentences. So, I need these labels of the sentences as well.
Because the Word2Vec model doesn't retain any representation of the individual training texts, this is entirely a matter for you in your own Python code.
That doesn't seem like very much data. (It's rather tiny for typical Word2Vec purposes to have just a 5,260-word final vocabulary.)
Unless each text (aka 'sentence') is very long, you could even just use a Python dict where each key is the full string of a sentence, and the value is your flag.
But if, as is likely, your source data has some other unique identifier per text – like a unique database key, or even a line/row number in the canonical representation – you should use that identifier as a key instead.
In fact, if there's a canonical source ordering of your 6,925 texts, you could just have a list flags with 6,925 elements, in order, where each element is your flag. When you need to know the status of a text from position n, you just look at flags[n].
(To make more specific suggestions, you'd need to add more details about the original source of the data, and exactly when/why you'd need to be checking this extra property later.)

Language translation using TorchText (PyTorch)

I have recently started with ML/DL using PyTorch. The following pytorch example explains how we can train a simple model for translating from German to English.
https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html
However I am confused on how to use the model for running inference on custom input. From my understanding so far :
1) We will need to save the "vocab" for both German (input) and English(output) [using torch.save()] so that they can be used later for running predictions.
2) At the time of running inference on a German paragraph, we will first need to convert the German text to tensor using the german vocab file.
3) The above tensor will be passed to the model's forward method for translation
4) The model will again return a tensor for the destination language i.e., English in current example.
5) We will use the English vocab saved in first step to convert this tensor back to English text.
Questions:
1) If the above understanding is correct, can the above steps be treated as a generic approach for running inference on any language translation model if we know the source and destination language and have the vocab files for the same? Or can we use the vocab provided by third party libraries like spacy?
2) How do we convert the output tensor returned from model back to target language? I couldn't find any example on how to do that. The above blog explains how to convert the input text to tensor using source-language vocab.
I could easily find various examples and detailed explanation for image/vision models but not much for text.
Yes globally what you are saying is correct, and of course you can any vocab, e.g. provided by spacy. To convert a tensor into natrual text, one of the most used thechniques is to keep both a dict that maps indexes to words and an other dict that maps words to indexes, the code below can do this:
tok2idx = defaultdict(lambda: 0)
idx2tok = {}
for seq in sequences:
for tok in seq:
if not tok in tok2idx:
tok2idx[tok] = index
idx2tok[index] = tok
index += 1
Here sequences is a list of all the sequences (i.e. sentences in your dataset). You can change the model easily if you have only a list of words or tokens, by only keeping the inner loop.

What type of charts is using in ECMA-404?

I found this figure in JSON spec
http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
Whats name of type of this chart?
Can I use this chart for specifying any programming language?
They are usually called railroad diagrams, and they are basically a way of presenting a finite state automaton's transition diagram in a readable fashion. It is straight-forward to convert any regular expression into this format, but some regular expressions are tidier than others.
There are variations which work for context-free languages, so you will also find railroad diagrams for push-down automata. It is common that some high level of the grammar (for example, the non-terminal statement) can be expressed as a regular expression of lower-level components (such as expression).

convert html entities to unicode(utf-8) strings in c? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to decode HTML Entities in C?
This question is very similar to that one, but I need to do the same thing in C, not python. Here are some examples of what the function should do:
input output
< <
> >
ä ä
ß ß
The function should have the signature char *html2str(char *html) or similar. I'm not reading byte by byte from a stream.
Is there a library function I can use?
There isn't a standard library function to do the job. There must be a large number of implementation available in the Open Source world - just about any program that has to deal with HTML will have one.
There are two aspects to the problem:
Finding the HTML entities in the source string.
Inserting the appropriate replacement text in its place.
Since the shortest possible entity is '&x;' (but, AFAIK, they all use at least 2 characters between the ampersand and the semi-colon), you will always be shortening the string since the longest possible UTF-8 character representation is 4 bytes. Hence, it is possible to edit in situ safely.
There's an illustration of HTML entity decoding in 'The Practice of Programming' by Kernighan and Pike, though it is done somewhat 'in passing'. They use a tokenizer to recognize the entity, and a sorted table of entity names plus the replacement value so that they can use a binary search to identify the replacements. This is only needed for the non-algorithmic entity names. For entities encoded as 'ß', you use an algorithmic technique to decode them.
This sounds like a job for flex. Granted, flex is usually stream-based, but you can change that using the flex function yy_scan_string (or its relatives). For details, see The flex Manual: Scanning Strings.
Flex's basic Unicode support is pretty bad, but if you don't mind coding in the bytes by hand, it could be a workaround. There are probably other tools that can do what you want, as well.