Improving the recall of a Custom Named Entity Recognition (NER) in Spacy - nltk

This is a second part of another question I posted. However, they are different enough to be seperate questions, but could be related.
Previous question
Building a Custom Named Entity Recognition with Spacy , using random text as a sample
I have built a custom Named Entity Recognition (NER) using the method described in the previous question. From here, I just copied the method to build the NER from the Spacy website (under "Named Entity Recognizer" at this website https://spacy.io/usage/training#ner)
the custom NER works, sorta. If I sentence tokenize the text, lemmatize the words (so "strawberries" become "strawberry"), it can pick up an entity. However, it stops there. It sometimes picks up two entities, but very rarely.
Is there anything I can do to improve its accuracy?
Here is the code (I have TRAIN_DATA in this format, but for food items
TRAIN_DATA = [
("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}),
("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]})]
)
The data is in the object train_food
import spacy
import nltk
nlp = spacy.blank("en")
#Create a built-in pipeline components and add them in the pipeline
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last =True)
else:
ner =nlp.get_pipe("ner")
##Testing for food
for _, annotations in train_food:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
model="en"
n_iter= 20
# only train NER
with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
# show warnings for misaligned entity spans once
warnings.filterwarnings("once", category=UserWarning, module='spacy')
# reset and initialize the weights randomly – but only if we're
# training a new model
nlp.begin_training()
for itn in range(n_iter):
random.shuffle(train_food)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(train_food, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
text = "mike went to the supermarket today. he went and bought a potatoes, carrots, towels, garlic, soap, perfume, a fridge, a tomato, tomatoes and tuna."
After this, and using text as a sample, I ran this code
def text_processor(text):
text = text.lower()
token = nltk.word_tokenize(text)
ls = []
for x in token:
p = lemmatizer.lemmatize(x)
ls.append(f"{p}")
new_text = " ".join(map(str,ls))
return new_text
def ner (text):
new_text = text_processor(text)
tokenizer = nltk.PunktSentenceTokenizer()
sentences = tokenizer.tokenize(new_text)
for sentence in sentences:
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.label_)
ner(text)
This results in
potato FOOD
carrot FOOD
Running the following code
ner("mike went to the supermarket today. he went and bought garlic and tuna")
Results in
garlic FOOD
Ideally, I want the NER to pick up potato, carrot and garlic. Is there anything I can do?
Thank you
Kah

while you are training your model, You can try some information retrieval techniques such as:
1-lower casing all of the words
2-replace words with their synonyms
3-removing stop words
4-rewrite sentences(it can be done automatically using back-translation aka translating into Arabic, then translating it back into English)
also, consider using better models such as:
http://nlp.stanford.edu:8080/corenlp
https://huggingface.co/models

Related

dumping list to JSON file creates list within a list [["x", "y","z"]], why?

I want to append multiple list items to a JSON file, but it creates a list within a list, and therefore I cannot acces the list from python. Since the code is overwriting existing data in the JSON file, there should not be any list there. I also tried it by having just an text in the file without brackets. It just creates a list within a list so [["x", "y","z"]] instead of ["x", "y","z"]
import json
filename = 'vocabulary.json'
print("Reading %s" % filename)
try:
with open(filename, "rt") as fp:
data = json.load(fp)
print("Data: %s" % data)#check
except IOError:
print("Could not read file, starting from scratch")
data = []
# Add some data
TEMPORARY_LIST = []
new_word = input("give new word: ")
TEMPORARY_LIST.append(new_word.split())
print(TEMPORARY_LIST)#check
data = TEMPORARY_LIST
print("Overwriting %s" % filename)
with open(filename, "wt") as fp:
json.dump(data, fp)
example and output with appending list with split words:
Reading vocabulary.json
Data: [['my', 'dads', 'house', 'is', 'nice']]
give new word: but my house is nicer
[['but', 'my', 'house', 'is', 'nicer']]
Overwriting vocabulary.json
So, if I understand what you are trying to accomplish correctly, it looks like you are trying to overwrite a list in a JSON file with a new list created from user input. For easiest data manipulation, set up your JSON file in dictionary form:
{
"words": [
"my",
"dad's",
"house",
"is",
"nice"
]
}
You should then set up functions to separate your functionality to make it more manageable:
def load_json(filename):
with open(filename, "r") as f:
return json.load(f)
Now, we can use those functions to load the JSON, access the words list, and overwrite it with the new word.
data = load_json("vocabulary.json")
new_word = input("Give new word: ").split()
data["words"] = new_word
write_json("vocabulary.json", data)
If the user inputs "but my house is nicer", the JSON file will look like this:
{
"words": [
"but",
"my",
"house",
"is",
"nicer"
]
}
Edit
Okay, I have a few suggestions to make before I get into solving the issue. Firstly, it's great that you have delegated much of the functionality of the program over to respective functions. However, using global variables is generally discouraged because it makes things extremely difficult to debug as any of the functions that use that variable could have mutated it by accident. To fix this, use method parameters and pass around the data accordingly. With small programs like this, you can think of the main() method as the point in which all data comes to and from. This means that the main() function will pass data to other functions and receive new or edited data back. One final recommendation, you should only be using all capital letters for variable names if they are going to be constant. For example, PI = 3.14159 is a constant, so it is conventional to make "pi" all caps.
Without using global, main() will look much cleaner:
def main():
choice = input("Do you want to start or manage the list? (start/manage)")
if choice == "start":
data = load_json()
words = data["words"]
dictee(words)
elif choice == "manage":
manage_list()
You can use the load_json() function from earlier (notice that I deleted write_json(), more on that later) if the user chooses to start the game. If the user chooses to manage the file, we can write something like this:
def manage_list():
choice = input("Do you want to add or clear the list? (add/clear)")
if choice == "add":
words_to_add = get_new_words()
add_words("vocabulary.json", words_to_add)
elif choice == "clear":
clear_words("vocabulary.json")
We get the user input first and then we can call two other functions, add_words() and clear_words():
def add_words(filename, words):
with open(filename, "r+") as f:
data = json.load(f)
data["words"].extend(words)
f.seek(0)
json.dump(data, f, indent=4)
def clear_words(filename):
with open(filename, "w+") as f:
data = {"words":[]}
json.dump(data, f, indent=4)
I did not utilize the load_json() function in the two functions above. My reasoning for this is because it would call for opening the file more times than needed, which would hurt performance. Furthermore, in these two functions, we already need to open the file, so it is okayt to load the JSON data here because it can be done with only one line: data = json.load(f). You may also notice that in add_words(), the file mode is "r+". This is the basic mode for reading and writing. "w+" is used in clear_words(), because "w+" not only opens the file for reading and writing, it overwrites the file if the file exists (that is also why we don't need to load the JSON data in clear_words()). Because we have these two functions for writing and/or overwriting data, we don't need the write_json() function that I had initially suggested.
We can then add to the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)add
>>> Please enter the words you want to add, separated by spaces: these are new words
And the JSON file becomes:
{
"words": [
"but",
"my",
"house",
"is",
"nicer",
"these",
"are",
"new",
"words"
]
}
We can then clear the list like so:
>>> Do you want to start or manage the list? (start/manage)manage
>>> Do you want to add or clear the list? (add/clear)clear
And the JSON file becomes:
{
"words": []
}
Great! Now, we implemented the ability for the user to manage the list. Let's move on to creating the functionality for the game: dictee()
You mentioned that you want to randomly select an item from a list and remove it from that list so it doesn't get asked twice. There are a multitude of ways you can accomplish this. For example, you could use random.shuffle:
def dictee(words):
correct = 0
incorrect = 0
random.shuffle(words)
for word in words:
# ask word
# evaluate response
# increment correct/incorrect
# ask if you want to play again
pass
random.shuffle randomly shuffles the list around. Then, you can iterate throught the list using for word in words: and start the game. You don't necessarily need to use random.choice here because when using random.shuffle and iterating through it, you are essentially selecting random values.
I hope this helped illustrate how powerful functions and function parameters are. They not only help you separate your code, but also make it easier to manage, understand, and write cleaner code.

Input Text Data Formatting for CNN in Flux, in Julia

I am implementing Yoon Kim's CNN (https://arxiv.org/abs/1408.5882) for text classification in Julia, using Flux as the deep learning framework, with individual sentences as input datapoints. The model zoo (https://github.com/FluxML/model-zoo) has proven useful to an extent, but it does not have an NLP example with CNNs. I'd like to check if my input data format is the correct one.
There is no explicit implementation in Flux of a 1D Conv, so I'm using Conv found in https://github.com/FluxML/Flux.jl/blob/master/src/layers/conv.jl
Here is part of the docstring that explains the input data format:
Data should be stored in WHCN order (width, height, # channels, # batches).
In other words, a 100×100 RGB image would be a `100×100×3×1` array,
and a batch of 50 would be a `100×100×3×50` array.
My format is as follows:
1. width: since text in a sentence is 1D, the width is always 1
2. height: this is the maximum number of tokens allowable in a sentence
3. \# of channels: this is the embedding size
4. \# of batches: the number of sentences in each batch
Following the MNIST example in the model zoo, I have
function make_minibatch(X, Y, idxs)
X_batch = zeros(1, num_sentences, emb_dims, MAX_LEN)
function get_sentence_matrix(sentence)
embeddings = Vector{Array{Float64, 1}}()
for word in sentence
embedding = get_embedding(word)
push!(embeddings, embedding)
end
embeddings = hcat(embeddings...)
return embeddings
end
for i in 1:length(idxs)
X_batch[1, i, :, :] = get_sentence_matrix(X[idxs[i]])
end
Y_batch = [Flux.onehot(label+1, 1:2) for label in Y[idxs]]
return (X_batch, Y_batch)
end
where the X is an array of arrays of words and the get_embedding function returns an embedding as an array.
X_batch is then a Array{Float64,4}. Is this the correct approach?

Getting alignment/attention during translation in OpenNMT-py

Does anyone know how to get the alignments weights when translating in Opennmt-py? Usually the only output are the resulting sentences and I have tried to find a debugging flag or similar for the attention weights. So far, I have been unsuccessful.
I'm not sure if this is a new feature, since I did not come across this when looking for alignments a few months back, but onmt seems to have added a flag -report_align to output word alignments along with the translation.
https://opennmt.net/OpenNMT-py/FAQ.html#raw-alignments-from-averaging-transformer-attention-heads
Excerpt from opennnmt.net -
Currently, we support producing word alignment while translating for Transformer based models. Using -report_align when calling translate.py will output the inferred alignments in Pharaoh format. Those alignments are computed from an argmax on the average of the attention heads of the second to last decoder layer.
You can get the attention matrices. Note that it is not the same as alignment which is a term from statistical (not neural) machine translation.
There is a thread on github discussing it. Here is a snippet from the discussion. When you get the translations from the mode, the attentions are in the attn field.
import onmt
import onmt.io
import onmt.translate
import onmt.ModelConstructor
from collections import namedtuple
# Load the model.
Opt = namedtuple('Opt', ['model', 'data_type', 'reuse_copy_attn', "gpu"])
opt = Opt("PATH_TO_SAVED_MODEL", "text", False, 0)
fields, model, model_opt = onmt.ModelConstructor.load_test_model(
opt, {"reuse_copy_attn" : False})
# Test data
data = onmt.io.build_dataset(
fields, "text", "PATH_TO_DATA", None, use_filter_pred=False)
data_iter = onmt.io.OrderedIterator(
dataset=data, device=0,
batch_size=1, train=False, sort=False,
sort_within_batch=True, shuffle=False)
# Translator
translator = onmt.translate.Translator(
model, fields, beam_size=5, n_best=1,
global_scorer=None, cuda=True)
builder = onmt.translate.TranslationBuilder(
data, translator.fields, 1, False, None)
batch = next(data_iter)
batch_data = translator.translate_batch(batch, data)
translations = builder.from_batch(batch_data)
translations[0].attn # <--- here are the attentions

Extract fields from log file where data is stored half json and half plain text

I am new to Spark, and want to read a log file and create a dataframe out of it. My data is half json, and I cannot convert it into a dataframe properly. Here below is first row in the file;
[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}
See first part is plain text and the last part between { } is json, I tried few things, converting it first to RDD then map and split then convert back to DataFrame, but I cannot extract the values from Json part of the row, is there a trick to extract fields in this context?
Final output will be like;
TimeStamp userid ip artist album song id service
2017-01-06 07:00:01 444444 11.11.111.0 Tears For Fears Songs From The Big Chair Everybody Wants To Rule The World S4555 pandora
You just need to parse out the pieces with a Python UDF into a tuple then tell spark to convert the RDD to a dataframe. The easiest way to do this is probably a regular expression. For example:
import re
import json
def parse(row):
pattern = ' '.join([
r'\[(?P<ts>\d{4}-\d\d-\d\d \d\d:\d\d:\d\d)\]',
r'userid:(?P<userid>\d+)',
r'(?P<ip>\d+\.\d+\.\d+\.\d+)',
r'(?P<level>\w+)',
r'(?P<json>.+$)'
])
match = re.match(pattern, row)
parsed_json = json.loads(match.group('json'))
return (match.group('ts'), match.group('userid'), match.group('ip'), match.group('level'), parsed_json['artist'], parsed_json['song'], parsed_json['service'])
lines = [
'[2017-01-06 07:00:01] userid:444444 11.11.111.0 info {"artist":"Tears For Fears","album":"Songs From The Big Chair","song":"Everybody Wants To Rule The World","id":"S4555","service":"pandora"}'
]
rdd = sc.parallelize(lines)
df = rdd.map(parse).toDF(['ts', 'userid', 'ip', 'level', 'artist', 'song', 'service'])
df.show()
This prints
+-------------------+------+-----------+-----+---------------+--------------------+-------+
| ts|userid| ip|level| artist| song|service|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
|2017-01-06 07:00:01|444444|11.11.111.0| info|Tears For Fears|Everybody Wants T...|pandora|
+-------------------+------+-----------+-----+---------------+--------------------+-------+
I have used the following, just some parsing utilizing pyspark power;
parts=r1.map( lambda x: x.value.replace('[','').replace('] ','###')
.replace(' userid:','###').replace('null','"null"').replace('""','"NA"')
.replace(' music_info {"artist":"','###').replace('","album":"','###')
.replace('","song":"','###').replace('","id":"','###')
.replace('","service":"','###').replace('"}','###').split('###'))
people = parts.map(lambda p: (p[0], p[1],p[2], p[3], p[4], p[5], p[6], p[7]))
schemaString = "timestamp mac userid_ip artist album song id service"
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
With this I got almost what I want, and performance was super fast.
+-------------------+-----------------+--------------------+-------------------- +--------------------+--------------------+--------------------+-------+
| timestamp| mac| userid_ip| artist| album| song| id|service|
+-------------------+-----------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------+
|2017-01-01 00:00:00|00:00:00:00:00:00|111122 22.235.17...|The United States...| This Is Christmas!|Do You Hear What ...| S1112536|pandora|
|2017-01-01 00:00:00|00:11:11:11:11:11|123123 108.252.2...| NA| Dinner Party Radio| NA| null|pandora|

What is "orcl" in the following pyalgotrade code? Further, i have a csv file and how can i upload data from it into my code?

I wanted to implement my own strategy for backtesting but am unable to modify the code according to my needs
from pyalgotrade.tools import yahoofinance
yahoofinance.download_daily_bars('orcl', 2000, 'orcl-2000.csv')
from pyalgotrade import strategy
from pyalgotrade.barfeed import yahoofeed
from pyalgotrade.technical import ma
#class to create objects
class MyStrategy(strategy.BacktestingStrategy):
def __init__(self, feed, instrument):
strategy.BacktestingStrategy.__init__(self, feed)
# We want a 15 period SMA over the closing prices.
self.__sma = ma.SMA(feed[instrument].getCloseDataSeries(), 15)
self.__instrument = instrument
def onBars(self, bars):
bar = bars[self.__instrument]
self.info("%s %s" % (bar.getClose(), self.__sma[-1]))
# Load the yahoo feed from the CSV file
feed = yahoofeed.Feed()
feed.addBarsFromCSV("orcl", "orcl-2000.csv")
# Evaluate the strategy with the feed's bars.
myStrategy = MyStrategy(feed, "orcl")
myStrategy.run()
Slightly modified from the documentation
documentation:
from pyalgotrade.tools import yahoofinance;
for instrument in ["AA","ACN"]:
for year in [2015, 2016]:
yahoofinance.download_daily_bars(instrument, year, r'D:\tmp\Trading\%s-%s.csv' % (instrument,year))
"orcl" is the name of the stock Oracle. If you want to use a different stock ticker, place it there.
You need to go to yahoo finance here: http://finance.yahoo.com/q/hp?s=ORCL&a=02&b=12&c=2000&d=05&e=26&f=2015&g=d
then save the file as orcl-2000.csv
This program reads in the orcl-2000.csv file from the directory and prints the prices.
If you want to download the data through python, then use a command like
instrument = "orcl"
feed = yahoofinance.build_feed([instrument], 2011, 2014, ".")
This will make files that say orcl-2011-yahoofinance.csv and so on from 2011 through 2014.