How to train own model and test it with spacy - nltk

I am using the below code to train an already existing spacy ner model. However, I dont get correct results on tests:
What I am missing?
import spacy
import random
from spacy.gold import GoldParse
from spacy.language import EntityRecognizer
train_data = [
('Who is Rocky babu?', [(7, 16, 'PERSON')]),
('I like London and Berlin.', [(7, 13, 'LOC'), (18, 24, 'LOC')])
]
nlp = spacy.load('en', entity=False, parser=False)
ner = EntityRecognizer(nlp.vocab, entity_types=['PERSON', 'LOC'])
for itn in range(5):
random.shuffle(train_data)
for raw_text, entity_offsets in train_data:
doc = nlp.make_doc(raw_text)
gold = GoldParse(doc, entities=entity_offsets)
nlp.tagger(doc)
nlp.entity.update([doc], [gold])
Now, When i try to test the above model by using the below code, I don't get the expected output.
text = ['Who is Rocky babu?']
for a in text:
doc = nlp(a)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
My output is as follows:
Entities []
whereas my expected output is as follows:
Entities [('Rocky babu', 'PERSON')]
Can someone please tell me what I'm missing ?

Could you retry with
nlp = spacy.load('en_core_web_sm', entity=False, parser=False)
If that gives an error because you don't have that model installed, you can run
python -m spacy download en_core_web_sm
on the commandline first.
And ofcourse keep in mind that for a proper training of the model, you'll need many more examples for the model to be able to generalize!

Related

mmdet - WARNING - The model and loaded state dict do not match exactly. unexpected key in source state_dict:

I'm currently trying to run a deep learning tool software that was previously created by someone else a few years ago. While trying to load a class called Evaluator which wraps all of the important mmdetection functions, I keep getting the following error:
enter image description here
The model was downloaded automatically while running the code due to the following part of the config file:
model = dict(
type='FCOS',
pretrained='open-mmlab://detectron/resnet101_caffe',
backbone=dict(
type='ResNet',
depth=101,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=False),
norm_eval=True,
style='caffe'),
neck=dict(
type='FPN',
in_channels=[256, 512, 1024, 2048],
out_channels=256,
start_level=1,
add_extra_convs=True,
extra_convs_on_inputs=False,
num_outs=5,
relu_before_extra_convs=True),
bbox_head=dict(
type='FCOSHead',
num_classes=15,
in_channels=256,
stacked_convs=4,
feat_channels=256,
strides=[8, 16, 32, 64, 128],
loss_cls=dict(
type='FocalLoss',
use_sigmoid=True,
gamma=2.0,
alpha=0.25,
loss_weight=1.0),
loss_bbox=dict(type='IoULoss', loss_weight=1.0),
loss_centerness=dict(
type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0)))
I'm not sure how to determine if the model I'm trying to load and the state dictionary are compatible or how to fix this problem. I'm new to deep learning and using MMdetection.
Here is part of the code from the utils.py file that contains the Evaluator class:
from skimage.draw import rectangle_perimeter
import skimage.io as io
from skimage.transform import resize
import numpy as np
import skimage
import pickle
import torch
from mmcv import Config, DictAction
from mmdet.models import build_detector
from mmcv.runner import load_checkpoint
import mmcv
from mmdet.datasets.pipelines import Compose # TO LOOK AT
from mmcv.parallel import collate, scatter
from mmdet.core import bbox2result
from skimage import data, io, filters
from matplotlib.pyplot import figure
import os
class_to_number = {"Yeast White": 0, "Budding White": 1, "Yeast Opaque": 2,
"Budding Opaque":3,"Yeast Gray": 4, "Budding Gray": 5,
"Shmoo":6,"Artifact": 7, "Unknown ": 8,
"Pseudohyphae": 9, "Hyphae": 10, "H-junction": 11,
"P-junction":12,"P-Start":13,"H-Start":14}
number_to_class = {y:x for x,y in class_to_number.items()}
class Evaluator():
def __init__(self,config,checkpoint_file):
self.cfg = Config.fromfile(config)
self.cfg["gpu-ids"] = 6
self.model = build_detector(
self.cfg.model, train_cfg=self.cfg.train_cfg, test_cfg=self.cfg.test_cfg)
checkpoint_dict = load_checkpoint(self.model,checkpoint_file)
state_dict = checkpoint_dict["state_dict"]
self.model.CLASSES = checkpoint_dict['meta']['CLASSES']
self.model.load_state_dict(state_dict)
self.model.eval()
I looked at the version of mmdet, mmcv, and pytorch to ensure they were the same versions that were used by the original creator of the software. I redownloaded the model file to ensure that it wasn't corrupted. `
It is normal that the model and loaded state dict do not match exactly, because the fully connected layers in the pretrained models are unused. It will not affect the training. If it causes any further issues while testing, then this is a problem otherwise you should be good.
Refer to the issue here.

I am trying to pass values to a function from the reliability library but without much success. here is my sample data and code

My sample data was as follows
Tag_Typ
Alpha_Estimate
Beta_Estimate
PM01_Avg_Cost
PM02_Avg_Cost
OLK-AC-101-14A_PM01
497.665
0.946584
1105.635
462.3833775
OLK-AC-103-01_PM01
288.672
0.882831
1303.8875
478.744375
OLK-AC-1105-01_PM01
164.282
0.787158
763.4475758
512.185814
OLK-AC-236-05A_PM01
567.279
0.756839
640.718
450.3277778
OLK-AC-276-05A_PM01
467.53
0.894773
1536.78625
439.78
This my sample code
import pandas as pd
import numpy as np
from reliability.Repairable_systems import optimal_replacement_time
import matplotlib.pyplot as plt
data = pd.read_excel (r'C:\Users\\EU_1_EQ_PM01_Cost.xlsx')
data_frame = pd.DataFrame(data, columns= ['Alpha_Estimate','Beta_Estimate','PM01_Avg_Cost','PM02_Avg_Cost'])
Alpha_Est=pd.DataFrame(data, columns= ['Alpha_Estimate'])
Beta_Est=pd.DataFrame(data, columns= ['Beta_Estimate'])
PM_Est=pd.DataFrame(data, columns= ['PM02_Avg_Cost'])
CM_Est=pd.DataFrame(data, columns= ['PM01_Avg_Cost'])
optimal_replacement_time(cost_PM=PM_Est, cost_CM=CM_Est, weibull_alpha=Alpha_Est, weibull_beta=Beta_Est,q=0)
plt.show()
I need to loop through the value set for each tag and pass those values to the Optimal replacement function to return the results.
[Sample Output]
ValueError: Can only compare identically-labeled DataFrame objects
I would appreciate any suggestions on how I can pass the values of the PM cost, PPM cost, and the distribution parameters alpha and beta in the function as I iterate through the tag-type and print the results for each tag. Thanks.
The core of your question is how to iterate through a list in Python. This will achieve what you're after:
import pandas as pd
from reliability.Repairable_systems import optimal_replacement_time
df = pd.read_excel(io=r"C:\Users\Matthew Reid\Desktop\sample_data.xlsx")
alpha = df["Alpha_Estimate"].tolist()
beta = df["Beta_Estimate"].tolist()
CM = df["PM01_Avg_Cost"].tolist()
PM = df["PM02_Avg_Cost"].tolist()
ORT = []
for i in range(len(alpha)):
ort = optimal_replacement_time(cost_PM=PM[i], cost_CM=CM[i], weibull_alpha=alpha[i], weibull_beta=beta[i],q=0)
ORT.append(ort.ORT)
print('List of the optimal replacement times:\n',ORT)
On a separate note, all of your beta values are less than 1. This means the hazard rate is decreasing (aka. infant mortality / early life failures). When you run the above script, each iteration will print the warning:
"WARNING: weibull_beta is < 1 so the hazard rate is decreasing, therefore preventative maintenance should not be conducted."
If you have any further questions, you know how to contact me :)

Spacy - NLTK: Language detection

I am currently working on a project dealing with a bunch of social media posts.
Some of these posts are in English and some in Spanish.
My current code runs quite smoothly. However, I am asking myself does Spacy/NLTK automatically detect which language stemmer/stopwords/etc. it has to use for each post (depending on whether it is an English or Spanish post)? At the moment, I am just parsing each post to a stemmer without explicitly specifying the language.
This is a snippet of my current script:
import re
import pandas as pd
!pip install pyphen
import pyphen
!pip install spacy
import spacy
!pip install nltk
import nltk
from nltk import SnowballStemmer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
!pip install spacy-langdetect
from spacy_langdetect import LanguageDetector
!pip install textblob
from textblob import TextBlob
# Download Stopwords
nltk.download('stopwords')
stop_words_eng = set(stopwords.words('english'))
stop_words_es = set(stopwords.words('spanish'))
# Import Stemmer
p_stemmer = PorterStemmer()
#Snowball (Porter2): Nearly universally regarded as an improvement over porter, and for good reason.
snowball_stemmer = SnowballStemmer("english")
dic = pyphen.Pyphen(lang='en')
# Load Data
data = pd.read_csv("mergerfile.csv", error_bad_lines=False)
pd.set_option('display.max_columns', None)
posts = data.loc[data["ad_creative"] != "NONE"]
# Functions
def get_number_of_sentences(text):
sentences = [sent.string.strip() for sent in text.sents]
return len(sentences)
def get_average_sentence_length(text):
number_of_sentences = get_number_of_sentences(text)
tokens = [token.text for token in text]
return len(tokens) / number_of_sentences
def get_token_length(text):
tokens = [token.text for token in text]
return len(tokens)
def text_analyzer(data_frame):
content = []
label = []
avg_sentence_length = []
number_sentences = []
number_words = []
for string in data_frame:
string.join("")
if len(string) <= 4:
print(string)
print("filtered")
content.append(string)
avg_sentence_length.append("filtered")
number_sentences.append("filtered")
number_words.append("filtered")
else:
# print list
print(string)
content.append(string)
##Average Sentence Lenght
result = get_average_sentence_length(nlp(string))
avg_sentence_length.append(result)
print("avg sentence length:", result)
##Number of Sentences
result = get_number_of_sentences(nlp(string))
number_sentences.append(result)
print("#sentences:", result)
##Number of words
result = get_token_length(nlp(string))
number_words.append(result)
print("#Words", result)
content, avg_sentence_length, number_sentences, number_words = text_analyzer(
data["posts"])
Short answer is no, neither NLTK nor SpaCy will automatically determine the language and apply appropriate algorithms to a text.
SpaCy has separate language models with their own methods, part-of-speech and dependency tagsets. It also has a set of stopwords for each available language.
NLTK is more modular; for stemming there is RSLPStemmer (Portuguese), ISRIStemmer (Arabic), and SnowballStemmer (Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish).
When you determine the language of a post through spacy_langdetect, the next thing you should do is explicitly instruct to use the appropriate SpaCy language model or NLTK module.
Use GoogleTrans Library for this
#/usr/bin/python
from googletrans import Translator
translator = Translator()
translator.detect('이 문장은 한글로 쓰여졌습니다.')
This Returns
<Detected lang=ko confidence=0.27041003>
So, this is the best way to do so if you have an internet connection and is better in most cases than Spacy as Google Translate is more mature and has better algorithms, ;)

How to predict Sentiments after training and testing the model by using NLTK NaiveBayesClassifier in Python?

I am doing sentiment classification using NLTK NaiveBayesClassifier. I trained and test the model with the labeled data. Now I want to predict sentiments of the data that is not labeled. However, I run into the error.
The line that is giving error is :
score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
The error is :
ValueError: not enough values to unpack (expected 2, got 1)
Below is the code:
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
new_data = pd.read_csv("Japan Data.csv", header=0)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from unidecode import unidecode
from nltk import word_tokenize
from nltk.classify import NaiveBayesClassifier
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import extract_unigram_feats
TRAINING_COUNT = 350
def clean_text(text):
text = text.replace("<br />", " ")
return text
analyzer = SentimentAnalyzer()
vocabulary = analyzer.all_words([(word_tokenize(unidecode(clean_text(instance))))
for instance in train_x[:TRAINING_COUNT]])
print("Vocabulary: ", len(vocabulary))
print("Computing Unigran Features ...")
unigram_features = analyzer.unigram_word_feats(vocabulary, min_freq=10)
print("Unigram Features: ", len(unigram_features))
analyzer.add_feat_extractor(extract_unigram_feats, unigrams=unigram_features)
# Build the training set
_train_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
for instance in train_x[:TRAINING_COUNT]], labeled=False)
# Build the test set
_test_X = analyzer.apply_features([(word_tokenize(unidecode(clean_text(instance))))
for instance in test_x], labeled=False)
trainer = NaiveBayesClassifier.train
classifier = analyzer.train(trainer, zip(_train_X, train_y[:TRAINING_COUNT]))
score = analyzer.evaluate(list(zip(_test_X, test_y)))
print("Accuracy: ", score['Accuracy'])
score_1 = analyzer.evaluate(list(zip(new_data['Articles'])))
print(score_1)
I understand that the problem is arising because I have to give two parameters is the line which is giving an error but I don't know how to do this.
Thanks in Advance.
Documentation and example
The line that gives you the error calls the method SentimentAnalyzer.evaluate(...) .
This method does the following.
Evaluate and print classifier performance on the test set.
See SentimentAnalyzer.evaluate.
The method has one mandatory parameter: test_set .
test_set – A list of (tokens, label) tuples to use as gold set.
In the example at http://www.nltk.org/howto/sentiment.html test_set has the following structure:
[({'contains(,)': False, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ({'contains(,)': True, 'contains(.)': True, 'contains(and)': False, 'contains(the)': True}, 'subj'), ...]
Here is a symbolic representation of the structure.
[(dictionary,label), ... , (dictionary,label)]
Error in your code
You are passing
list(zip(new_data['Articles']))
to SentimentAnalyzer.evaluate. I assume your getting the error because
list(zip(new_data['Articles']))
does not create a list of (tokens, label) tuples. You can check that by creating a variable which contains the list and printing it or looking at the value of the variable while debugging.
E.G.
test_set = list(zip(new_data['Articles']))
print("begin test_set")
print(test_set)
print("end test_set")
You are calling evaluate correctly 3 lines above the one that is giving the error.
score = analyzer.evaluate(list(zip(_test_X, test_y)))
I guess you want to call SentimentAnalyzer.classify(instance) to predict unlabeled data. See SentimentAnalyzer.classify.

How to calculate the the number and area of habitat patches in Arcview 10

I'm currently working on my masters thesis and having real trouble with GIS. I've downloaded the arc gis grid data set from http://www.kew.org/gis/projects/mad_veg/datasets_gis.html
Ive sucessfully plotted it in arcmap 10. The map consists of various different habitats. I want to know how I could take one of those habitats types, say "humid forest", and calculate how many patches of that habitat there are, and how big each patch is.
I've been been at this for weeks and haven't made much headway. someone suggested I look at zonal geometry as a table http://help.arcgis.com/en/arcgisdesktop/10.0/help/index.html#//009z000000w5000000.htm which look promising but I gave the coding a try and I couldnt get it to work. I posted some of my attempts below.
>>> import arcpy
>>> from arcpy import env
>>> from arcpy.sa import *
>>> env.workspace = "Q:/MADGIS"
>>> outZonalGeometryAsTable = ZonalGeometryAsTable("zones.shp", "Classes "zonalgeomout", 0.2)
Runtime error <class 'arcgisscripting.ExecuteError'>: ERROR 000626: Tool ZonalGeometryAsTable is not licensed.
>>> arcpy.CheckOutExtension("Spatial")
u'CheckedOut'
>>> outZonalGeometryAsTable = ZonalGeometryAsTable(inZoneData, zoneField, "AREA", cellSize)
Runtime error <type 'exceptions.NameError'>: name 'inZoneData' is not defined
The problem is some of the things ive copied in the example are specific to the example, but i'm not sure. If someone could even point me in the right direction it would be a big help
It seems that you didn’t set some parameters.
According to the link above, you must set this parameters:
# Set local variables
inZoneData = "YourShapefileName.shp"
zoneField = "Classes"
outTable = "zonalgeomout02.dbf"
processingCellSize = 0.2
# Check out the ArcGIS Spatial Analyst extension license
arcpy.CheckOutExtension("Spatial")
Update:
You must use this code for your raster data:
import arcpy
from arcpy import env
from arcpy.sa import *
env.workspace = "C:/Users/Puya/Downloads/Documents/StackOverflow/veg_grid"
inZoneData = "vegetation"
zoneField = "Value"
outTable = "zonalgeomout02.dbf"
processingCellSize = 29
arcpy.CheckOutExtension("Spatial")
outZonalGeometryAsTable = ZonalGeometryAsTable(inZoneData, zoneField, "AREA", processingCellSize)
Also, in ArcMap you can use ArcToolbox -> Spatial Analyst -> Zonal -> ZonalGeometryAsTable and select above parameters and run ZonalGeometryAsTable.