pos_tagging and NER tagging of a MUC dataset does not work correctly - nltk

I have a problem with MUC dataset. I want to do NER on that but all the words in this
dataset are in capital letters, so when pos_tagger is run, it detects all the words incorrectly
as a noun. To solve this problem, the whole text was turned initially to lower case. However,
this way raises another problem; if the text is in lowercase letters, the NER does not work
properly and literally finds no “PERSON, ORGANIZATION OR LOCATION”. Thus, the
conversion of the whole text to lower-case was kept, to be able to successfully pos_tag, and
then the manual capitalization of each word was performed to feed them into the NER
module. But another problem raises, this time NER detects everything as LOCATION.
Here is my code:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
def NER(input_file, output_file):
output = open('{0}_NER.txt'.format(output_file), 'w')
testset = open(input_file).readlines()
for line in testset:
line_clean = line.lower().strip()
tokens = nltk.word_tokenize(line_clean)
poss = nltk.pos_tag(tokens)
mylist = []
for w in poss:
s = list(w)
s1 = s[0].upper()
tmp = (s1, w[1])
mylist.append(tmp)
ner_ = nltk.ne_chunk(mylist)
Any help would be greatly appreciated.
Thanks.
Here is a piece of this dataset:
SAN SALVADOR, 3 JAN 90 -- [REPORT] [ARMED FORCES PRESS COMMITTEE,
COPREFA] [TEXT] THE ARCE BATTALION COMMAND HAS REPORTED THAT ABOUT 50
PEASANTS OF VARIOUS AGES HAVE BEEN KIDNAPPED BY TERRORISTS OF THE
FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] IN SAN MIGUEL
DEPARTMENT. ACCORDING TO THAT GARRISON, THE MASS KIDNAPPING TOOK PLACE ON
30 DECEMBER IN SAN LUIS DE LA REINA. THE SOURCE ADDED THAT THE TERRORISTS
FORCED THE INDIVIDUALS, WHO WERE TAKEN TO AN UNKNOWN LOCATION, OUT OF
THEIR RESIDENCES, PRESUMABLY TO INCORPORATE THEM AGAINST THEIR WILL INTO
CLANDESTINE GROUPS.

Your best bet is to train your own named entity classifier on case-folded text. The nltk book has a step by step tutorial in chapters 6 and 7. For training you could use the CONLL 2003 corpus.
Consider also training your own POS tagger on case-folded text, it might work better than the nltk POS tagger you're using now (but check).

Why do you need POS tagging if your task is NER? As far as I know POS tags do not really improve the NER result. I agree with Alexis that you need to train your own classifier since you don't have access to word shape feature without case information.

Related

how to label training data for Tesseract

I want to train my own model to detect and recognize ID card with Tesseract. I want to extract the key information like name, id from it. The data looks like: [sample of data]
The introduction of training can only input text with single line.I'm confused how to train the detection model in Tesseract and should I label single character or label the whole text line in each box. (https://github.com/tesseract-ocr/tesstrain)
enter image description here
1 by One Character Replacement from image to text is based on training in groups.
so here in the first tesseract training test sample, the idea is to let tesseract understand that the ch ligature is to be output as two letters the δ is to be lower case d with f as k and that Uber is Aber etc.
However that does not correct spelling of words without a dictionary of accepted character permutations and thus you need to either train all words you could expect like 123 is allowed but not 321 or else you allow all numbers.
The problem then is should ¦ be i | l or 1 ! ? and only human intelligent context is likely to agree what is 100% correct, especially when italics so is / = i | l or 1 ! or is it italic / ?
The clearer the characters are compared in contrast to the background, is usually going to produce the best result, and well defined void space within a character will help to distinguish well between B and 8 thus resolution is also a help or hindrance.
= INT 3O 80 S~A MARIA
A dictionary entry of BO and STA would possibly help in this case.
Oh, I think I get it. Tesseract doesn't need a detection model to get the position of the text line, it recognize each blob(letter) and uses the position of each letter to locate the text line.

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

I need to turn the texts into vectors then feed the vectors into a classifier

I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...

wordnet on different text?

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.
For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.
The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.

Can I test tesseract ocr in windows command line?

I am new to tesseract OCR. I tried to convert an image to tif and run it to see what the output from tesseract using cmd in windows, but I couldn't. Can you help me? What will be command to use?
Here is my sample image:
The simplest tesseract.exe syntax is tesseract.exe inputimage output-text-file.
The assumption here, is that tesseract.exe is added to the PATH environment variable.
You can add the -psm N argument if your text argument is particularly hard to recognize.
I see that the regular syntax (without any -psm switches) works fine enough with the image you attached, unless the level of accuracy is not good enough.
Note that non-english characters (such as the symbol next to prescription) are not recognized; my default installation only contains the English training data.
Here's the tesseract syntax description:
C:\Users\vish\Desktop>tesseract.exe
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine
And here's the output for your image (NOTE: When I downloaded it, it converted to a PNG image):
C:\Users\vish\Desktop>tesseract.exe ECL8R.png out.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
C:\Users\vish\Desktop>type out.txt.txt
1 Project Background
A prescription (R) is a written order by a physician or medical doctor to a pharmacist in the form of
medication instructions for an individual patient. You can't get prescription medicines unless someone
with authority prescribes them. Usually, this means a written prescription from your doctor. Dentists,
optometrists, midwives and nurse practitioners may also be authorized to prescribe medicines for you.
It can also be defined as an order to take certain medications.
A prescription has legal implications; this means the prescriber must assume his responsibility for the
clinical care ofthe patient.
Recently, the term "prescriptionΓÇ¥ has known a wider usage being used for clinical assessments,