wordnet on different text? - nltk

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.

For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.

The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.

Related

Model suggestion: Keyword spotting

I want to predict the occurrences of the word "repeat" in a speech as well as the word's approximate duration. For this task, I'm planning to build a Deep Learning model. I've around 50 positive as well as 50 negative utterances (I couldn't collect more).
Initially I've searched for any pretrained models for keyword spotting, but I couldn't get a good one.
Then I tried Speech Recognition models (Deep Speech), but it couldn't predict the exact repeat words as my data follows Indian accent. Also, I've thought that going for ASR models for this task would be a over-killing one.
Now, I've split the entire audio into chunk of 1 secs with 50% overlapping and tried a binary audio classification in each chunk that is whether the chunk has the word "repeat" or not. For building the classification model, I calculated the MFCC features and build a sequence model on the top of it. Nothing seems to work for me.
If anyone already worked with this kind of task, please provide me with a correct method/resources to build a DL model for this task. Thanks in advance!

I want to Develop an Android app that summarizes a user-entered text (could be a news article)

I searched for extractive and abstractive summarization methods.I would like to make inferential summarization because of the many disadvantages of abstractive summarization.I want to be able to summarize inferential using the supervised learning method.In my research for extraction summarization, I always came across the TextRank algorithm, but this is an unsupervised learning method.I want to be able to summarize inferential using the supervised learning method. Is it possible? Can I run TextRank on a dataset containing 15000 data (for example)?
The codes given below should not be taken into consideration.Irrelevant codes to share questions.
word_embeddings = {}
f = open('/content/drive/MyDrive/MetinAnalizi/glove.6B.100d.txt', encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
f.close()
sim_mat = np.zeros([len(sentences), len(sentences)])
from sklearn.metrics.pairwise import cosine_similarity
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
sim_mat[i][j] = cosine_similarity(sentence_vectors[i].reshape(1,100), sentence_vectors[j].reshape(1,100))[0,0]
There's a wide variety of text summarization methods, and the use of deep learning in NLP (aka, language models, transformers, etc.) since late 2017 has led to many advancements.
Some of the trade-offs here depend on quality vs. cost. For example, using extractive summarization with TextRank is relatively less expensive and does not require a trained model. OTOH, using abstractive summarization approaches with DL models will tend to be much more expensive, though also produce better results.
In terms of PyTextRank we have different algorithm variants implemented, which produce different kinds of extractive summarization – depending on the intended use case. News article summarizes might prefer to use PositionRank while research article summaries might prefer to use Biased TextRank. This is due to the kinds of phrases that are likely to be emphasized, depending on the typical style and structure of writing encountered in those domains.
My advice is to experiment and see what fits your needs best? If you have many articles to summarize and want to keep the budget low, then TextRank might work well. If you need better appearance of text in the summaries, perhaps abstractive summarization is needed.

word synonym / antonym detection

I need to create a classifier that takes 2 words and determines if they are synonyms or antonyms. I tried nltk's antsyn-net but it doesn't have enough data.
example:
capitalism <-[antonym]-> socialism
capitalism =[synonym]= free market
god <-[antonym]-> atheism
political correctness <-[antonym]-> free speach
advertising =[synonym]= marketing
I was thinking about taking a BERT model, because may be some of the relations would be embedded in it and transfer-learn on a data-set that I found.
I would suggest a following pipeline:
Construct a training set from existing dataset of synonyms and antonyms (taken e.g. from the Wordnet thesaurus). You'll need to craft negative examples carefully.
Take a pretrained model such as BERT and fine-tune it on your tasks. If you choose BERT, it should be probably BertForNextSentencePrediction where you use your words/prhases instead of sentences, and predict 1 if they are synonyms and 0 if they are not; same for antonyms.

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.