TextBlob 0-Values for Polarity and Subjectivity - nltk

I am using TextBlob with python 3 to create sentiment values for a larger corpus of documents. I reviewed the distribution of polarity and subjectivity values and noticed a big share of values equal "0", see the distribution in the image (for polarity values on 1300 documents). I thought this might just be because TextBlob returns 0 as a default value if it wasnt able to calculate the sentiments right or for some other reason. I didnt find any docu on that, but maybe one of you can tell me where the high amount of zeros might origin in.
Best, Nero

If you are looking for the way polarity and subjectivity are defined in textblob, you can follow the following link:
https://github.com/sloria/TextBlob/blob/eb08c120d364e908646731d60b4e4c6c1712ff63/textblob/en/en-sentiment.xml
Basically, the lexicon referred by Textblob is en-sentiment.xml.

Related

Data PreProcessing for BERT (base-german)

I am working on a sentiment analysis solution with BERT to analyze tweets in german. My training dataset of is a class of 1000 tweets, which have been manually annotated into the classes neutral, positive and negative.
The dataset with 10.000 tweets is quite unevenly distributed:
approx.
3000 positive
2000 negative
5000 neutral
the tweets contain formulations with #names, https links, numbers, punctuation marks, smileys like :3 :D :) etc..
The interesting thing is, if I remove them with the following code during Data Cleaning, the F1 score gets worse. Only the removal of https links (if I do it alone) leads to a small improvement.
# removing the punctuation and numbers
def remove_punct(text):
text = re.sub(r'http\S+', '', text) # removing links
text = re.sub(r'#\S+', '', text) # removing referencing on usernames with #
text = re.sub(r':\S+', '', text) # removing smileys with : (like :),:D,:( etc)
text = "".join([char for char in text if char not in string.punctuation])
text = re.sub('[0-9]+', '', text)
return text
data['Tweet_clean'] = data['Tweet'].apply(lambda x: remove_punct(x)) # extending the dataset with the column tweet_clean
data.head(40)
also steps like stop words removal or lemmitazation lead more to a deterioration. Is this because I do something wrong or can the model BERT actually handle such values?
A second question is:
I found other records that were also manually annotated, but these are not tweets and the structure of the sentences and language use is different. Would you still recommend to add these records to my original?
There are about 3000 records in German.
My last question:
Should I reduce the class sizes to the size of the smallest unit and thus balance?
BERT can handle punctuation, smileys etc. Of course, smileys contribute a lot to sentiment analysis. So, don't remove them. Next, it would be fair to replace #mentions and links with some special tokens, because the model will probably never see them again in the future.
If your model is designed for tweets, I suggest that you fine-tune BERT with additional corpus, and after fine-tune with Twitter corpus. Or do it simultaneously. More training samples is generally better.
No, it is better to use class weights instead of downsampling.
Based on this paper (By Adam Ek, Jean-Philippe Bernardy and Stergios Chatzikyriakidis), BERT models outperform BiLSTM in terms of better generalizing to punctuations. Looking at the experiments' results in the paper, I say keep the punctuations.
I couln't find anything solid for smiley faces; However, after doing some experiments with the HuggingFace API, I didn't notice much difference with/without smiley faces.

Truncated rice binarization method

Based on HEVC standard, decimal quantized coefficients are binarized through different methods, e.g. Truncated Unary, Truncated Rice (k-th order), Exp-Golomb, etc.
Considering the Truncated Rice method, I cannot understand the role of cmax parameter that completely changes the output. I have understood how the algorithm works, but I do not find any reference about that parameter. It seems that the Truncated Rice does not depend by cmax, however different outputs return for associated different values.
Comparing the two attached images, you can note the different binarized outputs based on cmax. Can you provide any reference that explains how the algorithm works based on cmax?
Could you add references to these screen shots, so we can read and understand the notation?
PS: I assume you have already read Transform Coefficient Coding in HEVC paper, but you still need more detailed information.

wordnet on different text?

I am new to nltk, and I find wordnet functionality pretty useful. It gives synsets, hypernyms, similarity, etc. But however it fails to give similarity between locations like 'Delhi'-'Hyderabad' obviously as these words are not in the wordnet corpus.
So, I would like to know, if somehow I can update the wordnet corpus OR create wordnet over a different corpus e.g. Set of pages extracted from wikipedia related to travel? If at all we can create wordnet over different corpus, then what would be the format, steps to do the same, any limitations?
Please can you point me to links that describe the above concerns. I have searched the internet, googled, read portions of nltk book, but I don't have a single hint to above question.
Pardon me, if the question sounds completely ridiculous.
For flexibility in measuring the semantic similarity of very specific terms like Dehli or Hyderabad, what you want is not something hand-crafted like WordNet, but an automatically-learned similarity measure from a very large database. These are statistical similarity approaches. Of course, you want to avoid having to train such a model on data yourself...
Thus one thing that may be useful is the Google Distance (wikipedia, original paper). It seems fairly simple to implement such a measure in a language like R (code), and the original paper reports 87% agreement with WordNet.
The similarity measures in Wordnet work as expected because Wordnet measures semantic similarity. In that sense, both are cities, so they are very similar. What you are looking for is probably called geographic similarity.
delhi = wn.synsets('Delhi', 'n')[0]
print delhi.definition()
# a city in north central India
hyderabad = wn.synsets('Hyderabad', 'n')[0]
print hyderabad.definition()
# a city in southern Pakistan on the Indus River
delhi.wup_similarity(hyderabad)
# 0.9
melon = wn.synsets('melon', 'n')[0]
delhi.wup_similarity(melon)
# 0.3
There is a Wordnet extension, called Geowordnet. I kind of had the same problem as you at one point and tried to unify Wordnet with some of its extensions: wnext. Hope that helps.

Best method to train Tesseract 3.02

i'm wondering what is the best method to train Tesseract (kind of text/TIFF and so on) for a particular kind of documents, with these particularities:
the structure and main text of the documents is always the same
the only things that change are 5 alphanumeric codes (THIS ARE THE REAL IMPORTANT THING TO DETECT!)
Some of thes codes are bold
At the moment I used standard trained datas, I detect the entire text and I extrapolate the codes with some regular expressions.
It's okay, but I've got errors sometimes, for example:
0 / O
L / I / 1
Please someone knowns some "tricks" to improve precision?
Thanks!
during training part of Tesseract, you have to make a file manually to give to the engine in order to specify ambiguous characters.
For more information look at the "unicharambigs" part of the Tesseract documentation.
Best Regards.

Text-correlation in MySQL [duplicate]

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I guess I would like to calculate some kind of correlation between the text values and merge the records if this value is over a certain threshold.
Example:
"West Lawnmower Drive 54 A" is probably the same as "W. Lawn Mower Dr. 54A" but different from "East Lawnmower Drive 54 A".
How would you approach this problem? Would it be necessary to have some kind of context-based dictionary that knows, in the address case, that "W", "W." and "West" are the same? What about misspellings ("mover" instead of "mower" etc)?
I think this is a tricky one - perhaps there are some well-known algorithms out there?
A good baseline, probably an impractical one in terms of its relatively high computational cost and more importantly its production of many false positive, would be generic string distance algorithms such as
Edit distance (aka Levenshtein distance)
Ratcliff/Obershelp
Depending on the level of accuracy required (which, BTW, should be specified both in terms of its recall and precision, i.e. generally expressing whether it is more important to miss a correlation than to falsely identify one), a home-grown process based on [some of] the following heuristics and ideas could do the trick:
tokenize the input, i.e. see the input as an array of words rather than a string
tokenization should also keep the line number info
normalize the input with the use of a short dictionary of common substituions (such as "dr" at the end of a line = "drive", "Jack" = "John", "Bill" = "William"..., "W." at the begining of a line is "West" etc.
Identify (a bit like tagging, as in POS tagging) the nature of some entities (for example ZIP Code, and Extended ZIP code, and also city
Identify (lookup) some of these entities (for example a relative short database table can include all the Cities / town in the targeted area
Identify (lookup) some domain-related entities (if all/many of the address deal with say folks in the legal profession, a lookup of law firm names or of federal buildings may be of help.
Generally, put more weight on tokens that come from the last line of the address
Put more (or less) weight on tokens with a particular entity type (ex: "Drive", "Street", "Court" should with much less than the tokens which precede them.
Consider a modified SOUNDEX algorithm to help with normalization of
With the above in mind, implement a rule-based evaluator. Tentatively, the rules could be implemented as visitors to a tree/array-like structure where the input is parsed initially (Visitor design pattern).
The advantage of the rule-based framework, is that each heuristic is in its own function and rules can be prioritized, i.e. placing some rules early in the chain, allowing to abort the evaluation early, with some strong heuristics (eg: different City => Correlation = 0, level of confidence = 95% etc...).
An important consideration with search for correlations is the need to a priori compare every single item (here address) with every other item, hence requiring as many as 1/2 n^2 item-level comparisons. Because of this, it may be useful to store the reference items in a way where they are pre-processed (parsed, normalized...) and also to maybe have a digest/key of sort that can be used as [very rough] indicator of a possible correlation (for example a key made of the 5 digit ZIP-Code followed by the SOUNDEX value of the "primary" name).
I would look at producing a similarity comparison metric that, given two objects (strings perhaps), returns "distance" between them.
If you fulfil the following criteria then it helps:
distance between an object and
itself is zero. (reflexive)
distance from a to b is the same in
both directions (transitive)
distance from a to c is not more
than distance from a to b plus
distance from a to c. (triangle
rule)
If your metric obeys these they you can arrange your objects in metric space which means you can run queries like:
Which other object is most like
this one
Give me the 5 objects
most like this one.
There's a good book about it here. Once you've set up the infrastructure for hosting objects and running the queries you can simply plug in different comparison algorithms, compare their performance and then tune them.
I did this for geographic data at university and it was quite fun trying to tune the comparison algorithms.
I'm sure you could come up with something more advanced but you could start with something simple like reducing the address line to the digits and the first letter of each word and then compare the result of that using a longest common subsequence algorithm.
Hope that helps in some way.
You can use Levenshtein edit distance to find strings that differ by only a few characters. BK Trees can help speed up the matching process.
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.
If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).
Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
west lawnmover drive
w lawnmower street
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
When I had to modify a proprietary program doing this, back in the early 90s, it took many thousands of lines of code in multiple modules, built up over years of experience. Modern machine-learning techniques ought to make it easier, and perhaps you don't need to perform as well (it was my employer's bread and butter).
So if you're talking about merging lists of actual mailing addresses, I'd do it by outsourcing if I can.
The USPS had some tests to measure quality of address standardization programs. I don't remember anything about how that worked, but you might check if they still do it -- maybe you can get some good training data.