How to Tokenize and Phoenetize(Sounds-like) in NLTK? - nltk

Here is the code to Tokeinize. How to Phoenitize after tokenizing the word in NLTK?
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']

Related

How to stop DjangoJSONEncoder from truncating microseconds datetime objects?

I have a dictionary with a datetime object inside it and when I try to json dump it, Django truncates the microseconds:
> dikt
{'date': datetime.datetime(2020, 6, 22, 11, 36, 25, 763835, tzinfo=<DstTzInfo 'Africa/Nairobi' EAT+3:00:00 STD>)}
> json.dumps(dikt, cls=DjangoJSONEncoder)
'{"date": "2020-06-22T11:36:25.763+03:00"}'
How can I preserve all the 6 microsecond digits?
DjangoJsonEncoder support ECMA-262 specification.
You can easily overcome this by introducing your custom encoder.
class MyCustomEncoder(DjangoJSONEncoder):
def default(self, obj):
if isinstance(obj, datetime.datetime):
r = obj.isoformat()
if r.endswith('+00:00'):
r = r[:-6] + 'Z'
return r
return super(MyCustomEncoder, self).default(obj)
dateime_object = datetime.datetime.now()
print(dateime_object)
print(json.dumps(dateime_object, cls=MyCustomEncoder))
>>> 2020-06-22 11:54:29.127120
>>> "2020-06-22T11:54:29.127120"

Word tokenizing gives different results at home than on Colaboratory

Local:
$ python
Python 3.8.0 (default, Nov 6 2019, 15:27:39)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import collections
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('averaged_perceptron_tagger')
>>> nltk.download('stopwords')
>>> stop_words = set(nltk.corpus.stopwords.words('english'))
>>> text = """Former Kansas Territorial Governor James W. Denver visited his namesake city in 1875 and in 1882."""
>>> def preprocess(document):
... sentence_list = list()
... for sentence in nltk.sent_tokenize(document):
... word_tokens = nltk.word_tokenize(sentence)
... sentence_list.append([w for w in word_tokens if not w in stop_words and len(w) > 1])
... sentences = [nltk.pos_tag(sent) for sent in sentence_list]
... return sentences
>>> grammar = r'Chunk: {(<A.*>*|<N.*>*|<VB[DGNP]?>*)+}'
>>> chunk_parser = nltk.RegexpParser(grammar)
>>> tagged = preprocess(text)
>>> result = collections.Counter()
>>> for sentence in tagged:
... my_tree = chunk_parser.parse(sentence)
... for subtree in my_tree.subtrees():
... if subtree.label() == 'Chunk':
... leaves = [x[0] for x in subtree.leaves()]
... phrase = " ".join(leaves)
... result[phrase] += 1
Output at home is:
>>> print(result.most_common(10))
[('Former Kansas Territorial Governor James W. Denver', 1), ('visited', 1), ('city', 1)]
Same code on Colaboratory, result is:
>>> print(result.most_common(10))
[]
I have run non-NLTK code in both places and gotten identical output. Could it be local NLTK libraries that are different? Different versions of the NLTK?
I was running python 3.8.0 locally. I changed it to 3.6.9 and I now get the same results as on Colaboratory.

skipping Attribute error while importing twitter data into pandas

I have almost 1 gb file storing almost .2 mln tweets. And, the huge size of file obviously carries some errors. The errors are shown as
AttributeError: 'int' object has no attribute 'items'. This occurs when I try to run this code.
raw_data_path = input("Enter the path for raw data file: ")
tweet_data_path = raw_data_path
tweet_data = []
tweets_file = open(tweet_data_path, "r", encoding="utf-8")
for line in tweets_file:
try:
tweet = json.loads(line)
tweet_data.append(tweet)
except:
continue
tweet_data2 = [tweet for tweet in tweet_data if isinstance(tweet,
dict)]
from pandas.io.json import json_normalize
tweets = json_normalize(tweet_data2)[["text", "lang", "place.country",
"created_at", "coordinates",
"user.location", "id"]]
Can a solution be found where those lines where such error occurs can be skipped and continue for the rest of the lines.
The issue here is not with lines in data but with tweet_data itself. If you check your tweet_data, you will find one more elements which are of 'int' datatype (assuming your tweet_data is a list of dictionaries as it only expects "dict or list of dicts").
You may want to check your tweet data to remove values other that dictionaries.
I was able to reproduce with below example for json_normalize document:
Working Example:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
]
json_normalize(data)
Output:
Displays datarame
Reproducing Error:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]},
1 # *Added an integer to the list*
]
result = json_normalize(data)
Error:
AttributeError: 'int' object has no attribute 'items'
How to prune "tweet_data": Not needed, if you follow update below
Before normalising, run below:
tweet_data = [tweet for tweet in tweet_data if isinstance(tweet, dict)]
Update: (for foor loop)
for line in tweets_file:
try:
tweet = json.loads(line)
if isinstance(tweet, dict):
tweet_data.append(tweet)
except:
continue
The final form of code looks like this:
tweet_data_path = raw_data_path
tweet_data = []
tweets_file = open(tweet_data_path, "r", encoding="utf-8")
for line in tweets_file:
try:
tweet = json.loads(line)
if isinstance(tweet, dict):
tweet_data.append(tweet)
except:
continue
This clears all the possibility of attribute error that might hinder importing into panda dataframe.

NLTK single-word part-of-speech tagging

Is there a way to use NLTK to get a set of possible parts of speech of a single string of letters, taking into account that different words might have homonyms?
For example: report -> {Noun, Verb} , kind -> {Adjective, Noun}
I have not been able to find a POS-tokenizer that tags part-of-speech for words outside of the context of a full sentence. This seems like a very basic request of NLTK, so I'm confused as to why I've had so much trouble finding it.
Yes. The simplest way is not to use a tagger, but simply load up one or more corpora and collect the set of all tags for the word you are interested in. If you're interested in more than one word, it's simplest to collect the tags for all words in the corpus, then look up anything you want. I'll add frequency counts, just because I can. For example, using the Brown corpus and the simple "universal" tagset:
>>> wordtags = nltk.ConditionalFreqDist((w.lower(), t)
for w, t in nltk.corpus.brown.tagged_words(tagset="universal"))
>>> wordtags["report"]
FreqDist({'NOUN': 135, 'VERB': 39})
>>> list(wordtags["kind"])
['ADJ', 'NOUN']
Because POS models are trained on sentence/document based data, so the expected input to the pre-trained model is a sentence/document. When there's only a single word, it treats it as a single word sentence, hence there should only be one tag in that single word sentence context.
If you're trying to find all possible POS tags per English words, you would need a corpus of many different use of the words and then tag the corpus and count/extract the no. of tags per word. E.g.
>>> from nltk import pos_tag
>>> sent1 = 'The coaches are going from Singapore to Frankfurt'
>>> sent2 = 'He coaches the football team'
>>> pos_tag(sent1.split())
[('The', 'DT'), ('coaches', 'NNS'), ('are', 'VBP'), ('going', 'VBG'), ('from', 'IN'), ('Singapore', 'NNP'), ('to', 'TO'), ('Frankfurt', 'NNP')]
>>> pos_tag(sent2.split())
[('He', 'PRP'), ('coaches', 'VBZ'), ('the', 'DT'), ('football', 'NN'), ('team', 'NN')]
>>> from collections import defaultdict, Counter
>>> counts = defaultdict(Counter)
>>> tagged_sents = [pos_tag(sent) for sent in [sent1.split(), sent2.split()]]
>>> for word, pos in chain(*tagged_sents):
... counts[word][pos] += 1
...
>>> counts
defaultdict(<class 'collections.Counter'>, {'from': Counter({'IN': 1}), 'to': Counter({'TO': 1}), 'Singapore': Counter({'NNP': 1}), 'football': Counter({'NN': 1}), 'coaches': Counter({'VBZ': 1, 'NNS': 1}), 'going': Counter({'VBG': 1}), 'are': Counter({'VBP': 1}), 'team': Counter({'NN': 1}), 'The': Counter({'DT': 1}), 'Frankfurt': Counter({'NNP': 1}), 'the': Counter({'DT': 1}), 'He': Counter({'PRP': 1})})
>>> counts['coaches']
Counter({'VBZ': 1, 'NNS': 1})
Alternatively, there's WordNet:
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('coaches')
[Synset('coach.n.01'), Synset('coach.n.02'), Synset('passenger_car.n.01'), Synset('coach.n.04'), Synset('bus.n.01'), Synset('coach.v.01'), Synset('coach.v.02')]
>>> [ss.pos() for ss in wn.synsets('coaches')]
[u'n', u'n', u'n', u'n', u'n', u'v', u'v']
>>> Counter([ss.pos() for ss in wn.synsets('coaches')])
Counter({u'n': 5, u'v': 2})
But note that WordNet is a manually crafted resource, so you cannot expect every English word to be in it.

Extarct particulr part of json string using python regex

I have below json string:
"{"sweep_enabled":true,"product":"XYZ","page":"XYZ Profile","list":" {\"id\":205782,\"name\":\"Robert Shriwas\",\"gender\":\"F\",\"practicing_since\":null,\"years\":21,\"specializations\":[\"Mentor\"]}","form":{"q":"","city":"Delhi","locality":null},"cerebro":true}"
I want to extract list part out of above string:
{\"id\":205782,\"name\":\"Robert Shriwas\",\"gender\":\"F\",\"practicing_since\":null,\"years\":21,\"specializations\":[\"Mentor\"]}
How can I do this using python regex?
There is a problem in your JSON, it encloses another json object in the double quotes and is causing json.loads to fail. Try doing some transformation on json string before passing to json.loads.
As following works perfectly.
>>> p = json.loads('''{"sweep_enabled":true,"product":"XYZ","page":"XYZ Profile","list":{\"id\":205782,\"name\":\"Robert Shriwas\",\"gender\":\"F\",\"practicing_since\":null,\"years\":21,\"specializations\":[\"Mentor\"]},"form":{"q":"","city":"Delhi","locality":null},"cerebro":true}''')
And you extract the requited part as
>>> p["list"]
{u'name': u'Robert Shriwas', u'gender': u'F', u'specializations': [u'Mentor'], u'id': 205782, u'years': 21, u'practicing_since': None}
Check this out I could manage to correct the json you provided.
>>> p = '''{"sweep_enabled":true,"product":"XYZ","page":"XYZ Profile","list":" {\"id\":205782,\"name\":\"Robert Shriwas\",\"gender\":\"F\",\"practicing_since\":null,\"years\":21,\"specializations\":[\"Mentor\"]}","form":{"q":"","city":"Delhi","locality":null},"cerebro":true}'''
>>> q = re.sub(r'(:)\s*"\s*(\{[^\}]+\})\s*"',r'\1\2', p[1:-1])
>>> q
'"sweep_enabled":true,"product":"XYZ","page":"XYZ Profile","list":{"id":205782,"name":"Robert Shriwas","gender":"F","practicing_since":null,"years":21,"specializations":["Mentor"]},"form":{"q":"","city":"Delhi","locality":null},"cerebro":true'
>>> r = p[0] + q + p[-1]
>>> r
'{"sweep_enabled":true,"product":"XYZ","page":"XYZ Profile","list":{"id":205782,"name":"Robert Shriwas","gender":"F","practicing_since":null,"years":21,"specializations":["Mentor"]},"form":{"q":"","city":"Delhi","locality":null},"cerebro":true}'
>>> json.loads(r)
{u'product': u'XYZ', u'form': {u'q': u'', u'city': u'Delhi', u'locality': None}, u'sweep_enabled': True, u'list': {u'name': u'Robert Shriwas', u'gender': u'F', u'specializations': [u'Mentor'], u'id': 205782, u'years': 21, u'practicing_since': None}, u'cerebro': True, u'page': u'XYZ Profile'}
>>> s = json.loads(r)
>>> s['list']
{u'name': u'Robert Shriwas', u'gender': u'F', u'specializations': [u'Mentor'], u'id': 205782, u'years': 21, u'practicing_since': None}
>>>