How to define special "untokenizable" words for nltk.word_tokenize - nltk

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.

The Multi Word Expression Tokenizer should be what you need.
You add the list of exceptions as tuples and pass it the already tokenized sentences:
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']

Related

dash Dropdown provides either string or list depending on number of selections

I'm currently working on a stock analysis tool utilizing dash. I have a dropdown populated with the NASDAQ 100 symbols and am attempting to get it to return a line graph with a line for each symbol selected.
With the dropdown, if I have one symbol selected the returned value is a string, if I select multiple it's then a list.
I'm trying to use a callback such as:
#app.callback(
Output(component_id='stock-graph-line', component_property='figure'),
Input(component_id='stock-input', component_property='value'),
suppress_callback_exceptions = True
)
def update_stock_symbol(input_value):
for i in input_value:
fig.append_trace({'x':df.index,'y':df[([i], 'close')], 'type':'scatter','name':'Price [Close]'},1,1)
fig['layout'].update(height=1000, title=input_value, template="plotly_dark")
return fig
However, the for loop does not work with only one symbol selected as it's getting a string, not a list. Is there an option in dash to specify the return type of the callbacks? (Can I force it to pass on the one symbol as a list item?) Or does this have to be handled with if statements testing the type?
I set up a dropdown and it is always returning either None or a list:
dcc.Dropdown(
id='dropdown-id',
options=[
{'label': 'a', 'value': 'a'},
{'label': 'b', 'value': 'b'},
{'label': 'c', 'value': 'c'},
{'label': 'd', 'value': 'd'},
{'label': 'e', 'value': 'e'},
],
multi=True
),
Perhaps yours is set up differently. If that won't work, then just check like this:
def update_stock_symbol(input_value):
if isinstance(input_value, str):
input_value = [input_value]
for i in input_value:
...

can't convert text data to json

I am trying to convert the following (json) string into a python data type:
data = "{'id': 26, 'photo': '/media/f082b5af-ad0.png', 'first_name': 'Islam', 'last_name': 'Mansour', 'email': 'islammansour06+8#gmail.com', 'city': 'Giza', 'cv': '/media/fbb61609-442.pdf', 'reference': 'Facebook', 'campaign': OrderedDict([('id', 2), ('name', 'javascript')]), 'status': 'Invitation Sent', 'user': None, 'at': '2020-01-20', 'time': '23:02:58.359179', 'technologies': [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}"
I am trying to convert it to JSON by using
json.loads(data.replace("\'", "\""))
but I am having the following error
json.decoder.JSONDecoderError: Expecting value: line 1 column 219 (char 218)
The issue is that your data is not valid json.
The main problem starts here: [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}. This looks like it is a string representaion of some python objects.
Below is a more friendly representation of your json data.
I have marked the problematic parts (with **) (basically everywhere there is a OrderedDict).
{
"id":26,
"photo":"/media/f082b5af-ad0.png",
"first_name":"Islam",
"last_name":"Mansour",
"email":"islammansour06+8#gmail.com",
"city":"Giza",
"cv":"/media/fbb61609-442.pdf",
"reference":"Facebook",
"campaign":**OrderedDict**([("id",
2), ("name", "javascript")]), "status":"Invitation Sent",
"user":None,
"at":"2020-01-20",
"time":"23:02:58.359179",
"technologies":[
**OrderedDict**([("id",
46),
("name",
"Django")
]("category", OrderedDict([("id", 24), ("name", "Framework"), ("_type", "skill")]))])]
}```
You could try making use of an [online json parser][1] which might give you some friendlier output.
[1]: http://json.parser.online.fr/
As previously said, OrderedDict is not correct JSON. But this is correct python.
To fix it:
from collections import OrderedDict # direct import because this is as this in your string
import json
jsonCorrect = json.dumps(eval(data))
json.loads(jsonCorrect) # it works
Not sure why you are adding the replace call. Should work with just the following:
json.loads(data)
You can read about it here.

In my fetched json data, how can I seperate out the balance?

So, I have been testing block.io api, and so far I have this:
knee = block_io.get_address_balance(labels='shibe1')
s1 = json.dumps(knee)
d2 = json.loads(s1)
print (d2)
It returns me with this batch of text:
{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.0', 'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1', 'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000', 'pending_received_balance': '0.00000000'}]}}
How would I get it so that I could grab only the available_balance part of it, and print it out instead of all of the json data?
EDIT: Please help! Cant find a solution.
Try using some regex.
import re
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}"
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(data)
for match in matches:
print(match.group())
Breakdown :
import re imports the regex library built into python
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}" is a string containing the data to match. You can replace this with the json data.
pattern = re.compile("(?<=available_balance': ').*?(?=')") compiles the regex for finding the data for available balance.
Regex breakdown
(?<= is a lookbehind, which means it will check if the value is actually available_balance.
.* matches everything inside a defined constraint.
(?= is a lookahead, which means it will match everything before the close parenthesis, and everything after the lookbehind.
pattern.finditer(data) matches the regex against data
for match in matches:
print(match.group()) prints the matches from the regex.
If you compile this code, you will get the following results :
0.129
0.00000000
If you want the code under your variables, here you go :
import re
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(d2)
for match in matches:
print(match.group())

NLTK tokenize but don't split named entities

I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.
from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')
Current output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']
Desired output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
How can I tokenize such sentences without splitting named entities?
Identify the named entities, then walk the result and join the chunked tokens together:
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
Each element of chunks is either a (word, pos) tuple or a Tree() containing the parts of the chunk.

Problems using json_extract in Sqlite for key with colon (:) in it

I have an example data set like below
id|accountid|attributes|created|type
1|10|{'base:instances': '{}', 'cont:contact': 'CLOSED', 'cont:contactchanged': '1468516440931', 'devconn:lastchange': '1462387904432', 'devconn:signal': '100', 'devconn:state': 'ONLINE', 'devpow:backupbatterycapable': 'false', 'devpow:battery': '66', 'devpow:linecapable': 'false', 'devpow:source': 'BATTERY', 'devpow:sourcechanged': '1462387904403', 'temp:temperature': '25.75'}|2016-05-04 18:51:44+0000|Test
2|20|{'base:instances': '{}', 'cont:contact': 'CLOSED', 'cont:contactchanged': '1468516440931', 'devconn:lastchange': '1462387904432', 'devconn:signal': '100', 'devconn:state': 'ONLINE', 'devpow:backupbatterycapable': 'false', 'devpow:battery': '66', 'devpow:linecapable': 'false', 'devpow:source': 'BATTERY', 'devpow:sourcechanged': '1462387904403', 'temp:temperature': '25.75'}|2016-05-04 18:51:44+0000|Prod
3|30|{'base:instances': '{}', 'cont:contact': 'CLOSED', 'cont:contactchanged': '1468516440931', 'devconn:lastchange': '1462387904432', 'devconn:signal': '100', 'devconn:state': 'ONLINE', 'devpow:backupbatterycapable': 'false', 'devpow:battery': '66', 'devpow:linecapable': 'false', 'devpow:source': 'BATTERY', 'devpow:sourcechanged': '1462387904403', 'temp:temperature': '25.75'}|2016-05-04 18:51:44+0000|Prod
4|40|{'base:instances': '{}', 'cont:contact': 'CLOSED', 'cont:contactchanged': '1468516440931', 'devconn:lastchange': '1462387904432', 'devconn:signal': '100', 'devconn:state': 'ONLINE', 'devpow:backupbatterycapable': 'false', 'devpow:battery': '66', 'devpow:linecapable': 'false', 'devpow:source': 'BATTERY', 'devpow:sourcechanged': '1462387904403', 'temp:temperature': '25.75'}|2016-05-04 18:51:44+0000|Test
I import this to sqlite3 3.13 to do some analysis (.mode csv, .headers on, .separator '|', .import file.csv dev)
As you can see the second field is json formatted data the keys all have : in the names and I think part of my issue.
I would like to and select all rows with column type matching Test and print out the devpow:battery value from the json in attributes column
I have tried all the below and I can't get this to work
select json_extract(dev.attributes, '$.devpower:battery') from dev where type=="Test";
select attributes.[devpower:battery] from dev where type=="Test";
select 'attributes.devpower:battery' from dev where type=="Test";
And quite a few permeations of the above. Any help is greatly appreciated.
devpow:battery is a perfectly valid object label, and it works if you are actually using valid JSON (the values in your example are not), and if you spell the label correctly (which you did not):
> SELECT attributes FROM dev;
{"devpow:battery": "66"}
> SELECT json_extract(dev.attributes, '$.devpow:battery') FROM dev WHERE ...;
66