NLTK tokenize but don't split named entities - nltk

I am working on a simple grammar based parser. For this I need to first tokenize the input. In my texts lots of cities appear (e.g., New York, San Francisco, etc.). When I just use the standard nltk word_tokenizer, all these cities are split.
from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')
Current output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']
Desired output:
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
How can I tokenize such sentences without splitting named entities?

Identify the named entities, then walk the result and join the chunked tokens together:
>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']
Each element of chunks is either a (word, pos) tuple or a Tree() containing the parts of the chunk.

Related

Iterating Thru Two JSON Payloads with multiple loops in Python - Creating a dictionary from results

Hello Fellow Pythoineers, I am a little confused by this output
Goal - I have two loops iterating thru two JSON payloads.
1st Loop groups all Department keys into a list of unique values.
2nd Loop iterates thru the key (which should match the keys collected from the first loop, and then assigns that key a display name and key to a dictionary
Example
1st Loop Returns key SO
2nd Loop Finds SO and determines the displayName is SOCIOLOGY
the dict should return SO: SOCIOLOGY
import json
all_courses = open('methodist_all_courses.json')
all_courses_json = json.load(all_courses)
all_departments = open('departments_payload.json')
all_departments_json = json.load(all_departments)
departments =[];
department_long = dict()
for cd_crs_id, course_details in all_courses_json.items():
for department_code in course_details['departments']:
if department_code not in departments:
departments.append(department_code)
for depot_code, departmnt_details in all_courses_json.items():
for distilled_department_code in departments:
if depot_code == distilled_department_code:
department_long[distilled_department_code] = departmnt_details['displayName']
print(department_long)
Replit: https://replit.com/join/tafkwzqaya-terry-brooksjr
Payload for loop 1: https://pastebin.com/9Bu0XL3Y
Payload for loop 2: https://pastebin.com/s0kMSmaF
Output
Traceback (most recent call last):
File "main.py", line 7, in <module>
all_departments_json = json.load(all_departments)
File "/nix/store/p21fdyxqb3yqflpim7g8s1mymgpnqiv7-python3-3.8.12/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/nix/store/p21fdyxqb3yqflpim7g8s1mymgpnqiv7-python3-3.8.12/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/nix/store/p21fdyxqb3yqflpim7g8s1mymgpnqiv7-python3-3.8.12/lib/python3.8/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 879 column 2 (char 25384)
departments_payload.json is not a valid json file because it ends with an extra character.
}\
Delete that ending \ (or if you pull that from a site, just use slicing the get not include that final character.
Secondly, your code won't give you the desired output as you are iterating both loops through all_courses_json.
This will work though once you fix that json.
Code:
import json
all_courses = open('methodist_all_courses.json')
all_courses_json = json.load(all_courses)
all_departments = open('departments_payload.json')
all_departments_json = json.load(all_departments)
departments =[];
department_long = dict()
for cd_crs_id, course_details in all_courses_json.items():
for department_code in course_details['departments']:
if department_code not in departments:
departments.append(department_code)
for depot_code, departmnt_details in all_departments_json.items():
for distilled_department_code in departments:
if depot_code == distilled_department_code:
department_long[distilled_department_code] = departmnt_details['displayName']
print(department_long)
Output:
print(department_long)
{'AC': 'Accounting', 'AR': 'Art', 'AT': 'Athletic Training', 'BI': 'Biology', 'BU': 'Business Administration', 'CH': 'Chemistry and Physical Science', 'CM': 'Communication', 'CS': 'Computer Science', 'EC': 'Economics', 'ED': 'Education', 'EG': 'Engineering', 'LL': 'English,Literature, Languages, and Culture', 'EV': 'Environmental Management', 'HC': 'Health Care Administration', 'HI': 'History', 'HO': 'Honors', 'ID': 'Inderdisciplinary Studies', 'JU': 'Justice Studies', 'KI': 'Kinesiology', 'MK': 'Marketing', 'MB': 'Masters of Business Admninistration', 'ME': 'Masters of Education', 'MA': 'Mathematics', 'MI': 'Military Science', 'NU': 'Nursing', 'OT': 'Occupational Therapy', 'PF': 'Performing Arts', 'PH': 'Philosophy and Religion', 'DP': 'Physical Therapy', 'PA': 'Physician Assistant', 'GS': 'Political Science', 'PG': 'Professional Golf Management', 'PT': 'Professional Tennis Management', 'PS': 'Psychology', 'RM': 'Resort Management', 'SW': 'Social Work', 'SM': 'Sport Management', 'TE': 'Teacher Education'}

converting json into pandas dataframe

I have JSON output that I would like to convert to pandas dataframe. I downloaded from a website via HTTPS and utilizing an API key. thanks much. here is what I coded:
json_data = vehicle_miles_traveled.json()
print(json_data)
{'request': {'command': 'series', 'series_id': 'STEO.MVVMPUS.A'}, 'series': [{'series_id': 'STEO.MVVMPUS.A', 'name': 'Vehicle Miles Traveled, Annual', 'units': 'million miles/day', 'f': 'A', 'description': 'Includes gasoline and diesel fuel vehicles', 'copyright': 'None', 'source': 'U.S. Energy Information Administration (EIA) - Short Term Energy Outlook', 'geography': 'USA', 'start': '1990', 'end': '2023', 'lastHistoricalPeriod': '2021', 'updated': '2022-03-08T12:39:35-0500', 'data': [['2023', 9247.0281671], ['2022', 9092.4575671], ['2021', 8846.1232877], ['2020', 7933.3907104], ['2019', 8936.3589041], ['2018', 8877.6027397], ['2017', 8800.9479452], ['2016', 8673.2431694], ['2015', 8480.4712329], ['2014', 8289.4684932], ['2013', 8187.0712329], ['2012', 8110.8387978], ['2011', 8083.2931507], ['2010', 8129.4958904], ['2009', 8100.7205479], ['2008', 8124.3387978], ['2007', 8300.8794521], ['2006', 8257.8520548], ['2005', 8190.2136986], ['2004', 8100.5163934], ['2003', 7918.4136986], ['2002', 7823.3123288], ['2001', 7659.2054795], ['2000', 7505.2622951], ['1999', 7340.9808219], ['1998', 7192.7780822], ['1997', 7014.7205479], ['1996', 6781.9699454], ['1995', 6637.7369863], ['1994', 6459.1452055], ['1993', 6292.3424658], ['1992', 6139.7595628], ['1991', 5951.2712329], ['1990', 5883.5643836]]}]}
It hugely depends on your final goal. You could add all meta-data in a dataframe if you want to. I assume that you are interested in reading the data field into a dataframe.
We can just get those fields by accessing:
data = json_data['series'][0]['data']
# and pass them to the dataframe constructor. We can specify the column names as well!
df = pd.DataFrame(data, columns=['year', 'other_col_name'])

How to parse nested JSON file in Pandas

I'm trying to transform a JSON file generated by the Day One Journal to a text file using Python but hit a brick wall.
This is broadly the format:
{'metadata': {'version': '1.0'},
'entries': [{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"attributes":{"line":{"header":1,"identifier":"F78B28DA-488E-489E-9C95-1A0648099792"}},"text":"2022\\n"},{"attributes":{"line":{"header":0,"identifier":"FA8C6594-F43D-4652-B442-DAF72A379799"}},"text":"\\n"},{"attributes":{"line":{"header":0,"identifier":"0923BCC8-B24A-4C0D-963C-73D09561EECD"}},"text":"It’s the beginning of a new year"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]},{"text":"\\n\\n\\n\\n"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 2925.313938140869,
'timeZone': 'Europe/London',
'creationDeviceType': 'Hal 9000',
'uuid': '988D9D9876624FAEB88F9BCC666FD9CD',
'creationDeviceModel': 'MacBookPro15,2',
'starred': False,
'location': {'region': {'center': {'longitude': -0.0095,
'latitude': 51},
'radius': 75},
'localityName': 'London',
'country': 'United Kingdom',
'timeZoneName': 'Europe/London',
'administrativeArea': 'England',
'longitude': -0.0095,
'placeName': 'Somewhere',
'latitude': 51},
'isPinned': False,
'creationDevice': 'somedevice'...,
}
I only want the 'text' (of which there might be a number of 'text' entries and 'creationDate' so I've got a daily record.
My code to pull out the data is straightforward:
import json
# Opening JSON file
f = open('files/2022.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Closing file
f.close()
I've tried using list comprensions and then concatenating the Series in Pandas, but two don't match in length - because multiple entries on one day mix up the dataframe.
I wanted to use this code, but:
result = []
for i in data['entries']:
entry = i['creationDate'] + i['text']
result.append(entry)
but I get this error:
KeyError: 'text'
What do I need to do?
Update:
{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"text":"Later than I planned\\n"}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 672.3099998235703,
'timeZone': 'Europe/London',
'creationDeviceType': 'Computer',
'uuid': 'F53DCC5E05BB4106A49C76954117DBF4',
'creationDeviceModel': 'xompurwe',
'isPinned': False,
'creationDevice': 'Computer',
'text': 'Later than I planned \\\n',
'modifiedDate': '2022-01-05T01:01:29Z',
'isAllDay': False,
'creationDate': '2022-01-05T00:39:19Z',
'creationOSName': 'macOS'},
Sort of managed to work a solution - thank you to everyone who helped this morning, particularly #Tomer S.
My solution was:
result = []
for i in data['entries']:
print (i['creationDate'] + i['text'])
result.append(entry)
It still won't get what I want

can't convert text data to json

I am trying to convert the following (json) string into a python data type:
data = "{'id': 26, 'photo': '/media/f082b5af-ad0.png', 'first_name': 'Islam', 'last_name': 'Mansour', 'email': 'islammansour06+8#gmail.com', 'city': 'Giza', 'cv': '/media/fbb61609-442.pdf', 'reference': 'Facebook', 'campaign': OrderedDict([('id', 2), ('name', 'javascript')]), 'status': 'Invitation Sent', 'user': None, 'at': '2020-01-20', 'time': '23:02:58.359179', 'technologies': [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}"
I am trying to convert it to JSON by using
json.loads(data.replace("\'", "\""))
but I am having the following error
json.decoder.JSONDecoderError: Expecting value: line 1 column 219 (char 218)
The issue is that your data is not valid json.
The main problem starts here: [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}. This looks like it is a string representaion of some python objects.
Below is a more friendly representation of your json data.
I have marked the problematic parts (with **) (basically everywhere there is a OrderedDict).
{
"id":26,
"photo":"/media/f082b5af-ad0.png",
"first_name":"Islam",
"last_name":"Mansour",
"email":"islammansour06+8#gmail.com",
"city":"Giza",
"cv":"/media/fbb61609-442.pdf",
"reference":"Facebook",
"campaign":**OrderedDict**([("id",
2), ("name", "javascript")]), "status":"Invitation Sent",
"user":None,
"at":"2020-01-20",
"time":"23:02:58.359179",
"technologies":[
**OrderedDict**([("id",
46),
("name",
"Django")
]("category", OrderedDict([("id", 24), ("name", "Framework"), ("_type", "skill")]))])]
}```
You could try making use of an [online json parser][1] which might give you some friendlier output.
[1]: http://json.parser.online.fr/
As previously said, OrderedDict is not correct JSON. But this is correct python.
To fix it:
from collections import OrderedDict # direct import because this is as this in your string
import json
jsonCorrect = json.dumps(eval(data))
json.loads(jsonCorrect) # it works
Not sure why you are adding the replace call. Should work with just the following:
json.loads(data)
You can read about it here.

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.
The Multi Word Expression Tokenizer should be what you need.
You add the list of exceptions as tuples and pass it the already tokenized sentences:
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']