parse json data to pandas column with missing quotes - json

I have a pandas dataframe "df". the dataframe has a field in it "json_field" that contains json data. I think it's actually yaml format. I'm trying to parse the json values in to their own columns. I have example data below. when I run the code below I'm getting an error when it hits the 'name' field. the code and the error are below. it seems like maybe it's having trouble because the value associated with the name field isn't quoted, or has spaces. does anyone see what the issue might be or suggest how to fix?
example:
print(df['json_field'][0])
output:
- start:"test"
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
code:
import yaml
pd.io.json.json_normalize(yaml.load(df['json_field'][0]), 'name','field','type','operator','value').head()
error:
An error was encountered:
mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
^
Traceback (most recent call last):
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 94, in safe_load
return load(stream, SafeLoader)
File "/usr/local/lib64/python3.6/site-packages/yaml/__init__.py", line 72, in load
return loader.get_single_data()
File "/usr/local/lib64/python3.6/site-packages/yaml/constructor.py", line 35, in get_single_data
node = self.get_single_node()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 36, in get_single_node
document = self.compose_document()
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 55, in compose_document
node = self.compose_node(None, None)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 82, in compose_node
node = self.compose_sequence_node(anchor)
File "/usr/local/lib64/python3.6/site-packages/yaml/composer.py", line 110, in compose_sequence_node
while not self.check_event(SequenceEndEvent):
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 98, in check_event
self.current_event = self.state()
File "/usr/local/lib64/python3.6/site-packages/yaml/parser.py", line 382, in parse_block_sequence_entry
if self.check_token(BlockEntryToken):
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 116, in check_token
self.fetch_more_tokens()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 220, in fetch_more_tokens
return self.fetch_value()
File "/usr/local/lib64/python3.6/site-packages/yaml/scanner.py", line 580, in fetch_value
self.get_mark())
yaml.scanner.ScannerError: mapping values are not allowed here
in "<unicode string>", line 3, column 7:
name: Price Unit
desired output:
name field type operator value
Price Unit priceunit number between - '40' - '60'
Update:
I tried the suggestion below
import yaml
df['json_field']=df['json_field'].str.replace('"test"', "")
pd.io.json.json_normalize(yaml.safe_load(df['json_field'][0].lstrip()), 'name','field','type','operator','value').head()
and got the output below:
operator0 typefield
0 P priceunit
1 r priceunit
2 i priceunit
3 c priceunit
4 e priceunit

Seems to me that it's having an issue because your yaml is improperly formatted. That "test" should not be there at all if you want a dictionary named "start" with 5 mappings inside of it.
import yaml
a = """
- start:
name: Price Unit
field: priceunit
type: number
operator: between
value:
- '40'
- '60'
""".lstrip()
# Single dictionary named start, with 5 entries
yaml.safe_load(a)[0]
{'start':
{'name': 'Price Unit',
'field': 'priceunit',
'type': 'number',
'operator': 'between',
'value': ['40', '60']}
}
To do this with your data, try:
data = yaml.load(df['json_field'][0].replace('"test"', ""))
df = pd.json_normalize(yaml.safe_load(a)[0]["start"])
print(df)
name field type operator value
0 Price Unit priceunit number between [40, 60]

It seems that your real problem in one line earlier.
Note that your YAML fragment starts with start:"test".
One of key principles of YAML is that a key is separated
from its value by a colon and a space, whereas your
sample does not contain this (mandatory) space.
Maybe you should manually edit your input, i.e. add this
missing space.
Other solution is to write a "specialized pre-parser",
which adds such missing spaces and its result is fed into
the YAML parser. This can be an option if the number of such
cases is big.

Related

Parsing a Pandas column in JSON format

I am parsing a Pandas column of type string that is in JSON format such as the following
kafka_data["MESSAGE_DATA__C"].iloc[0]
Out[20]: '{"userId":"af33f42e","trackingCategory":"ACTION","trackedItem":{"id":"PERSONAL_IDENTIFICATION_STARTED","category":"PERSONAL_IDENTIFICATION","title":"Personal Identification Started"}}'
When I parse a single row everything works
json.loads(kafka_data["MESSAGE_DATA__C"].iloc[0])
Out[25]:
{'userId': 'af33f42e',
'trackingCategory': 'ACTION',
'trackedItem': {'id': 'PERSONAL_IDENTIFICATION_STARTED',
'category': 'PERSONAL_IDENTIFICATION',
'title': 'Personal Identification Started'}}
But when I try to parse altogether the column, an error prompts.
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 212 (char 211)
Am I missing anything?
I need to read this column into a new dataframe.
When applying a function to entire column, use axis=1 parameter.
Please try:
kafka_data[["MESSAGE_DATA__C"]].apply(lambda row: json.loads(str(row["MESSAGE_DATA__C"])), axis=1)

ERROR: invalid input syntax for type json DETAIL: Token "'" is invalid. while importing csv in pgadmin

I have made a new table with three columns
customer_id,media_urls,survey_taste
in a db in pgadmin with attributes as
int,text[],jsonb
respectively.
I have a csv that I was trying to import into this table using pgadmin and
the contents of that file are like this
1,"{'http://example.com','http://example.com'}","{'taste':[1,2,3,4]}"
but I am getting this error
ERROR: invalid input syntax for type json
DETAIL: Token "'" is invalid.
CONTEXT: JSON data, line 1: '...
COPY survey_taste, line 2, column survey_taste: "{'taste': [-0.19101654669350904, 0.08575981750112513, 0.07133783942655376, -0.10579014363010293, 0.0..." ```
To address your comments in reverse order. To have this entered in one field you would need to have it as:
'[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'
Per:
select '[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]'::json;
json
---------------------------------------------------
[{"http":"abc","http":"abc"},{"taste":[1,2,3,4]}]
As to the quoting issue:
When you pass a dict to csv you will get:
d = {"taste":[1,2,3,4]}
print(d)
{'taste': [1, 2, 3, 4]
What you need is:
import json
json.dumps(d)
'{"test": [1, 2, 3, 4]}'
Using json.dumps will turn the dict into a proper JSON string representation.
Putting it all together:
# Create list of dicts
l = [{'http': 'abc', 'http': 'abc'}, {'taste': [1,2,3,4]}]
# Create JSON string representattion
json.dumps(l)
'[{"http": "abc"}, {"taste": [1, 2, 3, 4]}]'

python error on string format with "\n" exec(compile(contents+"\n", file, 'exec'), glob, loc)

i try to construct JSON with string that contains "\n" in it like this :
ver_str= 'Package ID: version_1234\nBuild\nnumber: 154\nBuilt\n'
proj_ver_str = 'Version_123'
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
json_content = json.loads()
d =json.dumps(json_content )
getting this error:
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Dev/python/new_tester/simple_main.py", line 18, in <module>
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
KeyError: '"r_content"'
The error arises not because of newlines in your values, but because of { and } characters in your format string other than the placeholders {0} and {1}. If you want to have an actual { or a } character in your string, double them.
Try replacing the line
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
with
comb = '{{"r_content": {0}, "s_version": {1}}}'.format(ver_str,proj_ver_str)
However, this will give you a different error on the next line, loads() missing 1 required positional argument: 's'. This is because you presumably forgot to pass comb to json.loads().
Replacing json.loads() with json.loads(comb) gives you another error: json.decoder.JSONDecodeError: Expecting value: line 1 column 15 (char 14). This tells you that you've given json.loads malformed JSON to parse. If you print out the value of comb, you see the following:
{"r_content": Package ID: version_1234
Build
number: 154
Built
, "s_version": Version_123}
This isn't valid JSON, because the string values aren't surrounded by quotes. So a JSON parsing error is to be expected.
At this point, let's take a look at what your code is doing and what you seem to want it to do. It seems you want to construct a JSON string from your data, but your code puts together a JSON string from your data, parses it to a dict and then formats it back as a JSON string.
If you want to create a JSON string from your data, it's far simpler to create a dict with your values and use json.dumps on that:
d = json.dumps({"r_content": ver_str, "s_version": proj_ver_str})

Passing a string variable to ast.literal_eval that has both single and double quotes

I have a variables that stores strings which contain both single and double quotes. For example, one variable, testing7, is the following string:
{'address_components': [{'long_name': 'Fairhope', 'short_name': 'Fairhope' 'types': ['locality', 'political']}...
I need to pass this variable through ast.literal_eval and then json.dumps, and finally json.loads. However when I try to pass it through ast.literal I get an error:
line 48, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
{'address_components': [{'long_name': 'Fairhope', 'short_name': 'Fairhope' 'types': ['locality', 'political']}...'
^
IndentationError: unexpected indent
If I copy and past the string into literal_eval it will work if enclosed with triple quotes:
#This works
ast.literal_eval('''{'address_components': [{'long_name': 'Fairhope', 'short_name': 'Fairhope' 'types': ['locality', 'political']}...''')
#This does not work
ast.literal_eval(testing7)
It seems that the error isn't related to quotes. The cause of the error is that you have space in front of the string.
For example:
>>>ast.literal_eval(" {'x':\"t\"}")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/leo/miniconda3/lib/python3.6/ast.py", line 48, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/home/leo/miniconda3/lib/python3.6/ast.py", line 35, in parse
return compile(source, filename, mode, PyCF_ONLY_AST)
File "<unknown>", line 1
{'x':"t"}
^
IndentationError: unexpected indent
You may want to strip the string before passing it into the function. Besides, I didn't see mixed double quotes and single quotes in your code. In a single quoted string, you may want to use \ to escape the single quote. For example:
>>> x = ' {"x":\'x\'}'.strip()
>>> x
'{"x":\'x\'}'
>>> ast.literal_eval(x)
{'x': 'x'}

How to load a json file with strings including double quotes (")

I've been given a load of JSON files which I'm trying to load into python 3.5
I've already had to do some clean up work, removing double backslashes and extra quotations, however I've run into an issue I don't know how to solve.
I'm running the following code:
with open(filepath,'r') as json_file:
reader = json_file.readlines()
for row in reader:
row = row.replace('\\', '')
row = row.replace('"{', '{')
row = row.replace('}"', '}')
response = json.loads(row)
for i in response:
responselist.append(i['ActionName'])
However it's throwing up the error:
JSONDecodeError: Expecting ',' delimiter: line 1 column 388833 (char 388832)
The part of the JSON that's causing the issue is the status text entry below:
"StatusId":8,
"StatusIdString":"UnknownServiceError",
"StatusText":"u003cCompany docTypeu003d"Mobile.Tile" statusIdu003d"421" statusTextu003d"Start time of 11/30/2015 12:15:00 PM is more than 5 minutes in the past relative to the current time of 12/1/2015 12:27:01 AM." copyrightu003d"Copyright Company Inc." versionNumberu003d"7.3" createdDateu003d"2015-12-01T00:27:01Z" responseIdu003d"e74710c0-dc7c-42db-b608-bf905d95d153" /u003e",
"ActionName":"GetTrafficTile"
I added the line breaks to illustrate my point, it looks like python is unhappy that the string contains double quotes.
I have a feeling this may be to do with my replacing '\ \' with '' messing with the unicode characters in the string. Is there any way to repair these nested strings? I don't mind if the StatusText field is deleted completely, all I'm after is a list of the ActionName fields.
EDIT:
I've hosted an example file here:
https://www.dropbox.com/s/1oanrneg3aqandz/2015-12-01T00%253A00%253A42.527Z_2015-12-01T00%253A01%253A17.478Z?dl=0
This is exactly as I received, before I've replaced the extra backslashes and quotations
Here is a pared down version of the sample with one bad entry
["{\"apiServerType\":0,\"RequestId\":\"52a65260-1637-4653-a496-7555a2386340\",\"StatusId\":0,\"StatusIdString\":\"Ok\",\"StatusText\":null,\"ActionName\":\"GetCameraImage\",\"Url\":\"http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782\",\"Lat\":0.0,\"Lon\":0.0,\"iVendorId\":12561,\"iConsumerId\":2986897,\"iSliverId\":51846,\"UserId\":\"2986897\",\"HardwareId\":null,\"AuthToken\":\"vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|\",\"RequestTime\":\"2015-12-01T00:00:42.5278699Z\",\"ResponseTime\":\"2015-12-01T00:01:02.5926127Z\",\"AppId\":null,\"HttpMethod\":\"GET\",\"RequestHeaders\":\"{\\\"Connection\\\":[\\\"keep-alive\\\"],\\\"Via\\\":[\\\"HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com\\\"],\\\"Accept\\\":[\\\"application/json\\\"],\\\"Accept-Encoding\\\":[\\\"gzip\\\",\\\"deflate\\\"],\\\"Accept-Language\\\":[\\\"en-us\\\"],\\\"Host\\\":[\\\"mosi-prod.cloudapp.net\\\"],\\\"User-Agent\\\":[\\\"Traffic/5.4.0\\\",\\\"CFNetwork/758.1.6\\\",\\\"Darwin/15.0.0\\\"]}\",\"RequestContentHeaders\":\"{}\",\"RequestContentBody\":\"\",\"ResponseBody\":null,\"ResponseContentHeaders\":\"{\\\"Content-Type\\\":[\\\"image/jpeg\\\"]}\",\"ResponseHeaders\":\"{}\",\"MiniProfilerJson\":null}"]
The problem is a little different than you think. Whatever program built these files used data that was already json-encoded and ended up double and even triple encoding some of the information. I peeled it apart in a shell session and got usable python data. You can (1) go dope-slap whoever wrote the program that built this steaming pile of... um... goodness? and (2) manually scan through and decode inner json strings.
I decoded the data and it was a list of strings, but those strings looked suspiciously like json
>>> data = json.load(open('test.json'))
>>> type(data)
<class 'list'>
>>> d0 = data[0]
>>> type(d0)
<class 'str'>
>>> d0[:70]
'{"apiServerType":0,"RequestId":"52a65260-1637-4653-a496-7555a2386340",'
Sure enough, I can decode it
>>> d0_1 = json.loads(d0)
>>> type(d0_1)
<class 'dict'>
>>> d0_1
{'ResponseBody': None, 'StatusText': None, 'AppId': None, 'ResponseTime': '2015-12-01T00:01:02.5926127Z', 'HardwareId': None, 'RequestTime': '2015-12-01T00:00:42.5278699Z', 'StatusId': 0, 'Lon': 0.0, 'Url': 'http://mosi-prod.cloudapp.net/api/v1/GetCameraImage?AuthToken=vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew%7C&CameraId=13782', 'RequestContentBody': '', 'RequestId': '52a65260-1637-4653-a496-7555a2386340', 'MiniProfilerJson': None, 'RequestContentHeaders': '{}', 'ActionName': 'GetCameraImage', 'StatusIdString': 'Ok', 'HttpMethod': 'GET', 'iSliverId': 51846, 'ResponseHeaders': '{}', 'ResponseContentHeaders': '{"Content-Type":["image/jpeg"]}', 'apiServerType': 0, 'AuthToken': 'vo*AB57XLptsKXf0AzKjf1MOgQ1hZ4BKipKgYl3uGew|', 'iConsumerId': 2986897, 'RequestHeaders': '{"Connection":["keep-alive"],"Via":["HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com"],"Accept":["application/json"],"Accept-Encoding":["gzip","deflate"],"Accept-Language":["en-us"],"Host":["mosi-prod.cloudapp.net"],"User-Agent":["Traffic/5.4.0","CFNetwork/758.1.6","Darwin/15.0.0"]}', 'iVendorId': 12561, 'Lat': 0.0, 'UserId': '2986897'}
Picking one of the entries, that looks like more json
>>> hdrs = d0_1['RequestHeaders']
>>> type(hdrs)
<class 'str'>
Yep, it decodes to what I want
>>> hdrs_0 = json.loads(hdrs)
>>> type(hdrs_0)
<class 'dict'>
>>>
>>> hdrs_0["Via"]
['HTTP/1.1 nycnz01msp1ts10.wnsnet.attws.com']
>>>
>>> type(hdrs_0["Via"])
<class 'list'>
Here you are :) :
responselist = []
with open('dataFile.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm = 'ActionName":"'; lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt-1: nxtQuotAt+1] )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
which gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetTrafficTile"']
>Exit code: 0
where dataFile.json is the file with the line you provided and dataFile.py the code provided above.
It's the hard tour, but if the files are in a bad format you have to find a way around and a simple pattern matching works in any case. For more complex cases you will need regex (regular expressions), but in this case a simple .find() is enough to do the job.
The code finds also multiple "actions" in the line (if the line would contain more than one action).
Here the result for the file you provided in your link while using following small modification of the code above:
responselist = []
with open('dataFile1.json','r') as json_file:
reader = json_file.readlines()
for row in reader:
strActNm='\\"ActionName\\":\\"'
# strActNm = 'ActionName":"'
lenActNm = len(strActNm)
actionAt = row.find(strActNm)
while actionAt > 0:
nxtQuotAt = row.find('"',actionAt+lenActNm+2)
responselist.append( row[actionAt: nxtQuotAt+1].replace('\\','') )
actionAt = row.find('ActionName":"', nxtQuotAt)
print(responselist)
gives:
>python3.6 -u "dataFile.py"
['"ActionName":"GetCameraImage"']
>Exit code: 0
where dataFile1.json is the file you provided in the link.