How to extract certain information from a string and create a json object in python - json

I made a get request to a website and parsed it using BS4 using 'Html.parser'. I want to extract the ID, size and availability from the string. I have parsed it down to this final string:
'{"id":706816278547,"parent_id":81935859731,"available":false,
"sku":"665570057894","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["S"],
"option1":"s","option2":"","option3":"","option4":""},
{"id":707316252691,"parent_id":81935859731,"available":true,
"sku":"665570057900","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["M"],
"option1":"m","option2":"","option3":"", "option4":""},
{"id":707316285459,"parent_id":81935859731,"available":true,
"sku":"665570057917","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["L"],
"option1":"l","option2":"","option3":"","option4":""},`
{"id":707316318227,"parent_id":81935859731,"available":true,`
"sku":"665570057924","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["XL"],
"option1":"xl","option2":"","option3":"","option4":""}'
I also tried using the split() method but I get lost and im unable to extract the needed information without creating a cluttered list and getting lost.
I tried using json.loads() so i could just extract the information needed by calling the key and value pairs but i get the following error
final_id =
'{"id":706816278547,"parent_id":81935859731,"available":false,
"sku":"665570057894","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["S"],
"option1":"s","option2":"","option3":"","option4":""},
{"id":707316252691,"parent_id":81935859731,"available":true,
"sku":"665570057900","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["M"],
"option1":"m","option2":"","option3":"", "option4":""},
{"id":707316285459,"parent_id":81935859731,"available":true,
"sku":"665570057917","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["L"],
"option1":"l","option2":"","option3":"","option4":""},`
{"id":707316318227,"parent_id":81935859731,"available":true,`
"sku":"665570057924","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["XL"],
"option1":"xl","option2":"","option3":"","option4":""}'
find_id = json.loads(final_id)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/anaconda3/lib/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 233 (char 232)
I want to create a json object for each ID and Size and if that size is available or not.
Any help is welcomed. Thank you.

First thats not a valid json info
second, json.loads works for files, so a file containing this info will solve the issue because null in json equal None in python so json.load you can say translate a json file so python understand it, so
import json
with open('sof.json', 'r') as stackof:
final_id = json.load(stackof)
print(final_id)
will output
[{'id': 706816278547, 'parent_id': 81935859731, 'available': 'false', 'sku': '665570057894', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['S'], 'option1': 's', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316252691, 'parent_id': 81935859731, 'available': True, 'sku': '665570057900', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['M'], 'option1': 'm', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316285459, 'parent_id': 81935859731, 'available': True, 'sku': '665570057917', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['L'], 'option1': 'l', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316318227, 'parent_id': 81935859731, 'available': True, 'sku': '665570057924', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['XL'], 'option1': 'xl', 'option2': '', 'option3': '', 'option4': ''}]
i made all of them divided into array, so now if you print the first id you should write
print(final_id[0]['id'])
output:
706816278547
Tell me in the comments if that helped you,
btw click on >> sof.json to see sof.json

Related

How to handle the variable size json file in python to create DataFrame using pandas

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting.
eg:
1st chunk:
{'ad': 0,
'country': 'US',
'ver': '1.0',
'adIdType': 2,
'adValue': '5',
'data': {'eventId': 99,
'clickId': '',
'eventType': 'PURCHASEMADE',
'tms': '2019-12-25T09:57:04+0000',
'productDetails': {'currency': 'DLR',
'productList': [
{'segment': 'Girls',
'vertical': 'Fashion Jewellery',
'brickname': 'Traditional Jewellery',
'price': 8,
'quantity': 10}]},
'transactionId': '1254'},
'appName': 'xer.tt',
'appId': 'XR',
'sdkVer': '1.0.0',
'language': 'en',
'tms': '2022-04-25T09:57:04+0000',
'tid': '124'}
2nd chunk:
{'ad': 0,
'country': 'US',
'ver': '1.0',
'adIdType': 2,
'adValue': '78',
'data': {'eventId': 7,
'clickId': '',
'eventType': 'PURCHASEMADE',
'tms': '20219-02-25T09:57:04+0000',
'productDetails': {'currency': 'DLR',
'productList': [{'segment': 'Boys',
'vertical': 'Fashion',
'brickname': 'Casuals',
'price': 10,
'quantity': 5},
{'segment': 'Girls',
'vertical': 'Fashion Jewellery',
'brickname': 'Traditional Jewellery',
'price': 8,
'quantity': 10}]},
'transactionId': '3258'},
'appName': 'xer.tt',
'appId': 'XR',
'sdkVer': '1.0.0',
'language': 'en',
'tms': '2029-02-25T09:57:04+0000',
'tid': '124'}
Now in the ProductDetails the number of products are getting changes, in the first chunk we have only 1 product listed and it's detailed but in the 2nd chunk, we have 2 products listed and it's detailed, for further chunks we can have ANY number of products for other chunks also. (i.e. chunks~Records)
I tried doing that by writing some python scripts but was not able to come to any good solution.
PS: If any further detail is required please let me know in the comments.
Thanks!
What you can do, is use pd.json_normalize and have the most "inner" dictionary as your record_path and all other data you are interested in as your meta . Here is an in-depth example how you could construct that: pandas.io.json.json_normalize with very nested json
In your case, that would for example be (for a single object):
df = pd.json_normalize(obj,
record_path=["data", "productDetails", "productList"],
meta=([
["data", "productDetails", "currency"],
["data", "transactionId"],
["data", "clickId"],
["data", "eventType"],
["data", "tms"],
"ad",
"country"
])
)

How to parse nested JSON file in Pandas

I'm trying to transform a JSON file generated by the Day One Journal to a text file using Python but hit a brick wall.
This is broadly the format:
{'metadata': {'version': '1.0'},
'entries': [{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"attributes":{"line":{"header":1,"identifier":"F78B28DA-488E-489E-9C95-1A0648099792"}},"text":"2022\\n"},{"attributes":{"line":{"header":0,"identifier":"FA8C6594-F43D-4652-B442-DAF72A379799"}},"text":"\\n"},{"attributes":{"line":{"header":0,"identifier":"0923BCC8-B24A-4C0D-963C-73D09561EECD"}},"text":"It’s the beginning of a new year"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]},{"text":"\\n\\n\\n\\n"},{"embeddedObjects":[{"type":"horizontalRuleLine"}]}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 2925.313938140869,
'timeZone': 'Europe/London',
'creationDeviceType': 'Hal 9000',
'uuid': '988D9D9876624FAEB88F9BCC666FD9CD',
'creationDeviceModel': 'MacBookPro15,2',
'starred': False,
'location': {'region': {'center': {'longitude': -0.0095,
'latitude': 51},
'radius': 75},
'localityName': 'London',
'country': 'United Kingdom',
'timeZoneName': 'Europe/London',
'administrativeArea': 'England',
'longitude': -0.0095,
'placeName': 'Somewhere',
'latitude': 51},
'isPinned': False,
'creationDevice': 'somedevice'...,
}
I only want the 'text' (of which there might be a number of 'text' entries and 'creationDate' so I've got a daily record.
My code to pull out the data is straightforward:
import json
# Opening JSON file
f = open('files/2022.json')
# returns JSON object as
# a dictionary
data = json.load(f)
# Closing file
f.close()
I've tried using list comprensions and then concatenating the Series in Pandas, but two don't match in length - because multiple entries on one day mix up the dataframe.
I wanted to use this code, but:
result = []
for i in data['entries']:
entry = i['creationDate'] + i['text']
result.append(entry)
but I get this error:
KeyError: 'text'
What do I need to do?
Update:
{'richText': '{"meta":{"version":1,"small-lines-removed":true,"created":{"platform":"com.bloombuilt.dayone-mac","version":1344}},"contents":[{"text":"Later than I planned\\n"}]}',
'duration': 0,
'creationOSVersion': '12.1',
'weather': {'sunsetDate': '2022-01-12T16:15:28Z',
'temperatureCelsius': 7,
'weatherServiceName': 'HAMweather',
'windBearing': 230,
'sunriseDate': '2022-01-12T08:00:44Z',
'conditionsDescription': 'Mostly Clear',
'pressureMB': 1042,
'visibilityKM': 48.28020095825195,
'relativeHumidity': 81,
'windSpeedKPH': 6,
'weatherCode': 'clear-night',
'windChillCelsius': 6.699999809265137},
'editingTime': 672.3099998235703,
'timeZone': 'Europe/London',
'creationDeviceType': 'Computer',
'uuid': 'F53DCC5E05BB4106A49C76954117DBF4',
'creationDeviceModel': 'xompurwe',
'isPinned': False,
'creationDevice': 'Computer',
'text': 'Later than I planned \\\n',
'modifiedDate': '2022-01-05T01:01:29Z',
'isAllDay': False,
'creationDate': '2022-01-05T00:39:19Z',
'creationOSName': 'macOS'},
Sort of managed to work a solution - thank you to everyone who helped this morning, particularly #Tomer S.
My solution was:
result = []
for i in data['entries']:
print (i['creationDate'] + i['text'])
result.append(entry)
It still won't get what I want

How to create MultiIndex Dataframe from a nested dictionary (many levels)

I am using the pyflightdata library to search for flight stats. It returns json inside a list of dicts.
Here is an example of the first dictionary in the list after my query:
> flightlog = {'identification': {'number': {'default': 'KE504', 'alternative': 'None'}, 'callsign': 'KAL504', 'codeshare': 'None'}
, 'status': {'live': False, 'text': 'Landed 22:29', 'estimated': 'None', 'ambiguous': False, 'generic': {'status': {'text': 'landed', 'type': 'arrival', 'color': 'green', 'diverted': 'None'}
, 'eventTime': {'utc_millis': 1604611778000, 'utc_date': '20201105', 'utc_time': '2229', 'utc': 1604611778, 'local_millis': 1604615378000, 'local_date': '20201105', 'local_time': '2329', 'local': 1604615378}}}
, 'aircraft': {'model': {'code': 'B77L', 'text': 'Boeing 777-FEZ'}, 'registration': 'HL8075', 'country': {'name': 'South Korea', 'alpha2': 'KR', 'alpha3': 'KOR'}}
, 'airline': {'name': 'Korean Air', 'code': {'iata': 'KE', 'icao': 'KAL'}}
, 'airport': {'origin': {'name': 'London Heathrow Airport', 'code': {'iata': 'LHR', 'icao': 'EGLL'}, 'position': {'latitude': 51.471626, 'longitude': -0.467081, 'country': {'name': 'United Kingdom', 'code': 'GB'}, 'region': {'city': 'London'}}
, 'timezone': {'name': 'Europe/London', 'offset': 0, 'abbr': 'GMT', 'abbrName': 'Greenwich Mean Time', 'isDst': False}}, 'destination': {'name': 'Paris Charles de Gaulle Airport', 'code': {'iata': 'CDG', 'icao': 'LFPG'}, 'position': {'latitude': 49.012516, 'longitude': 2.555752, 'country': {'name': 'France', 'code': 'FR'}, 'region': {'city': 'Paris'}}, 'timezone': {'name': 'Europe/Paris', 'offset': 3600, 'abbr': 'CET', 'abbrName': 'Central European Time', 'isDst': False}}, 'real': 'None'}
, 'time': {'scheduled': {'departure_millis': 1604607300000, 'departure_date': '20201105', 'departure_time': '2115', 'departure': 1604607300, 'arrival_millis': 1604612700000, 'arrival_date': '20201105', 'arrival_time': '2245', 'arrival': 1604612700}, 'real': {'departure_millis': 1604609079000, 'departure_date': '20201105', 'departure_time': '2144', 'departure': 1604609079, 'arrival_millis': 1604611778000, 'arrival_date': '20201105', 'arrival_time': '2229', 'arrival': 1604611778}, 'estimated': {'departure': 'None', 'arrival': 'None'}, 'other': {'eta_millis': 1604611778000, 'eta_date': '20201105', 'eta_time': '2229', 'eta': 1604611778}}}
This dictionary is a huge, multi-nested, json mess and I am struggling to find a way to make it readable. I guess something like this:
identification number default KE504
alternative None
callsign KAL504
codeshare None
status live False
text Landed 22:29
Estimated None
ambiguous False
...
I am trying to turn it into a pandas DataFrame, with mixed results.
In this post it was explained that MultiIndex values have to be tuples, not dictionaries, so I used their example to convert my dictionary:
> flightlog_tuple = {(outerKey, innerKey): values for outerKey, innerDict in flightlog.items() for innerKey, values in innerDict.items()}
Which worked, up to a certain point.
df2 = pd.Series(flightlog_tuple)
gives the following output:
identification number {'default': 'KE504', 'alternative': 'None'}
callsign KAL504
codeshare None
status live False
text Landed 22:29
estimated None
ambiguous False
generic {'status': {'text': 'landed', 'type': 'arrival...
aircraft model {'code': 'B77L', 'text': 'Boeing 777-FEZ'}
registration HL8075
country {'name': 'South Korea', 'alpha2': 'KR', 'alpha...
airline name Korean Air
code {'iata': 'KE', 'icao': 'KAL'}
airport origin {'name': 'London Heathrow Airport', 'code': {'...
destination {'name': 'Paris Charles de Gaulle Airport', 'c...
real None
time scheduled {'departure_millis': 1604607300000, 'departure...
real {'departure_millis': 1604609079000, 'departure...
estimated {'departure': 'None', 'arrival': 'None'}
other {'eta_millis': 1604611778000, 'eta_date': '202...
dtype: object
Kind of what I was going for but some of the indexes are still in the column with values because there are so many levels. So I followed this explanation and tried to add more levels:
level_up = {(level1Key, level2Key, level3Key): values for level1Key, level2Dict in flightlog.items() for level2Key, level3Dict in level2Dict.items() for level3Key, values in level3Dict.items()}
df2 = pd.Series(level_up)
This code gives me AttributeError: 'str' object has no attribute 'items'. I don't understand why the first 2 indexes worked, but the others give an error.
I've tried other methods like MultiIndex.from_tuple or DataFrame.from_dict, but I can't get it to work.
This Dictionary is too complex as a beginner. I don't know what the right approach is. Maybe I am using DataFrames in the wrong way. Maybe there is an easier way to access the data that I am overlooking.
Any help would be much appreciated!

Convert nested json to dataframe in python

I made a request of a url to get the json response from a website. The response is in the format of list of dict, in which some elements inside contains another list of dict. I have tried json_normalize but it only takes one layer out and cannot make all keys inside the dict as columns for dataframe. Gratefully if you can make any suggestion.
Below is one of the list element data:
[{'matchID': '0b0943b1-5673-4408-bca4-c34e63a11cfc', 'matchIDinofficial': '20190202SAT77', 'matchNum': '77', 'matchDate': '2019-02-02+08:00', 'matchDay': 'SAT', 'coupon': {'couponID': '1', 'couponShortName': 'SAT', 'couponNameCH': '周六賽事', 'couponNameEN': 'Saturday Matches'}, 'league': {'leagueID': '124', 'leagueShortName': 'MXL', 'leagueNameCH': '墨西哥超級聯賽', 'leagueNameEN': 'Mexican Premier'}, 'homeTeam': {'teamID': '2041', 'teamNameCH': '迪祖亞拿', 'teamNameEN': 'Tijuana'}, 'awayTeam': {'teamID': '910', 'teamNameCH': '托盧卡', 'teamNameEN': 'Toluca'}, 'matchStatus': 'ResultIn', 'matchTime': '2019-02-03T11:06:00+08:00', 'statuslastupdated': '2019-02-03T11:56:05+08:00', 'inplaydelay': 'false', 'liveEvent': {'ilcLiveDisplay': True, 'hasLiveInfo': True, 'isIncomplete': False, 'matchIDbetradar': '16560915', 'matchstate': 'HalfTime', 'stateTS': '2019-02-03T11:07:44+08:00', 'liveevent': [{'order': 1, 'minutesElasped': '14', 'actionType': 'Regular', 'playerNameCH': '米拿保蘭奴斯', 'playerNameEN': 'Miller Bolanos', 'homeaway': 'Home'}, {'order': 2, 'minutesElasped': '25', 'actionType': 'YellowCard', 'playerNameCH': '菲臘比柏度', 'playerNameEN': 'Felipe Pardo', 'homeaway': 'Away'}]}, 'accumulatedscore': [{'periodvalue': 'FirstHalf', 'periodstatus': 'ResultFinal', 'home': '1', 'away': '0'}], 'livescore': {'home': '1', 'away': '0'}, 'cornerresult': '5', 'hasWebTV': False, 'hilodds': {'LINELIST': [{'LINENUM': '2', 'MAINLINE': 'false', 'LINESTATUS': '1', 'LINEORDER': '2', 'LINE': '3.5/3.5', 'L': '100#1.22', 'H': '100#3.80'}, {'LINENUM': '1', 'MAINLINE': 'true', 'LINESTATUS': '1', 'LINEORDER': '1', 'LINE': '2.5/2.5', 'H': '100#1.95', 'L': '100#1.75'}, {'LINENUM': '3', 'MAINLINE': 'false', 'LINESTATUS': '1', 'LINEORDER': '3', 'LINE': '2.0/2.5', 'L': '100#2.10', 'H': '100#1.65'}], 'ID': '0adfff89-9b63-4771-9008-96762227aca6', 'POOLSTATUS': 'Selling', 'INPLAY': 'true', 'ALLUP': 'true', 'Cur': '1'}, 'hasExtraTimePools': False, 'results': {}, 'definedPools': ['HAD', 'FHA', 'CRS', 'FCS', 'FTS', 'OOE', 'TTG', 'HFT', 'HHA', 'HDC', 'HIL', 'FHL', 'CHL', 'NTS'], 'inplayPools': ['HAD', 'HIL', 'CHL', 'CRS', 'NTS']}]
import requests
import pandas as pd
import from pandas.io.json import json_normalize
url = 'url'
response = requests.get(url).json()
newdf = pd.DataFrame()
for match in response:
df = json_normalize(match)
newdf = newdf.join(df)
It give me value error as shown below:
ValueError: columns overlap but no suffix specified:
Index(['awayTeam.teamID', 'awayTeam.teamNameCH', 'awayTeam.teamNameEN',
'cornerresult', 'coupon.couponID', 'coupon.couponNameCH',
'coupon.couponNameEN', 'coupon.couponShortName', 'definedPools',
'hasExtraTimePools', 'hasWebTV', 'hilodds.ALLUP', 'hilodds.Cur',
'hilodds.ID', 'hilodds.INPLAY', 'hilodds.LINELIST',
'hilodds.POOLSTATUS', 'homeTeam.teamID', 'homeTeam.teamNameCH',
'homeTeam.teamNameEN', 'inplayPools', 'inplaydelay', 'league.leagueID',
'league.leagueNameCH', 'league.leagueNameEN', 'league.leagueShortName',
'liveEvent.hasLiveInfo', 'liveEvent.ilcLiveDisplay',
'liveEvent.isIncomplete', 'liveEvent.liveevent',
'liveEvent.matchIDbetradar', 'liveEvent.matchstate',
'liveEvent.stateTS', 'matchDate', 'matchDay', 'matchID',
'matchIDinofficial', 'matchNum', 'matchStatus', 'matchTime',
'statuslastupdated'],
dtype='object')
What I expect for the columns in dataframe is something like this:
matchID homeTeam.teamNameEN awayTeam.teamNameEN hilodds.LINELIST.LINENUM
The above is just tiny example and I want all keys inside the list of dict to be column header in the dataframe.

Simple Json decoding with SimpleJSON - Python

Ive just started learning python and Im having a go at using a google api. But I hit a brick wall trying to parse the JSON with simplejson.
How do I go about pulling single values (ie product or brand fields) out of this mess below
{'currentItemCount': 25, 'etag': '"izYJutfqR9tRDg1H4X3fGx1UiCI/hqqZ6pMwV1-CEu5NSqfJO0Ix-gs"', 'id': 'tag:google.com,2010:shopping/products', 'items': [{'id': 'tag:google.com,2010:shopping/products/1196682/8186421160532506003',
'kind': 'shopping#product',
'product': {'author': {'accountId': '1196682',
'name': "Dillard's"},
'brand': 'Merrell',
'condition': 'new',
'country': 'US',
'creationTime': '2011-03-10T08:11:08.000Z',
'description': u'Merrell\'s "Trail Glove" barefoot running shoe lets your feet follow their natural i$
'googleId': '8186421160532506003',
'gtin': '00797240569847',
'images': [{'link': 'http://dimg.dillards.com/is/image/DillardsZoom/03528718_zi_amazon?$product$'}],
'inventories': [{'availability': 'inStock',
'channel': 'online',
'currency': 'USD',
'price': 110.0}],
'language': 'en',
'link': 'http://www.dillards.com/product/Merrell-Mens-Trail-Glove-Barefoot-Running-Shoes_301_-1_301_5$
'modificationTime': '2011-05-25T07:42:51.000Z',
'title': 'Merrell Men\'s "Trail Glove" Barefoot Running Shoes'},
'selfLink': 'https://www.googleapis.com/shopping/search/v1/public/products/1196682/gid/8186421160532506003?alt=js$
The JSON you've pasted in the question is not valid. But when you fixed that here's how to use simplejson:
import simplejson as json
your_response_body = '["foo", {"bar":["baz", null, 1.0, 2]}]'
obj = json.loads(your_response_body)
print(obj[1]['bar'])
And a link to the documentation.