Parsing json files - json

Below is the code for visual analysis of a set of tweets obtained in a .json file. Upon interpreting , an error is shown at the map() function. Any way to fix it?
import json
import pandas as pd
import matplotlib.pyplot as plt
tweets_data_path = 'import_requests.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
try:
tweet = json.loads(line)
tweets_data.append(tweet)
except:
continue
print(len(tweets_data))
tweets = pd.DataFrame()
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
These are the lines leading up to the 'ValueError' message I am getting for the above code :
Traceback (most recent call last):
File "tweet_len.py", line 21, in
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 1887, in setitem
self._set_item(key, value)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 1966, in _set_item
self._ensure_valid_index(value)
File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 1943, in _ensure_valid_index
raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
I am using Python3.
EDIT : Below is a sample of the twitter data collected ( .json format).
{
"created_at": "Sat Mar 05 05:47:23 +0000 2016",
"id": 705993088574033920,
"id_str": "705993088574033920",
"text": "Tumi Inc. civil war: Staff manning US ceasefire hotline 'can't speak Arabic' #fakeheadlinebot #learntocode #makeatwitterbot #javascript",
"source": "\u003ca href=\"http://javascriptiseasy.com\" rel=\"nofollow\"\u003eJavaScript is Easy\u003c/a\u003e",
"truncated": false,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"in_reply_to_screen_name": null,
"user": {
"id": 4382400263,
"id_str": "4382400263",
"name": "JavaScript is Easy",
"screen_name": "javascriptisez",
"location": "Your Console",
"url": "http://javascriptiseasy.com",
"description": "Get learning!",
"protected": false,
"verified": false,
"followers_count": 167,
"friends_count": 68,
"listed_count": 212,
"favourites_count": 11,
"statuses_count": 55501,
"created_at": "Sat Dec 05 11:18:00 +0000 2015",
"utc_offset": null,
"time_zone": null,
"geo_enabled": false,
"lang": "en",
"contributors_enabled": false,
"is_translator": false,
"profile_background_color": "000000",
"profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_tile": false,
"profile_link_color": "FFCC4D",
"profile_sidebar_border_color": "000000",
"profile_sidebar_fill_color": "000000",
"profile_text_color": "000000",
"profile_use_background_image": false,
"profile_image_url": "http://pbs.twimg.com/profile_images/673099606348070912/xNxp4zOt_normal.jpg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/673099606348070912/xNxp4zOt_normal.jpg",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/4382400263/1449314370",
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null
},
"geo": null,
"coordinates": null,
"place": null,
"contributors": null,
"is_quote_status": false,
"retweet_count": 0,
"favorite_count": 0,
"entities": {
"hashtags": [{
"text": "fakeheadlinebot",
"indices": [77, 93]
}, {
"text": "learntocode",
"indices": [94, 106]
}, {
"text": "makeatwitterbot",
"indices": [107, 123]
}, {
"text": "javascript",
"indices": [124, 135]
}],
"urls": [],
"user_mentions": [],
"symbols": []
},
"favorited": false,
"retweeted": false,
"filter_level": "low",
"lang": "en",
"timestamp_ms": "1457156843690"
}

I think you can use read_json:
import pandas as pd
df = pd.read_json('file.json')
print df.head()
contributors coordinates created_at entities \
contributors_enabled NaN NaN 2016-03-05 05:47:23 NaN
created_at NaN NaN 2016-03-05 05:47:23 NaN
default_profile NaN NaN 2016-03-05 05:47:23 NaN
default_profile_image NaN NaN 2016-03-05 05:47:23 NaN
description NaN NaN 2016-03-05 05:47:23 NaN
favorite_count favorited filter_level geo \
contributors_enabled 0 False low NaN
created_at 0 False low NaN
default_profile 0 False low NaN
default_profile_image 0 False low NaN
description 0 False low NaN
id id_str \
contributors_enabled 705993088574033920 705993088574033920
created_at 705993088574033920 705993088574033920
default_profile 705993088574033920 705993088574033920
default_profile_image 705993088574033920 705993088574033920
description 705993088574033920 705993088574033920
... is_quote_status lang \
contributors_enabled ... False en
created_at ... False en
default_profile ... False en
default_profile_image ... False en
description ... False en
place retweet_count retweeted \
contributors_enabled NaN 0 False
created_at NaN 0 False
default_profile NaN 0 False
default_profile_image NaN 0 False
description NaN 0 False
source \
contributors_enabled <a href="http://javascriptiseasy.com" rel="nof...
created_at <a href="http://javascriptiseasy.com" rel="nof...
default_profile <a href="http://javascriptiseasy.com" rel="nof...
default_profile_image <a href="http://javascriptiseasy.com" rel="nof...
description <a href="http://javascriptiseasy.com" rel="nof...
text \
contributors_enabled Tumi Inc. civil war: Staff manning US ceasefir...
created_at Tumi Inc. civil war: Staff manning US ceasefir...
default_profile Tumi Inc. civil war: Staff manning US ceasefir...
default_profile_image Tumi Inc. civil war: Staff manning US ceasefir...
description Tumi Inc. civil war: Staff manning US ceasefir...
timestamp_ms truncated \
contributors_enabled 2016-03-05 05:47:23.690 False
created_at 2016-03-05 05:47:23.690 False
default_profile 2016-03-05 05:47:23.690 False
default_profile_image 2016-03-05 05:47:23.690 False
description 2016-03-05 05:47:23.690 False
user
contributors_enabled False
created_at Sat Dec 05 11:18:00 +0000 2015
default_profile False
default_profile_image False
description Get learning!
[5 rows x 25 columns]

Related

How can I parse this html?

I'm trying to scrape https://PickleballBrackets.com using Selenium and BeautifulSoup with this code:
browser = webdriver.Safari()
browser.get('https://pickleballbrackets.com')
soup = BeautifulSoup(browser.page_source, 'lxml')
If I look at browser.page_source after I get the html, I can see 50 instances of
<div class="browse-row-box">
but after I create a soup object, they are lost. I believe that means that I have poorly formed html. I've tried all three parsers ('lxml', 'html5lib', 'html.parser') without any luck.
Suggestions on how to proceed?
Lot easier to get the data from the source.
import pandas as pd
import requests
url = 'https://pickleballbrackets.com/Json.asmx/EventsSearch_PublicUI'
payload = {
'AgeIDs': "",
'Alpha': "All",
'ClubID': "",
'CountryID': "",
'DateFilter': "future",
'EventTypeIDs': "1",
'FormatIDs': "",
'FromDate': "",
'IncludeTestEvents': "0",
'OrderBy': "EventActivityFirstDate",
'OrderDirection': "Asc",
'PageNumber': "1",
'PageSize': 9999,
'PlayerGroupIDs': "",
'PrizeMoney': "All",
'RankIDs': "",
'ReturnType': "json",
'SearchWord': "",
'ShowOnCalendar': "0",
'SportIDs': "dc1894c6-7e85-43bc-bfa2-3993b0dd630f",
'StateIDs': "",
'ToDate': "",
'prt': ""}
jsonData = requests.post(url, json=payload).json()
df = pd.DataFrame(jsonData['d'])
Output:
print(df.head(2).to_string())
RowNumber RecordCount PageCount CurrPage EventID ClubID Title TimeZoneAbbreviation UTCOffset HasDST StartTimesPosted Logo OnlineRegistration_Active Registration_DateOpen Registration_DateClosed IsSanctioned CancelTourney LocationOfEvent_Venue LocationOfEvent_StreetAddress LocationOfEvent_City LocationOfEvent_CountryTitle LocationOfEvent_StateTitle LocationOfEvent_Zip ShowDraws IsFavorite IsPrizeMoney MaxRegistrationsForEntireEvent Sanction_PCO SanctionLevelAppovedStatus_PCO SanctionLevelID_PCO Sanction_SSIPA SanctionLevelAppovedStatus_SSIPA SanctionLevelID_SSIPA Sanction_USAPA SanctionLevelAppovedStatus_USAP SanctionLevelID_USAP Sanction_WPF SanctionLevelAppovedStatus_WPF SanctionLevelID_WPF Sanction_GPA SanctionLevelAppovedStatus_GPA SanctionLevelID_GPA EventActivityFirstDate EventActivityLastDate IsRegClosed Cost_Registration_Current Cost_FeeOnEvents RegistrationCount_InAtLeastOneLiveEvent showResultsButton SantionLevels_PCO_Title SantionLevels_PCO_LevelLogo SantionLevels_SSIPA_Title SantionLevels_SSIPA_LevelLogo SantionLevels_USAP_Title SantionLevels_USAP_LevelLogo SantionLevels_WPF_Title SantionLevels_WPF_LevelLogo SantionLevels_GPA_Title SantionLevels_GPA_LevelLogo mng
0 1 152 1 1 410d04c2-49c5-48a4-847f-0f0ac0aa92f7 91c83e9c-c8e3-460d-b124-52f5c1036336 Cincinnati Pickleball Club 2022 March Mania EST -5 True False 410d04c2-49c5-48a4-847f-0f0ac0aa92f7_Logo.png True 1/24/2022 7:30:00 AM 3/22/2022 5:00:00 PM False False Five Seasons Ohio 11790 Snider Road Cincinnati United States Ohio 45249 -1 0 False 0 False False False False False 3/25/2022 4:00:00 PM 3/27/2022 2:00:00 PM 1 50.0 225.0 238 1 0
1 2 152 1 1 9f0c5976-94e9-4d58-a273-774744bdacec e5cd380b-fe72-4ef4-89e8-5053e94587a3 Flash Fridays Slam Series - March 25th EST -5 True False 9f0c5976-94e9-4d58-a273-774744bdacec_Logo.png True 3/1/2022 5:00:00 PM 3/23/2022 11:45:00 PM False False Holbrook Park 100 Sherwood Dr Huntersville United States North Carolina 28078 1 0 False 6 False False False False False 3/25/2022 4:00:00 PM 3/25/2022 4:00:00 PM 1 25.0 0.0 0 0 0
....
[152 rows x 60 columns]

I want to Itrate JSON and convert into rows from postgresql

Basically I want to itrate JSON till its length from table but rest of values of remains same till current JSON ends.
My Table format is like this
id
line
txndate
metadata
docnumber
363
[{"Id": "0", "Amount": 135000.0, "DetailType": "JournalEntryLineDetail", "Description": "Paid Office Rent Of Office", "JournalEntryLineDetail": {"AccountRef": {"name": "Rent or lease payments", "value": "57"}, "PostingType": "Debit"}}, {"Id": "1", "Amount": 135000.0, "DetailType": "JournalEntryLineDetail", "Description": "Paid Office Rent Of Office", "JournalEntryLineDetail": {"AccountRef": {"name": "Cash and cash equivalents:Bank", "value": "83"}, "PostingType": "Credit"}}]
2021-08-16 00:00:00.000000 +00:00
{"CreateTime": "2021-08-20T05:39:38.000000Z", "LastUpdatedTime": "2021-08-20T05:39:38.000000Z"}
332
610
[{"Id": "0", "Amount": 4138088.25, "DetailType": "JournalEntryLineDetail", "Description": "Deposit in Bank", "JournalEntryLineDetail": {"AccountRef": {"name": "Cash and cash equivalents:Bank", "value": "83"}, "PostingType": "Debit"}}, {"Id": "1", "Amount": 4138088.25, "DetailType": "JournalEntryLineDetail", "Description": "Deposit in Bank", "JournalEntryLineDetail": {"AccountRef": {"name": "Share capital", "value": "8"}, "PostingType": "Credit"}}, {"Id": "2", "DetailType": "DescriptionOnly", "Description": "Deposit in Bank"}]
2021-10-11 00:00:00.000000 +00:00
{"CreateTime": "2021-10-13T10:44:09.000000Z", "LastUpdatedTime": "2021-10-13T10:44:09.000000Z"}
560
381
[{"Id": "0", "Amount": 30000.0, "DetailType": "JournalEntryLineDetail", "Description": "Paid to Punkish", "JournalEntryLineDetail": {"AccountRef": {"name": "Advance Against Salary", "value": "103"}, "PostingType": "Debit"}}, {"Id": "1", "Amount": 30000.0, "DetailType": "JournalEntryLineDetail", "Description": "Paid to Punkish", "JournalEntryLineDetail": {"AccountRef": {"name": "Cash and cash equivalents:Bank", "value": "83"}, "PostingType": "Credit"}}]
2021-07-01 00:00:00.000000 +00:00
{"CreateTime": "2021-08-23T05:31:42.000000Z", "LastUpdatedTime": "2021-08-23T05:47:03.000000Z"}
521
But I want to extract information like following table
id
line_id
Amount
Description
name
value
posting_type
txndate
CreatedTime
LastUpdatedTime
363
0
13500
Paid Office Rent Of Office
Rent or lease payments
57
Debit
2021-08-16 00:00:00.000000 +00:00
2021-08-20T05:39:38.000000Z
2021-08-20T05:39:38.000000Z
363
1
13500
Paid Office Rent Of Office
Cash and cash equivalents:Bank
83
Cebit
2021-08-16 00:00:00.000000 +00:00
2021-08-20T05:39:38.000000Z
2021-08-20T05:39:38.000000Z
610
0
4138088.25
Deposit in Bank
Cash and cash equivalents:Bank
83
Debit
2021-10-11 00:00:00.000000 +00:00
2021-10-13T10:44:09.000000Z
2021-10-13T10:44:09.000000Z
610
1
4138088.25
..........
..
...
...
2021-10-11 00:00:00.000000 +00:00
2021-10-13T10:44:09.000000Z
2021-10-13T10:44:09.000000Z
610
2
4138088.25
..........
..
...
...
2021-10-11 00:00:00.000000 +00:00
2021-10-13T10:44:09.000000Z
2021-10-13T10:44:09.000000Z
610
3
4138088.25
..........
..
...
...
2021-10-11 00:00:00.000000 +00:00
2021-10-13T10:44:09.000000Z
2021-10-13T10:44:09.000000Z
I want to convert JSON column enteries into rows but want preserve id, txndate, CreatedTime, and LastUpdatedTime same till the length of JSON column which is line in my case.
Please guide me with solution if possible.
Note: I am using Postgresql and datatype of line column is jsonb
Here you go, You can use jsonb_array_elements function to convert an array of JSON to row and then query on each row
Demo
select
t.id,
e.value ->> 'Id' as line_id,
e.value ->> 'Amount' as amount,
e.value ->> 'Description' as description,
e.value -> 'JournalEntryLineDetail' -> 'AccountRef' ->> 'name' as name,
e.value -> 'JournalEntryLineDetail' -> 'AccountRef' ->> 'value' as value,
e.value -> 'JournalEntryLineDetail' ->> 'PostingType' as posting_type,
t.txndate,
t.metadata ->> 'CreateTime' as CreatedTime,
t.metadata ->> 'LastUpdatedTime' as LastUpdatedTime
from
test t
cross join jsonb_array_elements(t.line) e

Working with nested lists in R

I have dataset in json format. When I pull this data to a variable in R using jsonlite::fromJson, a list is created. This list further contains another list, which in turn contains another list and final list has other objects.
How do I make a dataframe out of this data?
Sample dataset:
{
"AG2": {
"-KB89nmaAHtjvIHl6X6I": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 0,
"score": 20,
"timeBonus": 95,
"timestamp": "Mon Feb 22 2016 18:59:17 GMT+0530 (India Standard Time)",
"totalScore": 20
},
"-KB89q8EJ-rYvjzVcIiQ": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 1,
"score": 39,
"timeBonus": 93,
"timestamp": "Mon Feb 22 2016 18:59:27 GMT+0530 (India Standard Time)",
"totalScore": 59
},
"-KB89s7SVjq0YZQ5XrOt": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 2,
"score": 59,
"timeBonus": 95,
"timestamp": "Mon Feb 22 2016 18:59:35 GMT+0530 (India Standard Time)",
"totalScore": 118
},
"-KB89xsifcqbxCDE9Ygt": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 3,
"score": 78,
"timeBonus": 93,
"timestamp": "Mon Feb 22 2016 18:59:59 GMT+0530 (India Standard Time)",
"totalScore": 196
},
"-KB8A-85L2P9nGmii3L_": {
"attempts": 1,
"isCorrect": "The output value should be your name",
"questionIndex": 4,
"score": 0,
"timeBonus": 95,
"timestamp": "Mon Feb 22 2016 19:00:08 GMT+0530 (India Standard Time)",
"totalScore": 196
},
"-KB8A1JTmwee2a5gRkLq": {
"attempts": 1,
"isCorrect": true,
"questionIndex": 4,
"score": 85,
"timeBonus": 87,
"timestamp": "Mon Feb 22 2016 19:00:17 GMT+0530 (India Standard Time)",
"totalScore": 281
}
},
"AKD": {
"-KB88bypMqEu3Y2zOEKD": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 0,
"score": 19,
"timeBonus": 90,
"timestamp": "Mon Feb 22 2016 18:54:07 GMT+0530 (India Standard Time)",
"totalScore": 19
},
"-KB88grpZUs3j3J4SNWc": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 1,
"score": 37,
"timeBonus": 85,
"timestamp": "Mon Feb 22 2016 18:54:27 GMT+0530 (India Standard Time)",
"totalScore": 56
},
"-KB88jYngOMf7chyXUch": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 2,
"score": 58,
"timeBonus": 91,
"timestamp": "Mon Feb 22 2016 18:54:38 GMT+0530 (India Standard Time)",
"totalScore": 114
},
"-KB88mtpsbwp6GD4VQ0f": {
"attempts": 0,
"isCorrect": true,
"questionIndex": 3,
"score": 76,
"timeBonus": 89,
"timestamp": "Mon Feb 22 2016 18:54:52 GMT+0530 (India Standard Time)",
"totalScore": 190
},
"-KB88z1nL0Ckhn5A5Lsq": {
"attempts": 1,
"isCorrect": "The output value should be your name",
"questionIndex": 4,
"score": 0,
"timeBonus": 60,
"timestamp": "Mon Feb 22 2016 18:55:41 GMT+0530 (India Standard Time)",
"totalScore": 190
}
}
}
Required dataframe:
User | TimeStamp | Question Index | Attempts | TimeBonus | isCorrect | Score | Total
Thanks.

converting R dataframes to json object

Say I have the following dataframes:
df1 <- data.frame(Name = c("Harry","George"), color=c("#EA0001", "#EEEEEE"))
Name color
1 Harry #EA0001
2 George #EEEEEE
df.details <- data.frame(Name = c(rep("Harry",each=3), rep("George", each=3)),
age=21:23,
total=c(14,19,24,1,9,4)
)
Name age total
1 Harry 21 14
2 Harry 22 19
3 Harry 23 24
4 George 21 1
5 George 22 9
6 George 23 4
I know how to convert each df to json like this:
library(jsonlite)
toJSON(df.details)
[{"Name":"Harry","age":21,"total":14},{"Name":"Harry","age":22,"total":19},{"Name":"Harry","age":23,"total":24},{"Name":"George","age":21,"total":1},{"Name":"George","age":22,"total":9},{"Name":"George","age":23,"total":4}]
However, I am looking to get the following structure to my JSON data:
{
"myjsondata": [
{
"Name": "Harry",
"color": "#EA0001",
"details": [
{
"age": 21,
"total": 14
},
{
"age": 22,
"total": 19
},
{
"age": 23,
"total": 24
}
]
},
{
"Name": "George",
"color": "#EEEEEE",
"details": [
{
"age": 21,
"total": 1
},
{
"age": 22,
"total": 9
},
{
"age": 23,
"total": 4
}
]
}
]
}
I think the answer may be in how I store the data in a list in R before converting, but not sure.
Try this format:
df1$details <- split(df.details[-1], df.details$Name)[df1$Name]
df1
# Name color details
#1 Harry #EA0001 21, 22, 23, 14, 19, 24
#2 George #EEEEEE 21, 22, 23, 1, 9, 4
toJSON(df1)
#[{
#"Name":"Harry",
#"color":"#EA0001",
#"details":[
# {"age":21,"total":14},
# {"age":22,"total":19},
# {"age":23,"total":24}]},
#{
#"Name":"George",
#"color":"#EEEEEE",
#"details":[
# {"age":21,"total":1},
# {"age":22,"total":9},
# {"age":23,"total":4}]}
#]

Constructing request payload in R using rjson/jsonlite

My current code as seen below attempts to construct a request payload (body), but isn't giving me the desired result.
library(df2json)
library(rjson)
y = rjson::fromJSON((df2json::df2json(dataframe)))
globalparam = ""
req = list(
Inputs = list(
input1 = y
)
,GlobalParameters = paste("{",globalparam,"}",sep="")#globalparam
)
body = enc2utf8((rjson::toJSON(req)))
body currently turns out to be
{
"Inputs": {
"input1": [
{
"X": 7,
"Y": 5,
"month": "mar",
"day": "fri",
"FFMC": 86.2,
"DMC": 26.2,
"DC": 94.3,
"ISI": 5.1,
"temp": 8.2,
"RH": 51,
"wind": 6.7,
"rain": 0,
"area": 0
}
]
},
"GlobalParameters": "{}"
}
However, I need it to look like this:
{
"Inputs": {
"input1": [
{
"X": 7,
"Y": 5,
"month": "mar",
"day": "fri",
"FFMC": 86.2,
"DMC": 26.2,
"DC": 94.3,
"ISI": 5.1,
"temp": 8.2,
"RH": 51,
"wind": 6.7,
"rain": 0,
"area": 0
}
]
},
"GlobalParameters": {}
}
So basically global parameters have to be {}, but not hardcoded. It seemed like a fairly simple problem, but I couldn't fix it. Please help!
EDIT:
This is the dataframe
X Y month day FFMC DMC DC ISI temp RH wind rain area
1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0
2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0
3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0
4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0
This is an example of another data frame
> a = data.frame("col1" = c(81, 81, 81, 81), "col2" = c(72, 69, 79, 84))
Using this sample data
dd<-read.table(text=" X Y month day FFMC DMC DC ISI temp RH wind rain area
1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0", header=T)
You can do
globalparam = setNames(list(), character(0))
req = list(
Inputs = list(
input1 = dd
)
,GlobalParameters = globalparam
)
body = enc2utf8((rjson::toJSON(req)))
Note that globalparam looks a bit funny because we need to force it to a named list for rjson to treat it properly. We only have to do this when it's empty.