Need python code to parse the JSON in specific format in Python to expand a json data in a column in a pandas dataframe [duplicate] - json

I am trying here to use json_normalize to somehow format the output of an API, but I keep getting a faulty and empty csv file. I tried to change df2 = pd.json_normalize(response, record_path=['LIST']) , but keep getting this error message:
TypeError: byte indices must be integers or slices, not str
Could you please guide me on what am I doing wrong ?
Thanks a lot !
import requests
import json
import pandas as pd
url = "https://*hidden*Results/"
payload = json.dumps({
"id": 12345
})
headers = {
'Authorization': 'Basic *hidden*',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
df1 = pd.DataFrame(response).iloc[:,:-2]
df2 = pd.json_normalize(response, record_path=None)
df = pd.concat([df1, df2], axis=1)
df.to_csv("test.csv", index=False)

You are passing the variable response in the call:
df2 = pd.json_normalize(response, record_path=None)
Which is an a requests.models.Response Object and you need to pass a dict, so you need to do something like pd.json_normalize(response.json(), record_path=['LIST'])
I tried it with this example and works:
>>> import pandas as pd
>>> data = [
... {
... "state": "Florida",
... "shortname": "FL",
... "info": {"governor": "Rick Scott"},
... "counties": [
... {"name": "Dade", "population": 12345},
... {"name": "Broward", "population": 40000},
... {"name": "Palm Beach", "population": 60000},
... ],
... },
... {
... "state": "Ohio",
... "shortname": "OH",
... "info": {"governor": "John Kasich"},
... "counties": [
... {"name": "Summit", "population": 1234},
... {"name": "Cuyahoga", "population": 1337},
... ],
... },
... ]
>>> result = pd.json_normalize(data, ["counties"])
>>> result
name population
0 Dade 12345
1 Broward 40000
2 Palm Beach 60000
3 Summit 1234
4 Cuyahoga 1337
EDIT I will try to do this:
import requests
import json
import pandas as pd
url = "https://*hidden*Results/"
payload = json.dumps({
"id": 12345
})
headers = {
'Authorization': 'Basic *hidden*',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_response = response.json()
df1 = pd.DataFrame(json_response).iloc[:,:-2]
df2 = pd.json_normalize(json_response, record_path=['LIST'])
df = pd.concat([df1, df2], axis=1)
df.to_csv("test.csv", index=False)

Related

json_normalize does not read all data

I have a json file that I want to flatten and retrieve all the information into a pandas dataframe. The json file looks like this:
jsonstr = {
"calculation": {
"id": "3k3k3k3kwk3kwk",
"Id": 23,
"submissionDate": 1622428064679,
"serverVersion": "3.3.5.6.r",
"tag": [
{
"code": "qq4059331155113278",
"manual": {
"location": {
"x": 26.5717,
"y": 59.4313,
"z": 0.0,
"floor": 0
},
"timestamp": 1599486138000
},
"device": null,
"measurements": [
{
"Address": "D_333",
"subcell": "",
"frequency": 14.0,
"dfId": 0
},
{
"trxAddress": "D_334",
"subcell": "",
"frequency": 11.0,
"dfId": 0
}]
}]
}
}
Now, as usual, I do the following. I thought that this would return all the "fields", including id, Id, submissionDate and so on
import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)
file = './Testjson.json'
#file = './jsondumps/ff80818178f93bd90179ab51781e1c95.json'
with open(file) as json_string:
jsonstr = json.load(json_string)
labels = pd.json_normalize(jsonstr, record_path=['calculation','tag'])
But in fact, it returns:
code device \
0 qq4059331155113278 None
measurements manual.location.x \
0 [{'Address': 'D_333', 'subcell': '', 'frequenc... 26.5717
manual.location.y manual.location.z manual.location.floor \
0 59.4313 0.0 0
manual.timestamp
0 1599486138000
and trying the following
labels = pd.json_normalize(jsonstr, record_path=['calculation','tag'], meta=['id', 'Id'])
returns an error:
KeyError: 'id'
which makes sense. But What am I doing wrong to begin with? Why can I not get all the fields under calculation since they are in the path?
Greatful for any insights!
Your syntax is slightly off on the meta argument. id and Id are at the end of the dataframe.
If you are looking to flatten the entire json, look into flatten_json. It's a pretty good library to use with nested json.
pd.json_normalize(jsonstr, record_path=['calculation','tag'], meta=[['calculation','id'],['calculation','Id']])
code device measurements manual.location.x manual.location.y manual.location.z manual.location.floor manual.timestamp calculation.id calculation.Id
0 qq4059331155113278 null [{'Address': 'D_333', 'subcell': '', 'frequenc... 26.5717 59.4313 0.0 0 1599486138000 3k3k3k3kwk3kwk 23

Combine multiple JSON files, and parse into CSV

I have about 100 JSON files, all titled with different dates and I need to merge them into one CSV file that has headers "date", "real_name", "text".
There are no dates listed in the JSON itself, and the real_name is nested. I haven't worked with JSON in a while and am a little lost.
The basic structure of the JSON looks more or less like this:
Filename: 2021-01-18.json
[
{
"client_msg_id": "xxxx",
"type": "message",
"text": "THIS IS THE TEXT I WANT TO PULL",
"user": "XXX",
"user_profile": {
"first_name": "XXX",
"real_name": "THIS IS THE NAME I WANT TO PULL",
"display_name": "XXX",
"is_restricted": false,
"is_ultra_restricted": false
},
"blocks": [
{
"type": "rich_text",
"block_id": "yf=A9",
}
]
}
]
So far I have
import glob
read_files = glob.glob("*.json")
output_list = []
all_items = []
for f in read_files:
with open(f, "rb") as infile:
output_list.append(json.load(infile))
data = {}
for obj in output_list[]
data['date'] = f
data['text'] = 'text'
data['real_name'] = 'real_name'
all_items.append(data)
Once you've read the JSON object, just index into the dictionaries for the data. You might need obj[0]['text'], etc., if your JSON data is really in a list in each file, but that seems odd and I'm assuming your data was pasted from output_list after you'd collected the data. So assuming your file content is exactly like below:
{
"client_msg_id": "xxxx",
"type": "message",
"text": "THIS IS THE TEXT I WANT TO PULL",
"user": "XXX",
"user_profile": {
"first_name": "XXX",
"real_name": "THIS IS THE NAME I WANT TO PULL",
"display_name": "XXX",
"is_restricted": false,
"is_ultra_restricted": false
},
"blocks": [
{
"type": "rich_text",
"block_id": "yf=A9",
}
]
}
test.py:
import json
import glob
from pathlib import Path
read_files = glob.glob("*.json")
output_list = []
all_items = []
for f in read_files:
with open(f, "rb") as infile:
output_list.append(json.load(infile))
data = {}
for obj in output_list:
data['date'] = Path(f).stem
data['text'] = obj['text']
data['real_name'] = obj['user_profile']['real_name']
all_items.append(data)
print(all_items)
Output:
[{'date': '2021-01-18', 'text': 'THIS IS THE TEXT I WANT TO PULL', 'real_name': 'THIS IS THE NAME I WANT TO PULL'}]

Flatten JSON data to individual columns

I am working upon Twitter streaming data and I am having an output like this:
"data": {
"author_id": "1318123716522479616",
"created_at": "2020-11-05T04:18:21.000Z",
"entities": {
"hashtags": [
{
"end": 107,
"start": 86,
"tag": "MilliHesaplarYanyana"
}
],
"mentions": [
{
"end": 15,
"start": 3,
"username": "MilliTaakip"
}
]
},
"id": "1324204381177323520",
"lang": "tr",
"text": "RT #MilliTaakip: Milli hesaplar\u0131m\u0131z\u0131n g\u00fc\u00e7lenmesi i\u00e7in\nCumhurba\u015fkan\u0131m\u0131z\u0131n talimat\u0131yla,\n#MilliHesaplarYanyana \u00e7al\u0131\u015fmas\u0131n\u0131 destekliyoruz;\n\n\ud83c\uddf9\ud83c\uddf7\u2026"
}
}
I want to extract specific information like the hashtags from this data and store them in my database.
I tried using multiple ways like json.normalize ,flatten_json but it does not work. I get the following as my output
here's my code:
def connect_to_endpoint(url, headers):
response = requests.request("GET", url, headers=headers, stream=True, params=payload)
print(response.status_code)
for response_line in response.iter_lines():
if response_line:
# print(ndjson.dumps(json_response["data"]["text"], indent=4, sort_keys=True))
conn = psycopg2.connect(database="tweetData", user="postgres", password="pass", host="localhost", port="5432")
cur = conn.cursor()
# cc
try:
data = json.loads(response_line.decode('utf-8'))
index = 0
#for created at
var2 = json.loads(response_line.decode('utf-8'))["data"]["text"]
# define a list of keywords
keywords = ('biden', 'election', 'trump','stocks')
if any(keyword in var2.lower() for keyword in keywords):
df= pd.json_normalize(data)
dffinal=pd.DataFrame(df)
engine = create_engine('postgresql+psycopg2://postgres:root#localhost:5432/tweetData')
dffinal.to_sql("new-tweets", engine,if_exists='append',dtype = {'relevant_column':sqlalchemy.types.JSON})
print("loaded")
else:
print("none")
conn.commit()
index += 1
cur.close()
except IOError as io:
print("ERROR!")
if response.status_code != 200:
raise Exception(
"Request returned an error: {} {}".format(
response.status_code, response.text
)
)
Please advise on how should I proceed and what errors I have in my approach
EDIT:
Every time I try to retrieve the tweet data, in case there is no entities or no hashtags in the tweet data, it sends an error saying Key Error: 'entities'
In PostgreSQL you could use
SELECT value ->> 'tag'
FROM jsonb_array_elements(your_json #> '{data,entities,hashtags}') AS x(value);
to extract the tags.

TypeError: list indices must be integers or slices, not str JSON Scrapy

I was scraping a JSON response but getting the following error
values = resp['acf']
TypeError: list indices must be integers or slices, not str
I am not sure where did I do wrong.
Your response is highly appreciated.
# -*- coding: utf-8 -*-
import scrapy
import json
class MainSpider(scrapy.Spider):
name = 'main'
start_urls = 'https://chamber.vinylagency.com/wp-json/wp/v2/directory?industry-type=547&per_page=100'
def parse(self, response):
resp = json.loads(response.body)
values = resp['acf']
for value in values:
name = value['OrgName']
yield {
"Name": name,
}
The exception is raised because the response is a list of objects and you are trying to access it as a dict directly.
Here is a sample of the response:
[
{
"id": 33286,
"date": "2020-05-09T02:38:47",
"date_gmt": "2020-05-09T02:38:47",
"guid":
...
},
{
"id": 32954,
"date": "2020-05-09T02:38:22",
"date_gmt": "2020-05-09T02:38:22",
"guid":
...
}
]
You probably want to parse like this:
def parse(self, response):
resp = json.loads(response.body)
for value in values:
name = value['acf']['OrgName']
yield {
"Name": name,
}

Join nested JSON dataframe and another dataframe

I am trying to join a dataframe1 generated by the JSON with dataframe2 using the field order_id, then assign the "status" from dataframe2 to the "status" of dataframe1. Anyone knows how to do this. Many thanks for your help.
dataframe1
[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": "" <== use "status" from dataframe2 to populate this field
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
}
]
dataframe2
order_id status
10002 "Delivered"
10001 "Ordered"
Here is your raw dataset as a json string:
d = """[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]
}
]"""
Firstly, I would load it as json:
import json
data = json.loads(d)
Then, I would turn it into a Pandas dataframe, notice that I remove status field as it will be populated by the join step :
df1 = pd.json_normalize(data, 'olist')[['order_id', 'order_dt_tm']]
Then, from the second dataframe sample, I would do a left join using merge function:
data = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data)
result = df1.merge(df2, on='order_id', how='left')
Good luck
UPDATE
# JSON to Dataframe
df1 = pd.json_normalize(data)
# Sub JSON to dataframe
df1['sub_df'] = df1['olist'].apply(lambda x: pd.json_normalize(x).drop('status', axis=1))
# Build second dataframe
data2 = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data2)
# Populates status in sub dataframes
df1['sub_df'] = df1['sub_df'].apply(lambda x: x.merge(df2, on='order_id', how='left').fillna(''))
# Sub dataframes back to JSON
def back_to_json_str(df):
# turns a df back to string json
return str(df.to_json(orient="records", indent=4))
df1['olist'] = df1['sub_df'].apply(lambda x: back_to_json_str(x))
# Global DF back to JSON string
parsed = str(df1.drop('sub_df', axis=1).to_json(orient="records", indent=4))
parsed = parsed.replace(r'\n', '\n')
parsed = parsed.replace(r'\"', '\"')
# Print result
print(parsed)
UPDATE 2
here is a way to add index colum to a dataframe:
df1['index'] = [e for e in range(df1.shape[0])]
This is my code assigning title values from a dataframe back to the JSON object. The assignment operation takes a bit time if the number records in the JSON object is 100000. Anyone knows how to improve the performance of this code. Many thanks.
import json
import random
import pandas as pd
import pydash as _
data = [{"pid":1,"name":"Test1","title":""},{"pid":2,"name":"Test2","title":""}] # 5000 records
# dataframe1
df = pd.json_normalize(data)
# dataframe2
pid = [x for x in range(1, 5000)]
title_set = ["Boss", "CEO", "CFO", "PMO", "Team Lead"]
titles = [title_set[random.randrange(0, 5)] for x in range(1, 5000)]
df2 = pd.DataFrame({'pid': pid, 'title': titles})
#left join dataframe1 and dataframe2
df3 = df.merge(df2, on='pid', how='left')
#assign title values from dataframe back to the json object
for row in df3.iterrows():
idx = _.find_index(data, lambda x: x['pid'] == row[1]['pid'])
data[idx]['title'] = row[1]['title_y']
print(data)