Parse data from a complex dictionary in Python - json

I'm brand new to programming and have taken several days trying to solve this problem. I have the following response from an API that I would like to reformat so I can read it better, and use pieces of the data in future codes.
I'm using the following code
import requests
import pandas
for item in items: #item is defined previously in the code
response = request.get(url, parameters={put stuff here}, headers={more stuff here})
json_response = response.json()
print(item)
print(json_response)
output:
item1
{'series':{'data':[{'keyA':'value1', 'keyB:value2, 'keyC':value3}, {'keyA':'value4', 'keyB:value5, 'keyC':value6}, {'keyA':'value7', 'keyB:value8, 'keyC':value9}]}}
item2
{'series':{'data':[{'keyA':'value10', 'keyB:value11, 'keyC':value12}, {'keyA':'value13', 'keyB:value14, 'keyC':value15}, {'keyA':'value16', 'keyB:value17, 'keyC':value18}]}}
I'd like to get the output to look like this:
item1
keyA keyB keyC
value1 value2 value3
value4 value5 value6
value7 value8 value9
item2
keyA keyB keyC
value10 value11 value12
value13 value14 value15
value16 value17 value18
I've tried several pandas and numpy codes, but I can't find anything that works. Everything I try results in a multitude of errors.

To make a dataframe from item you can do:
item = {
"series": {
"data": [
{"keyA": "value1", "keyB": "value2", "keyC": "value3"},
{"keyA": "value4", "keyB": "value5", "keyC": "value6"},
{"keyA": "value7", "keyB": "value8", "keyC": "value9"},
]
}
}
df = pd.DataFrame(item["series"]["data"])
print(df)
Prints:
keyA keyB keyC
0 value1 value2 value3
1 value4 value5 value6
2 value7 value8 value9
With your pseudo-code:
for item in items:
response = request.get(url, parameters={put stuff here}, headers={more stuff here})
json_response = response.json()
df = pd.DataFrame(json_response["series"]["data"])
print(df)

Related

Read complex JSON to extract key values

I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+

Convert json to long format table using jq

Given an array of JSON objects, all having the same key names (key1, key2, key3) and just one key (key3) whose value is an array, how can it be converted to a long format table?
Input:
[
{ "key1": "A",
"key2": 1,
"key3" : ["aaa", "bbb"]
},
{ "key1": "B",
"key2": 2,
"key3" : ["ccc", "ddd"]
}
]
Desired output:
key1
key2
key3
A
1
aaa
A
1
bbb
B
2
ccc
B
2
ddd
.[]| ([.key1,.key2] + (.key3[]|[.])) | #csv

read pretty json format data through spark

We read data present in hour format present in S3 through spark in scala.For example,
spark.read.textFile("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
spark.read.textFile reads records one line at a time so for example records that are present in jsonLines(full json data in one line) are read and can be parsed later to retrieve data from json.
Now,I have to read data which is having multiple json but in pretty format instead of json lines.Using same strategy gives corrupt record error.For example Dataset[String] obtained after reading through spark.read.textFile:
{
"a": 1,
"b": 2
}
is
_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
Input data :
{
"key1": "value1",
"key2": "value2"
}
{
"key1": "value1",
"key2": "value2"
}
ExpectedOutput
+------+------+
|key1 |key2 |
+------+------+
|value1|value2|
|value1|value2|
+------+------+
This file has multiple pretty formatted json with delimiter between records as newline.
Approaches already used
spark.read.option("multiline", "true").json("") .This will not work as multiline requires data to be present in form of [{},{}].
Approach working
val x=sparkSession
.read
.json(sc
.wholeTextFiles(filePath)
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
I just wanted to ask if there is a better approach as the above solution is doing some slice and dice on data which might lead to performance issue for large data?Thanks
This can be a working solution for you, use from_json() and correct schema in order to parse a json correctly
Create the dataframe here
df = spark.createDataFrame([(str([{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}]))],T.StringType())
df.show(truncate=False)
+----------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|
+----------------------------------------------------------------------------+
Now, use explode() as the value/json column is a list in order to map correctly
And, finally use getItem() to extract the columns
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("key1", df.col.getItem("key1")).withColumn("key2", df.col.getItem("key2"))
+----------------------------------------------------------------------------+--------------------------------+------+------+
|value |col |key1 |key2 |
+----------------------------------------------------------------------------+--------------------------------+------+------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value1, key2 -> value2]|value1|value2|
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value3, key2 -> value4]|value3|value4|
+----------------------------------------------------------------------------+--------------------------------+------+------+
df.show(truncate=False)

Double Nested JSON to DF

I'm unable to make this JSON:
{
“profiles”: {
“1”: {
“id”: “1”,
“property1”: “value1”,
“property2”: “value2”
},
“2”: {
“id”: “2”,
“property1”: “value21”,
“property2”: “value22”
}
}}
To this format
Desired output
Id Property1 Property2
1 Value1 Value2
2 Value21 Value22
I've attempted different approaches, that just result in one col all data.
Can someone please orient me on this?
Based on this example:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
I would suggest something like:
your_json = {<your_json>}
property1 = []
property2 = []
for key, value in your_json.items():
for k, v in value.items():
property1.append(v['property1'])
property2.append(v['property2'])
data = {'property1': property1, 'property2': property2}
tt = pd.DataFrame.from_dict(data)
print(tt)

Create DF/RDD from nested other DF/RDD (Nested Json) in Spark

I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+
I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+