Read complex JSON to extract key values - json

I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower

If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+

Related

Opening a json column as a string in pyspark schema and working with it

I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?
Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+

Explode multiple columns from nested JSON but it is giving extra records

I have a JSON document like below:
{
"Data": [{
"Code": "ABC",
"ID": 123456,
"Type": "Yes",
"Geo": "East"
}, {
"Code": "XYZ",
"ID": 987654,
"Type": "No",
"Geo": "West"
}],
"Total": 2,
"AggregateResults": null,
"Errors": null
}
My PySpark sample code:
getjsonresponsedata=json.dumps(getjsondata)
jsonDataList.append(getjsonresponsedata)
jsonRDD = sc.parallelize(jsonDataList)
df_Json=spark.read.json(jsonRDD)
display(df_Json.withColumn("Code",explode(col("Data.Code"))).withColumn("ID",explode(col("Data.ID"))).select('Code','ID'))
When I explode the JSON then I get below records (it looks like cross join)
Code ID
ABC 123456
ABC 987654
XYZ 123456
XYZ 987654
But I expect the records like below:
Code ID
ABC 123456
XYZ 987654
Could you please help me on how to get the expected result?
You only need to explode Data column, then you can select fields from the resulting struct column (Code, Id...). What duplicates the rows here is that you're exploding 2 arrays Data.Code and Data.Id.
Try this instead:
import pyspark.sql.functions as F
df_Json.withColumn("Data", F.explode("Data")).select("Data.Code", "Data.Id").show()
#+----+------+
#|Code| Id|
#+----+------+
#| ABC|123456|
#| XYZ|987654|
#+----+------+
Or using inline function directly on Data array:
df_Json.selectExpr("inline(Data)").show()
#+----+----+------+----+
#|Code| Geo| ID|Type|
#+----+----+------+----+
#| ABC|East|123456| Yes|
#| XYZ|West|987654| No|
#+----+----+------+----+

How to merge arrays from two files into one array with jq?

I would like to merge two files containing JSON. They each contain an array of JSON objects.
registration.json
[
{ "name": "User1", "registration": "2009-04-18T21:55:40Z" },
{ "name": "User2", "registration": "2010-11-17T15:09:43Z" }
]
useredits.json
[
{ "name": "User1", "editcount": 164 },
{ "name": "User2", "editcount": 150 },
{ "name": "User3", "editcount": 10 }
]
In the ideal scenario, I would like to have the following as a result of the merge operation:
[
{ "name": "User1", "editcount": 164, "registration": "2009-04-18T21:55:40Z" },
{ "name": "User2", "editcount": 150, "registration": "2010-11-17T15:09:43Z" }
]
I have found https://github.com/stedolan/jq/issues/1247#issuecomment-348817802 but I get
jq: error: module not found: jq
jq solution:
jq -s '[ .[0] + .[1] | group_by(.name)[]
| select(length > 1) | add ]' registration.json useredits.json
The output:
[
{
"name": "User1",
"registration": "2009-04-18T21:55:40Z",
"editcount": 164
},
{
"name": "User2",
"registration": "2010-11-17T15:09:43Z",
"editcount": 150
}
]
Although not strictly answering the question, the command below
jq -s 'flatten | group_by(.name) | map(reduce .[] as $x ({}; . * $x))'
registration.json useredits.json
generates this output:
[
{ "name": "User1", "editcount": 164, "registration": "2009-04-18T21:55:40Z" },
{ "name": "User2", "editcount": 150, "registration": "2010-11-17T15:09:43Z" },
{ "name": "User3", "editcount": 10 }
]
Source:
jq - error when merging two JSON files "cannot be multiplied"
The following assumes you have jq 1.5 or later, and that:
joins.jq as shown below is in the directory ~/.jq/ or the directory ~/.jq/joins/
there is no file named joins.jq in the pwd
registration.json has been fixed to make it valid JSON (btw, this can be done by jq itself).
The invocation to use would then be:
jq -s 'include "joins"; joins(.name)' registration.json useredits.json
joins.jq
# joins.jq Version 1 (12-12-2017)
def distinct(s):
reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x)
|.[];
# Relational Join
# joins/6 provides similar functionality to the SQL INNER JOIN statement:
# SELECT (Table1|p1), (Table2|p2)
# FROM Table1
# INNER JOIN Table2 ON (Table1|filter1) = (Table2|filter2)
# where filter1, filter2, p1 and p2 are filters.
# joins(s1; s2; filter1; filter2; p1; p2)
# s1 and s2 are streams of objects corresponding to rows in Table1 and Table2;
# filter1 and filter2 determine the join criteria;
# p1 and p2 are filters determining the final results.
# Input: ignored
# Output: a stream of distinct pairs [p1, p2]
# Note: items in s1 for which filter1 == null are ignored, otherwise all rows are considered.
#
def joins(s1; s2; filter1; filter2; p1; p2):
def it: type[0:1] + tostring;
def ix(s;f):
reduce s as $x ({}; ($x|f) as $y | if $y == null then . else .[$y|it] += [$x] end);
# combine two dictionaries using the cartesian product of distinct elements
def merge:
.[0] as $d1 | .[1] as $d2
| ($d1|keys_unsorted[]) as $k
| if $d2[$k] then distinct($d1[$k][]|p1) as $a | distinct($d2[$k][]|p2) as $b | [$a,$b]
else empty end;
[ix(s1; filter1), ix(s2; filter2)] | merge;
def joins(s1; s2; filter1; filter2):
joins(s1; s2; filter1; filter2; .; .) | add ;
# Input: an array of two arrays of objects
# Output: a stream of the joined objects
def joins(filter1; filter2):
joins(.[0][]; .[1][]; filter1; filter2);
# Input: an array of arrays of objects.
# Output: a stream of the joined objects where f defines the join criterion.
def joins(f):
# j/0 is defined so TCO is applicable
def j:
if length < 2 then .[][]
else [[ joins(.[0][]; .[1][]; f; f)]] + .[2:] | j
end;
j ;

Parse JSON Nested SubValues in Powershell to Table

I converted the JSON string to Powershell in v5. The original json string is below:
$j = #'
[{
"id": "1",
"Members": [
"A",
"B",
"C"
]
}, {
"id": "2",
"Members": [
"A",
"C"
]
}, {
"id": "3",
"Members": [
"A",
"D"
]
}]
'#
$json = $j | ConvertFrom-Json
I would like the result set to look like the result set below. Eventually I will export to SQL:
id Members
----- --------
1 A
1 B
1 C
2 A
2 C
3 A
3 D
try this
$json | % {
$id = $_.id
$_.members | select #{n='id';e={$id}}, #{n='members';e={$_}}
}

Create DF/RDD from nested other DF/RDD (Nested Json) in Spark

I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+
I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+