Create DF/RDD from nested other DF/RDD (Nested Json) in Spark - json

I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+

I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+

Related

Read complex JSON to extract key values

I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+

Explode multiple columns from nested JSON but it is giving extra records

I have a JSON document like below:
{
"Data": [{
"Code": "ABC",
"ID": 123456,
"Type": "Yes",
"Geo": "East"
}, {
"Code": "XYZ",
"ID": 987654,
"Type": "No",
"Geo": "West"
}],
"Total": 2,
"AggregateResults": null,
"Errors": null
}
My PySpark sample code:
getjsonresponsedata=json.dumps(getjsondata)
jsonDataList.append(getjsonresponsedata)
jsonRDD = sc.parallelize(jsonDataList)
df_Json=spark.read.json(jsonRDD)
display(df_Json.withColumn("Code",explode(col("Data.Code"))).withColumn("ID",explode(col("Data.ID"))).select('Code','ID'))
When I explode the JSON then I get below records (it looks like cross join)
Code ID
ABC 123456
ABC 987654
XYZ 123456
XYZ 987654
But I expect the records like below:
Code ID
ABC 123456
XYZ 987654
Could you please help me on how to get the expected result?
You only need to explode Data column, then you can select fields from the resulting struct column (Code, Id...). What duplicates the rows here is that you're exploding 2 arrays Data.Code and Data.Id.
Try this instead:
import pyspark.sql.functions as F
df_Json.withColumn("Data", F.explode("Data")).select("Data.Code", "Data.Id").show()
#+----+------+
#|Code| Id|
#+----+------+
#| ABC|123456|
#| XYZ|987654|
#+----+------+
Or using inline function directly on Data array:
df_Json.selectExpr("inline(Data)").show()
#+----+----+------+----+
#|Code| Geo| ID|Type|
#+----+----+------+----+
#| ABC|East|123456| Yes|
#| XYZ|West|987654| No|
#+----+----+------+----+

read pretty json format data through spark

We read data present in hour format present in S3 through spark in scala.For example,
spark.read.textFile("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
spark.read.textFile reads records one line at a time so for example records that are present in jsonLines(full json data in one line) are read and can be parsed later to retrieve data from json.
Now,I have to read data which is having multiple json but in pretty format instead of json lines.Using same strategy gives corrupt record error.For example Dataset[String] obtained after reading through spark.read.textFile:
{
"a": 1,
"b": 2
}
is
_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
Input data :
{
"key1": "value1",
"key2": "value2"
}
{
"key1": "value1",
"key2": "value2"
}
ExpectedOutput
+------+------+
|key1 |key2 |
+------+------+
|value1|value2|
|value1|value2|
+------+------+
This file has multiple pretty formatted json with delimiter between records as newline.
Approaches already used
spark.read.option("multiline", "true").json("") .This will not work as multiline requires data to be present in form of [{},{}].
Approach working
val x=sparkSession
.read
.json(sc
.wholeTextFiles(filePath)
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
I just wanted to ask if there is a better approach as the above solution is doing some slice and dice on data which might lead to performance issue for large data?Thanks
This can be a working solution for you, use from_json() and correct schema in order to parse a json correctly
Create the dataframe here
df = spark.createDataFrame([(str([{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}]))],T.StringType())
df.show(truncate=False)
+----------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|
+----------------------------------------------------------------------------+
Now, use explode() as the value/json column is a list in order to map correctly
And, finally use getItem() to extract the columns
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("key1", df.col.getItem("key1")).withColumn("key2", df.col.getItem("key2"))
+----------------------------------------------------------------------------+--------------------------------+------+------+
|value |col |key1 |key2 |
+----------------------------------------------------------------------------+--------------------------------+------+------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value1, key2 -> value2]|value1|value2|
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value3, key2 -> value4]|value3|value4|
+----------------------------------------------------------------------------+--------------------------------+------+------+
df.show(truncate=False)

Karate API framework how to match the response values with the table columns?

I have below API response sample
{
"items": [
{
"id":11,
"name": "SMITH",
"prefix": "SAM",
"code": "SSO"
},
{
"id":10,
"name": "James",
"prefix": "JAM",
"code": "BBC"
}
]
}
As per above response, my tests says that whenever I hit the API request the 11th ID would be of SMITH and 10th id would be JAMES
So what I thought to store this in a table and assert against the actual response
* table person
| id | name |
| 11 | SMITH |
| 10 | James |
| 9 | RIO |
Now how would I match one by one ? like first it parse the first ID and first name from the API response and match with the Tables first ID and tables first name
Please share any convenient way of doing it from KARATE
There are a few possible ways, here is one:
* def lookup = { 11: 'SMITH', 10: 'James' }
* def items =
"""
[
{
"id":11,
"name":"SMITH",
"prefix":"SAM",
"code":"SSO"
},
{
"id":10,
"name":"James",
"prefix":"JAM",
"code":"BBC"
}
]
"""
* match each items contains { name: "#(lookup[_$.id+''])" }
And you already know how to use table instead of JSON.
Please read the docs and other stack-overflow answers to get more ideas.

How to remove {} and [] from json column postgreSQL

I have column in postgreSQL with json data type. Until today there were not row which contained {} or [].
However, I start to see {} and [] due to new implementation. I want to remove it.
Example: Following is my table looks like. json is json data type
id | json
----+------------------
a | {"st":[{"State": "TX", "Value":"0.02"}, {"State": "CA", "Value":"0.2" ...
----+------------------
b | {"st":[{"State": "TX", "Value":"0.32"}, {"State": "CA", "Value":"0.47" ...
----+------------------
d | {}
----+------------------
e | []
Where I want as following:
id | json
----+------------------
a | {"st":[{"State": "TX", "Value":"0.02"}, {"State": "CA", "Value":"0.2" ...
----+------------------
b | {"st":[{"State": "TX", "Value":"0.32"}, {"State": "CA", "Value":"0.47" ...
How I should able to do it ?
I have writen following query:
SELECT *
FROM tableA
WHERE json::text <> '[]'::text
Where I am able to filter empty elements which starts with {}. but still seeing [].
Very easy, just select all rows that don't contain those values:
SELECT *
FROM tableA
WHERE json :: text NOT IN ('{}', '[]')