read a file of JSON string in pyspark - json

I have a file look like this:
'{"Name": "John", "Age": 23}'
'{"Name": "Mary", "Age": 21}'
How can I read this file and get a pyspark dataframe like this:
Name | Age
"John" | 23
"Mary" | 21

First read in the file in text format, and then use the from_json function to convert the row to two columns.
df = spark.read.load(path_to_your_file, format='text')
df = df.selectExpr("from_json(trim('\\'' from value), 'Name string,Age int') as data").select('data.*')
df.show(truncate=False)

Related

Pyspark: How to create a nested Json by adding dynamic prefix to each column based on a row value

I have a dataframe in below format.
Input:
id
Name_type
Name
Car
1
First
rob
Nissan
2
First
joe
Hyundai
1
Last
dent
Infiniti
2
Last
Kent
Genesis
need to transform into a json column by appending a row value below format for a given key column as shown below.
Result expected:
id
json_column
1
{"First_Name":"rob","First_*Car", "Nissan","Last_Name":"dent","Last_Car", "Infiniti"}
2
{"First_Name":"joe","First_Car", "Hyundai","Last_Name":"kent","Last_Car", "Genesis"}
with below piece of code
column_set = ['Name','Car']
df = df.withColumn("json_data", to_json(struct(\[df\[x\] for x in column_set\])))
I was able to generate data as
id
Name_type
Json_data
1
First
{"Name":"rob", "Car": "Nissan"}
2
First
{"Name":"joe", "Car": "Hyundai"}
1
Last
{"Name":"dent", "Car": "infiniti"}
2
Last
{"Name":"kent", "Car": "Genesis"}
I was able to create a json column using to_json for a given row.
But not able to figure out how to append the row value to a column name and convert to nested json for a given key column.
To do what you want, you first need to manipulate your input dataframe a little bit. You can do this by grouping by the id column, and pivoting around the Name_type column like so:
from pyspark.sql.functions import first
df = spark.createDataFrame(
[
("1", "First", "rob", "Nissan"),
("2", "First", "joe", "Hyundai"),
("1", "Last", "dent", "Infiniti"),
("2", "Last", "Kent", "Genesis")
],
["id", "Name_type", "Name", "Car"]
)
output = df.groupBy("id").pivot("Name_type").agg(first("Name").alias('Name'), first("Car").alias('Car'))
output.show()
+---+----------+---------+---------+--------+
| id|First_Name|First_Car|Last_Name|Last_Car|
+---+----------+---------+---------+--------+
| 1| rob| Nissan| dent|Infiniti|
| 2| joe| Hyundai| Kent| Genesis|
+---+----------+---------+---------+--------+
Then you can use the exact same code as what you used to get your wanted result, but using 4 columns instead of 2:
from pyspark.sql.functions import to_json, struct
column_set = ['First_Name','First_Car', 'Last_Name', 'Last_Car']
output = output.withColumn("json_data", to_json(struct([output[x] for x in column_set])))
output.show(truncate=False)
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|id |First_Name|First_Car|Last_Name|Last_Car|json_data |
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|1 |rob |Nissan |dent |Infiniti|{"First_Name":"rob","First_Car":"Nissan","Last_Name":"dent","Last_Car":"Infiniti"}|
|2 |joe |Hyundai |Kent |Genesis |{"First_Name":"joe","First_Car":"Hyundai","Last_Name":"Kent","Last_Car":"Genesis"}|
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+

Read complex JSON to extract key values

I have a JSON and I'm trying to read part of it to extract keys and values.
Assuming response is my JSON data, here is my code:
data_dump = json.dumps(response)
data = json.loads(data_dump)
Here my data object becomes a list and I'm trying to get the keys as below
id = [key for key in data.keys()]
This fails with the error:
A list object does not have an attribute keys**. How can I get over this to get my below output?
Here is my JSON:
{
"1": {
"task": [
"wakeup",
"getready"
]
},
"2": {
"task": [
"brush",
"shower"
]
},
"3": {
"task": [
"brush",
"shower"
]
},
"activites": ["standup", "play", "sitdown"],
"statuscheck": {
"time": 60,
"color": 1002,
"change(me)": 9898
},
"action": ["1", "2", "3", "4"]
}
The output I need is as below. I do not need data from the rest of JSON.
id
task
1
wakeup, getready
2
brush , shower
If you know that the keys you need are "1" and "2", you could try reading the JSON string as a dataframe, unpivoting it, exploding and grouping:
from pyspark.sql import functions as F
df = (spark.read.json(sc.parallelize([data_dump]))
.selectExpr("stack(2, '1', `1`, '2', `2`) (id, task)")
.withColumn('task', F.explode('task.task'))
.groupBy('id').agg(F.collect_list('task').alias('task'))
)
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+
However, it may be easier to deal with it in Python:
data = json.loads(data_dump)
data2 = [(k, v['task']) for k, v in data.items() if k in ['1', '2']]
df = spark.createDataFrame(data2, ['id', 'task'])
df.show()
# +---+------------------+
# | id| task|
# +---+------------------+
# | 1|[wakeup, getready]|
# | 2| [brush, shower]|
# +---+------------------+

PySpark - Referencing a column named "name" in DataFrame

I am trying to use PySpark to parse json data. Below is the script.
arrayData = [
{"resource":
{
"id": "123456789",
"name2": "test123"
}
}
]
df = spark.createDataFrame(data=arrayData)
df3 = df.select(df.resource.id, df.resource.name2)
df3.show()
The script works and the output is
+------------+---------------+
|resource[id]|resource[name2]|
+------------+---------------+
| 123456789| test123|
+------------+---------------+
However, after I changed the text "name2" in the variable arrayData to "name", and referenced it in df3 as below,
df3 = df.select(df.resource.id, df.resource.name)
I got the following error
TypeError: Invalid argument, not a string or column: <bound method alias of Column<b'resource'>> of type <class 'method'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I think the root cause might be that "name" is a reserved word. If so, how can I go around this?
you can use the bracket notation which suresh mentioned. following is the code
df3 = df.select(df.resource.id, df.resource["name"])
df3.show()
+------------+--------------+
|resource[id]|resource[name]|
+------------+--------------+
| 123456789| test123|
+------------+--------------+
if you want only id & name as column name in you dataframe you can use the following
from pyspark.sql import functions as f
df4 = df.select(f.col("resource.id"), f.col("resource.name"))
df4.show()
+---------+-------+
| id| name|
+---------+-------+
|123456789|test123|
+---------+-------+

read pretty json format data through spark

We read data present in hour format present in S3 through spark in scala.For example,
spark.read.textFile("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
spark.read.textFile reads records one line at a time so for example records that are present in jsonLines(full json data in one line) are read and can be parsed later to retrieve data from json.
Now,I have to read data which is having multiple json but in pretty format instead of json lines.Using same strategy gives corrupt record error.For example Dataset[String] obtained after reading through spark.read.textFile:
{
"a": 1,
"b": 2
}
is
_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
Input data :
{
"key1": "value1",
"key2": "value2"
}
{
"key1": "value1",
"key2": "value2"
}
ExpectedOutput
+------+------+
|key1 |key2 |
+------+------+
|value1|value2|
|value1|value2|
+------+------+
This file has multiple pretty formatted json with delimiter between records as newline.
Approaches already used
spark.read.option("multiline", "true").json("") .This will not work as multiline requires data to be present in form of [{},{}].
Approach working
val x=sparkSession
.read
.json(sc
.wholeTextFiles(filePath)
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
I just wanted to ask if there is a better approach as the above solution is doing some slice and dice on data which might lead to performance issue for large data?Thanks
This can be a working solution for you, use from_json() and correct schema in order to parse a json correctly
Create the dataframe here
df = spark.createDataFrame([(str([{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}]))],T.StringType())
df.show(truncate=False)
+----------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|
+----------------------------------------------------------------------------+
Now, use explode() as the value/json column is a list in order to map correctly
And, finally use getItem() to extract the columns
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("key1", df.col.getItem("key1")).withColumn("key2", df.col.getItem("key2"))
+----------------------------------------------------------------------------+--------------------------------+------+------+
|value |col |key1 |key2 |
+----------------------------------------------------------------------------+--------------------------------+------+------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value1, key2 -> value2]|value1|value2|
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value3, key2 -> value4]|value3|value4|
+----------------------------------------------------------------------------+--------------------------------+------+------+
df.show(truncate=False)

Create DF/RDD from nested other DF/RDD (Nested Json) in Spark

I'm a total newbie in Spark&Scala stuff, it would be great if someone could explain this to me.
Let's take following JSON
{
"id": 1,
"persons": [{
"name": "n1",
"lastname": "l1",
"hobbies": [{
"name": "h1",
"activity": "a1"
},
{
"name": "h2",
"activity": "a2"
}]
},
{
"name": "n2",
"lastname": "l2",
"hobbies": [{
"name": "h3",
"activity": "a3"
},
{
"name": "h4",
"activity": "a4"
}]
}]
}
I'm loading this Json to RDD via sc.parralelize(file.json) and to DF via sqlContext.sql.load.json(file.json). So far so good, this gives me RDD and DF (with schema) for mentioned Json, but I want to create annother RDD/DF from existing one that contains all distinct "hobbies" records. How can I achieve sth like that?
The only things I get from my operations are multiple WrappedArrays for Hobbies but I cannot go deeper nor assign them to DF/RDD.
Code for SqlContext I have so far
val jsonData = sqlContext.read.json("path/file.json")
jsonData.registerTempTable("jsonData") //I receive schema for whole file
val hobbies = sqlContext.sql("SELECT persons.hobbies FROM jasonData") //subschema for hobbies
hobbies.show()
That leaves me with
+--------------------+
| hobbies|
+--------------------+
|[WrappedArray([a1...|
+--------------------+
What I expect is more like:
+--------------------+-----------------+
| name | activity |
+--------------------+-----------------|
| h1| a1 |
+--------------------+-----------------+
| h2| a2 |
+--------------------+-----------------+
| h3| a3 |
+--------------------+-----------------+
| h4| a4 |
+--------------------+-----------------+
I loaded your example into the dataframe hobbies exactly as you do it and worked with it. You could run something like the following:
val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
val dhDF = distinctHobbies.toDF("activity", "name")
This essentially flattens your hobbies struct, transforms it into a tuple, and runs a distinct on the returned tuples. We then turn it back into a dataframe under the correct column aliases. Because we are doing this through the underlying RDD, there may also be a more efficient way to do it using just the DataFrame API.
Regardless, when I run on your example, I see:
scala> val distinctHobbies = hobbies.rdd.flatMap {row => row.getSeq[List[Row]](0).flatten}.map(row => (row.getString(0), row.getString(1))).distinct
distinctHobbies: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[121] at distinct at <console>:24
scala> val dhDF = distinctHobbies.toDF("activity", "name")
dhDF: org.apache.spark.sql.DataFrame = [activity: string, name: string]
scala> dhDF.show
...
+--------+----+
|activity|name|
+--------+----+
| a2| h2|
| a1| h1|
| a3| h3|
| a4| h4|
+--------+----+