PySpark - Referencing a column named "name" in DataFrame - json

I am trying to use PySpark to parse json data. Below is the script.
arrayData = [
{"resource":
{
"id": "123456789",
"name2": "test123"
}
}
]
df = spark.createDataFrame(data=arrayData)
df3 = df.select(df.resource.id, df.resource.name2)
df3.show()
The script works and the output is
+------------+---------------+
|resource[id]|resource[name2]|
+------------+---------------+
| 123456789| test123|
+------------+---------------+
However, after I changed the text "name2" in the variable arrayData to "name", and referenced it in df3 as below,
df3 = df.select(df.resource.id, df.resource.name)
I got the following error
TypeError: Invalid argument, not a string or column: <bound method alias of Column<b'resource'>> of type <class 'method'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
I think the root cause might be that "name" is a reserved word. If so, how can I go around this?

you can use the bracket notation which suresh mentioned. following is the code
df3 = df.select(df.resource.id, df.resource["name"])
df3.show()
+------------+--------------+
|resource[id]|resource[name]|
+------------+--------------+
| 123456789| test123|
+------------+--------------+
if you want only id & name as column name in you dataframe you can use the following
from pyspark.sql import functions as f
df4 = df.select(f.col("resource.id"), f.col("resource.name"))
df4.show()
+---------+-------+
| id| name|
+---------+-------+
|123456789|test123|
+---------+-------+

Related

Flatten a json column containing multiple comma separated json in spark dataframe

In my spark dataframe I have a column which contains a single json having multiple comma separated json having key value pair. Need to faltten the json data in different columns.
The record of json column student_data looks like below
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|id|name |student_data |
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
|11|stephy|{{"key":"hindi","value":{"hindi_mythology":80}},{"key":"social_science","value":{"civics":65}},{"key":"maths","value":{"geometry":70}}}|
+--+------+---------------------------------------------------------------------------------------------------------------------------------------+
Schema of record is as below.
root
|-- id : int
|-- name : string
|-- student_data : string
The requirement is to flatten the json as expected output is as below.
+-----------+-----+--------------+------+
|id |name |hindi|social_science|maths |
+---+-------+-----+--------------+------+
|1 |stephy |80 |65 |70 |
+---+-------+-----+-----+--------+------+
You can transform your json into a struct type using spark function from_json() using a schema that represent the schema of the json string, after that to get the expected results you can pivot the column to go from rows into column format:
The input jdon file:
{
"id": 11,
"name": "stephy",
"student_data": "[{\"key\":\"hindi\",\"value\":{\"hindi_mythology\":80}},{\"key\":\"social_science\",\"value\":{\"civics\":65}},{\"key\":\"maths\",\"value\":{\"geometry\":70}}]"
}
Code:
val df = spark.read.json("file.json")
val schema = new StructType()
.add("key", StringType, true)
.add("value", MapType(StringType, IntegerType), true)
val res = df.withColumn("student_data", from_json(col("student_data"), ArrayType(schema)))
.select(col("id"), col("name"), explode(col("student_data")).as("student_data"))
.select("id", "name", "student_data.*")
.select(col("id"), col("name"), col("key"), map_values(col("value")).getItem(0).as("value"))
res.groupBy("id", "name").pivot("key").agg(first(col("value"))).show(false)
+---+------+-----+-----+--------------+
|id |name |hindi|maths|social_science|
+---+------+-----+-----+--------------+
|11 |stephy|80 |70 |65 |
+---+------+-----+-----+--------------+

Pyspark: How to create a nested Json by adding dynamic prefix to each column based on a row value

I have a dataframe in below format.
Input:
id
Name_type
Name
Car
1
First
rob
Nissan
2
First
joe
Hyundai
1
Last
dent
Infiniti
2
Last
Kent
Genesis
need to transform into a json column by appending a row value below format for a given key column as shown below.
Result expected:
id
json_column
1
{"First_Name":"rob","First_*Car", "Nissan","Last_Name":"dent","Last_Car", "Infiniti"}
2
{"First_Name":"joe","First_Car", "Hyundai","Last_Name":"kent","Last_Car", "Genesis"}
with below piece of code
column_set = ['Name','Car']
df = df.withColumn("json_data", to_json(struct(\[df\[x\] for x in column_set\])))
I was able to generate data as
id
Name_type
Json_data
1
First
{"Name":"rob", "Car": "Nissan"}
2
First
{"Name":"joe", "Car": "Hyundai"}
1
Last
{"Name":"dent", "Car": "infiniti"}
2
Last
{"Name":"kent", "Car": "Genesis"}
I was able to create a json column using to_json for a given row.
But not able to figure out how to append the row value to a column name and convert to nested json for a given key column.
To do what you want, you first need to manipulate your input dataframe a little bit. You can do this by grouping by the id column, and pivoting around the Name_type column like so:
from pyspark.sql.functions import first
df = spark.createDataFrame(
[
("1", "First", "rob", "Nissan"),
("2", "First", "joe", "Hyundai"),
("1", "Last", "dent", "Infiniti"),
("2", "Last", "Kent", "Genesis")
],
["id", "Name_type", "Name", "Car"]
)
output = df.groupBy("id").pivot("Name_type").agg(first("Name").alias('Name'), first("Car").alias('Car'))
output.show()
+---+----------+---------+---------+--------+
| id|First_Name|First_Car|Last_Name|Last_Car|
+---+----------+---------+---------+--------+
| 1| rob| Nissan| dent|Infiniti|
| 2| joe| Hyundai| Kent| Genesis|
+---+----------+---------+---------+--------+
Then you can use the exact same code as what you used to get your wanted result, but using 4 columns instead of 2:
from pyspark.sql.functions import to_json, struct
column_set = ['First_Name','First_Car', 'Last_Name', 'Last_Car']
output = output.withColumn("json_data", to_json(struct([output[x] for x in column_set])))
output.show(truncate=False)
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|id |First_Name|First_Car|Last_Name|Last_Car|json_data |
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+
|1 |rob |Nissan |dent |Infiniti|{"First_Name":"rob","First_Car":"Nissan","Last_Name":"dent","Last_Car":"Infiniti"}|
|2 |joe |Hyundai |Kent |Genesis |{"First_Name":"joe","First_Car":"Hyundai","Last_Name":"Kent","Last_Car":"Genesis"}|
+---+----------+---------+---------+--------+----------------------------------------------------------------------------------+

read a file of JSON string in pyspark

I have a file look like this:
'{"Name": "John", "Age": 23}'
'{"Name": "Mary", "Age": 21}'
How can I read this file and get a pyspark dataframe like this:
Name | Age
"John" | 23
"Mary" | 21
First read in the file in text format, and then use the from_json function to convert the row to two columns.
df = spark.read.load(path_to_your_file, format='text')
df = df.selectExpr("from_json(trim('\\'' from value), 'Name string,Age int') as data").select('data.*')
df.show(truncate=False)

Reading REST API JSON response using Spark Scala [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I want to hit an API by applying some parameters from a dataframe, get the Json Response body, and from the body, pull out all the distinct values of a particular Key.
I then need to add this column into the first dataframe.
Suppose i have a dataframe like below:
df1:
+-----+-------+--------+
| DB | User | UserID |
+-----+-------+--------+
| db1 | user1 | 123 |
| db2 | user2 | 456 |
+-----+-------+--------+
I want to hit a REST API by providing the column value of Df1 as parameters.
If my parameters for URL is db=db1 and User=user1(First record of df1),the response will be a json format of following format:
{
"data":[
{
"db": "db1"
"User": "User1"
"UserID": 123
"Query": "Select * from A"
"Application": "App1"
},
{
"db": "db1"
"User": "User1"
"UserID": 123
"Query": "Select * from B"
"Application": "App2"
}
]
}
From this json file, i want get distinct values of Application key as an array or list and attach it as a new column to Df1
My output will look similar to below:
Final df:
+-----+-------+--------+-------------+
| DB | User | UserID | Apps |
+-----+-------+--------+-------------+
| db1 | user1 | 123 | {App1,App2} |
| db2 | user2 | 456 | {App3,App3} |
+-----+-------+--------+-------------+
I have come up with a high level plan on how to achieve it.
Add a new column called response URL built from multiple columns in input.
Define a scala function that takes in URL and return an array of application and convert it to UDF.
Create another column by applying the UDF by passing response URL.
Since i am pretty new to scala-spark and have never worked with REST APIs, can someone please help me here on achieving the result please.
Any other idea or suggestion is always welcome.
I am using spark 1.6.
Check below code, You may need to write logic to invoke reset api. once you get result next process is simple.
scala> val df = Seq(("db1","user1",123),("db2","user2",456)).toDF("db","user","userid")
df: org.apache.spark.sql.DataFrame = [db: string, user: string, userid: int]
scala> df.show(false)
+---+-----+------+
|db |user |userid|
+---+-----+------+
|db1|user1|123 |
|db2|user2|456 |
+---+-----+------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
def invokeRestAPI(db:String,user: String) = {
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats
// Write your invoke logic & for now I am hardcoding your sample json here.
val json_data = parse("""{"data":[ {"db": "db1","User": "User1","UserID": 123,"Query": "Select * from A","Application": "App1"},{"db": "db1","User": "User1","UserID": 123,"Query": "Select * from B","Application": "App2"}]}""")
(json_data \\ "data" \ "Application").extract[Set[String]].toList
}
// Exiting paste mode, now interpreting.
invokeRestAPI: (db: String, user: String)List[String]
scala> val fetch = udf(invokeRestAPI _)
fetch: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StringType,true),List(StringType, StringType))
scala> df.withColumn("apps",fetch($"db",$"user")).show(false)
+---+-----+------+------------+
|db |user |userid|apps |
+---+-----+------+------------+
|db1|user1|123 |[App1, App2]|
|db2|user2|456 |[App1, App2]|
+---+-----+------+------------+

from_json of Spark sql return null values

I loaded a parquet file into a spark dataframe as follows :
val message= spark.read.parquet("gs://defenault-zdtt-devde/pubsub/part-00001-e9f8c58f-7de0-4537-a7be-a9a8556sede04a-c000.snappy.parquet")
when I perform a collect on my dataframe I get the following result :
message.collect()
Array[org.apache.spark.sql.Row] = Array([118738748835150,2018-08-20T17:44:38.742Z,{"id":"uplink-3130-85bc","device_id":60517119992794222,"group_id":69,"group":"box-2478-2555","profile_id":3,"profile":"eolane-movee","type":"uplink","timestamp":"2018-08-20T17:44:37.048Z","count":3130,"payload":[{"timestamp":"2018-08-20T17:44:37.048Z","data":{"battery":3.5975599999999996,"temperature":27}}],"payload_encrypted":"9da25e36","payload_cleartext":"fe1b01aa","device_properties":{"appeui":"7ca97df000001190","deveui":"7ca97d0000001bb0","external_id":"Product: 3.7 / HW: 3.1 / SW: 1.8.8","no_de_serie_eolane":"4904","no_emballage":"S02066","product_version":"1.3.1"},"protocol_data":{"AppNonce":"e820ef","DevAddr":"0e6c5fda","DevNonce":"85bc","NetID":"000007","best_gateway_id":"M40246","gateway.
The schema of this dataframe is
message.printSchema()
root
|-- Id: string (nullable = true)
|-- publishTime: string (nullable = true)
|-- data: string (nullable = true)
My aim is to work on the data column which holds json data and to flatten it.
I wrote the following code
val schemaTotal = new StructType (
Array (StructField("id",StringType,false),StructField("device_id",StringType),StructField("group_id",LongType), StructField("group",StringType),StructField("profile_id",IntegerType),StructField("profile",StringType),StructField("type",StringType),StructField("timestamp",StringType),
StructField("count",StringType),
StructField("payload",new StructType ()
.add("timestamp",StringType)
.add("data",new ArrayType (new StructType().add("battery",LongType).add("temperature",LongType),false))),
StructField("payload_encrypted",StringType),
StructField("payload_cleartext",StringType),
StructField("device_properties", new ArrayType (new StructType().add("appeui",StringType).add("deveui",StringType).add("external_id",StringType).add("no_de_serie_eolane",LongType).add("no_emballage",StringType).add("product_version",StringType),false)),
StructField("protocol_data", new ArrayType (new StructType().add("AppNonce",StringType).add("DevAddr",StringType).add("DevNonce",StringType).add("NetID",LongType).add("best_gateway_id",StringType).add("gateways",IntegerType).add("lora_version",IntegerType).add("noise",LongType).add("port",IntegerType).add("rssi",DoubleType).add("sf",IntegerType).add("signal",DoubleType).add("snr",DoubleType),false)),
StructField("lat",StringType),
StructField("lng",StringType),
StructField("geolocation_type",StringType),
StructField("geolocation_precision",StringType),
StructField("delivered_at",StringType)))
val dataframe_extract=message.select($"Id",
$"publishTime",
from_json($"data",schemaTotal).as("content"))
val table = dataframe_extract.select(
$"Id",
$"publishTime",
$"content.id" as "id",
$"content.device_id" as "device_id",
$"content.group_id" as "group_id",
$"content.group" as "group",
$"content.profile_id" as "profile_id",
$"content.profile" as "profile",
$"content.type" as "type",
$"content.timestamp" as "timestamp",
$"content.count" as "count",
$"content.payload.timestamp" as "timestamp2",
$"content.payload.data.battery" as "battery",
$"content.payload.data.temperature" as "temperature",
$"content.payload_encrypted" as "payload_encrypted",
$"content.payload_cleartext" as "payload_cleartext",
$"content.device_properties.appeui" as "appeui"
)
table.show() gives me null values for all columns:
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
| Id| publishTime| id|device_id|group_id|group|profile_id|profile|type|timestamp|count|timestamp2|battery|temperature|payload_encrypted|payload_cleartext|appeui|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
|118738748835150|2018-08-20T17:44:...|null| null| null| null| null| null|null| null| null| null| null| null| null| null| null|
+---------------+--------------------+----+---------+--------+-----+----------+-------+----+---------+-----+----------+-------+-----------+-----------------+-----------------+------+
, whereas table.printSchema() gives me the expected result, any idea how to solve this, please?
I am working with Zeppelin as a first prototyping step thanks a lot in advance for your help.
Best Regards
from_json() SQL function has below constraint to be followed to convert column value to a dataframe.
whatever the datatype you have defined in the schema should match with the value present in the json, if there is any column's mismatch value leads to null in all column values
e.g.:
'{"name": "raj", "age": 12}' for this column value
StructType(List(StructField(name,StringType,true),StructField(age,StringType,true)))
The above schema will return you a null value on both the columns
StructType(List(StructField(name,StringType,true),StructField(age,IntegerType,true)))
The above schema will return you an expected dataframe
For this thread possible reason could be this, if there is any mismatched column value present, from_json will return all column value as null