Extracting JSON values and concatenating them using PySpark - json

I have a an array of JSONs as below.
id address
1 [{street: 11 Summit Ave, city: null, postal_code: 07306, state: NJ , country: null}, {street: 11 Sum Ave , city: null , postal_code: null, state: NJ, country: US}, {street: 12 Oliver Avenue, city: Seattle , postal_code: 98121, state: WA, country: US}]
Here's what the data types are:
root
|-- id: string (nullable = true)
|-- addresses: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
| | |-- street: string (nullable = true)
| | |-- postalCode: string (nullable = true)
| | |-- country: string (nullable = true)
I want to create a string of the addresses ignoring nulls and separated by a delimiter (say ;). So output should look like:
id addresses
1 11 Summit Ave 07306 NJ ; 11 Sum Ave NJ US; 12 Oliver Avenue Seattle 98121 WA US
How can I achieve this in PySpark? If it matters, my original address is of string type but using from_json, I converted it to the schema specified above.

This would work:
df.withColumn("allAdd", F.explode("addresses"))\
.withColumn("asString", F.expr("concat_ws(' ', allAdd.*)"))\
.groupBy("id")\
.agg(F.concat_ws("; ", F.collect_list("asString")).alias("asString"))\
.show(truncate=False)
Input:
Output:

Related

Separate string of JSONs into multiple rows PySpark

I want to separate a string of JSONs in my dataframe column into multiple rows in PySpark. Example:
Input:
id
addresses
1
[{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]
Expected output:
id
addresses
1
{"city":null,"state":null,"street":"123, ABC St, ABC Square","postalCode":"11111","country":"USA"}
1
{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}
Any ideas how to do this?
Looking at the example in your question, it is not clear what is the type of the addresses column and what type you need in the output column. So, let's explore different combinations.
addresses column is of type ArrayType: in this case, you can use explode:
df.select('id', F.explode('addresses').alias('address'))
The result is:
+---+-----------------------------------------------------------------------------------------------------+
|id |address |
+---+-----------------------------------------------------------------------------------------------------+
|1 |{country -> USA, state -> null, city -> null, street -> 123, ABC St, ABC Square, postalCode -> 11111}|
|1 |{country -> USA, state -> TX, city -> Dallas, street -> 456, DEF Plaza, Test St, postalCode -> 99999}|
+---+-----------------------------------------------------------------------------------------------------+
The type of the output column will be the same of the type of the items in the input column.
addresses column is an Array of StringType, but you want your output to be a StructTpye: in this case, you can convert each string into a struct, using from_json:
from pyspark.sql import functions as F, SparkSession, types as T
json_schema = T.StructType([
T.StructField("city", T.StringType()),
T.StructField("state", T.StringType()),
T.StructField("street", T.StringType()),
T.StructField("postalCode", T.StringType()),
T.StructField("country", T.StringType()),
])
df_struct_from_array = (
df
.withColumn('address', F.explode('addresses'))
.select('id', F.from_json('address', json_schema).alias('address'))
)
The following dataframe is the result:
+---+-------------------------------------------------+
|id |address |
+---+-------------------------------------------------+
|1 |{null, null, 123, ABC St, ABC Square, 11111, USA}|
|1 |{Dallas, TX, 456, DEF Plaza, Test St, 99999, USA}|
+---+-------------------------------------------------+
The schema of df_struct_from_array is:
root
|-- id: long (nullable = true)
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- country: string (nullable = true)
addresses column is of StringType and you want a StructType Column in output: in this case, you have to convert from JSON first and then explode:
json_schema = T.ArrayType(T.StructType([
T.StructField("city", T.StringType()),
T.StructField("state", T.StringType()),
T.StructField("street", T.StringType()),
T.StructField("postalCode", T.StringType()),
T.StructField("country", T.StringType()),
]))
df_struct_from_str = (
df
.withColumn('addresses_conv', F.from_json('addresses', json_schema))
.select('id', F.explode('addresses_conv').alias('address'))
)
This is what you get:
+---+-------------------------------------------------+
|id |address |
+---+-------------------------------------------------+
|1 |{null, null, 123, ABC St, ABC Square, 11111, USA}|
|1 |{Dallas, TX, 456, DEF Plaza, Test St, 99999, USA}|
+---+-------------------------------------------------+
root
|-- id: long (nullable = true)
|-- address: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- state: string (nullable = true)
| |-- street: string (nullable = true)
| |-- postalCode: string (nullable = true)
| |-- country: string (nullable = true)

Define a schema from a DF column in array type

I have a metadata file with a column with information on the schema of a file:
[{"column_datatype": "varchar", "column_description": "Indicates whether the Customer belongs to a particular business size, business activity, retail segment, demography, or other group and is used for reporting on regio performance regio migration.", "column_length": "4", "column_name": "clnt_grp_cd", "column_personally_identifiable_information": "False", "column_precision": "4", "column_primary_key": "True", "column_scale": null, "column_security_classifications": [], "column_sequence_number": "1"}
root
|-- column_info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- column_datatype: string (nullable = true)
| | |-- column_description: string (nullable = true)
| | |-- column_length: string (nullable = true)
| | |-- column_name: string (nullable = true)
| | |-- column_personally_identifiable_information: string (nullable = true)
| | |-- column_precision: string (nullable = true)
| | |-- column_primary_key: string (nullable = true)
| | |-- column_scale: string (nullable = true)
| | |-- column_security_classifications: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- column_sequence_number: string (nullable = true)
I want to read a df using this schema. Something like:
schema = StructType([ \
StructField("clnt_grp_cd",StringType(),True),\
StructField("clnt_grp_lvl1_nm",StringType(),True),\
(...)
])
df = spark.read.schema(schema).format("csv").option("header","true").load(filenamepath)
Is there a built in method to parse this as a schema?

How to covert the nested json to datafarme [duplicate]

This question already has answers here:
reading json file in pyspark
(4 answers)
Closed 3 years ago.
I have the json data and I want to convert the json data into dataframe
[
{FlierNumber:,BaggageTypeReturn:,FirstName:K,Title:1,MiddleName:D,LastName:Gupta,MealTypeOnward:,DateOfBirth:,BaggageTypeOnward:,SeatTypeOnward:,MealTypeReturn:,FrequentAirline:null,Type:A,SeatTypeReturn:},
{FlierNumber:,BaggageTypeReturn:,FirstName:Sweety,Title:2,MiddleName:,LastName:Gupta,MealTypeOnward:,DateOfBirth:,BaggageTypeOnward:,SeatTypeOnward:,MealTypeReturn:,FrequentAirline:null,Type:A,SeatTypeReturn:}
]
The JSON you gave above is invalid. Here is the syntactically correct JSON format
[{"FlierNumber":"","BaggageTypeReturn":"","FirstName":"K","Title":"1","MiddleName":"D","LastName":"Gupta","MealTypeOnward":"","DateOfBirth":"","BaggageTypeOnward":"","SeatTypeOnward":"","MealTypeReturn":"","FrequentAirline":"null","Type":"A","SeatTypeReturn":""},{"FlierNumber":"","BaggageTypeReturn":"","FirstName":"Sweety","Title":"2","MiddleName":"","LastName":"Gupta","MealTypeOnward":"","DateOfBirth":"","BaggageTypeOnward":"","SeatTypeOnward":"","MealTypeReturn":"","FrequentAirline":"null","Type":"A","SeatTypeReturn":""}]
If it is present in a file you can read in spark directly using
val jsonDF = spark.read.json("filepath\sample.json")
jsonDF.printSchema()
jsonDF.show
Result is:
root
|-- BaggageTypeOnward: string (nullable = true)
|-- BaggageTypeReturn: string (nullable = true)
|-- DateOfBirth: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- FlierNumber: string (nullable = true)
|-- FrequentAirline: string (nullable = true)
|-- LastName: string (nullable = true)
|-- MealTypeOnward: string (nullable = true)
|-- MealTypeReturn: string (nullable = true)
|-- MiddleName: string (nullable = true)
|-- SeatTypeOnward: string (nullable = true)
|-- SeatTypeReturn: string (nullable = true)
|-- Title: string (nullable = true)
|-- Type: string (nullable = true)
+-----------------+-----------------+-----------+---------+-----------+---------------+--------+--------------+--------------+----------+--------------+--------------+-----+----+
|BaggageTypeOnward|BaggageTypeReturn|DateOfBirth|FirstName|FlierNumber|FrequentAirline|LastName|MealTypeOnward|MealTypeReturn|MiddleName|SeatTypeOnward|SeatTypeReturn|Title|Type|
+-----------------+-----------------+-----------+---------+-----------+---------------+--------+--------------+--------------+----------+--------------+--------------+-----+----+
| | | | K| | null| Gupta| | | D| | | 1| A|
| | | | Sweety| | null| Gupta| | | | | | 2| A|
+-----------------+-----------------+-----------+---------+-----------+---------------+--------+--------------+--------------+----------+--------------+--------------+-----+----+

How to parse jsonfile with spark

I have a jsonfile to be parsed.The json format is like this :
{"cv_id":"001","cv_parse": { "educations": [{"major": "English", "degree": "Bachelor" },{"major": "English", "degree": "Master "}],"basic_info": { "birthyear": "1984", "location": {"state": "New York"}}}}
I have to get every word in the file.How can I get the "major" from an array and do I have to get the word of "province" using the method df.select("cv_parse.basic_info.location.province")?
This is the result I want:
cv_id major degree birthyear state
001 English Bachelor 1984 New York
001 English MasterĀ  1984 New York
This might not be the best way of doing it but you can give it a shot.
// import the implicits functions
import org.apache.spark.sql.functions._
import sqlContext.implicits._
//read the json file
val jsonDf = sqlContext.read.json("sample-data/sample.json")
jsonDf.printSchema
Your schema would be :
root
|-- cv_id: string (nullable = true)
|-- cv_parse: struct (nullable = true)
| |-- basic_info: struct (nullable = true)
| | |-- birthyear: string (nullable = true)
| | |-- location: struct (nullable = true)
| | | |-- state: string (nullable = true)
| |-- educations: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- degree: string (nullable = true)
| | | |-- major: string (nullable = true)
Now you need can have explode the educations column
val explodedResult = jsonDf.select($"cv_id", explode($"cv_parse.educations"),
$"cv_parse.basic_info.birthyear", $"cv_parse.basic_info.location.state")
explodedResult.printSchema
Now your schema would be
root
|-- cv_id: string (nullable = true)
|-- col: struct (nullable = true)
| |-- degree: string (nullable = true)
| |-- major: string (nullable = true)
|-- birthyear: string (nullable = true)
|-- state: string (nullable = true)
Now you can select the columns
explodedResult.select("cv_id", "birthyear", "state", "col.degree", "col.major").show
+-----+---------+--------+--------+-------+
|cv_id|birthyear| state| degree| major|
+-----+---------+--------+--------+-------+
| 001| 1984|New York|Bachelor|English|
| 001| 1984|New York| Master |English|
+-----+---------+--------+--------+-------+

spark hivecontext working with queries issues

I'm trying to get information from Jsons to create tables in Hive.
This is my Json schema:
root
|-- info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- stations: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- bikes: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- slots: string (nullable = true)
| | | | |-- streetName: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- updateTime: long (nullable = true)
|-- date: string (nullable = true)
|-- numRecords: string (nullable = true)
I'm using this query:
sqlContext.sql("SELECT info.updateTime FROM STATIONS").foreach(println)
This is what i get:
[WrappedArray(1449098169, 1449108553, 1449098468)]
But i don't know how to put this information in a table to use it after from the Hive console.
I used this:
query.write.save("/home/cloudera/Desktop/select")
And it creates something, but i don't know how to use it.
Thanks
You can do it in several ways...it depends.
First way: Have the table created in the query
sqlContext.sql("create table mytable AS SELECT info.updateTime FROM STATIONS")
// now you can query mytable
Second way: write the DataFrame with saveAsTable()
sqlContext.sql("SELECT info.updateTime FROM STATIONS").saveAsTable("othertable")