Spark SQL DataFrame pretty print - json

I'm not very good with Scala (I'm more an R addict) I wish to display the WrappedArray elemnt's content (see below sqlDf.show()) in two rows using Scala in spark-shell. I've tried the explode() function but couldn't get further ...
scala> val sqlDf = spark.sql("select t.articles.donneesComptablesArticle.taxes from dau_temp t")
sqlDf: org.apache.spark.sql.DataFrame = [taxes: array<array<struct<baseImposition:bigint,codeCommunautaire:string,codeNatureTaxe:string,codeTaxe:string,droitCautionnable:boolean,droitPercu:boolean,imputationCreditCautionne:boolean,montantLiquidation:bigint,quotite:double,statutAi2:boolean,statutDeLiquidation:string,statutRessourcesPropres:boolean,typeTaxe:string>>>]
scala> sqlDf.show
16/12/21 15:13:21 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+--------------------+
| taxes|
+--------------------+
|[WrappedArray([12...|
+--------------------+
scala> sqlDf.printSchema
root
|-- taxes: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- baseImposition: long (nullable = true)
| | | |-- codeCommunautaire: string (nullable = true)
| | | |-- codeNatureTaxe: string (nullable = true)
| | | |-- codeTaxe: string (nullable = true)
| | | |-- droitCautionnable: boolean (nullable = true)
| | | |-- droitPercu: boolean (nullable = true)
| | | |-- imputationCreditCautionne: boolean (nullable = true)
| | | |-- montantLiquidation: long (nullable = true)
| | | |-- quotite: double (nullable = true)
| | | |-- statutAi2: boolean (nullable = true)
| | | |-- statutDeLiquidation: string (nullable = true)
| | | |-- statutRessourcesPropres: boolean (nullable = true)
| | | |-- typeTaxe: string (nullable = true)
scala> val sqlDfTaxes = sqlDf.select(explode(sqlDf("taxes")))
sqlDfTaxes: org.apache.spark.sql.DataFrame = [col: array<struct<baseImposition:bigint,codeCommunautaire:string,codeNatureTaxe:string,codeTaxe:string,droitCautionnable:boolean,droitPercu:boolean,imputationCreditCautionne:boolean,montantLiquidation:bigint,quotite:double,statutAi2:boolean,statutDeLiquidation:string,statutRessourcesPropres:boolean,typeTaxe:string>>]
scala> sqlDfTaxes.show()
16/12/21 15:22:28 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
+--------------------+
| col|
+--------------------+
|[[12564,B00,TVA,A...|
+--------------------+
The "readable" content looks like this (THIS IS MY GOAL: a classic row x columns structure display with headers):
codeTaxe codeCommunautaire baseImposition quotite montantLiquidation statutDeLiquidation
A445 B00 12564 20.0 2513 C
U165 A00 12000 4.7 564 C
codeNatureTaxe typeTaxe statutRessourcesPropres statutAi2 imputationCreditCautionne
TVA ADVAL FALSE TRUE FALSE
DD ADVAL TRUE FALSE TRUE
droitCautionnable droitPercu
FALSE TRUE
FALSE TRUE
and the class of each row is (found it using R package sparklyr):
<jobj[100]>
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
[12564,B00,TVA,A445,false,true,false,2513,20.0,true,C,false,ADVAL]
[[1]][[1]][[2]]
<jobj[101]>
class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
[12000,A00,DD,U165,false,true,true,564,4.7,false,C,true,ADVAL]

you can explode on each column:
val flattenedtaxes = sqlDf.withColumn("codeCommunautaire", org.apache.spark.sql.functions.explode($"taxes. codeCommunautaire"))
After this your flattenedtaxes will have 2 columns taxes(all the columns as is) new column codeCommunautaire

Related

scala dataframe column names replace '-' with _ for nested json

I am working with Nested json, using scala and need to replace the - in column names with _.
Schema of json:
|-- a-type: struct (nullable = true)
| |-- x-Type: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- part: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- x-Type: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- Length: long (nullable = true)
| | | |-- Order: long (nullable = true)
| | | |-- y-Name: string (nullable = true)
| | | |-- Payload-Text: string (nullable = true)
| |-- Date: string (nullable = true)
I am using below code which only works at first level. However, I have to replace - with _ at all levels. Any help is really appreciated.
Code used currently:
scJsonDF.columns.foreach { col =>
println(col + " after column replace " + col.replaceAll("-", ""))
scJsonDFCorrectedCols = scJsonDFCorrectedCols.withColumnRenamed(col, col.replaceAll("-", "")
)
}
I am looking for a dynamic solution as there are different structures available.
One of the solution I found is to flatten the json and update column names. I used link here to help https://gist.github.com/fahadsiddiqui/d5cff15698f9dc57e2dd7d7052c6cc43 and updated a line
col(x.toString).as(x.toString.replace(".", "_"))
col(x.toString).as(x.toString.replaceAll("-","_").replace(".", "_"))

How to convert the dataframe column type from string to (array and struct) in spark

I have a Dataframe with the following schema, where 'name' is a string type and the value
is a complex JSON with Array and struct.
Basically with string datatype i couldn't able to parse the data and write into rows.
So I am trying to convert datatype and apply explode to parse the data.
Current:
root
|--id: string (nullable = true)
|--partitionNo: string (nullable = true)
|--name: string (nullable = true)
After conversion:
Expected:
root
|id: string (nullable = true)
|partitionNo: string (nullable = true)
|name: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- extension: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- url: string (nullable = true)
| | | | |-- valueMetadata: struct (nullable = true)
| | | | |-- modifiedDateTime: string (nullable = true)
| | | | |-- code: string (nullable = true)
| | |-- lastName: string (nullable = true)
| | |-- firstName: array (nullable = true)
| | | |-- element: string (containsNull = true)
You can use from_json, but you need to provide a schema, which can be automatically inferred using a spaghetti code... because from_json only accepts a schema in the form of lit:
val df2 = df.withColumn(
"name",
from_json(
$"name",
// the lines below generate the schema
lit(
df.select(
schema_of_json(
lit(
df.select($"name").head()(0)
)
)
).head()(0)
)
// end of schema generation
)
)

Need to pull out Json Data (nested array) from a single column dataframe - Table is coming out as Null with schema read -Scala

I am trying to pull out data as below from data frame. The Json data which has nested arrays is completely in one column(_c1). I want to pull it out and create it as separate data frame with valid column names. One sample record would be as below.
|_c1 |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}}
I am reading it to a schema as,
val schema=StructType(Array(
StructField("Id", StringType, false),
StructField("Type", StringType, false),
StructField("client", StringType, false),
StructField("eventTime", StringType, false),
StructField("eventType", StringType, false),
StructField("payload", ArrayType(StructType(Array(
StructField("sourceApp", StringType, false),
StructField("questionnaire", ArrayType(StructType(Array(
StructField("version", StringType, false),
StructField("question", StringType, false),
StructField("fb", StringType, false)))))
))))
))
val json_paral = DF.select(from_json(col("_c1"),schema))
`
Structure comes out as below,
`
|-- jsontostructs(_c1): struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- sourceApp: string (nullable = true)
| | | |-- questionnaire: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- version: string (nullable = true)
| | | | | |-- question: string (nullable = true)
| | | | | |-- fb: string (nullable = true)
The structure is good but when I check the dataframe all data is coming out as NULL. Is the read fine ? Not getting any parsing issues either.
Please check if this helps-
1. Load the data
val data = """{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} """
val df = Seq(data).toDF("jsonCol")
df.show(false)
df.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|jsonCol |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"Id":"31279605299","Type":"12121212","client":"Checklist _API","eventTime":"2020-03-17T15:50:30.640Z","eventType":"Event","payload":{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}} |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
root
|-- jsonCol: string (nullable = true)
2. extract the json string to separate fileds
df.select(json_tuple(col("jsonCol"), "Id", "Type", "client", "eventTime", "eventType", "payload"))
.show(false)
Output-
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|c0 |c1 |c2 |c3 |c4 |c5 |
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
|31279605299|12121212|Checklist _API|2020-03-17T15:50:30.640Z|Event|{"sourceApp":"ios","questionnaire":{"version":"1.0","question":"How to resolve ? ","fb":"Na"}}|
+-----------+--------+--------------+------------------------+-----+----------------------------------------------------------------------------------------------+
3. using from_json(..)
val processed = df.select(
expr("from_json(jsonCol, 'struct<Id:string,Type:string,client:string,eventTime:string, eventType:string," +
"payload:struct<questionnaire:struct<fb:string,question:string,version:string>,sourceApp:string>>')")
.as("json_converted"))
processed.show(false)
processed.printSchema()
Output-
+-------------------------------------------------------------------------------------------------------------+
|json_converted |
+-------------------------------------------------------------------------------------------------------------+
|[31279605299, 12121212, Checklist _API, 2020-03-17T15:50:30.640Z, Event, [[Na, How to resolve ? , 1.0], ios]]|
+-------------------------------------------------------------------------------------------------------------+
root
|-- json_converted: struct (nullable = true)
| |-- Id: string (nullable = true)
| |-- Type: string (nullable = true)
| |-- client: string (nullable = true)
| |-- eventTime: string (nullable = true)
| |-- eventType: string (nullable = true)
| |-- payload: struct (nullable = true)
| | |-- questionnaire: struct (nullable = true)
| | | |-- fb: string (nullable = true)
| | | |-- question: string (nullable = true)
| | | |-- version: string (nullable = true)
| | |-- sourceApp: string (nullable = true)
Instead of reading it to schema I tried making it to a value as
val Df = json_DF.map(r => r.getString(0))
This will pull the data as a string on which the below would pull it out with the keys as column names.
val g1DF=spark.read.json(Df)
Did some lateral view explode nested to pull out nested array values.

what is optimal way to parse following kafka JSON message to pyspark dataframe?

I'm using spark structured streaming to read kafka topic and want to convert following complex JSON (kafka-msgs) in to dataframe having "NAME,ADDRESS,DESCRIPTION,CODE,DEPARTMENT,INFA_OP_TYPE,DTL__CAPXTIMESTAMP" columns.
{
"meta_data": [{"name":{"string":"INFA_SEQUENCE"},"value":
{"string":"2,PWX_GENERIC"},"type":null},
{"name":{"string":"INFA_TABLE_NAME"},"value":{"string":"customers"},"type":null},
{"name":{"string":"INFA_OP_TYPE"},"value":{"string":"INSERT_EVENT"},"type":null},
{"name":{"string":"DTL__CAPXRESTART1"},"value":{"string":"B+IABwAfA"},"type":null},
{"name":{"string":"DTL__CAPXRESTART2"},"value":{"string":"AAABpMwgRDk="},"type":null},
{"name":{"string":"DTL__CAPXUOW"},"value":{"string":"AAMKPgAAqaIABg=="},"type":null},
{"name":{"string":"DTL__CAPXUSER"},"value":null,"type":null},
{"name":{"string":"DTL__CAPXTIMESTAMP"},"value":{"string":"201807310934257270000000"},"type":null},
{"name":{"string":"DTL__CAPXACTION"},"value":{"string":"I"},"type":null}],
"columns":{"array":[{"name":{"string":"NAME"},"value":{"string":"ABCD"},"isPresent":{"boolean":true}},
{"name":{"string":"ADDRESS"},"value":{"string":"123,Bark street"},"isPresent":{"boolean":true}},
{"name":{"string":"DESCRIPTION"},"value":{"string":"Canadian"},"isPresent":{"boolean":true}},
{"name":{"string":"CODE"},"value":{"string":"3_1"},"isPresent":{"boolean":true}},
{"name":{"string":"DEPARTMENT"},"value":{"string":"HR"},"isPresent":{"boolean":true}}
] }
}
I'm able to extract two json object "meta_data" and "columns" but I'm unable to explode "columns.array"
newJsonObj = events.select(get_json_object(events.value,'$.meta_data').alias('meta_data'),get_json_object(events.value,'$.columns.array').alias('columns'))
And I don't know how to extract values from two json object and create dataframe having columns from both json object.
-- Schema of events dataframe --
root
|-- columns: struct (nullable = true)
| |-- array: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- isPresent: struct (nullable = true)
| | | | |-- boolean: boolean (nullable = true)
| | | |-- name: struct (nullable = true)
| | | | |-- string: string (nullable = true)
| | | |-- value: struct (nullable = true)
| | | | |-- string: string (nullable = true)
|-- meta_data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: struct (nullable = true)
| | | |-- string: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- value: struct (nullable = true)
| | | |-- string: string (nullable = true)

spark hivecontext working with queries issues

I'm trying to get information from Jsons to create tables in Hive.
This is my Json schema:
root
|-- info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- stations: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- bikes: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- slots: string (nullable = true)
| | | | |-- streetName: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- updateTime: long (nullable = true)
|-- date: string (nullable = true)
|-- numRecords: string (nullable = true)
I'm using this query:
sqlContext.sql("SELECT info.updateTime FROM STATIONS").foreach(println)
This is what i get:
[WrappedArray(1449098169, 1449108553, 1449098468)]
But i don't know how to put this information in a table to use it after from the Hive console.
I used this:
query.write.save("/home/cloudera/Desktop/select")
And it creates something, but i don't know how to use it.
Thanks
You can do it in several ways...it depends.
First way: Have the table created in the query
sqlContext.sql("create table mytable AS SELECT info.updateTime FROM STATIONS")
// now you can query mytable
Second way: write the DataFrame with saveAsTable()
sqlContext.sql("SELECT info.updateTime FROM STATIONS").saveAsTable("othertable")