extract multiple columns from a json string - json

I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet)
Schema
root
|-- : string (nullable = true)
sample data
+----------------------------------------------+
+----------------------------------------------+
|{"deviceTypeId":"A2A","deviceId":"123","geo...|
|{"deviceTypeId":"A2B","deviceId":"456","geo...|
+----------------------------------------------+
Expected Output
+--------------+------------+
| deviceTypeId|deviceId|...|
+--------------+--------+---+
| A2A| 123| |
| A2B| 456| |
+--------------+--------+---+
I tried splitting the string, but this doesn't seem like an efficient approach
split_col = split(df_explode[''], ',')
And then extract the columns, but it appends the initial string as well.
df_1 = df_explode.withColumn('deviceId',split_col.getItem(1))
# df_1 = df_explode.withColumn('deviceTypeId',split_col.getItem(0))
printOutput(df_1)
I'm looking for a better way to solve this problem

Explode function is only to work on Array.
In your case which is a json, you should use from_json function.
Please refer from_json from pyspark.sql.functions

I was able to do it using the from_json function.
#Convert json column to multiple columns
schema = getSchema()
dfJSON = df_explode.withColumn("jsonData",from_json(col(''),schema)) \
.select("jsonData.*")
dfJSON.printSchema()
dfJSON.limit(100).toPandas()
We need to create Json Schema that will parse the Json data.
def getSchema():
schema = StructType([
StructField('deviceTypeId', StringType()),
StructField('deviceId', StringType()),
...
])
return schema
The value string is empty in this Json data so the col consists of empty string

Related

Convert string column to json and parse in pyspark

My dataframe looks like
|ID|Notes|
---------------
|1|'{"Country":"USA","Count":"1000"}'|
|2|{"Country":"USA","Count":"1000"}|
ID : int
Notes : string
When i use from_json to parse the column Notes, it gives all Null values.
I need help in parsing this column Notes into columns in pyspark
When you are using from_json() function, make sure that the column value is exactly a json/dictionary in String format. In the sample data you have given, the Notes column value with id=1 is not exactly in json format (it is a string but enclosed within additional single quotes). This is the reason it is returning NULL values. Implementing the following code on the input dataframe gives the following output.
df = df.withColumn("Notes",from_json(df.Notes,MapType(StringType(),StringType())))
You need to change your input data such that the entire Notes column is in same format which is json/dictionary as a string and nothing more because it is the main reason for the issue. The below is the correct format that helps you to fix your issue.
| ID | Notes |
---------------
| 1 | {"Country":"USA","Count":"1000"} |
| 2 | {"Country":"USA","Count":"1000"} |
To parse Notes column values as columns in pyspark, you can simply use function called json_tuple() (no need to use from_json()). It extracts the elements from a json column (string format) and creates the result as new columns.
df = df.select(col("id"),json_tuple(col("Notes"),"Country","Count")) \
.toDF("id","Country","Count")
df.show()
Output:
NOTE: json_tuple() also returns null if the column value is not in the correct format (make sure the column values are json/dictionary as a string without additional quotes).

pySpark - convert an entire dataframe column into JSON object before inserting into DB

My knowledge of pyspark is quite limited at this point, so I'm looking for a quick solution to this one issue I have with my current implementation. I'm attempting to read a JSON file via pyspark into a dataframe, convert it into an object which I can insert into a database table (DynamoDB). The columns in the table should be representative of the keys specified in the JSON file. For example, if my JSON file comprises of the following elements:
{
"Records":[
{
"column1":"Value1",
"column2":"Value2",
"column3":"Value3",
"column4":{
"sub1":"Value4",
"sub2":"Value5",
"sub3":{
"sub4":"Value6",
"sub5":"Value7"
}
}
},
{
"column1":"Value8",
"column2":"Value9",
"column3":"Value10",
"column4":{
"sub1":"Value11",
"sub2":"Value12",
"sub3":{
"sub4":"Value13",
"sub5":"Value14"
}
}
}
]
}
The columns in the database table are column1, column2, column3 and column4 respectively. In the case of column4, which is of Map type, I need the entire object to be converted to a string before it is inserted into the database. Hence, in the case of the first row, I can expect to see this for that column:
{
"sub1":"Value4",
"sub2":"Value5",
"sub3":{
"sub4":"Value6",
"sub5":"Value7"
}
}
However, this is what I am seeing in the database table after running my script:
{ Value4, Value5, { Value6, Value7 }}
I understand this is happening because something needs to be done prior to converting all column values to type String before performing the DB insertion operation:
for col in Rows.columns:
Rows = Rows.withColumn(col, Rows[col].cast(StringType()))
I'm looking for a way to rectify the contents of Column4 to represent the original JSON object before converting them to the type String. Here is what I've written so far (DB insertion operation excluded)
import pyspark.sql.types as T
from pyspark.sql import functions as SF
df = spark.read.option("multiline", "true").json('/home/abhishek.tirkey/Documents/test')
Records = df.withColumn("Records", SF.explode(SF.col("Records")))
Rows = Records.select(
"Records.column1",
"Records.column2",
"Records.column3",
"Records.column4",
)
for col in Rows.columns:
Rows = Rows.withColumn(col, Rows[col].cast(StringType()))
RowsJSON = Rows.toJSON()
there is a to_json function to do that :
from pyspark.sql import functions as F
df = df.withColumn("record", F.explode("records")).select(
"record.column1",
"record.column2",
"record.column3",
F.to_json("record.column4").alias("column4"),
)
df.show()
+-------+-------+-------+--------------------+
|column1|column2|column3| column4|
+-------+-------+-------+--------------------+
| Value1| Value2| Value3|{"sub1":"Value4",...|
| Value8| Value9|Value10|{"sub1":"Value11"...|
+-------+-------+-------+--------------------+
df.printSchema()
root
|-- column1: string (nullable = true)
|-- column2: string (nullable = true)
|-- column3: string (nullable = true)
|-- column4: string (nullable = true)

How to infer schema of serialized JSON column in Spark SQL?

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json function).
I can do this in Scala like
val contextSchema = spark.read.json(data.select("context").as[String]).schema
val updatedData = data.withColumn("context", from_json(col("context"), contextSchema))
How can I transform this solution to a pure Spark-SQL?
For spark-sql use toDDL to generate schema then use the schema in from_json.
Example:
df.show(10,false)
//+---+-------------------+
//|seq|json |
//+---+-------------------+
//|1 |{"id":1,"name":"a"}|
//+---+-------------------+
val sch=spark.read.json(df.select("json").as[String]).schema.toDDL
//sch: String = `id` BIGINT,`name` STRING
df.createOrReplaceTempView("tmp")
spark.sql(s"""select seq,jsn.* from (select *,from_json(json,"$sch") as jsn from tmp)""").
show(10,false)
//+---+---+----+
//|seq|id |name|
//+---+---+----+
//|1 |1 |a |
//+---+---+----+
You can use schema_of_json() function to infer JSON schema.
select from_json(<column_name>, schema_of_json(<sample_JSON>)) from <table>

Spark: How to merge json objects into array

I have the following JSON Array in on of the columns of Dataframe in Spark (2.4)
[{"quoteID":"12411736"},{"quoteID":"12438257"},{"quoteID":"12438288"},{"quoteID":"12438296"},{"quoteID":"12438299"}]
I am trying to merge as
{"quoteIDs":["12411736","12438257","12438288","12438296","12438299"]}
Require help on it
Use from_json to read your column data as array<struct> then by exploding we can flatten the data.
finally do groupBy + collect_list to create an array and use to_json to create json required.
Example:
val df=Seq(("a","""[{"quoteID":"123"},{"quoteID":"456"}]""")).toDF("i","j")
df.show(false)
//+---+-------------------------------------+
//|i |j |
//+---+-------------------------------------+
//|a |[{"quoteID":"123"},{"quoteID":"456"}]|
//+---+-------------------------------------+
val sch=ArrayType(new StructType().add("quoteID",StringType))
val drop_cols=Seq("tmp1","tmp2","j")
//if you need as column in the dataframe
df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS")).
withColumn("quoteIDS",to_json(struct(col("quoteIDS")))).
drop(drop_cols:_*).
show(false)
//+---+--------------------------+
//|i |quoteIDS |
//+---+--------------------------+
//|a |{"quoteIDS":["123","456"]}|
//+---+--------------------------+
//if you need as json string
val df1=df.withColumn("tmp1",from_json(col("j"),sch)).
withColumn("tmp2",explode(col("tmp1"))).
selectExpr("*","tmp2.*").
groupBy("i").
agg(collect_list("quoteID").alias("quoteIDS"))
//toJSON (or) write as json file
df1.toJSON.first
//String = {"i":"a","quoteIDS":["123","456"]}

Write Spark dataframe as CSV with partitions

I'm trying to write a dataframe in spark to an HDFS location and I expect that if I'm adding the partitionBy notation Spark will create partition
(similar to writing in Parquet format)
folder in form of
partition_column_name=partition_value
( i.e partition_date=2016-05-03). To do so, I ran the following command :
(df.write
.partitionBy('partition_date')
.mode('overwrite')
.format("com.databricks.spark.csv")
.save('/tmp/af_organic'))
but partition folders had not been created
any idea what sould I do in order for spark DF automatically create those folders?
Thanks,
Spark 2.0.0+:
Built-in csv format supports partitioning out of the box so you should be able to simply use:
df.write.partitionBy('partition_date').mode(mode).format("csv").save(path)
without including any additional packages.
Spark < 2.0.0:
At this moment (v1.4.0) spark-csv doesn't support partitionBy (see databricks/spark-csv#123) but you can adjust built-in sources to achieve what you want.
You can try two different approaches. Assuming your data is relatively simple (no complex strings and need for character escaping) and looks more or less like this:
df = sc.parallelize([
("foo", 1, 2.0, 4.0), ("bar", -1, 3.5, -0.1)
]).toDF(["k", "x1", "x2", "x3"])
You can manually prepare values for writing:
from pyspark.sql.functions import col, concat_ws
key = col("k")
values = concat_ws(",", *[col(x) for x in df.columns[1:]])
kvs = df.select(key, values)
and write using text source
kvs.write.partitionBy("k").text("/tmp/foo")
df_foo = (sqlContext.read.format("com.databricks.spark.csv")
.options(inferSchema="true")
.load("/tmp/foo/k=foo"))
df_foo.printSchema()
## root
## |-- C0: integer (nullable = true)
## |-- C1: double (nullable = true)
## |-- C2: double (nullable = true)
In more complex cases you can try to use proper CSV parser to preprocess values in a similar way, either by using UDF or mapping over RDD, but it will be significantly more expensive.
If CSV format is not a hard requirement you can also use JSON writer which supports partitionBy out-of-the-box:
df.write.partitionBy("k").json("/tmp/bar")
as well as partition discovery on read.