I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:
{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}
JSONInput.show() gave me something like the below:
+---------------------------------------------------------------------+
|str |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}] |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}] |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}] |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}] |
+---------------------------------------------------------------------|
I know this is not the correct syntax for JSON, but this is what I have.
How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):
Name Age Location
-----------------------
Chris 29 USA
John 20 France
Mark 30 UK
Mary 22 UK
And I want to filter for the specific country:
val resultToReturn = JSONInput.filter("Location=USA")
But this results the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve Location given input columns: [str]; line 1 pos 0;
How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?
You can use from_json to parse the string values :
import org.apache.spark.sql.types._
val schema = MapType(StringType,
StructType(Array(
StructField("Name", StringType, true),
StructField("Age", StringType, true),
StructField("Location", StringType, true)
)
), true
)
val resultToReturn = JSONInput.select(
explode(from_json(col("str")(1), schema))
).select("value.*")
resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//| John| 20| France|
//| Mark| 30| UK|
//| Mary| 22| UK|
//+-----+---+--------+
Then you can filter :
resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//+-----+---+--------+
You can extract the innermost JSON using regexp_extract and parse that JSON using from_json. Then you can star-expand the extracted JSON struct.
val parsed_df = JSONInput.selectExpr("""
from_json(
regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
'Name string, Age string, Location string'
) as parsed
""").select("parsed.*")
parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA |
|John |20 |France |
|Mark |30 |UK |
|Mary |22 |UK |
+-----+---+--------+
And you can filter it using
val filtered = parsed_df.filter("Location = 'USA'")
PS remember to add single quotes around USA.
Related
I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)
You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()
I've parsed a nested json file using the following code:
df = df.select(df.data.attributes.signatures_by_constituency.ons_code.alias("wc_code"), \
df.data.attributes.signatures_by_constituency.signature_count.alias("sign_count")) \
This results in the following dataframe with values sitting in arrays.
+--------------------+--------------------+
| wc_code| sign_count|
+--------------------+--------------------+
|[E14000530, E1400...|[28, 6, 17, 15, 5...|
+--------------------+--------------------+
The next step is to parse the arrays into columns for which I'm using the explode() as follows:
df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
.show()
which results in the following error:
+---------+----------+
| wc_code|sign_count|
+---------+----------+
|E14000530| 28|
|E14000531| 6|
|E14000532| 17|
|E14000533| 15|
|E14000534| 54|
|E14000535| 12|
|E14000536| 34|
|E14000537| 10|
|E14000538| 32|
|E14000539| 29|
|E14000540| 3|
|E14000541| 10|
|E14000542| 8|
|E14000543| 13|
|E14000544| 15|
|E14000545| 19|
|E14000546| 8|
|E14000547| 28|
|E14000548| 7|
|E14000549| 13|
+---------+----------+
only showing top 20 rows
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-16ce85050366> in <module>
2
3 df1 = spark.createDataFrame(df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
----> 4 .select("wc_count.wc_code", "wc_count.sign_count")\
5 .show()
6 )
/usr/local/spark/python/pyspark/sql/context.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
305 Py4JJavaError: ...
306 """
--> 307 return self.sparkSession.createDataFrame(data, schema, samplingRatio, verifySchema)
308
309 #since(1.3)
/usr/local/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
746 rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
747 else:
--> 748 rdd, schema = self._createFromLocal(map(prepare, data), schema)
749 jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
750 jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
TypeError: 'NoneType' object is not iterable
Note that the table is displayed.
I don't get the error is I simply use df1 = ... but no matter what I do with the variable afterwards errors out in 'NoneType' object has no attribute 'something' .
If I don't assign the above to a variable and don't use df1 = spark.createDataFrame(), I don't get this error so I'm guessing something breaks when the variable gets created.
In case this is important, the df.printSchema() produces the following:
root
|-- wc_code: array (nullable = true)
| |-- element: string (containsNull = true)
|-- sign_count: array (nullable = true)
| |-- element: long (containsNull = true)
Can you try below query-
val df = spark.sql("select array('E14000530', 'E1400') as wc_code, array(28, 6, 17, 15) as sign_count")
df.selectExpr("inline_outer(arrays_zip(wc_code, sign_count)) as (wc_code, sign_count)").show()
/**
* +---------+----------+
* | wc_code|sign_count|
* +---------+----------+
* |E14000530| 28|
* | E1400| 6|
* | null| 17|
* | null| 15|
* +---------+----------+
*/
I think I've fixed the issue.
Running the code like this seems to have fixed it:
df1 = df.withColumn("wc_count", F.explode(F.arrays_zip("wc_code", "sign_count")))\
.select("wc_count.wc_code", "wc_count.sign_count")\
For some reason calling .show() at the end of it was messing with the newly created dataframe. Would be curious to know if there's a valid reason or it's a bug
I have a multi line field csv ,which i am try to load through spark as a data frame.
Cust_id, cust_address, city,zip
1, "1289 cobb parkway
Bufford", "ATLANTA",34343
2, "1234 IVY lane
Decatur", "ATLANTA",23435
val df = Spark.read.format("csv")
.option("multiLine", true)
.option("header", true)
.option("escape", "\"")
.load("/home/SPARK/file.csv")
df.show()
This shows me the data frame like -
+--------+-------------------+-----+----+
| id | address | city| zip|
+--------+-------------------+-----+----+
| 1| "1289 cobb parkway| null|null|
|Bufford"| "ATLANTA"|34343|null|
| 2| "1234 IVY lane| null|null|
|Decatur"| "ATLANTA"|23435|null|
+--------+-------------------+-----+----+
i want output like-
+---+--------------------+-------+-----+
| id| address| city| zip|
+---+--------------------+-------+-----+
| 1|1289 cobb parkway...|ATLANTA|34343|
| 2|1234 IVY lane Dec...|ATLANTA|23435|
+---+--------------------+-------+-----+
val File = sqlContext.read.format("com.databricks.spark.csv")
.option("delimiter", delimiter)
.option("header",true)
.option("quote", "\"")
.option("multiLine", "true")
.option("inferSchema", "true")
.option("parserLib", "UNIVOCITY")
.option("ignoreTrailingWhiteSpace","true")
.option("ignoreLeadingWhiteSpace", true)
.load(file_name)
I have an existing Spark dataframe that has columns as such:
--------------------
pid | response
--------------------
12 | {"status":"200"}
response is a string column.
Is there a way to cast it into JSON and extract specific fields? Can lateral view be used as it is in Hive? I looked up some examples on line that used explode and later view but it doesn't seem to work with Spark 2.1.1
From pyspark.sql.functions , you can use any of from_json,get_json_object,json_tuple to extract fields from json string as below,
>>from pyspark.sql.functions import json_tuple,from_json,get_json_object
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> l = [(12, '{"status":"200"}'),(13,'{"status":"200","somecol":"300"}')]
>>> df = spark.createDataFrame(l,['pid','response'])
>>> df.show()
+---+--------------------+
|pid| response|
+---+--------------------+
| 12| {"status":"200"}|
| 13|{"status":"200",...|
+---+--------------------+
>>> df.printSchema()
root
|-- pid: long (nullable = true)
|-- response: string (nullable = true)
Using json_tuple :
>>> df.select('pid',json_tuple(df.response,'status','somecol')).show()
+---+---+----+
|pid| c0| c1|
+---+---+----+
| 12|200|null|
| 13|200| 300|
+---+---+----+
Using from_json:
>>> schema = StructType([StructField("status", StringType()),StructField("somecol", StringType())])
>>> df.select('pid',from_json(df.response, schema).alias("json")).show()
+---+----------+
|pid| json|
+---+----------+
| 12|[200,null]|
| 13| [200,300]|
+---+----------+
Using get_json_object:
>>> df.select('pid',get_json_object(df.response,'$.status').alias('status'),get_json_object(df.response,'$.somecol').alias('somecol')).show()
+---+------+-------+
|pid|status|somecol|
+---+------+-------+
| 12| 200| null|
| 13| 200| 300|
+---+------+-------+
I have two DataFrames, DF1 and DF2, and a JSON file which I need to use as a mapping file to create another dataframe (DF3).
DF1:
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| 100| John| Mumbai|
| 101| Alex| Delhi|
| 104| Divas|Kolkata|
| 108| Jerry|Chennai|
+-------+-------+-------+
DF2:
+-------+-----------+-------+
|column4| column5|column6|
+-------+-----------+-------+
| S1| New| xxx|
| S2| Old| yyy|
| S5|replacement| zzz|
| S10| New| ppp|
+-------+-----------+-------+
Apart from this one mapping file I am having in JSON format which will be use to generate DF3.
Below is the JSON mapping file:
{"targetColumn":"newColumn1","sourceField1":"column2","sourceField2":"column4"}
{"targetColumn":"newColumn2","sourceField1":"column7","sourceField2":"column5"}
{"targetColumn":"newColumn3","sourceField1":"column8","sourceField2":"column6"}
So from this JSON file I need to create DF3 with a column available in the targetColumn section of the mapping and it will check the source column if it is present in DF1 then it map to sourceField1 from DF1 otherwise sourceField2 from DF2.
Below is the expected output.
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Divas|replacement| zzz|
| Jerry| New| ppp|
+----------+-----------+----------+
Any help here will be appropriated.
Parse the JSON and create the below List of custom objects
case class SrcTgtMapping(targetColumn:String,sourceField1:String,sourceField2:String)
val srcTgtMappingList=List(SrcTgtMapping("newColumn1","column2","column4"),SrcTgtMapping("newColumn2","column7","column5"),SrcTgtMapping("newColumn3","column8","column6"))
Add dummy index column to both the dataframes and join both the dataframes based on index column
import org.apache.spark.sql.functions._
val df1WithIndex=df1.withColumn("index",monotonicallyIncreasingId)
val df2WithIndex=df2.withColumn("index",monotonicallyIncreasingId)
val joinedDf=df1WithIndex.join(df2WithIndex,df1WithIndex.col("index")===df2WithIndex.col("index"))
Create the query and execute it.
val df1Columns=df1WithIndex.columns.toList
val df2Columns=df2WithIndex.columns.toList
val query=srcTgtMappingList.map(stm=>if(df1Columns.contains(stm.sourceField1)) joinedDf.col(stm.sourceField1).alias(stm.targetColumn) else joinedDf.col(stm.sourceField2).alias(stm.targetColumn))
val output=joinedDf.select(query:_*)
output.show
Sample Output:
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Jerry| New| ppp|
| Divas|replacement| zzz|
+----------+-----------+----------+
Hope this approach will help you