now has JSON data as follows
{"Id":11,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000},{"package":"com.browser7","activetime":1205000}]}
{"Id":12,"data":[{"package":"com.browser1","activetime":60000},{"package":"com.browser6","activetime":1205000}]}
......
This JSON is the activation time of app, the purpose of which is to analyze the total activation time of each app
I use sparK SQL to parse JSON
scala
val sqlContext = sc.sqlContext
val behavior = sqlContext.read.json("behavior-json.log")
behavior.cache()
behavior.createOrReplaceTempView("behavior")
val appActiveTime = sqlContext.sql ("SELECT data FROM behavior") // SQL query
appActiveTime.show (100100) // print dataFrame
appActiveTime.rdd.foreach(println) // print RDD
But the printed dataFrame is like this
.
+----------------------------------------------------------------------+
| data|
+----------------------------------------------------------------------+
| [[60000, com.browser1], [12870000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1207000, com.browser]]|
| [[120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser5]]|
| [[60000, com.browser1], [12075000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1204000, com.browser]]|
| [[60000, com.browser1], [120000, com.browser]]|
| [[60000, com.browser1], [1201000, com.browser]]|
| [[1200400, com.browser5]]|
| [[60000, com.browser1], [1200400, com.browser]]|
|[[60000, com.browser1], [1205000, com.browser6], [1205000, com.browser7]]|
.
RDD is like this
.
[WrappedArray ([60000, com.browser1], [60000, com.browser1])]
[WrappedArray ([120000, com.browser])]
[WrappedArray ([60000, com.browser1], [1204000, com.browser5])]
[WrappedArray ([12075000, com.browser], [12075000, com.browser])]
.
And I want to turn the data into
.
Com.browser1 60000
Com.browser1 60000
Com.browser 12075000
Com.browser 12075000
...
.
I want to turn the array elements of each line in RDD into one row. Of course, it can be another structure that is easy to analyze.
Because I only learn spark and Scala a lot, I have try it for a long time but fail, so I hope you can guide me.
From your given json data you can view the schema of your dataframe with printSchema and use it
appActiveTime.printSchema()
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- activetime: long (nullable = true)
| | |-- package: string (nullable = true)
Since you have array you need to explode the data and select the struct field as below
import org.apache.spark.sql.functions._
appActiveTime.withColumn("data", explode($"data"))
.select("data.*")
.show(false)
Output:
+----------+------------+
|activetime| package|
+----------+------------+
| 60000|com.browser1|
| 1205000|com.browser6|
| 1205000|com.browser7|
| 60000|com.browser1|
| 1205000|com.browser6|
+----------+------------+
Hope this helps!
with #Shankar Koirala 's help , I learned how to use ' explode' to handle joson array.
val df = sqlContext.sql("SELECT data FROM behavior")
appActiveTime.select(explode(df("data"))).toDF("data")
.select("data.package","data.activetime")
.show(false)
For Apache spark Java we will require to do something like below:
Dataset<Row> dataDF = spark.read()
.option("header", "true")
.json("/file_path");
dataDF.createOrReplaceTempView("behavior");
String sqlQuery = "SELECT data from behavior";
Dataset<Row> jsonData = spark.sql(sqlQuery);
snapshot.withColumn("data", explode(jsonData.col("data"))).select("data.*").show();
Related
I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)
You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()
I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:
{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}
JSONInput.show() gave me something like the below:
+---------------------------------------------------------------------+
|str |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}] |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}] |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}] |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}] |
+---------------------------------------------------------------------|
I know this is not the correct syntax for JSON, but this is what I have.
How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):
Name Age Location
-----------------------
Chris 29 USA
John 20 France
Mark 30 UK
Mary 22 UK
And I want to filter for the specific country:
val resultToReturn = JSONInput.filter("Location=USA")
But this results the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve Location given input columns: [str]; line 1 pos 0;
How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?
You can use from_json to parse the string values :
import org.apache.spark.sql.types._
val schema = MapType(StringType,
StructType(Array(
StructField("Name", StringType, true),
StructField("Age", StringType, true),
StructField("Location", StringType, true)
)
), true
)
val resultToReturn = JSONInput.select(
explode(from_json(col("str")(1), schema))
).select("value.*")
resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//| John| 20| France|
//| Mark| 30| UK|
//| Mary| 22| UK|
//+-----+---+--------+
Then you can filter :
resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//+-----+---+--------+
You can extract the innermost JSON using regexp_extract and parse that JSON using from_json. Then you can star-expand the extracted JSON struct.
val parsed_df = JSONInput.selectExpr("""
from_json(
regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
'Name string, Age string, Location string'
) as parsed
""").select("parsed.*")
parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA |
|John |20 |France |
|Mark |30 |UK |
|Mary |22 |UK |
+-----+---+--------+
And you can filter it using
val filtered = parsed_df.filter("Location = 'USA'")
PS remember to add single quotes around USA.
I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+
I have an existing Spark dataframe that has columns as such:
--------------------
pid | response
--------------------
12 | {"status":"200"}
response is a string column.
Is there a way to cast it into JSON and extract specific fields? Can lateral view be used as it is in Hive? I looked up some examples on line that used explode and later view but it doesn't seem to work with Spark 2.1.1
From pyspark.sql.functions , you can use any of from_json,get_json_object,json_tuple to extract fields from json string as below,
>>from pyspark.sql.functions import json_tuple,from_json,get_json_object
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> l = [(12, '{"status":"200"}'),(13,'{"status":"200","somecol":"300"}')]
>>> df = spark.createDataFrame(l,['pid','response'])
>>> df.show()
+---+--------------------+
|pid| response|
+---+--------------------+
| 12| {"status":"200"}|
| 13|{"status":"200",...|
+---+--------------------+
>>> df.printSchema()
root
|-- pid: long (nullable = true)
|-- response: string (nullable = true)
Using json_tuple :
>>> df.select('pid',json_tuple(df.response,'status','somecol')).show()
+---+---+----+
|pid| c0| c1|
+---+---+----+
| 12|200|null|
| 13|200| 300|
+---+---+----+
Using from_json:
>>> schema = StructType([StructField("status", StringType()),StructField("somecol", StringType())])
>>> df.select('pid',from_json(df.response, schema).alias("json")).show()
+---+----------+
|pid| json|
+---+----------+
| 12|[200,null]|
| 13| [200,300]|
+---+----------+
Using get_json_object:
>>> df.select('pid',get_json_object(df.response,'$.status').alias('status'),get_json_object(df.response,'$.somecol').alias('somecol')).show()
+---+------+-------+
|pid|status|somecol|
+---+------+-------+
| 12| 200| null|
| 13| 200| 300|
+---+------+-------+
How Can I query an RDD with complex types such as maps/arrays?
for example, when I was writing this test code:
case class Test(name: String, map: Map[String, String])
val map = Map("hello" -> "world", "hey" -> "there")
val map2 = Map("hello" -> "people", "hey" -> "you")
val rdd = sc.parallelize(Array(Test("first", map), Test("second", map2)))
I thought the syntax would be something like:
sqlContext.sql("SELECT * FROM rdd WHERE map.hello = world")
or
sqlContext.sql("SELECT * FROM rdd WHERE map[hello] = world")
but I get
Can't access nested field in type MapType(StringType,StringType,true)
and
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes
respectively.
It depends on a type of the column. Lets start with some dummy data:
import org.apache.spark.sql.functions.{udf, lit}
import scala.util.Try
case class SubRecord(x: Int)
case class ArrayElement(foo: String, bar: Int, vals: Array[Double])
case class Record(
an_array: Array[Int], a_map: Map[String, String],
a_struct: SubRecord, an_array_of_structs: Array[ArrayElement])
val df = sc.parallelize(Seq(
Record(Array(1, 2, 3), Map("foo" -> "bar"), SubRecord(1),
Array(
ArrayElement("foo", 1, Array(1.0, 2.0, 2.0)),
ArrayElement("bar", 2, Array(3.0, 4.0, 5.0)))),
Record(Array(4, 5, 6), Map("foz" -> "baz"), SubRecord(2),
Array(ArrayElement("foz", 3, Array(5.0, 6.0)),
ArrayElement("baz", 4, Array(7.0, 8.0))))
)).toDF
df.registerTempTable("df")
df.printSchema
// root
// |-- an_array: array (nullable = true)
// | |-- element: integer (containsNull = false)
// |-- a_map: map (nullable = true)
// | |-- key: string
// | |-- value: string (valueContainsNull = true)
// |-- a_struct: struct (nullable = true)
// | |-- x: integer (nullable = false)
// |-- an_array_of_structs: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- foo: string (nullable = true)
// | | |-- bar: integer (nullable = false)
// | | |-- vals: array (nullable = true)
// | | | |-- element: double (containsNull = false)
array (ArrayType) columns:
Column.getItem method
df.select($"an_array".getItem(1)).show
// +-----------+
// |an_array[1]|
// +-----------+
// | 2|
// | 5|
// +-----------+
Hive brackets syntax:
sqlContext.sql("SELECT an_array[1] FROM df").show
// +---+
// |_c0|
// +---+
// | 2|
// | 5|
// +---+
an UDF
val get_ith = udf((xs: Seq[Int], i: Int) => Try(xs(i)).toOption)
df.select(get_ith($"an_array", lit(1))).show
// +---------------+
// |UDF(an_array,1)|
// +---------------+
// | 2|
// | 5|
// +---------------+
Additionally to the methods listed above Spark supports a growing list of built-in functions operating on complex types. Notable examples include higher order functions like transform (SQL 2.4+, Scala 3.0+, PySpark / SparkR 3.1+):
df.selectExpr("transform(an_array, x -> x + 1) an_array_inc").show
// +------------+
// |an_array_inc|
// +------------+
// | [2, 3, 4]|
// | [5, 6, 7]|
// +------------+
import org.apache.spark.sql.functions.transform
df.select(transform($"an_array", x => x + 1) as "an_array_inc").show
// +------------+
// |an_array_inc|
// +------------+
// | [2, 3, 4]|
// | [5, 6, 7]|
// +------------+
filter (SQL 2.4+, Scala 3.0+, Python / SparkR 3.1+)
df.selectExpr("filter(an_array, x -> x % 2 == 0) an_array_even").show
// +-------------+
// |an_array_even|
// +-------------+
// | [2]|
// | [4, 6]|
// +-------------+
import org.apache.spark.sql.functions.filter
df.select(filter($"an_array", x => x % 2 === 0) as "an_array_even").show
// +-------------+
// |an_array_even|
// +-------------+
// | [2]|
// | [4, 6]|
// +-------------+
aggregate (SQL 2.4+, Scala 3.0+, PySpark / SparkR 3.1+):
df.selectExpr("aggregate(an_array, 0, (acc, x) -> acc + x, acc -> acc) an_array_sum").show
// +------------+
// |an_array_sum|
// +------------+
// | 6|
// | 15|
// +------------+
import org.apache.spark.sql.functions.aggregate
df.select(aggregate($"an_array", lit(0), (x, y) => x + y) as "an_array_sum").show
// +------------+
// |an_array_sum|
// +------------+
// | 6|
// | 15|
// +------------+
array processing functions (array_*) like array_distinct (2.4+):
import org.apache.spark.sql.functions.array_distinct
df.select(array_distinct($"an_array_of_structs.vals"(0))).show
// +-------------------------------------------+
// |array_distinct(an_array_of_structs.vals[0])|
// +-------------------------------------------+
// | [1.0, 2.0]|
// | [5.0, 6.0]|
// +-------------------------------------------+
array_max (array_min, 2.4+):
import org.apache.spark.sql.functions.array_max
df.select(array_max($"an_array")).show
// +-------------------+
// |array_max(an_array)|
// +-------------------+
// | 3|
// | 6|
// +-------------------+
flatten (2.4+)
import org.apache.spark.sql.functions.flatten
df.select(flatten($"an_array_of_structs.vals")).show
// +---------------------------------+
// |flatten(an_array_of_structs.vals)|
// +---------------------------------+
// | [1.0, 2.0, 2.0, 3...|
// | [5.0, 6.0, 7.0, 8.0]|
// +---------------------------------+
arrays_zip (2.4+):
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip($"an_array_of_structs.vals"(0), $"an_array_of_structs.vals"(1))).show(false)
// +--------------------------------------------------------------------+
// |arrays_zip(an_array_of_structs.vals[0], an_array_of_structs.vals[1])|
// +--------------------------------------------------------------------+
// |[[1.0, 3.0], [2.0, 4.0], [2.0, 5.0]] |
// |[[5.0, 7.0], [6.0, 8.0]] |
// +--------------------------------------------------------------------+
array_union (2.4+):
import org.apache.spark.sql.functions.array_union
df.select(array_union($"an_array_of_structs.vals"(0), $"an_array_of_structs.vals"(1))).show
// +---------------------------------------------------------------------+
// |array_union(an_array_of_structs.vals[0], an_array_of_structs.vals[1])|
// +---------------------------------------------------------------------+
// | [1.0, 2.0, 3.0, 4...|
// | [5.0, 6.0, 7.0, 8.0]|
// +---------------------------------------------------------------------+
slice (2.4+):
import org.apache.spark.sql.functions.slice
df.select(slice($"an_array", 2, 2)).show
// +---------------------+
// |slice(an_array, 2, 2)|
// +---------------------+
// | [2, 3]|
// | [5, 6]|
// +---------------------+
map (MapType) columns
using Column.getField method:
df.select($"a_map".getField("foo")).show
// +----------+
// |a_map[foo]|
// +----------+
// | bar|
// | null|
// +----------+
using Hive brackets syntax:
sqlContext.sql("SELECT a_map['foz'] FROM df").show
// +----+
// | _c0|
// +----+
// |null|
// | baz|
// +----+
using a full path with dot syntax:
df.select($"a_map.foo").show
// +----+
// | foo|
// +----+
// | bar|
// |null|
// +----+
using an UDF
val get_field = udf((kvs: Map[String, String], k: String) => kvs.get(k))
df.select(get_field($"a_map", lit("foo"))).show
// +--------------+
// |UDF(a_map,foo)|
// +--------------+
// | bar|
// | null|
// +--------------+
Growing number of map_* functions like map_keys (2.3+)
import org.apache.spark.sql.functions.map_keys
df.select(map_keys($"a_map")).show
// +---------------+
// |map_keys(a_map)|
// +---------------+
// | [foo]|
// | [foz]|
// +---------------+
or map_values (2.3+)
import org.apache.spark.sql.functions.map_values
df.select(map_values($"a_map")).show
// +-----------------+
// |map_values(a_map)|
// +-----------------+
// | [bar]|
// | [baz]|
// +-----------------+
Please check SPARK-23899 for a detailed list.
struct (StructType) columns using full path with dot syntax:
with DataFrame API
df.select($"a_struct.x").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
with raw SQL
sqlContext.sql("SELECT a_struct.x FROM df").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
fields inside array of structs can be accessed using dot-syntax, names and standard Column methods:
df.select($"an_array_of_structs.foo").show
// +----------+
// | foo|
// +----------+
// |[foo, bar]|
// |[foz, baz]|
// +----------+
sqlContext.sql("SELECT an_array_of_structs[0].foo FROM df").show
// +---+
// |_c0|
// +---+
// |foo|
// |foz|
// +---+
df.select($"an_array_of_structs.vals".getItem(1).getItem(1)).show
// +------------------------------+
// |an_array_of_structs.vals[1][1]|
// +------------------------------+
// | 4.0|
// | 8.0|
// +------------------------------+
user defined types (UDTs) fields can be accessed using UDFs. See Spark SQL referencing attributes of UDT for details.
Notes:
depending on a Spark version some of these methods can be available only with HiveContext. UDFs should work independent of version with both standard SQLContext and HiveContext.
generally speaking nested values are a second class citizens. Not all typical operations are supported on nested fields. Depending on a context it could be better to flatten the schema and / or explode collections
df.select(explode($"an_array_of_structs")).show
// +--------------------+
// | col|
// +--------------------+
// |[foo,1,WrappedArr...|
// |[bar,2,WrappedArr...|
// |[foz,3,WrappedArr...|
// |[baz,4,WrappedArr...|
// +--------------------+
Dot syntax can be combined with wildcard character (*) to select (possibly multiple) fields without specifying names explicitly:
df.select($"a_struct.*").show
// +---+
// | x|
// +---+
// | 1|
// | 2|
// +---+
JSON columns can be queried using get_json_object and from_json functions. See How to query JSON data column using Spark DataFrames? for details.
Once You convert it to DF, u can simply fetch data as
val rddRow= rdd.map(kv=>{
val k = kv._1
val v = kv._2
Row(k, v)
})
val myFld1 = StructField("name", org.apache.spark.sql.types.StringType, true)
val myFld2 = StructField("map", org.apache.spark.sql.types.MapType(StringType, StringType), true)
val arr = Array( myFld1, myFld2)
val schema = StructType( arr )
val rowrddDF = sqc.createDataFrame(rddRow, schema)
rowrddDF.registerTempTable("rowtbl")
val rowrddDFFinal = rowrddDF.select(rowrddDF("map.one"))
or
val rowrddDFFinal = rowrddDF.select("map.one")
here was what I did and it worked
case class Test(name: String, m: Map[String, String])
val map = Map("hello" -> "world", "hey" -> "there")
val map2 = Map("hello" -> "people", "hey" -> "you")
val rdd = sc.parallelize(Array(Test("first", map), Test("second", map2)))
val rdddf = rdd.toDF
rdddf.registerTempTable("mytable")
sqlContext.sql("select m.hello from mytable").show
Results
+------+
| hello|
+------+
| world|
|people|
+------+