spark streaming writestream issue - json

I am trying to make a dynamic schema creation out of JSON records from text file as every record will have different schema. The following is my code.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}
object streamingexample {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = spark.readStream.textFile("C:\\Users\\sheol\\Desktop\\streaming")
val newdf11=df1
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
val df2 = df1.select(from_json($"value", schema_of_json(json_schema)).alias("value_new"))
val df3 = df2.select($"value_new.*")
df3.printSchema()
df3.writeStream
.option("truncate", "false")
.format("console")
.start()
.awaitTermination()
}
}
I am getting the following error. Please help on how to fix the code. I tried a lot. unable to figure out.
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Sample data:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

This statement in your code causing the problem from your code, as you already know.
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
You can get json schema in a different way like below...
val dd: DataFrame = spark.read.json("C:\\Users\\sheol\\Desktop\\streaming")
dd.show()
/** you can use val df1 = spark.readStream.textFile(yourfile) also **/
val json_schema = dd.schema.json;
println(json_schema)
Result :
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]}
you can further refine to your requirements I will leave it to you

This exception occurred because you are trying to access the data from the stream before the stream was started. Issues is with the df3.printSchema() make sure to call this function after the stream start.

Related

Scala Spark - Split JSON column to multiple columns

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
it outputs as follows:
+----------------+---------------------------------------+
| encrypted_data | decrypted_json |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
+----------------+---------------------------------------+
The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:
+----------------+----------------------+
| encrypted_data | a | b |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+
Below solution is inspired by one of the solutions given by #Jacek Laskowski:
import org.apache.spark.sql.types._
val JsonSchema = new StructType()
.add($"a".string)
.add($"b".string)
val schema = new StructType()
.add($"encrypted_data".string)
.add($"decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
{
"encrypted_data" : "eyJleHAiOjE1",
"decrypted_json" : [
{
"a" : "547.65",
"b" : "Some Data"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"decrypted_json")) // <-- explode the array field
.drop("decrypted_json") // <-- no longer needed
.select("encrypted_data", "address.*") // <-- flatten the struct field
Please go through Link for the original solution with the explanation.
I hope that helps.
Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -
val sparkSession = //create spark session
import sparkSession.implicits._
val jsonData = """{"a":547.65 , "b":"Some Data"}"""
val schema = {StructType(
List(
StructField("a", DoubleType, nullable = false),
StructField("b", StringType, nullable = false)
))}
val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))
dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()
Result
+-------------+------------------------------+------+---------+
|string_column|json_column |a |b |
+-------------+------------------------------+------+---------+
|dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

Create nested JSON of all rows having same Id: DataFrame

I have a DataFrame df4 with three column
id annotating entity
data having JSON Array data
executor_id as string value
Code to create same is as follow:
val df1 = Seq((1, "n1", "d1")).toDF("id", "number", "data")
val df2 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e1"))
val df3 = df1.withColumn("data", to_json(struct($"number", $"data"))).groupBy("id").agg(collect_list($"data").alias("data")).withColumn("executor_id", lit("e2"))
val df4 = df2.union(df3)
Content of DF4 is like
scala> df4.show(false)
+---+-----------------------------+-----------+
|id |data |executor_id|
+---+-----------------------------+-----------+
|1 |[{"number":"n1","data":"d1"}]|e1 |
|1 |[{"number":"n1","data":"d1"}]|e2 |
+---+-----------------------------+-----------+
I have to create new json data with executor_id as key and data as json data, group by id. Resultant dataFrame like
+---+------------------------------------------------------------------------+
|id |new_data |
+---+------------------------------------------------------------------------+
|1 |{"e1":[{"number":"n1","data":"d1"}], "e2":[{"number":"n1","data":"d1"}]}|
+---+------------------------------------------------------------------------+
Versions:
Spark: 2.2
Scala: 2.11
I have been struggling to solve this problem from past three days and finally able to work around it using UserDefinedAggregateFunction. Here is sample code for same
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
import scala.collection.mutable.ListBuffer
class CustomAggregator extends UserDefinedAggregateFunction {
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(Array(StructField("key", StringType), StructField("value", ArrayType(StringType))))
// This is the internal fields you keep for computing your aggregate
override def bufferSchema: StructType = StructType(
Array(StructField("mapData", MapType(StringType, ArrayType(StringType))))
)
// This is the output type of your aggregatation function.
override def dataType = StringType
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = scala.collection.mutable.Map[String, String]()
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = buffer.getMap(0) + (input.getAs[String](0) -> input.getAs[String](1))
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0, buffer1.getAs[Map[String, Any]](0) ++ buffer2.getAs[Map[String, Any]](0))
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
val map = buffer(0).asInstanceOf[Map[Any, Any]]
val buff: ListBuffer[String] = ListBuffer()
for ((k, v) <- map) {
val valArray = v.asInstanceOf[mutable.WrappedArray[Any]].array;
val tmp = {
for {
valString <- valArray
} yield valString.toString
}.toList.mkString(",")
buff += "\"" + k.toString + "\":[" + tmp + "]"
}
"{" + buff.toList.mkString(",") + "}"
}
}
Now use customAggregator,
val ca = new CustomAggregator
val df5 = df4.groupBy("id").agg(ca(col("executor_id"), col("data")).as("jsonData"))
Resultant DF is
scala> df5.show(false)
+---+-----------------------------------------------------------------------+
|id |jsonData |
+---+-----------------------------------------------------------------------+
|1 |{"e1":[{"number":"n1","data":"d1"}],"e2":[{"number":"n1","data":"d1"}]}|
+---+-----------------------------------------------------------------------+
Even though I have solved this problem, I am not sure whether this is right way or not. Reasons for my doubts are
In places, I have used Any. I don't feel this it is correct.
For each evaluation, I am creating ListBuffer and many other data type. I am not sure about performance of code.
I still have to test code for many dataType like double, date tpye, nested json etc. as data.

Serialize table to nested JSON using Apache Spark

I have a set of records like the following sample
|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014| MH43AJ411| 20000000|
| 10003014| MH43AJ411| 20000001|
| 10003015| MH12GZ3392| 20000002|
I want to parse into JSON and it should be look like this:
{
"ACCOUNTNO":10003014,
"VEHICLE": [
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
{ "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
],
"ACCOUNTNO":10003015,
"VEHICLE": [
{ "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
]
}
I have written the program but failed to achieve the output.
package com.report.pack1.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
object sqltojson {
def main(args:Array[String]) {
System.setProperty("hadoop.home.dir", "C:/winutil/")
val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"
val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
jdbcDF.registerTempTable("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")
res01.coalesce(1).write.json("D:/res01.json")
}
}
How can I serialize in the given format? Thanks in advance!
You can use struct and groupBy to get your desired result. Below is the code for same. I have commented the code whenever required.
val df = Seq((10003014,"MH43AJ411",20000000),
(10003014,"MH43AJ411",20000001),
(10003015,"MH12GZ3392",20000002)
).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID")
df.show
//output
//+---------+-------------+----------+
//|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
//+---------+-------------+----------+
//| 10003014| MH43AJ411| 20000000|
//| 10003014| MH43AJ411| 20000001|
//| 10003015| MH12GZ3392| 20000002|
//+---------+-------------+----------+
//create a struct column then group by ACCOUNTNO column and finally convert DF to JSON
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
select("VEHICLE","ACCOUNTNO"). //only select reqired columns
groupBy("ACCOUNTNO").
agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
toJSON. //convert to json
show(false)
//output
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|value |
//+------------------------------------------------------------------------------------------------------------------------------------------+
//|{"ACCOUNTNO":10003014,"VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}]}|
//|{"ACCOUNTNO":10003015,"VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}]} |
//+------------------------------------------------------------------------------------------------------------------------------------------+
You can also write this dataframe to a file using same statement as you mentioned in question.

Spark from_json with dynamic schema

I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB.
Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.
What would be the best approach for processing such data?
I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema.
Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
This is just a restatement of #Ramesh Maharjan's answer, but with more modern Spark syntax.
I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.
def json(jsonDataset: Dataset[String]): DataFrame
Example usage:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
The method is available from Spark 2.2.0:
http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader#json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
val df = sqlContext.read.json(data)
which will give you schema as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select key3_k1 as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe as you wish. I hope the answer is helpful
I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:
1) So the idea is to use json rapture (or some other json library) to
load JSON schema dynamically. For instance you could read the 1st
row of the json file to discover the schema(similarly to what I do
here with jsonSchema)
2) Generate schema dynamically. First iterate through the dynamic
fields (notice that I project values of key3 as Map[String, String])
and add a StructField for each one of them to schema
3) Apply the generated schema into your dataframe
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
I used the next data as input:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
And the output:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
This prints out:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
Good luck

How to specify a missing value in a dataframe

I am trying to load a CSV file into a Spark data frame with spark-csv [1] using an Apache Zeppelin notebook and when loading a numeric field that doesn't have value the parser fails for that line and the line gets skipped.
I would have expected the line to get loaded and the value in the data frame load the line and have the value set to NULL so that aggregations just ignore the value.
%dep
z.reset()
z.addRepo("my-nexus").url("<my_local_nexus_repo_that_is_a_proxy_of_public_repos>")
z.load("com.databricks:spark-csv_2.10:1.1.0")
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", DoubleType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "true")
.load("file:///home/spark_user/data.csv")
df.describe("height").show()
Here is the content of the data file: /home/spark_user/data.csv
identifier,name,height
1,sam,184
2,cath,180
3,santa, <-- note that there is not height recorded for Santa !
Here is the output:
+-------+------+
|summary|height|
+-------+------+
| count| 2| <- 2 of 3 lines loaded, ie. sam and cath
| mean| 182.0|
| stddev| 2.0|
| min| 180.0|
| max| 184.0|
+-------+------+
In the logs of zeppelin I can see the following error on parsing santa's line:
ERROR [2015-07-21 16:42:09,940] ({Executor task launch worker-45} CsvRelation.scala[apply]:209) - Exception while parsing line: 3,santa,.
java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:42)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:198)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:180)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
So you might tell me so far so good ... and you'd be right ;)
Now I want to add an extra column, say age and I always have data in that field.
identifier,name,height,age
1,sam,184,30
2,cath,180,32
3,santa,,70
Now ask politely for some stats about age:
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", DoubleType, true) ::
StructField("age", DoubleType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "true")
.load("file:///home/spark_user/data2.csv")
df.describe("age").show()
Results
+-------+----+
|summary| age|
+-------+----+
| count| 2|
| mean|31.0|
| stddev| 1.0|
| min|30.0|
| max|32.0|
+-------+----+
ALL WRONG ! Since santa's height is not known, the whole line is lost and the calculation of age is only based on Sam and Cath while Santa has a perfectly valid age.
My question is what value do I need to plug in Santa's height so that the CSV can be loaded. I have tried to set the schema to be all StringType but then
The next question is more about
I have found in the API that one can handle N/A values using spark. SO I thought maybe I could load my data with all columns set to StringType and then do some cleanup and then only set the schema properly as written below:
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", StringType, true) ::
StructField("age", StringType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").schema(schema).option("header", "true").load("file:///home/spark_user/data.csv")
// eg. for each column of my dataframe, replace empty string by null
df.na.replace( "*", Map("" -> null) )
val toDouble = udf[Double, String]( _.toDouble)
df2 = df.withColumn("age", toDouble(df("age")))
df2.describe("age").show()
But df.na.replace() throws an exception and stops:
java.lang.IllegalArgumentException: Unsupported value type java.lang.String ().
at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$convertToDouble(DataFrameNaFunctions.scala:417)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:337)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:337)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.DataFrameNaFunctions.replace0(DataFrameNaFunctions.scala:337)
at org.apache.spark.sql.DataFrameNaFunctions.replace(DataFrameNaFunctions.scala:304)
Any help, & tips much appreciated !!
[1] https://github.com/databricks/spark-csv
Spark-csv lacks this option. It has been fixed in master branch. I guess you should use it or wait for the next stable version.