How to specify a missing value in a dataframe - csv

I am trying to load a CSV file into a Spark data frame with spark-csv [1] using an Apache Zeppelin notebook and when loading a numeric field that doesn't have value the parser fails for that line and the line gets skipped.
I would have expected the line to get loaded and the value in the data frame load the line and have the value set to NULL so that aggregations just ignore the value.
%dep
z.reset()
z.addRepo("my-nexus").url("<my_local_nexus_repo_that_is_a_proxy_of_public_repos>")
z.load("com.databricks:spark-csv_2.10:1.1.0")
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", DoubleType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "true")
.load("file:///home/spark_user/data.csv")
df.describe("height").show()
Here is the content of the data file: /home/spark_user/data.csv
identifier,name,height
1,sam,184
2,cath,180
3,santa, <-- note that there is not height recorded for Santa !
Here is the output:
+-------+------+
|summary|height|
+-------+------+
| count| 2| <- 2 of 3 lines loaded, ie. sam and cath
| mean| 182.0|
| stddev| 2.0|
| min| 180.0|
| max| 184.0|
+-------+------+
In the logs of zeppelin I can see the following error on parsing santa's line:
ERROR [2015-07-21 16:42:09,940] ({Executor task launch worker-45} CsvRelation.scala[apply]:209) - Exception while parsing line: 3,santa,.
java.lang.NumberFormatException: empty String
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1842)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:31)
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:42)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:198)
at com.databricks.spark.csv.CsvRelation$$anonfun$com$databricks$spark$csv$CsvRelation$$parseCSV$1.apply(CsvRelation.scala:180)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
So you might tell me so far so good ... and you'd be right ;)
Now I want to add an extra column, say age and I always have data in that field.
identifier,name,height,age
1,sam,184,30
2,cath,180,32
3,santa,,70
Now ask politely for some stats about age:
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", DoubleType, true) ::
StructField("age", DoubleType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv")
.schema(schema)
.option("header", "true")
.load("file:///home/spark_user/data2.csv")
df.describe("age").show()
Results
+-------+----+
|summary| age|
+-------+----+
| count| 2|
| mean|31.0|
| stddev| 1.0|
| min|30.0|
| max|32.0|
+-------+----+
ALL WRONG ! Since santa's height is not known, the whole line is lost and the calculation of age is only based on Sam and Cath while Santa has a perfectly valid age.
My question is what value do I need to plug in Santa's height so that the CSV can be loaded. I have tried to set the schema to be all StringType but then
The next question is more about
I have found in the API that one can handle N/A values using spark. SO I thought maybe I could load my data with all columns set to StringType and then do some cleanup and then only set the schema properly as written below:
%spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types._
import com.databricks.spark.csv._
import org.apache.spark.sql.functions._
val schema = StructType(
StructField("identifier", StringType, true) ::
StructField("name", StringType, true) ::
StructField("height", StringType, true) ::
StructField("age", StringType, true) ::
Nil)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").schema(schema).option("header", "true").load("file:///home/spark_user/data.csv")
// eg. for each column of my dataframe, replace empty string by null
df.na.replace( "*", Map("" -> null) )
val toDouble = udf[Double, String]( _.toDouble)
df2 = df.withColumn("age", toDouble(df("age")))
df2.describe("age").show()
But df.na.replace() throws an exception and stops:
java.lang.IllegalArgumentException: Unsupported value type java.lang.String ().
at org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$convertToDouble(DataFrameNaFunctions.scala:417)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:337)
at org.apache.spark.sql.DataFrameNaFunctions$$anonfun$4.apply(DataFrameNaFunctions.scala:337)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.DataFrameNaFunctions.replace0(DataFrameNaFunctions.scala:337)
at org.apache.spark.sql.DataFrameNaFunctions.replace(DataFrameNaFunctions.scala:304)
Any help, & tips much appreciated !!
[1] https://github.com/databricks/spark-csv

Spark-csv lacks this option. It has been fixed in master branch. I guess you should use it or wait for the next stable version.

Related

spark streaming writestream issue

I am trying to make a dynamic schema creation out of JSON records from text file as every record will have different schema. The following is my code.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json, col}
object streamingexample {
def main(args: Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[*]")
.appName("SparkByExamples")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = spark.readStream.textFile("C:\\Users\\sheol\\Desktop\\streaming")
val newdf11=df1
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
val df2 = df1.select(from_json($"value", schema_of_json(json_schema)).alias("value_new"))
val df3 = df2.select($"value_new.*")
df3.printSchema()
df3.writeStream
.option("truncate", "false")
.format("console")
.start()
.awaitTermination()
}
}
I am getting the following error. Please help on how to fix the code. I tried a lot. unable to figure out.
Error: Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Sample data:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
This statement in your code causing the problem from your code, as you already know.
val json_schema = newdf11.select("value").collect().map(x => x.get(0)).mkString(",")
You can get json schema in a different way like below...
val dd: DataFrame = spark.read.json("C:\\Users\\sheol\\Desktop\\streaming")
dd.show()
/** you can use val df1 = spark.readStream.textFile(yourfile) also **/
val json_schema = dd.schema.json;
println(json_schema)
Result :
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
{"type":"struct","fields":[{"name":"age","type":"long","nullable":true,"metadata":{}},{"name":"name","type":"string","nullable":true,"metadata":{}}]}
you can further refine to your requirements I will leave it to you
This exception occurred because you are trying to access the data from the stream before the stream was started. Issues is with the df3.printSchema() make sure to call this function after the stream start.

Scala Spark - Split JSON column to multiple columns

Scala noob, using Spark 2.3.0.
I'm creating a DataFrame using a udf that creates a JSON String column:
val result: DataFrame = df.withColumn("decrypted_json", instance.decryptJsonUdf(df("encrypted_data")))
it outputs as follows:
+----------------+---------------------------------------+
| encrypted_data | decrypted_json |
+----------------+---------------------------------------+
|eyJleHAiOjE1 ...| {"a":547.65 , "b":"Some Data"} |
+----------------+---------------------------------------+
The UDF is an external code, that I can't change. I would like to split the decrypted_json column into individual columns so the output DataFrame will be like so:
+----------------+----------------------+
| encrypted_data | a | b |
+----------------+--------+-------------+
|eyJleHAiOjE1 ...| 547.65 | "Some Data" |
+----------------+--------+-------------+
Below solution is inspired by one of the solutions given by #Jacek Laskowski:
import org.apache.spark.sql.types._
val JsonSchema = new StructType()
.add($"a".string)
.add($"b".string)
val schema = new StructType()
.add($"encrypted_data".string)
.add($"decrypted_json".array(JsonSchema))
val schemaAsJson = schema.json
import org.apache.spark.sql.types.DataType
val dt = DataType.fromJson(schemaAsJson)
import org.apache.spark.sql.functions._
val rawJsons = Seq("""
{
"encrypted_data" : "eyJleHAiOjE1",
"decrypted_json" : [
{
"a" : "547.65",
"b" : "Some Data"
}
]
}
""").toDF("rawjson")
val people = rawJsons
.select(from_json($"rawjson", schemaAsJson, Map.empty[String, String]) as "json")
.select("json.*") // <-- flatten the struct field
.withColumn("address", explode($"decrypted_json")) // <-- explode the array field
.drop("decrypted_json") // <-- no longer needed
.select("encrypted_data", "address.*") // <-- flatten the struct field
Please go through Link for the original solution with the explanation.
I hope that helps.
Using from_jason you can give parse the JSON into a Struct type then select columns from that dataframe. You will need to know the schema of the json. Here is how -
val sparkSession = //create spark session
import sparkSession.implicits._
val jsonData = """{"a":547.65 , "b":"Some Data"}"""
val schema = {StructType(
List(
StructField("a", DoubleType, nullable = false),
StructField("b", StringType, nullable = false)
))}
val df = sparkSession.createDataset(Seq(("dummy data",jsonData))).toDF("string_column","json_column")
val dfWithParsedJson = df.withColumn("json_data",from_json($"json_column",schema))
dfWithParsedJson.select($"string_column",$"json_column",$"json_data.a", $"json_data.b").show()
Result
+-------------+------------------------------+------+---------+
|string_column|json_column |a |b |
+-------------+------------------------------+------+---------+
|dummy data |{"a":547.65 , "b":"Some Data"}|547.65|Some Data|
+-------------+------------------------------+------+---------+

Importing schema from json with optional value

I'm trying to create a table from a json datasource.
The problem is that there is a field in the json data that is not always present for every entry and looks like this.
[ { "k1" : "someValue",
"optK" : { "nestedK" : true } },
{ "k1" : "someOtherValue" }
]
When I try to specify the optional field in the schema, all the entries without that field have all null value in the table:
columns: k1 | optK
row1: "someValue" [true]
row2: null null
is it possible to write a schema such that I would have null only in the column where the value is missing?
Like this:
columns: k1 | optK
row1: "someValue" "optV"
row2: "someOtherValue" null
My current code:
import org.apache.spark.sql.expressions.scalalang._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
val session = SparkSession.builder().enableHiveSupport().getOrCreate()
val schema = StructType(Seq(
StructField("k1", StringType, false),
StructField("optK", StructType(Seq(StructField("nestedK", BooleanType, false))), false)
))
val df = session.read.schema(schema).json("data.json")
df.registerTempTable("Mr_Table")
There are several issues in your code/input data:
Input data - JSON keys aren't in quote.
You can use avoid this problem, by one of the following options:
Updating the input data by adding quotes to the json keys
Using .option("allowUnquotedFieldNames",true) in the following way:
val df = session.read.option("allowUnquotedFieldNames",true).schema(schema).json("data.json")
A string field in the input data was defined as boolean in the schema schema should be updated to be:
val schema = StructType(Seq(
StructField("k1", StringType, false),
StructField("optK", StructType(Seq(StructField("nestedK", StringType, false))), false)
))
JSON data format, I've update the sample json input to be in json lines format:
{ k1 : "someValue", optK : { nestedK : "optV" } }
{ k1 : "someOtherValue" }
Running the modify code shows the following:
Spark context available as 'sc' (master = yarn, app id = application_xxx).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_141)
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.expressions.scalalang._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
val schema = StructType(Seq(
StructField("k1", StringType, false),
StructField("optK", StructType(Seq(StructField("nestedK", StringType, false))), false)
))
val df = spark.read.option("allowUnquotedFieldNames",true).schema(schema).json("s3 location of data.json")
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.expressions.scalalang._
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
schema: org.apache.spark.sql.types.StructType = StructType(StructField(k1,StringType,false), StructField(optK,StructType(StructField(nestedK,StringType,false)),false))
df: org.apache.spark.sql.DataFrame = [k1: string, optK: struct<nestedK: string>]
scala> df.show
+--------------+------+
| k1| optK|
+--------------+------+
| someValue|[optV]|
|someOtherValue| null|
+--------------+------+

Parsing epoch milliseconds from json with Spark 2

Has anyone parsed a millisecond timestamp using from_json in Spark 2+? How's it done?
So Spark changed the TimestampType to parse epoch numerical values as being in seconds instead of millis in v2.
My input is a hive table that has a json formatted string in a column which I'm trying to parse like this:
val spark = SparkSession
.builder
.appName("Problematic Timestamps")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema = StructType(
StructField("categoryId", LongType) ::
StructField("cleared", BooleanType) ::
StructField("dataVersion", LongType) ::
StructField("details", DataTypes.createArrayType(StringType)) ::
…
StructField("timestamp", TimestampType) ::
StructField("version", StringType) :: Nil
)
val item_parsed =
spark.sql("select * FROM source.jsonStrInOrc")
.select('itemid, 'locale,
from_json('internalitem, schema)
as 'internalitem,
'version, 'createdat, 'modifiedat)
val item_flattened = item_parsed
.select('itemid, 'locale,
$"internalitem.*",
'version as'outer_version, 'createdat, 'modifiedat)
This can parse a row with a column containing:
{"timestamp": 1494790299549, "cleared": false, "version": "V1", "dataVersion": 2, "categoryId": 2641, "details": [], …}
And that gives me timestamp fields like 49338-01-08 00:39:09.0 from a value 1494790299549 which I'd rather read as: 2017-05-14 19:31:39.549
Now I could set the schema for timestamp to be a long, then divide the value by 1000 and cast to a timestamp, but then I'd have 2017-05-14 19:31:39.000 not 2017-05-14 19:31:39.549. I'm having trouble figuring out how I could either:
Tell from_json to parse a millisecond timestamp (maybe by subclassing the TimestampType in some way to use in the schema)
Use a LongType in the schema and cast that to a Timestamp which preserves the milliseconds.
Addendum on UDFs
I found that trying to do the division in the select and then casting didn't look clean to me, though it's a perfectly valid method. I opted for a UDF that used a java.sql.timestamp which is actually specified in epoch milliseconds.
import java.sql.Timestamp
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{explode, from_json, udf}
import org.apache.spark.sql.types.
{BooleanType, DataTypes, IntegerType, LongType,
StringType, StructField, StructType, TimestampType}
val tsmillis = udf { t: Long => new Timestamp (t) }
val spark = SparkSession
.builder
.appName("Problematic Timestamps")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
val schema = StructType(
StructField("categoryId", LongType) ::
StructField("cleared", BooleanType) ::
StructField("dataVersion", LongType) ::
StructField("details", DataTypes.createArrayType(StringType)) ::
…
StructField("timestamp", LongType) ::
StructField("version", StringType) :: Nil
)
val item_parsed =
spark.sql("select * FROM source.jsonStrInOrc")
.select('itemid, 'locale,
from_json('internalitem, schema)
as 'internalitem,
'version, 'createdat, 'modifiedat)
val item_flattened = item_parsed
.select('itemid, 'locale,
$"internalitem.categoryId", $"internalitem.cleared",
$"internalitem.dataVersion", $"internalitem.details",
tsmillis($"internalitem.timestamp"),
$"internalitem.version",
'version as'outer_version, 'createdat, 'modifiedat)
See how that's in the select.
I think it would be worthwhile to do a performance test to see if using withcolumn division and casting is faster than the udf.
Now I could set the schema for timestamp to be a long, then divide the value by 1000
Actually this exactly what you need, just keep the types right. Let's say you have only Long timestamp field:
val df = spark.range(0, 1).select(lit(1494790299549L).alias("timestamp"))
// df: org.apache.spark.sql.DataFrame = [timestamp: bigint]
If you divide by 1000:
val inSeconds = df.withColumn("timestamp_seconds", $"timestamp" / 1000)
// org.apache.spark.sql.DataFrame = [timestamp: bigint, timestamp_seconds: double]
you'll get timestamp in seconds as double (note that this is SQL, not Scala behavior).
All what is left is cast (Spark < 3.1)
inSeconds.select($"timestamp_seconds".cast("timestamp")).show(false)
// +-----------------------+
// |timestamp_seconds |
// +-----------------------+
// |2017-05-14 21:31:39.549|
// +-----------------------+
or (Spark >= 3.1) timestamp_seconds (or directly timestamp_millis)
import org.apache.spark.sql.functions.{expr, timestamp_seconds}
inSeconds.select(timestamp_seconds($"timestamp_seconds")).show(false)
// +------------------------------------+
// |timestamp_seconds(timestamp_seconds)|
// +------------------------------------+
// |2017-05-14 21:31:39.549 |
// +------------------------------------+
df.select(expr("timestamp_millis(timestamp)")).show(false)
// +---------------------------+
// |timestamp_millis(timestamp)|
// +---------------------------+
// |2017-05-14 21:31:39.549 |
// +---------------------------+

Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]

So i was trying to load the csv file inferring custom schema but everytime i end up with the following errors:
Project_Bank.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [110, 111, 13, 10]
This is how my program looks like and my csv file entries ,
age;job;marital;education;default;balance;housing;loan;contact;day;month;duration;campaign;pdays;previous;poutcome;y
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
My Code :
$spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import org.apache.spark.sql.types._
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
val bankSchema = StructType(Array(
StructField("age", IntegerType, true),
StructField("job", StringType, true),
StructField("marital", StringType, true),
StructField("education", StringType, true),
StructField("default", StringType, true),
StructField("balance", IntegerType, true),
StructField("housing", StringType, true),
StructField("loan", StringType, true),
StructField("contact", StringType, true),
StructField("day", IntegerType, true),
StructField("month", StringType, true),
StructField("duration", IntegerType, true),
StructField("campaign", IntegerType, true),
StructField("pdays", IntegerType, true),
StructField("previous", IntegerType, true),
StructField("poutcome", StringType, true),
StructField("y", StringType, true)))
val df = sqlContext.
read.
schema(bankSchema).
option("header", "true").
option("delimiter", ";").
load("/user/amit.kudnaver_gmail/hadoop/project_bank/Project_Bank.csv").toDF()
df.registerTempTable("people")
df.printSchema()
val distinctage = sqlContext.sql("select distinct age from people")
Any suggestion as why am not able to work with the csv file here after pushing the correct schema. Thanks in advance for your advise.
Thanks
Amit K
Here the problem is Data Frame expects Parquet file while processing it. In order to handle data in CSV. Here what you can do.
First of all, remove the header row from the data.
58;management;married;tertiary;no;2143;yes;no;unknown;5;may;261;1;-1;0;unknown;no
44;technician;single;secondary;no;29;yes;no;unknown;5;may;151;1;-1;0;unknown;no
33;entrepreneur;married;secondary;no;2;yes;yes;unknown;5;may;76;1;-1;0;unknown;no
Next we write following code to read the data.
Create case class
case class BankSchema(age: Int, job: String, marital:String, education:String, default:String, balance:Int, housing:String, loan:String, contact:String, day:Int, month:String, duration:Int, campaign:Int, pdays:Int, previous:Int, poutcome:String, y:String)
Read data from HDFS and parse it
val bankData = sc.textFile("/user/myuser/Project_Bank.csv").map(_.split(";")).map(p => BankSchema(p(0).toInt, p(1), p(2),p(3),p(4), p(5).toInt, p(6), p(7), p(8), p(9).toInt, p(10), p(11).toInt, p(12).toInt, p(13).toInt, p(14).toInt, p(15), p(16))).toDF()
And then register table and execute queries.
bankData.registerTempTable("bankData")
val distinctage = sqlContext.sql("select distinct age from bankData")
Here is what the output would look like
+---+
|age|
+---+
| 33|
| 44|
| 58|
+---+
Here the expected file format is csv but as per error its looking for parquet file format.
This can be overcome by explicitly mentioning the file format as below (which was missing in the problem shared) because if we don't specify the file format then it by default expects Parquet format.
As per Java code version (sample example):
Dataset<Row> resultData = session.read().format("csv")
.option("sep", ",")
.option("header", true)
.option("mode", "DROPMALFORMED")
.schema(definedSchema)
.load(inputPath);
Here, schema can be defined either by using a java class (ie. POJO class) or by using StructType as already mentioned.
And inputPath is the path of input csv file.