Spark CSV read/write for empty field - csv

I want to write by Dataframe's empty field as empty but it always writes as NULL. I want to write NULLS as ? and empty as empty/blank. Same while reading from a csv.
val df = sqlContext.createDataFrame(Seq(
(0, "a"),
(1, "b"),
(2, "c"),
(3, ""),
(4, null)
))
scala> df.show
| 0| a|
| 1| b|
| 2| c|
| 3| |
| 4|null|
+---+----+
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.csv").option("nullValue","?").save("/xxxxx/test_out")
written output :
0,a
1,b
2,c
3,?
4,?
.option("treatEmptyValuesAsNulls" , "false")
This option does not work.
I need the empty to write as empty
0,a
1,b
2,c
3,
4,?

Try using sql-
I am using spark 2.2.
val ds= sqlContext.sql("select `_1`, case when `_2` is not null then `_2` else case when `_2` is null then '?' else case when `_2` = '' then '' end end end as val "+
"from global_temp.test");
ds.write.csv("<output path>");

Related

Pyspark map a function to two array type

I am quite new to Pyspark, here is what I try to do, below is the table, type is ArrayType(DoubleType), ArrayType(DecimalType)
A
B
[1,2]
[2,4]
[1,2,4]
[1,3,3]
What I want to do is treat A and B as np.array, then pass a function to do calculation.
def func(row):
a = row.A
b = row.B
res = some-function(a,b)
return res
What I am trying now is
res = a.rdd.map(func)
resDF = res.toDF(res)
resDF.show()
But I am receiving the following error, could someone guide me a bit here? Thank you.
TypeError: schema should be StructType or list or None, but got: PythonRDD[167] at RDD at PythonRDD.scala:53
You can use pandas_udf
sample data
df = spark.createDataFrame([
([1,2], [2,4]),
([1,2,4], [1,3,3]),
], 'a array<int>, b array<int>')
df.show()
+---------+---------+
|a |b |
+---------+---------+
|[1, 2] |[2, 4] |
|[1, 2, 4]|[1, 3, 3]|
+---------+---------+
create new column with pandas_udf
#F.pandas_udf("array<int>")
def func(a, b):
return a * b
df.withColumn('c', func('a', 'b')).show()
+---------+---------+----------+
| a| b| c|
+---------+---------+----------+
| [1, 2]| [2, 4]| [2, 8]|
|[1, 2, 4]|[1, 3, 3]|[1, 6, 12]|
+---------+---------+----------+

PySpark Explode JSON String into Multiple Columns

I have a dataframe with a column of string datatype. The string represents an api request that returns a json.
df = spark.createDataFrame([
("[{original={ranking=1.0, input=top3}, response=[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]",1)],
"col1:string, col2:int")
df.show()
Which generates a dataframe like:
+--------------------+----+
| col1|col2|
+--------------------+----+
|[{original={ranki...| 1|
+--------------------+----+
The output I would like to have col2 and have two additional columns from the response. Col3 would capture the player name, indicated by to= and col 4 would have their position indicated by position=. As well as the dataframe would now have three rows, since there's three players. Example:
+----+------+-------+
|col2| col3| col4|
+----+------+-------+
| 1| Sam| guard|
| 1| John| center|
| 1|Andrew|forward|
+----+------+-------+
I've read that I can leverage something like:
df.withColumn("col3",explode(from_json("col1")))
However, I'm not sure how to explode given I want two columns instead of one and need the schema.
Note, I can modify the response using json_dumps to return only the response piece of the string or...
[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]
If you simplify the output like mentioned, you can define a simple JSON schema and convert JSON string into StructType and read each fields
Input
df = spark.createDataFrame([("[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]",1)], "col1:string, col2:int")
# +-----------------------------------------------------------------------------------------------------------------+----+
# |col1 |col2|
# +-----------------------------------------------------------------------------------------------------------------+----+
# |[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]|1 |
# +-----------------------------------------------------------------------------------------------------------------+----+
And this is the transformation
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('to', T.StringType()),
T.StructField('position', T.StringType())
]))
(df
.withColumn('temp', F.explode(F.from_json('col1', schema=schema)))
.select(
F.col('col2'),
F.col('temp.to').alias('col3'),
F.col('temp.position').alias('col4'),
)
.show()
)
# Output
# +----+------+-------+
# |col2| col3| col4|
# +----+------+-------+
# | 1| Sam| guard|
# | 1| John| center|
# | 1|Andrew|forward|
# +----+------+-------+

Spark dataframe from Json string with nested key

I have several columns to be extracted from json string. However one field has nested values. Not sure how to deal with that?
Need to explode into multiple rows to get values of field name, Value1, Value2.
import spark.implicits._
val df = Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"p": "bar", "q": 3.0}""", "some_other_field_2"),
("3",
"""{"nestedKey":[ {"field name":"name1","Value1":false,"Value2":true},
| {"field name":"name2","Value1":"100","Value2":"200"}
|]}""".stripMargin, "some_other_field_3")
).toDF("id","json","other")
df.show(truncate = false)
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.select("id1","other1","k","v","p","q","nestedKey")
df1.show(truncate = false)
You can parse the nestedKey using from_json and explode it:
val df2 = df1.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).select("*", "nestedKey.*").drop("nestedKey")
df2.show
+---+------------------+----+----+----+----+----------+------+------+
|id1| other1| k| v| p| q|field name|Value1|Value2|
+---+------------------+----+----+----+----+----------+------+------+
| 1|some_other_field_1| foo| 1.0|null|null| null| null| null|
| 2|some_other_field_2|null|null| bar| 3.0| null| null| null|
| 3|some_other_field_3|null|null|null|null| name1| false| true|
| 3|some_other_field_3|null|null|null|null| name2| 100| 200|
+---+------------------+----+----+----+----+----------+------+------+
i did it in one dataframe
val df1= df.withColumn("id1",col("id"))
.withColumn("other1",col("other"))
.withColumn("k",get_json_object(col("json"),"$.k"))
.withColumn("v",get_json_object(col("json"),"$.v"))
.withColumn("p",get_json_object(col("json"),"$.p"))
.withColumn("q",get_json_object(col("json"),"$.q"))
.withColumn("nestedKey",get_json_object(col("json"),"$.nestedKey"))
.withColumn(
"nestedKey",
expr("explode_outer(from_json(nestedKey, 'array<struct<`field name`:string, Value1:string, Value2:string>>'))")
).withColumn("fieldname",col("nestedKey.field name"))
.withColumn("valueone",col("nestedKey.Value1"))
.withColumn("valuetwo",col("nestedKey.Value2"))
.select("id1","other1","k","v","p","q","fieldname","valueone","valuetwo")```
still working to make it more elegant

How to read custom formatted dates as timestamp in pyspark

I want to use spark.read() to pull data from a .csv file, while enforcing a schema. However, I can't get spark to recognize my dates as timestamps.
First I create a dummy file to test with
%scala
Seq("1|1/15/2019 2:24:00 AM","2|test","3|").toDF().write.text("/tmp/input/csvDateReadTest")
Then I try to read it, and provide a dateFormat string, but it doesn't recognize my dates, and sends the records to the badRecordsPath
df = spark.read.format('csv')
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("dateFormat","M/dd/yyyy hh:mm:ss aaa")
.load("/tmp/input/csvDateReadTest")
As the result, I get just 1 record in df (ID 3), when I'm expecting to see 2. (IDs 1 and 3)
df.show()
+---+----+
| id| dt|
+---+----+
| 3|null|
+---+----+
You must change the dateFormat to timestampFormat since in your case you need a timestamp type and not a date. Additionally the value of timestamp format should be mm/dd/yyyy h:mm:ss a.
Sample data:
Seq(
"1|1/15/2019 2:24:00 AM",
"2|test",
"3|5/30/1981 3:11:00 PM"
).toDF().write.text("/tmp/input/csvDateReadTest")
With the changes for the timestamp:
val df = spark.read.format("csv")
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("timestampFormat","mm/dd/yyyy h:mm:ss a")
.load("/tmp/input/csvDateReadTest")
And the output:
+----+-------------------+
| id| dt|
+----+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
|null| null|
+----+-------------------+
Note that the record with id 2 failed to comply with the schema definition and therefore it will contain null. If you want to keep also the invalid records you need to change the timestamp column into string and the output in this case will be:
+---+--------------------+
| id| dt|
+---+--------------------+
| 1|1/15/2019 2:24:00 AM|
| 3|5/30/1981 3:11:00 PM|
| 2| test|
+---+--------------------+
UPDATE:
In order to change the string dt into timestamp type you could try with df.withColumn("dt", $"dt".cast("timestamp")) although this will fail and replace all the values with null.
You can achieve this with the next code:
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.{Date, Locale}
import java.sql.Timestamp
import scala.util.{Try, Success, Failure}
val formatter = new SimpleDateFormat("mm/dd/yyyy h:mm:ss a", Locale.US)
df.map{ case Row(id:Int, dt:String) =>
val tryParse = Try[Date](formatter.parse(dt))
val p_timestamp = tryParse match {
case Success(parsed) => new Timestamp(parsed.getTime())
case Failure(_) => null
}
(id, p_timestamp)
}.toDF("id", "dt").show
Output:
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
| 2| null|
+---+-------------------+
Hi here is the sample code
df.withColumn("times",
from_unixtime(unix_timestamp(col("df"), "M/dd/yyyy hh:mm:ss a"),
"yyyy-MM-dd HH:mm:ss.SSSSSS"))
.show(false)

How to select only first row from repeating values in columns of dataframe in apache-spark?

Consider I do have the dataframe containing following data,
val seq = Seq((1, "John"), (1, "John"), (2, "Michael"), (3, "Sham"),(4, "Dan"), (2, "Michael"), (4, "Dan"))
val rdd = sc.parallelize(seq)
val df = rdd.toDF("id","name")
I want output as :
1, "John"
2, "Michael"
3, "Sham"
4, "Dan"
How can I select only row from the dataset where repeatation is allowed on the both id and name column.
You can use dropDuplicates() on dataframe/dataset.
You may be looking for distinct values from the Data Frame.
df.distinct.orderBy("id").show();
You can drop the orderBy if don't want the ordering of results.
+---+-------+
| id| name|
+---+-------+
| 1| John|
| 2|Michael|
| 3| Sham|
| 4| Dan|
+---+-------+