spark df.write quote all fields but not null values - csv

I am trying to create a csv from values stored in the table:
| col1 | col2 | col3 |
| "one" | null | "one" |
| "two" | "two" | "two" |
hive > select * from table where col2 is null;
one null one
I am getting the csv using the below code:
df.repartition(1)
.write.option("header",true)
.option("delimiter", ",")
.option("quoteAll", true)
.option("nullValue", "")
.csv(S3Destination)
Csv I get:
"col1","col2","col3"
"one","","one"
"two","two","two"
Expected Csv:WITH NO DOUBLE QUOTES FOR NULL VALUE
"col1","col2","col3"
"one",,"one"
"two","two","two"
Any help is appreciated to know if the dataframe writer has options to do this.

You can go in a udf approach and apply on the column (using withColumn on the repartitioned datafrmae above) where possiblity of double quote empty string is there see below sample code
sqlContext.udf().register("convertToEmptyWithOutQuotes",(String abc) -> (abc.trim().length() > 0 ? abc : abc.replace("\"", " ")),DataTypes.StringType);
String has replace method which does the job.
val a = Array("'x'","","z")
println(a.mkString(",").replace("\"", " "))
will produce 'x',,z

Related

Loading quoted numbers into snowflake table from CSV with COPY TO <TABLE>

I have a problem with loading CSV data into snowflake table. Fields are wrapped in double quote marks and hence there is problem with importing them into table.
I know that COPY TO has CSV specific option FIELD_OPTIONALLY_ENCLOSED_BY = '"'but it's not working at all.
Here are some pices of table definition and copy command:
CREATE TABLE ...
(
GamePlayId NUMBER NOT NULL,
etc...
....);
COPY INTO ...
FROM ...csv.gz'
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 1
error_on_column_count_mismatch=false
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "ABORT_STATEMENT"
;
Csv file looks like this:
"3922000","14733370","57256","2","3","2","2","2019-05-23 14:14:44",",00000000",",00000000",",00000000",",00000000","1000,00000000","1000,00000000","1317,50400000","1166,50000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000",",00000000"
I get an error
'''Numeric value '"3922000"' is not recognized '''
I'm pretty sure it's because NUMBER value is interpreted as string when snowflake is reading "" marks, but since I use
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
it shouldn't even be there... Does anyone have some solution to this?
Maybe something is incorrect with your file? I was just able to run the following without issue.
1. create the test table:
CREATE OR REPLACE TABLE
dbNameHere.schemaNameHere.stacko_58322339 (
num1 NUMBER,
num2 NUMBER,
num3 NUMBER);
2. create test file, contents as follows
1,2,3
"3922000","14733370","57256"
3,"2",1
4,5,"6"
3. create stage and put file in stage
4. run the following copy command
COPY INTO dbNameHere.schemaNameHere.STACKO_58322339
FROM #stageNameHere/stacko_58322339.csv.gz
FILE_FORMAT = (TYPE = CSV
STRIP_NULL_VALUES = TRUE
FIELD_DELIMITER = ','
SKIP_HEADER = 0
ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
)
ON_ERROR = "CONTINUE";
4. results
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| stageNameHere/stacko_58322339.csv.gz | LOADED | 4 | 4 | 4 | 0 | NULL | NULL | NULL | NULL |
+-----------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 2.436s
5. view the records
>SELECT * FROM dbNameHere.schemaNameHere.stacko_58322339;
+---------+----------+-------+
| NUM1 | NUM2 | NUM3 |
|---------+----------+-------|
| 1 | 2 | 3 |
| 3922000 | 14733370 | 57256 |
| 3 | 2 | 1 |
| 4 | 5 | 6 |
+---------+----------+-------+
Can you try with a similar test as this?
EDIT: A quick look at your data shows many of your numeric fields appear to start with commas, so something definitely amiss with the data.
Assuming your numbers are European formatted , decimal place, and . thousands, reading the numeric formating help, it seems Snowflake does not support this as input. I'd open a feature request.
But if you read the column in as text then use REPLACE like
SELECT '100,1234'::text as A
,REPLACE(A,',','.') as B
,TRY_TO_DECIMAL(b, 20,10 ) as C;
gives:
A B C
100,1234 100.1234 100.1234000000
safer would be to strip placeholders first like
SELECT '1.100,1234'::text as A
,REPLACE(A,'.') as B
,REPLACE(B,',','.') as C
,TRY_TO_DECIMAL(C, 20,10 ) as D;

Output Error for Matrix in Python?

I am trying to create a function that outputs a matrix that contains each item in a list on a separate line with lines in between. The only output I'm getting is quotations (''). I do not understand why. I think I set it all up correctly to output what is needed but there has to be something missing?
I included examples below my code.
def show_table(table):
table=[]
s=[[str(e) for e in row] for row in table]
lens= [max(map(len, col)) for col in zip(*s)]
fmt= '\t'.join('{{:{}}}'.format(x) for x in lens)
table= [fmt.format(*row) for row in s]
return '\n'.join(table)
show_table([['A','BB'],['C','DD']])
output:
'| A | BB |\n| C | DD |\n'
print(show_table([['A','BB'],['C','DD']]))
output:
| A | BB |
| C | DD |
The issue is on the second line where you are initialising your list to an empty list. Instead try:
if table is None:
table = []
Perhaps a better way to accomplish this could be:
def show_table(table):
if table is None:
table = []
data = ""
for row in table:
for val in row:
data += "| " + val + " "
data += "|\n"
return data.strip("\n")
print show_table([['a','bb'],['c','dd']])
Output:
| a | bb |
| c | dd |

Remove double quotes from the return of a function in PostgreSQL

I have the following function in PostgreSQL
CREATE OR REPLACE FUNCTION public.translatejson(JSONB, TEXT)
RETURNS TEXT
AS
$BODY$
SELECT ($1->$2)::TEXT
$BODY$
LANGUAGE sql STABLE;
When I execute it I receive the values surrounded by double quotes. For example:
SELECT id, translatejson("title", 'en-US') AS "tname" FROM types."FuelTypes";
in return I get a table like this
-------------------
| id | tname |
-------------------
| 1 | "gasoline" |
| 2 | "diesel" |
-------------------
The values in the 'title' column are in JSON format:
{ "en-US":"gasoline", "fr-FR":"essence" }.
How I can omit the double quotes to return just the string of the result?
The -> operator returns a json result. Casting it to text leaves it in a json reprsentation.
The ->> operator returns a text result. Use that instead.
test=> SELECT '{"car": "going"}'::jsonb -> 'car';
?column?
----------
"going"
(1 row)
test=> SELECT '{"car": "going"}'::jsonb ->> 'car';
?column?
----------
going
(1 row)

Spark Row to JSON

I would like to create a JSON from a Spark v.1.6 (using scala) dataframe. I know that there is the simple solution of doing df.toJSON.
However, my problem looks a bit different. Consider for instance a dataframe with the following columns:
| A | B | C1 |  C2 | C3 |
-------------------------------------------
| 1 | test | ab | 22 | TRUE |
| 2 | mytest | gh | 17 | FALSE |
I would like to have at the end a dataframe with
| A | B | C |
----------------------------------------------------------------
| 1 | test | { "c1" : "ab", "c2" : 22, "c3" : TRUE } |
| 2 | mytest | { "c1" : "gh", "c2" : 17, "c3" : FALSE } |
where C is a JSON containing C1, C2, C3. Unfortunately, I at compile time I do not know what the dataframe looks like (except the columns A and B that are always "fixed").
As for the reason why I need this: I am using Protobuf for sending around the results. Unfortunately, my dataframe sometimes has more columns than expected and I would still send those via Protobuf, but I do not want to specify all columns in the definition.
How can I achieve this?
Spark 2.1 should have native support for this use case (see #15354).
import org.apache.spark.sql.functions.to_json
df.select(to_json(struct($"c1", $"c2", $"c3")))
I use this command to solve the to_json problem:
output_df = (df.select(to_json(struct(col("*"))).alias("content")))
Here, no JSON parser, and it adapts to your schema:
import org.apache.spark.sql.functions.{col, concat, concat_ws, lit}
df.select(
col(df.columns(0)),
col(df.columns(1)),
concat(
lit("{"),
concat_ws(",",df.dtypes.slice(2, df.dtypes.length).map(dt => {
val c = dt._1;
val t = dt._2;
concat(
lit("\"" + c + "\":" + (if (t == "StringType") "\""; else "") ),
col(c),
lit(if(t=="StringType") "\""; else "")
)
}):_*),
lit("}")
) as "C"
).collect()
First lets convert C's to a struct:
val dfStruct = df.select($"A", $"B", struct($"C1", $"C2", $"C3").alias("C"))
This is structure can be converted to JSONL using toJSON as before:
dfStruct.toJSON.collect
// Array[String] = Array(
// {"A":1,"B":"test","C":{"C1":"ab","C2":22,"C3":true}},
// {"A":2,"B":"mytest","C":{"C1":"gh","C2":17,"C3":false}})
I am not aware of any built-in method that can convert a single column but you can either convert it individually and join or use your favorite JSON parser in an UDF.
case class C(C1: String, C2: Int, C3: Boolean)
object CJsonizer {
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.Serialization
import org.json4s.jackson.Serialization.write
implicit val formats = Serialization.formats(org.json4s.NoTypeHints)
def toJSON(c1: String, c2: Int, c3: Boolean) = write(C(c1, c2, c3))
}
val cToJSON = udf((c1: String, c2: Int, c3: Boolean) =>
CJsonizer.toJSON(c1, c2, c3))
df.withColumn("c_json", cToJSON($"C1", $"C2", $"C3"))

Why does reading csv file with empty values lead to IndexOutOfBoundException?

I have a csv file with the foll struct
Name | Val1 | Val2 | Val3 | Val4 | Val5
John 1 2
Joe 1 2
David 1 2 10 11
I am able to load this into an RDD fine. I tried to create a schema and then a Dataframe from it and get an indexOutOfBound error.
Code is something like this ...
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
When I tried to perform an action on rowRDD, gives the error.
Any help is greatly appreciated.
This is not answer to your question. But it may help to solve your problem.
From the question I see that you are trying to create a dataframe from a CSV.
Creating dataframe using CSV can be easily done using spark-csv package
With the spark-csv below scala code can be used to read a CSV
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csvFilePath)
For your sample data I got the following result
+-----+----+----+----+----+----+
| Name|Val1|Val2|Val3|Val4|Val5|
+-----+----+----+----+----+----+
| John| 1| 2| | | |
| Joe| 1| 2| | | |
|David| 1| 2| | 10| 11|
+-----+----+----+----+----+----+
You can also inferSchema with latest version. See this answer
Empty values are not the issue if the CSV file contains fixed number of columns and your CVS looks like this (note the empty field separated with it's own commas):
David,1,2,10,,11
The problem is your CSV file contains 6 columns, yet with:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5), p(6) )
You try to read 7 columns. Just change your mapping to:
val rowRDD = fileRDD.map(p => Row(p(0), p(1), p(2), p(3), p(4), p(5))
And Spark will take care of the rest.
The possible solution to that problem is replacing missing value with Double.NaN. Suppose I have a file example.csv with columns in it
David,1,2,10,,11
You can read the csv file as text file as follow
fileRDD=sc.textFile(example.csv).map(x=> {val y=x.split(","); val z=y.map(k=> if(k==""){Double.NaN}else{k.toDouble()})})
And then you can use your code to create dataframe from it
You can do it as follows.
val df = sqlContext
.read
.textfile(csvFilePath)
.map(_.split(delimiter_of_file, -1)
.map(
p =>
Row(
p(0),
p(1),
p(2),
p(3),
p(4),
p(5),
p(6))
Split using delimiter of your file. When you set -1 as limit it consider all the empty fields.