Read empty string as Null Athena - csv

I want to create a table in Amazon Athena over csv file on s3. Csv file looks like
id,name,invalid
1,abc,
2,cba,y
Code for creating table looks like
CREATE EXTERNAL TABLE IF NOT EXISTS {schema}.{table_name} (
id int,
name string,
invalid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION '{s3}'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip')
So, my problem is Athena reads empty string as actually empty string but I'd like to see it like null. I haven't found any property for that in docs.

LazySimpleSerDe will interpret \N as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property.
For this guide on CSV and Athena for more details.

Related

Apache Drill: Convert JSON as String to JSON object to retrieve each element

I have the below string in a column in hive table which i am trying to query using apache drill:
{"cdrreasun":"52","cdxscarc":"20150407161405","cdrend":"20150407155201","cdrdnrar.1un":"24321.70","servlnqlp":"54.201.25.50","men":"42403","xa:lnqruup":"3","cemcau":"120","accuuncl":"21","cdrc:
5","volcuca":"1.7"}
Want to retrieve all values for key cdrreasun using apache drill SQL.
Can't use FLATTEN on the column as it says Flatten does not work with inputs of non-list types.
Can't use KVGEN as well as it works only with MAP datatype.
Drill has function convert_fromJSON which allows converting from String to JSON object. For more details about this function and examples of usage please see https://drill.apache.org/docs/data-type-conversion/#convert_to-and-convert_from
For the example you specified, you can run
convert_fromJSON(colWithJsonText)['cdrreasun']
I figured it out, hope it will be helpful for others.
We have to do it in 3 steps if the datatype is of type MAP:
KVGEN() -> FLATTEN() -> convert_from()
If it's of type STRING then KVGEN() function is not needed.
SELECT ratinggrouplist
,t3.cdrlist3.cdrreason AS cdrreason
,t3.cdrlist3.cdrstart AS cdrstart
,t3.cdrlist3.cdrend AS cdrend
,t3.cdrlist3.cdrduration AS cdrduration
FROM (
SELECT ratinggrouplist, convert_from(t2.cdrlist2.`element`, 'JSON') AS cdrlist3
FROM (
SELECT ratinggrouplist ,flatten(t1.cdrlist1.`value`) AS cdrlist2
FROM (
SELECT ratinggrouplist, kvgen(cdrlist) AS cdrlist1
FROM dfs.tmp.SOME_TABLE
) AS t1
) AS t2
) AS t3;

Hive :Current token (VALUE_STRING) not numeric, can not use numeric value accessors, while trying to QUERY an external table in hive on nested json

Unable to query an external hive table on a nested json due to
Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
Json looks like-
Create table command used-
create external table s
(
magic String,
type String,
headers String,
messageSchemaId String,
messageSchema String,
message struct<data:struct<s_ID:double,s_TYPE_ID:Int,NAME:String,DES CR:String,ACTIVE_s:double,s_ID:double,s_ENABLED:Int,pharmacy_location:Int>,seData:struct<APPLICATION_ID:double,s_TYPE_ID:Int,NAME:String,DESCR:String,s_STAT:double,PROGRAM_ID:double,s_ENABLED:Int,s_location:Int>,headers:struct<operation:String, changeSequence:String, timestamp: String, streamPosition: String, transactionId: String>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
Stored as textfile
LOCATION '/user/eh2524/pt_rpt/MGPH.APPLICATION';
For the same json I am able to create external table with -
CREATE EXTERNAL TABLE `MGPH_ZT`(
`jsonstr` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/eh2524/pt_rpt/MGPH.APPLICATION/'
TBLPROPERTIES (
'transient_lastDdlTime'='1510776187')
But to query the table created by above I have use jsontuple method like
select count(*) from pt_rpt_stg.hvf_modules j
lateral view json_tuple(j.jsonstr, 'message') m as message
lateral view json_tuple(m.message, 'data') d as datacntr
lateral view json_tuple(d.datacntr,'l_location') s as pharmacy_location
where pharmacy_location is null;
I want to create the table using Json serde so that my team can query it directly like we do for a normal hive table, right now it fails when you query it.
What did I try-
I checked if there is any /n in the json file but there was none, tried with single record as well.
Checked the table creation definition on (https://community.hortonworks.com/questions/29814/how-to-load-nested-json-file-in-hive-table-using-h.html) for a nested json but it seems correct as i have used required complex data types.
-
The problem is that you declaring pharmacy_location as int in your table definition, but your sample data is a string: "pharmacy_location": "93". If you change that in your table definition, it should work.

I am trying to set the empty values in a csv file to zero in hive. But this code doesn't seem to work. What changes should I make?

This is the input .csv file
"1","","Animation"
"2","Jumanji",""
"","Grumpier Old Men","Comedy"
Hive Code
CREATE TABLE IF NOT EXISTS movies(movie_id int, movie_name string,genre string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"serialization.null.format" = '0'
);
Output
1 Animation
2 Jumanji
Grumpier Old Men Comedy
Empty strings in csv are interpreted as empty strings, not NULLs. To represent NULL inside a delimited text file you should use "\N". Also Hive provides you a table property “serialization.null.format” which can be used to treat a character of your choice as null in Hive SQL. In your case it should be empty string "". To convert NULLs to zeroes use NVL(col, 0) or COALESCE(col, 0) function depending on your hive version (COALESCE should work for all).

Hive - Complex regexp_replace

I'm not a specialist with regular expression and I'm facing issues using regexp_replace in Hive.
I would like to load a CSV file into Hive, which contains rows like that:
AAA,1234,BBB,,,"""CC,CCC""","""DDD""","""EE"EEE""",,
"""AAA""",1234,BBB,,,CCCC,"""DD,DD""",,"""FFFF""",
As you can see, the format isn't perfect
There are non-escaped commas into string fields
Some string fields are enclosed by """ (3 double-quotes)
There are non-escaped double-quotes into string fields
There are empty fields
When I try to import it into a Hive table, the columns are not well parsed because of the non-escaped commas.
So I imported the raw data as rows into a Hive table like this:
CREATE EXTERNAL TABLE MyRawTable
(
RAW_DATA STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/hdfs/file'
And i'm trying to use the regexp_replace function to transform the rows:
Escape the commas, the double and simple quotes in the string fields
Not enclose string fields by double quotes
So data will look like that:
AAA,1234,BBB,,,CC\,CCC,DDD,EE\"EEE,,
AAA,1234,BBB,,,CCCC,DD\,DD,,FFFF,
I don't find the solution for this regex, any ideas? Thanks a lot!
Forget about the regexp, you don't need it. The commas aren't escaped, but they are surrounded by double-quotes. You can simply use the OpenCSVSerde :
CREATE EXTERNAL TABLE yourtable(foo int, bar string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\""
)
LOCATION '/your/folder/containing/csv/files/';

Date variable is NULL while loading csv data into hive External table

I am trying to load a SAS Dataset to Hive external table. For that, I have converted SAS dataset into CSV file format first. In sas dataset, Date variable (i.e as_of_dt) contents shows this:
LENGTH=8 , FORMAT= DATE9. , INFORMAT=DATE9. , LABLE=as_of_dt
And for converting SAS into CSV, I have used below code patch (i have used 'retain' statement before in sas so that the order of variables are maintained):
proc export data=input_SASdataset_for_csv_conv
outfile= "/mdl/myData/final_merged_table_201501.csv"
dbms=csv
replace;
putnames=no;
run;
Till here (i.e till csv file creation), the Date variable is read correctly. But after this, when I am loading it into Hive External Table by using below command in HIVE, then the DATE variable (i.e as_of_dt) is getting assigned as NULL :
CREATE EXTERNAL TABLE final_merged_table_20151(as_of_dt DATE, client_cm_id STRING, cm11 BIGINT, cm_id BIGINT, corp_id BIGINT, iclic_id STRING, mkt_segment_cd STRING, product_type_cd STRING, rated_company_id STRING, recovery_amt DOUBLE, total_bal_amt DOUBLE, write_off_amt DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/mdl/myData';
Also, when i am using this command in hive desc formatted final_merged_table_201501, then I am getting following table parameters:
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1447151851
But even though it shows numRows=-1, still I am able to see data inside the table, by using hive command SELECT * FROM final_merged_table_20151 limit 10; , with Date variable (as_of_dt) stored as NULL.
Where might be the problem?
Based on madhu's comment you need to change the format on as_of_dt to yymmdd10.
You can do that with PROC DATASETS. Here is an example:
data test;
/*Test data with AS_OF_DT formatted date9. per your question*/
format as_of_dt date9.;
do as_of_dt=today() to today()+5;
output;
end;
run;
proc datasets lib=work nolist;
/*Modify Test Data Set and set format for AS_OF_DT variable*/
modify test;
attrib as_of_dt format=yymmdd10.;
run;
quit;
/*Create CSV*/
proc export file="C:\temp\test.csv"
data=test
dbms=csv
replace;
putnames=no;
run;
If you open the CSV, you will see the date in YYYY-MM-DD format.