Date variable is NULL while loading csv data into hive External table - csv

I am trying to load a SAS Dataset to Hive external table. For that, I have converted SAS dataset into CSV file format first. In sas dataset, Date variable (i.e as_of_dt) contents shows this:
LENGTH=8 , FORMAT= DATE9. , INFORMAT=DATE9. , LABLE=as_of_dt
And for converting SAS into CSV, I have used below code patch (i have used 'retain' statement before in sas so that the order of variables are maintained):
proc export data=input_SASdataset_for_csv_conv
outfile= "/mdl/myData/final_merged_table_201501.csv"
dbms=csv
replace;
putnames=no;
run;
Till here (i.e till csv file creation), the Date variable is read correctly. But after this, when I am loading it into Hive External Table by using below command in HIVE, then the DATE variable (i.e as_of_dt) is getting assigned as NULL :
CREATE EXTERNAL TABLE final_merged_table_20151(as_of_dt DATE, client_cm_id STRING, cm11 BIGINT, cm_id BIGINT, corp_id BIGINT, iclic_id STRING, mkt_segment_cd STRING, product_type_cd STRING, rated_company_id STRING, recovery_amt DOUBLE, total_bal_amt DOUBLE, write_off_amt DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '/mdl/myData';
Also, when i am using this command in hive desc formatted final_merged_table_201501, then I am getting following table parameters:
Table Parameters:
COLUMN_STATS_ACCURATE false
EXTERNAL TRUE
numFiles 0
numRows -1
rawDataSize -1
totalSize 0
transient_lastDdlTime 1447151851
But even though it shows numRows=-1, still I am able to see data inside the table, by using hive command SELECT * FROM final_merged_table_20151 limit 10; , with Date variable (as_of_dt) stored as NULL.
Where might be the problem?

Based on madhu's comment you need to change the format on as_of_dt to yymmdd10.
You can do that with PROC DATASETS. Here is an example:
data test;
/*Test data with AS_OF_DT formatted date9. per your question*/
format as_of_dt date9.;
do as_of_dt=today() to today()+5;
output;
end;
run;
proc datasets lib=work nolist;
/*Modify Test Data Set and set format for AS_OF_DT variable*/
modify test;
attrib as_of_dt format=yymmdd10.;
run;
quit;
/*Create CSV*/
proc export file="C:\temp\test.csv"
data=test
dbms=csv
replace;
putnames=no;
run;
If you open the CSV, you will see the date in YYYY-MM-DD format.

Related

Read empty string as Null Athena

I want to create a table in Amazon Athena over csv file on s3. Csv file looks like
id,name,invalid
1,abc,
2,cba,y
Code for creating table looks like
CREATE EXTERNAL TABLE IF NOT EXISTS {schema}.{table_name} (
id int,
name string,
invalid string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
)
LOCATION '{s3}'
TBLPROPERTIES ('has_encrypted_data'='false','compressionType'='gzip')
So, my problem is Athena reads empty string as actually empty string but I'd like to see it like null. I haven't found any property for that in docs.
LazySimpleSerDe will interpret \N as NULL by default, but you can configure it to use other strings with the serialization.null.format serde property.
For this guide on CSV and Athena for more details.

Hive :Current token (VALUE_STRING) not numeric, can not use numeric value accessors, while trying to QUERY an external table in hive on nested json

Unable to query an external hive table on a nested json due to
Error: java.io.IOException: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Current token (VALUE_STRING) not numeric, can not use numeric value accessors
Json looks like-
Create table command used-
create external table s
(
magic String,
type String,
headers String,
messageSchemaId String,
messageSchema String,
message struct<data:struct<s_ID:double,s_TYPE_ID:Int,NAME:String,DES CR:String,ACTIVE_s:double,s_ID:double,s_ENABLED:Int,pharmacy_location:Int>,seData:struct<APPLICATION_ID:double,s_TYPE_ID:Int,NAME:String,DESCR:String,s_STAT:double,PROGRAM_ID:double,s_ENABLED:Int,s_location:Int>,headers:struct<operation:String, changeSequence:String, timestamp: String, streamPosition: String, transactionId: String>>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
Stored as textfile
LOCATION '/user/eh2524/pt_rpt/MGPH.APPLICATION';
For the same json I am able to create external table with -
CREATE EXTERNAL TABLE `MGPH_ZT`(
`jsonstr` string)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/eh2524/pt_rpt/MGPH.APPLICATION/'
TBLPROPERTIES (
'transient_lastDdlTime'='1510776187')
But to query the table created by above I have use jsontuple method like
select count(*) from pt_rpt_stg.hvf_modules j
lateral view json_tuple(j.jsonstr, 'message') m as message
lateral view json_tuple(m.message, 'data') d as datacntr
lateral view json_tuple(d.datacntr,'l_location') s as pharmacy_location
where pharmacy_location is null;
I want to create the table using Json serde so that my team can query it directly like we do for a normal hive table, right now it fails when you query it.
What did I try-
I checked if there is any /n in the json file but there was none, tried with single record as well.
Checked the table creation definition on (https://community.hortonworks.com/questions/29814/how-to-load-nested-json-file-in-hive-table-using-h.html) for a nested json but it seems correct as i have used required complex data types.
-
The problem is that you declaring pharmacy_location as int in your table definition, but your sample data is a string: "pharmacy_location": "93". If you change that in your table definition, it should work.

I am trying to set the empty values in a csv file to zero in hive. But this code doesn't seem to work. What changes should I make?

This is the input .csv file
"1","","Animation"
"2","Jumanji",""
"","Grumpier Old Men","Comedy"
Hive Code
CREATE TABLE IF NOT EXISTS movies(movie_id int, movie_name string,genre string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"serialization.null.format" = '0'
);
Output
1 Animation
2 Jumanji
Grumpier Old Men Comedy
Empty strings in csv are interpreted as empty strings, not NULLs. To represent NULL inside a delimited text file you should use "\N". Also Hive provides you a table property “serialization.null.format” which can be used to treat a character of your choice as null in Hive SQL. In your case it should be empty string "". To convert NULLs to zeroes use NVL(col, 0) or COALESCE(col, 0) function depending on your hive version (COALESCE should work for all).

Column names missing when exporting files using SAS data step

I have a large SAS dataset raw_data which contains data collected from various countries. This dataset has a column "country" which lists the country from which the observation is originated. I would like to export a separate .csv file for each country in raw_data. I use the following data step to produce the output:
data _null_;
set raw_data;
length fv $ 200;
fv = "/directory/" || strip(put(country,$32.)) || ".csv";
file write filevar=fv dsd dlm=',';
put (_all_) (:);
run;
However, the resulting .csv files no longer have the column names from raw_data. I have over a hundred columns in my dataset, so listing all of the column names is prohibitive. Can anyone provide me some guidance on how I can modify the above code so as to attach the column names to the .csv files being exported? Any help is appreciated!
You can create a macro variable that holds the variable names and puts them to the CSV file.
proc sql noprint;
select name into :var_list separated by ", "
from sashelp.vcolumn
where libname="WORK" and memname='RAW_DATA'
order by varnum;
quit;
data _null_;
set raw_data;
length fv $ 200;
by country;
fv = "/directory/" || strip(put(country,$32.)) || ".csv";
if first.country then do;
put "&var_list";
end;
file write filevar=fv dsd dlm=',';
put (_all_) (:);
run;
Consider this data step that is very similar to your program. It uses VNEXT to query the PDV and write the variable names as the first record of each file.
proc sort data=sashelp.class out=class;
by age;
run;
data _null_;
set class;
by age;
filevar=catx('\','C:\Users\name\Documents',catx('.',age,'csv'));
file dummy filevar=filevar ls=256 dsd;
if first.age then link names;
put (_all_)(:);
return;
names:
length _name_ $32;
call missing(_name_);
do while(1);
call vnext(_name_);
if _name_ eq: 'FIRST.' then leave;
put _name_ #;
end;
put;
run;

Dealing with currency values in PIG - pigstorage

I have 2 column CSV file loaded in HDFS. Column 1 is a Model name, column 2 is a price in $. Example - Model: IE33, Price: $52678.00
When I run the following script, the price values all return as a two digit result example $52.
ultraPrice = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING PigStorage(',') AS (
Model, Price);
dump ultraPrice;
All my values are between $20000 and $60000. I don't know why it is being cut off.
If I change the CSV file and remove the $ from the price values everything works fine, but I know there has to be a better way.
Note that in your load statement you are not specifying the datatype.By default the model and price will be of type bytearray and hence the discrepancy.
You can either remove the $ from the csv file or load the data as chararray and replace the $ sign and cast it into float.
A = LOAD '/user/maria_dev/UltrasoundPrice.csv' USING TextLoader() as (line:chararray);
A1 = FOREACH A GENERATE REPLACE(line,'([^a-zA-Z0-9.,\\s]+)','');
B = FOREACH A1 GENERATE FLATTEN(STRSPLIT($0,','));
B1 = FOREACH B GENERATE $0 as Model,(float)$1 as Price;
DUMP B1;