Error in data while creating external tables in Athena - csv

I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.

By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe

Related

Can Redshift Spectrum handle trailing space after a field enclosed in double quotes while reading csv file?

I have a csv file which has fields enclosed in double quotes. I created an Redshift external table over it using OpenCSVSerDe.
The issue is one of my rows in the file has a trailing space outside the double quotes. Something like this:
"name1" ,"123","something"
"name2","234","somethingelse"
Now a SELECT on the external table returning NULL in the first row for the first column.
123 something
name2 234 somethingelse
However, the S3 SELECT functionality is returning proper values as below:
name1 123 something
name2 234 somethingelse
Is there any property at the table level which i can use to properly retrieve the data or this is a limitation?
Table DDL:
CREATE EXTERNAL TABLE test_table
(
column1 varchar(50)
,column2 varchar(50)
,column3 varchar(100)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',') --checked with and without this
STORED AS textfile
LOCATION 's3://s3bucketlocation'
;

Redshift External Table not handling Linefeed character within a field

I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.
Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.
CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
For a file like this:
id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF
Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.
I want the output to be like:
1 abc def
2 a,b mno
3 a xyz
b-------this b need to be in the same cell as that of the a above
BUT the Output is coming like :
1 abc def
2 a,b mno
3 a null
null null null
Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.

csv file to hive table using load data - How to format the date in csv to accept by hive table

I am using load data syntax to load a csv file to a table.The file is same format as hive accepts. But still after load data is issued, Last 2 columns returns null on select.
1750,651,'2013-03-11','2013-03-17'
1751,652,'2013-03-18','2013-03-24'
1752,653,'2013-03-25','2013-03-31'
1753,654,'2013-04-01','2013-04-07'
create table dattable(
DATANUM INT,
ENTRYNUM BIGINT,
START_DATE DATE,
END_DATE DATE )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
LOAD DATA LOCAL INPATH '/path/dtatable.csv' OVERWRITE INTO TABLE dattable ;
Select returns NULL values for the last 2 cols
Other question was what if the date format is different than YYYY-MM-DD. is it possible to make hive identify the format? (Because right now i am modifying the csv file format to accept by hive)
Answer to your 2nd question:
You will need an additional temporary table to read your input file, and then you can do date conversions in your insert select statements.In your temporary table store date fields as string. Ex.
create table dattable_ext(
DATANUM INT,
ENTRYNUM BIGINT,
START_DATE String,
END_DATE String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Load data into temporary table
LOAD DATA LOCAL INPATH '/path/dtatable.csv' OVERWRITE INTO TABLE dattable_ext;
Insert from temporary table to the managed table.
insert into table dattable select DATANUM, ENTRYNUM,
from_unixtime(unix_timestamp(START_DATE,'yyyy/MM/dd'),'yyyy-MM-dd'),
from_unixtime(unix_timestamp(END_DATE,'yyyy/MM/dd'),'yyyy-MM-dd') from dattable_ext;
You can replace date format in unix_timestamp function with your input date format.
LasySimpleSerDe (default) does not work with quoted CSV. Use CSVSerDe:
create table dattable(
DATANUM INT,
ENTRYNUM BIGINT,
START_DATE DATE,
END_DATE DATE )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'"
)
STORED AS TEXTFILE;
Also read this: CSVSerDe treats all columns to be of type String
Define you date columns as string and apply conversion in select.

How to handle json file having new line characters in hive?

I'm trying to load the json data into hive table. This json data contains newline characters. When I try to load this json data into hive table, it is not inserting properly.
My hive table creation:
CREATE EXTERNAL TABLE serde_tab(
gender STRING, name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/input/text' ;
My json data:
{"gender":"femal\ne","name":"xyz"}
My hive table data:
select * from serde_tab;
OK
serde_tab.gender serde_tab.name
femal
e xyz
Can anyone please help me out regarding the same ...
You can use regexp_replace function to replace \n with ''.
hive> select regexp_replace(string("femal\ne"),'\n','');
+---------+--+
| _c0 |
+---------+--+
| female |
+---------+--+
(or)
Write a shell script to replace all \n newline characters with emptyvalues ('').

Load csv file to Hive Table

I have a csv file which has contents like this.
"DepartmentID","Name","GroupName","ModifiedDate"
"1","Engineering","Research and Development","2008-04-30 00:00:00"
I have
create external table if not exists AdventureWorks2014.Department
(
DepartmentID smallint ,
Name string ,
GroupName string,
rate_code string,
ModifiedDate timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///ds/Department' TBLPROPERTIES('skip.header.line.count'='1');`
And after loading the data
LOAD DATA INPATH 'wasb:///ds/Department.csv' INTO TABLE AdventureWorks2014.Department;
The data is not loaded.
select * from AdventureWorks2014.Department;
The above select returns nothing.
I think the double quotes around each fileds is the issue. Is there a way to load the data from such a file to hive tables, Without having to strip out the double quotes?
Try this (cellphone...)
create external table if not exists AdventureWorks2014.Department ( DepartmentID smallint , Name string , GroupName string, rate_code string, ModifiedDate timestamp )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'wasb:///ds/Department'
** Limitation **
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
FIELDS TERMINATED BY '","' is incorrect. Your fields are terminated by a , not ",". Change your DDL to FIELDS TERMINATED BY ','.
LOAD DATA LOCAL INPATH '/home/hadoop/hive/log_2013805_16210.log'into table_name