Redshift External Table not handling Linefeed character within a field - csv

I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.
Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.
CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
For a file like this:
id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF
Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.
I want the output to be like:
1 abc def
2 a,b mno
3 a xyz
b-------this b need to be in the same cell as that of the a above
BUT the Output is coming like :
1 abc def
2 a,b mno
3 a null
null null null

Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.

Related

Impala create external table prevent parsing of newline in double quotes

Impala version: impalad version 4.0.0.2022.0.11.0-122
I have a CSV in S3 that has a field with newlines in it but the field is wrapped in double quotes. I can see that the CSV ignores the newlines in the field correctly but when issuing the CREATE statement in Impala it takes the newline as an actual newline for the row instead of just inside the field value, and messes up the structure of the CSV being ingested.
What can I do to ensure that newlines inside field values, that are wrapped in double quotations in the Impala table, are ignored?
CSV:
SQL CREATE statement:
CREATE EXTERNAL TABLE IF NOT EXISTS schema_name.table_name (
`week` VARCHAR(10),
notes STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-- ESCAPED BY '"' -- tried this, didn't work
STORED AS TEXTFILE
LOCATION 's3a://bucket_name/folder_name/'
TBLPROPERTIES("skip.header.line.count"="1")
-- Also tried this (get syntax error, also tried without ROW FORMAT keywords):
-- ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = """ )
Table:

Error in data while creating external tables in Athena

I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.
By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe

Can Redshift Spectrum handle trailing space after a field enclosed in double quotes while reading csv file?

I have a csv file which has fields enclosed in double quotes. I created an Redshift external table over it using OpenCSVSerDe.
The issue is one of my rows in the file has a trailing space outside the double quotes. Something like this:
"name1" ,"123","something"
"name2","234","somethingelse"
Now a SELECT on the external table returning NULL in the first row for the first column.
123 something
name2 234 somethingelse
However, the S3 SELECT functionality is returning proper values as below:
name1 123 something
name2 234 somethingelse
Is there any property at the table level which i can use to properly retrieve the data or this is a limitation?
Table DDL:
CREATE EXTERNAL TABLE test_table
(
column1 varchar(50)
,column2 varchar(50)
,column3 varchar(100)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ('separatorChar' = ',') --checked with and without this
STORED AS textfile
LOCATION 's3://s3bucketlocation'
;

Hive imports data from csv to an incorrect columns in table

Below is my table creation and a sample from my csv;
DROP TABLE IF EXISTS xxx.fbp;
CREATE TABLE IF NOT EXISTS xxx.fbp (id bigint, p_name string, h_name string, ufi int, city string, country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
74905,xxx,xyz,-5420642,City One,France
74993,xxx,zyx,-874432,City,Germany
75729,xxx,yzx,-1284248,City Two Long Name,France
I then load the data into a hive table with the following query:
LOAD DATA
INPATH '/user/xxx/hdfs_import/fbp.csv'
INTO TABLE xxx.fbp;
It seems that there is data leaking from the 5th csv "column" into the 6th column of the table. So, I'm seeing city data in my country column.
SELECT country, count(country) from xxx.fbp group by country
+---------+------+
| country | _c1 |
| Germany | 1143 |
| City | 1 |
+---------+------+
I'm not sure why city data is occasionally being imported to the country column. The csv is downloaded from Google Sheets and I've removed the header.
The reason could be your line termination is not '\n', Windows based tool add additional characters which creates issue. Also may be you have feilds using column separator creating this.
Solution:
1. Try print line which have issue by 'where country = City' clause, this will give you some idea how Hive created the record.
2. Try binary storage format to be 100% sure about data processed by Hive.
Hope it helps.
The issue was within the CSV itself. Some columns, such as p.name contained , in several fields. This would cause a line ending to end sooner than expected. I had to clean the data and remove all ,. After that, it imported correctly. Quickly done with python.
with open("fbp.csv") as infile, open("outfile.csv", "w") as outfile:
for line in infile:
outfile.write(line.replace(",", ""))

How to handle json file having new line characters in hive?

I'm trying to load the json data into hive table. This json data contains newline characters. When I try to load this json data into hive table, it is not inserting properly.
My hive table creation:
CREATE EXTERNAL TABLE serde_tab(
gender STRING, name STRING
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/input/text' ;
My json data:
{"gender":"femal\ne","name":"xyz"}
My hive table data:
select * from serde_tab;
OK
serde_tab.gender serde_tab.name
femal
e xyz
Can anyone please help me out regarding the same ...
You can use regexp_replace function to replace \n with ''.
hive> select regexp_replace(string("femal\ne"),'\n','');
+---------+--+
| _c0 |
+---------+--+
| female |
+---------+--+
(or)
Write a shell script to replace all \n newline characters with emptyvalues ('').