Impala create external table prevent parsing of newline in double quotes - csv

Impala version: impalad version 4.0.0.2022.0.11.0-122
I have a CSV in S3 that has a field with newlines in it but the field is wrapped in double quotes. I can see that the CSV ignores the newlines in the field correctly but when issuing the CREATE statement in Impala it takes the newline as an actual newline for the row instead of just inside the field value, and messes up the structure of the CSV being ingested.
What can I do to ensure that newlines inside field values, that are wrapped in double quotations in the Impala table, are ignored?
CSV:
SQL CREATE statement:
CREATE EXTERNAL TABLE IF NOT EXISTS schema_name.table_name (
`week` VARCHAR(10),
notes STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-- ESCAPED BY '"' -- tried this, didn't work
STORED AS TEXTFILE
LOCATION 's3a://bucket_name/folder_name/'
TBLPROPERTIES("skip.header.line.count"="1")
-- Also tried this (get syntax error, also tried without ROW FORMAT keywords):
-- ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = """ )
Table:

Related

Error in data while creating external tables in Athena

I have my data in CSV format in the below form:
Id -> tinyint
Name -> String
Id Name
1 Alex
2 Sam
When I export the CSV file to S3 and create an Athena table, the data transform into the following format.
Id Name
1 "Alex"
2 "Sam"
How do I get rid of the double quotes while creating the table?
Any help is appreciated.
By default if SerDe is not specified, Athena is using LasySimpleSerDe, it does not support quoted values and reads quotes as a part of value. If your CSV file contains quoted values, use OpenCSVSerde (specify correct separatorChar if it is not comma):
CREATE EXTERNAL TABLE mytable(
id tinyint,
Name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
LOCATION 's3://my-bucket/mytable/'
;
Read the manuals: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html
See also this answer about data types in OpenCSVSerDe

Redshift External Table not handling Linefeed character within a field

I have an external table using Glue catalog and reading a CSV file. The fields are enclosed in double quotes if they have comma or a LF (LineFeed). I am able to read a field properly as a single value if there is delimiter within that field but the fields having LineFeed in it are getting split and the rest of the columns afterwards are shown as NULL.
Have used serde row format to specify the quote character. and used normal row format delimiter with line delimited by Line Feed ascii as well. But as of now, none of it seems to be working.
CREATE EXTERNAL TABLE schema.ext_table
(
id varchar (18),
name varchar (80)
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ( 'separatorChar' = ',', 'quoteChar' = '"', 'escapeChar' = '\\' )
STORED AS textfile
LOCATION 's3://path/'
TABLE PROPERTIES ('skip.header.line.count'='1')
;
For a file like this:
id,name,addCRLF
1,abc,defCRLF
2,"a,b",mnoCRLF
3,"aLF
b",xyzCRLF
Please note that the CRLF and LF, in the above file, can be seen by notepad++ like tools.
I want the output to be like:
1 abc def
2 a,b mno
3 a xyz
b-------this b need to be in the same cell as that of the a above
BUT the Output is coming like :
1 abc def
2 a,b mno
3 a null
null null null
Got the official response from AWS support. Redshift Spectrum doesn't support embedded line breaks in a csv file.

Load csv file to Hive Table

I have a csv file which has contents like this.
"DepartmentID","Name","GroupName","ModifiedDate"
"1","Engineering","Research and Development","2008-04-30 00:00:00"
I have
create external table if not exists AdventureWorks2014.Department
(
DepartmentID smallint ,
Name string ,
GroupName string,
rate_code string,
ModifiedDate timestamp
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' lines terminated by '\n'
STORED AS TEXTFILE LOCATION 'wasb:///ds/Department' TBLPROPERTIES('skip.header.line.count'='1');`
And after loading the data
LOAD DATA INPATH 'wasb:///ds/Department.csv' INTO TABLE AdventureWorks2014.Department;
The data is not loaded.
select * from AdventureWorks2014.Department;
The above select returns nothing.
I think the double quotes around each fileds is the issue. Is there a way to load the data from such a file to hive tables, Without having to strip out the double quotes?
Try this (cellphone...)
create external table if not exists AdventureWorks2014.Department ( DepartmentID smallint , Name string , GroupName string, rate_code string, ModifiedDate timestamp )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'wasb:///ds/Department'
** Limitation **
This SerDe treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. The type information is retrieved from the SerDe. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
FIELDS TERMINATED BY '","' is incorrect. Your fields are terminated by a , not ",". Change your DDL to FIELDS TERMINATED BY ','.
LOAD DATA LOCAL INPATH '/home/hadoop/hive/log_2013805_16210.log'into table_name

Importing Scalding-produced CSV into MySQL

I produced a CSV file using Scalding's default Csv writer (specifying on the p parameter for the path to write to, and not any of the other parameters for how to write the CSV data) that I am looking to import into MySql. I am running into a problem on the import.
Example queries to load the data:
CREATE TABLE `example_table` (
`a` varchar(255) DEFAULT NULL,
`b` varchar(255) DEFAULT NULL,
`c` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
LOAD DATA LOCAL INFILE '~/example.csv'
INTO TABLE example_table
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(a,b,c)
Example data (i.e. ~/examples.csv):
row1,this is neither quoted nor enclosed,"This is quoted, and contains a comma"
row2,"this is enclosed","This is quoted, and contains a comma"
row3,""this"" is quoted at the start,"This is quoted, and contains a comma"
row4,"""this"" is quoted at the start and enclosed","This is quoted, and contains a comma"
When I run the queries with the data file, the resulting table is (excuse the formatting, I can't figure out how to make a table nicely here):
row1|this is neither quoted nor enclosed|This is quoted, and contains a comma
row2|this is enclosed|This is quoted, and contains a comma
row3|"this" is quoted at the start,"This is quoted, and contains a comma|NULL
row4|"this" is quoted at the start and enclosed|This is quoted, and contains a comma
Row 3 is malformed, and is how Scalding outputs CSV if that field equals "this" is quoted at the start (i.e. it has quotes at the beginning of the string, and doesn't contain the field delimiter, in which case it would look like row 4).
Is there a way screwing with the FIELDS TERMINATED BY, OPTIONALLY ENCLOSED BY, etc options in MySql to get it to import the fields correctly?

Error loading CSV with MySQL LOAD DATA INFILE

LOAD DATA LOCAL INFILE '/local/home/rep/saloncodeforde.csv' INTO TABLE account_code
My table has 3 columns, as does my CSV (id int, zipc varchar and ph varchar).
The result is ok for ID column but for zipc and ph, I get NULL.
Try using more complete syntax. Reference link: http://dev.mysql.com/doc/refman/5.0/en/load-data.html
For instance:
LOAD DATA LOCAL INFILE '/local/home/rep/saloncodeforde.csv'
INTO TABLE account_code
FIELDS TERMINATED BY ' ';
There are also parameters for end-of-line terminators, escape characters, and optional field enclosures (such as quotes around strings with spaces, though I doubt you have those in your data as described).