Presto (Athena) loading of a CSV file with quote-escaped commas - csv

Consider the following row in a CSV file:
1,0,True,"{""foo"":null,""bar"":null}",0,1
▲
The highlighted , is part of a column. That is, this full text: " {""foo"":null,""bar"":null}" is the value of a single column. However AWS Athena is interpreting the highlighted , as a column-delimiting comma, incorrectly splitting that text into multiple columns.
I know I could change the column delimiter to something else to avoid this problem. My question is: Is this a bug in AWS Athena / Presto? How can I escape these commas?

If your data is enclosed in double quotes, you need to use OpenCSVSerDe .
for the sample data, the following table definition works:
1,0,True,"{""foo"":null,""bar"":null}",0,1
How to escape comma inside the data
CREATE EXTERNAL TABLE `extra_comma`(
`a` string COMMENT 'from deserializer',
`b` string COMMENT 'from deserializer',
`c` string COMMENT 'from deserializer',
`d` string COMMENT 'from deserializer',
`e` string COMMENT 'from deserializer',
`f` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-glue-stackoverflow/comma_in_data/'

Related

Hive load data OpenCSVSerde comment control

How to produce the problem:
Create a table use the hive create SQL, such as:
create table `db`.`table`(
`field1` string,
`field2` string,
`field3` string
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
Load csv file use the hive load data SQL, such as:
load data local inpath 'my/file/path' overwrite into table `db`.`table`;
The table schema will be:
field1 string from deserializer
field2 string from deserializer
field3 string from deserializer
Noticed that the comment of every field will be 'from deserializer'.
My question:
How can I get rid of this comment OR customize the comment.
Set this set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;param in your hive shell then you can have empty comment (or) custom comment.
hive> set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;
hive> create table `db`.`table`(
`field1` string COMMENT '',
`field2` string COMMENT 'fiel2 comment',
`field3` string COMMENT 'field3 comment'
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
hive> desc formatted table;
OK
# col_name data_type comment
field1 string
field2 string field2 comment
field3 string field3 comment

Set value=' '(empty String) where the field value is NULL (multiple columns)

I have a table contains lots of column. Many columns contain NULL value. But while I'm trying to create CSV file from it, NULL is being replaced by '/N'. I want to set all the columns value to empty string instead of NULL so that I don't have to face problem with '/N'
You can use FIELDS ESCAPED BY clause along with SELECT query and specify an empty String or whitespace (' ') to get empty values instead of \N, e.g.:
SELECT * INTO file
FIELDS ESCAPED BY ''
LINES TERMINATED BY '\n'
FROM test_table;
Here's MySQL's documentation about it, this is what it says:
If the FIELDS ESCAPED BY character is empty, no characters are escaped
and NULL is output as NULL, not \N.
You can update all the columns and set to empty string instead of NULL, If you don't want to continue with NULL.
If you want to continue with NULL in DB, then export into a temp DB table and apply above logic...
First you need to update column default value with below query.
ALTER TABLE `table_name` CHANGE `column_name` `column_name` varchar(length) default '';
UPDATE `table_name` SET `column_name`='' where `column_name`='/N';
May this will help you.
You can use resultset metadata. Iterate through all the columns and check for null values.
ResultSet rsMain = stmt.executeQuery(selectQuery);
ResultSetMetaData rsmd = rsMain.getMetaData();
while(rsMain.next()){
for (int i = 1; i <=rsmd.getColumnCount(); i++) {
if(rsMain.wasNull())
{
sb.append('');// this is the string buffer that writes to the csv
sb.append(',');
}
}
}
You can update the table like this:
UPDATE yourTable
SET col1 = IFNULL(col1, ''),
col2 = IFNULL(col2, ''),
col3 = IFNULL(col3, ''),
...
Once you've done this, you should probably change the schema so that nulls aren't allowed and the default value is '', so you don't have to go through this again.

MySQL error with CAST and concatenation

Consider the below code
UPDATE users SET class = '-' + CAST(class AS CHAR(50)) + '-' WHERE 1=1
The above query throws the following error:
#1292 - Truncated incorrect DOUBLE value: '-'
I recently updated the class column from int(11) to varchar(255) and am trying to update every row to a format of: -class- where class is the previous int value.
if you have updated the column from int to varchar then there is no point casting it and just use CONCAT
UPDATE users SET class = CONCAT ('-',class,'-') WHERE 1=1
if class is NULL then the result will be NULL so to make sure you are only updating CLASS with values
UPDATE users SET class = CONCAT ('-',class,'-') WHERE class IS NOT NULL

output hive query result as csv enclosed in quotes

I have to export data from a hive table in a csv file in which fields are enclosed in double quotes.
So far I am able to generate a csv without quotes using the following query
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT QUERY
The output generated looks like
1,Vikas Saxena,Banking,JL5
However, I need the output as
"1","Vikas Saxena","Banking","JL5"
I tried changing the query to
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
SELECT QUERY
But it displays error
Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'
Create an external table:
CREATE EXTERNAL TABLE new_table(field1 type1, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION '/user/vikas/output';
Then select into that table:
insert into new_table select * from original_table;
Your CSV is then on disk at /user/vikas/output

MySQL string quotation for REPLACE()

I have problem using REPLACE() function on specific data. It doesn't match string occurrence that it should replace.
The string I want to replace is the following.
s:54:"Doctrine\Common\Collections\ArrayCollection_elements
It is stored in the following field
`definitions` longtext COLLATE utf8_unicode_ci NOT NULL COMMENT '(DC2Type:object)',
Here is the LIKE request which matches all rows that contain the string (notice \0 on the string):
SELECT `definitions`
FROM `entity_type`
WHERE `definitions` LIKE '%s:54:"\0Doctrine\\\\Common\\\\Collections\\\\ArrayCollection\0_elements%'
At the same time when I run the following request I get '0 rows affected' message and nothing is replaced:
UPDATE `entity_type`
SET `definitions` = REPLACE(
`definitions`,
's:54:"\0Doctrine\\\\Common\\\\Collections\\\\ArrayCollection\0_elements',
's:53:"\0Doctrine\\\\Common\\\\Collections\\\\ArrayCollection\0elements'
);
How should I modify the string to make REPLACE() match the text I need and replace it?
PS: Please don't blame me for what I'm trying to replace. it is not my fault :-)
If your "where condition" works, you can try with:
UPDATE `entity_type`
SET `definitions` = REPLACE(REPLACE(
`definitions`,
's:54:',
's:53:'),'ArrayCollection_elements','ArrayCollectionelements')
where `definitions` LIKE '%s:54:"\0Doctrine\\\\Common\\\\Collections\\\\ArrayCollection\0_elements%';