Hive load data OpenCSVSerde comment control - csv

How to produce the problem:
Create a table use the hive create SQL, such as:
create table `db`.`table`(
`field1` string,
`field2` string,
`field3` string
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
Load csv file use the hive load data SQL, such as:
load data local inpath 'my/file/path' overwrite into table `db`.`table`;
The table schema will be:
field1 string from deserializer
field2 string from deserializer
field3 string from deserializer
Noticed that the comment of every field will be 'from deserializer'.
My question:
How can I get rid of this comment OR customize the comment.

Set this set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;param in your hive shell then you can have empty comment (or) custom comment.
hive> set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;
hive> create table `db`.`table`(
`field1` string COMMENT '',
`field2` string COMMENT 'fiel2 comment',
`field3` string COMMENT 'field3 comment'
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
hive> desc formatted table;
OK
# col_name data_type comment
field1 string
field2 string field2 comment
field3 string field3 comment

Related

Presto (Athena) loading of a CSV file with quote-escaped commas

Consider the following row in a CSV file:
1,0,True,"{""foo"":null,""bar"":null}",0,1
▲
The highlighted , is part of a column. That is, this full text: " {""foo"":null,""bar"":null}" is the value of a single column. However AWS Athena is interpreting the highlighted , as a column-delimiting comma, incorrectly splitting that text into multiple columns.
I know I could change the column delimiter to something else to avoid this problem. My question is: Is this a bug in AWS Athena / Presto? How can I escape these commas?
If your data is enclosed in double quotes, you need to use OpenCSVSerDe .
for the sample data, the following table definition works:
1,0,True,"{""foo"":null,""bar"":null}",0,1
How to escape comma inside the data
CREATE EXTERNAL TABLE `extra_comma`(
`a` string COMMENT 'from deserializer',
`b` string COMMENT 'from deserializer',
`c` string COMMENT 'from deserializer',
`d` string COMMENT 'from deserializer',
`e` string COMMENT 'from deserializer',
`f` string COMMENT 'from deserializer'
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://aws-glue-stackoverflow/comma_in_data/'

Set value=' '(empty String) where the field value is NULL (multiple columns)

I have a table contains lots of column. Many columns contain NULL value. But while I'm trying to create CSV file from it, NULL is being replaced by '/N'. I want to set all the columns value to empty string instead of NULL so that I don't have to face problem with '/N'
You can use FIELDS ESCAPED BY clause along with SELECT query and specify an empty String or whitespace (' ') to get empty values instead of \N, e.g.:
SELECT * INTO file
FIELDS ESCAPED BY ''
LINES TERMINATED BY '\n'
FROM test_table;
Here's MySQL's documentation about it, this is what it says:
If the FIELDS ESCAPED BY character is empty, no characters are escaped
and NULL is output as NULL, not \N.
You can update all the columns and set to empty string instead of NULL, If you don't want to continue with NULL.
If you want to continue with NULL in DB, then export into a temp DB table and apply above logic...
First you need to update column default value with below query.
ALTER TABLE `table_name` CHANGE `column_name` `column_name` varchar(length) default '';
UPDATE `table_name` SET `column_name`='' where `column_name`='/N';
May this will help you.
You can use resultset metadata. Iterate through all the columns and check for null values.
ResultSet rsMain = stmt.executeQuery(selectQuery);
ResultSetMetaData rsmd = rsMain.getMetaData();
while(rsMain.next()){
for (int i = 1; i <=rsmd.getColumnCount(); i++) {
if(rsMain.wasNull())
{
sb.append('');// this is the string buffer that writes to the csv
sb.append(',');
}
}
}
You can update the table like this:
UPDATE yourTable
SET col1 = IFNULL(col1, ''),
col2 = IFNULL(col2, ''),
col3 = IFNULL(col3, ''),
...
Once you've done this, you should probably change the schema so that nulls aren't allowed and the default value is '', so you don't have to go through this again.

SQL append strings

how to insert a string from a particular position in another string to a column in database.Like i have column named Name in table world.Name column has a value Test.How to insert in the string name from 2nd postion that is from after e in test.Like test should become tenewst.
Database name :World
Table Name :Name
A Value in Name Table: Test
Desired output: Te-New-st-
(That is adding string "New" afer 2nd position in "Test" string.)
For SQL Server
--Original String
DECLARE #orgString varchar(50) = 'This is some test string'
--Search String
DECLARE #searchString varchar(50) = 'te'
--String to insert into the original string
DECLARE #insertString varchar(50) = 'NEW'
SELECT
CONCAT(SUBSTRING(#orgString,1,CHARINDEX(#searchString,#orgString)+1),
#insertString,
SUBSTRING(#orgString,CHARINDEX(#searchString,#orgString)+2,LEN(#orgString)))
AS String
To run something like this against data in your table, replace the original string variable with your column name
--Search String
DECLARE #searchString varchar(50) = 'te'
--String to insert into the original string
DECLARE #insertString varchar(50) = 'NEW'
SELECT CONCAT(SUBSTRING(Name,1,CHARINDEX(#searchString,Name)+1),
#insertString,
SUBSTRING(Name,CHARINDEX(#searchString,Name)+2,LEN(Name)))
AS String
FROM Table_1
If it is ALWAYS going to be between the 2nd and 3rd position, you could simplify it a little to this.
--String to insert into the original string
DECLARE #insertString varchar(50) = 'NEW'
SELECT CONCAT(SUBSTRING(Name,1,2),
#insertString,
SUBSTRING(Name,3,LEN(Name)))
AS String
FROM Table_1
Check out this for string function references, String Functions

Csv file ingestion from hdfs to hive

I am trying to ingest the csv file from my hdfs to hive using the command below.
create table test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",","quoteChar" = "\"")
stored as textfile;
But I am still getting double quotes in my hive table, so I tried the command below.
alter table test
set TBLPROPERTIES ('skip.header.line.count'='1','serialization.null.format' = '');
But still getting double quotes. What can I do to remove these double quotes?
You need to specify the file location.
For example:
CREATE TABLE test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
STORED AS textfile
LOCATION 'hdfs://<your-data-node-address>:8020/hdfs/path/to/csv/files-dir';
When I create the table this way I don't have quotes on the values in my table (the source csv file does have)

Error while creating Parquet table from CSV using Apache Drill

I'm Trying to create a Parquet table from a CSV extract (generated from an Oracle database table) that has over a million rows. about 25 of those rows have null values for the START_DATE and CTAS is failing to interpret "" as null. Any suggestions would be greatly appreciated.
CREATE TABLE dfs.tmp.FOO as
select cast(columns[0] as INT) as `PRODUCT_ID`,
cast(columns[1] as INT) as `LEG_ID`,
columns[2] as `LEG_TYPE`,
to_timestamp(columns[3], 'dd-MMM-yy HH.mm.ss.SSSSSS a') as `START_DATE`
from dfs.`c:\work\prod\data\foo.csv`;
Error: SYSTEM ERROR: IllegalArgumentException: Invalid format ""
You can always include a CASE statement to filter out the empty entries:
CREATE TABLE dfs.tmp.FOO as
select cast(columns[0] as INT) as `PRODUCT_ID`,
cast(columns[1] as INT) as `LEG_ID`,
columns[2] as `LEG_TYPE`,
CASE WHEN columns[3] = '' THEN null
ELSE to_timestamp(columns[3], 'dd-MMM-yy HH.mm.ss.SSSSSS a')
END as `START_DATE`
from dfs.`c:\work\prod\data\foo.csv`;
You can also use NULLIF() function as below
CREATE TABLE dfs.tmp.FOO as
select cast(columns[0] as INT) as `PRODUCT_ID`,
cast(columns[1] as INT) as `LEG_ID`,
columns[2] as `LEG_TYPE`,
to_timestamp(NULLIF(columns[3],''), 'dd-MMM-yy HH.mm.ss.SSSSSS a') as `START_DATE`
from dfs.`c:\work\prod\data\foo.csv`;
NULLIF will convert empty string to null and the casting won't fail.