Csv file ingestion from hdfs to hive - csv

I am trying to ingest the csv file from my hdfs to hive using the command below.
create table test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",","quoteChar" = "\"")
stored as textfile;
But I am still getting double quotes in my hive table, so I tried the command below.
alter table test
set TBLPROPERTIES ('skip.header.line.count'='1','serialization.null.format' = '');
But still getting double quotes. What can I do to remove these double quotes?

You need to specify the file location.
For example:
CREATE TABLE test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
STORED AS textfile
LOCATION 'hdfs://<your-data-node-address>:8020/hdfs/path/to/csv/files-dir';
When I create the table this way I don't have quotes on the values in my table (the source csv file does have)

Related

Hive load data OpenCSVSerde comment control

How to produce the problem:
Create a table use the hive create SQL, such as:
create table `db`.`table`(
`field1` string,
`field2` string,
`field3` string
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
Load csv file use the hive load data SQL, such as:
load data local inpath 'my/file/path' overwrite into table `db`.`table`;
The table schema will be:
field1 string from deserializer
field2 string from deserializer
field3 string from deserializer
Noticed that the comment of every field will be 'from deserializer'.
My question:
How can I get rid of this comment OR customize the comment.
Set this set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;param in your hive shell then you can have empty comment (or) custom comment.
hive> set hive.serdes.using.metastore.for.schema=org.apache.hadoop.hive.serde2.OpenCSVSerde;
hive> create table `db`.`table`(
`field1` string COMMENT '',
`field2` string COMMENT 'fiel2 comment',
`field3` string COMMENT 'field3 comment'
) row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde';
hive> desc formatted table;
OK
# col_name data_type comment
field1 string
field2 string field2 comment
field3 string field3 comment

Read csv file in SQL

I have a csv file which I want to directly use without creating a table. Is there a way to read and manipulate it directly?
As long as you can connect to the server, you will be able to create temp table.
For Microsoft SQL;
declare #TempTable csvtable
(
firstCol varchar(50) NOT NULL,
secondCol varchar(50) NOT NULL
)
BULK INSERT #TempTable FROM 'PathToCSVFile' WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')
GO
For MySQL;
CREATE TEMPORARY TABLE csvtable
LOAD DATA INFILE 'PathToCSVFile'
INTO TABLE csvtable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
I know this is tagged as MySQL, but I think the question is you just want to run SQL queries on a CSV. If you are open to using Python, you can use FugueSQL to do that.
A sample Python snippet would be:
from fugue_sql import fsql
query = """
df = LOAD "/path/to/myfile.csv"
SELECT *
FROM df
WHERE col > 1
PRINT
"""
fsql(query).run()
and this will use Pandas to run the query by default. There is also a SAVE keyword so you can save the output to another file.

Hive external table pointing to CSV file with embedded double quotes

I am trying to create an external Hive table pointing to a CSV file.
My CSV file has a column(col2) that could have double quotes and comma as part of the column value.
Data in each column:
Col1 : 150
Col2 : BATWING, ABC "D " TEST DATA
Col3 : 300
Row in CSV:
150,"BATWING, ABC ""D "" TEST DATA",300
Create table DDL :
CREATE EXTERNAL TABLE test (
col1 INT,
col2 STRING,
col3 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '"'
LOCATION 's3://test-folder/test-file.csv'
When I query the table, I see null values in col3.
What am I missing here while creating the table? Any help is appreciated
Use OpenCSVSerde. Here is an example
Create table
CREATE TABLE bala (col1 int, col2 string, col3 int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES(
"separatorChar" = ",", "escapeChar"='\"'
);
Load data
hive>LOAD DATA INPATH '/../test.csv' INTO TABLE bala
Loading data to table bala
Table testing.bala stats: [numFiles=1, totalSize=40]
OK
Time taken: 0.514 seconds
Check if it has loaded
hive> select * from bala;
OK
150 BATWING, ABC "D " TEST DATA 300
Time taken: 0.288 seconds, Fetched: 1 row(s)
Create hive external table:
DROP TABLE IF EXISTS ${hiveconf:dbnm}.tblnm ;
CREATE EXTERNAL TABLE ${hiveconf:dbnm}.tblnm (
C1 string,
C2 string
)
PARTITIONED BY (C3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = '|' (change it to your separator)
,"quoteChar" = '\"'
)
STORED AS TEXTFILE
LOCATION '/hdfspath'
--tblproperties ("skip.header.line.count"="1")
;
MSCK REPAIR TABLE ${hiveconf:dbnm}.tblnm;

output hive query result as csv enclosed in quotes

I have to export data from a hive table in a csv file in which fields are enclosed in double quotes.
So far I am able to generate a csv without quotes using the following query
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT QUERY
The output generated looks like
1,Vikas Saxena,Banking,JL5
However, I need the output as
"1","Vikas Saxena","Banking","JL5"
I tried changing the query to
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
SELECT QUERY
But it displays error
Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'
Create an external table:
CREATE EXTERNAL TABLE new_table(field1 type1, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION '/user/vikas/output';
Then select into that table:
insert into new_table select * from original_table;
Your CSV is then on disk at /user/vikas/output

using sequence in SQL loader

I have created table as
CREATE TABLE TEST2
(Seq varchar2(255 CHAR),
ID varchar2(255 CHAR),
NAME VARCHAR2 (255 CHAR),
DOB TIMESTAMP(3)
);
my control file is
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
seq"TEST2_seq.nextval"
)
I am not able to use sequence in sql loader.
Can you please help
Although not a particularly pretty solution, it does what you ask:
CREATE OR REPLACE
FUNCTION get_test2_seq RETURN INTEGER
IS
BEGIN
RETURN TEST2_seq.nextval;
END;
/
And then your control file would be
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(
ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
SEQ "get_test2_seq()"
)
This will work for sure
options (DIRECT=TRUE,readsize=4096000,bindsize=4096000,skip=1,errors=1,rows=50000)
LOAD DATA
CHARACTERSET AL32UTF8 LENGTH SEMANTICS CHARACTER
INFILE /path/test.csv'
BADFILE '/path/file.bad'
INSERT INTO TABLE test_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'TRAILING NULLCOLS
(
Col1 sequence(1,1),
Col2 constant "N",
)