Hive external table pointing to CSV file with embedded double quotes - csv

I am trying to create an external Hive table pointing to a CSV file.
My CSV file has a column(col2) that could have double quotes and comma as part of the column value.
Data in each column:
Col1 : 150
Col2 : BATWING, ABC "D " TEST DATA
Col3 : 300
Row in CSV:
150,"BATWING, ABC ""D "" TEST DATA",300
Create table DDL :
CREATE EXTERNAL TABLE test (
col1 INT,
col2 STRING,
col3 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '"'
LOCATION 's3://test-folder/test-file.csv'
When I query the table, I see null values in col3.
What am I missing here while creating the table? Any help is appreciated

Use OpenCSVSerde. Here is an example
Create table
CREATE TABLE bala (col1 int, col2 string, col3 int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES(
"separatorChar" = ",", "escapeChar"='\"'
);
Load data
hive>LOAD DATA INPATH '/../test.csv' INTO TABLE bala
Loading data to table bala
Table testing.bala stats: [numFiles=1, totalSize=40]
OK
Time taken: 0.514 seconds
Check if it has loaded
hive> select * from bala;
OK
150 BATWING, ABC "D " TEST DATA 300
Time taken: 0.288 seconds, Fetched: 1 row(s)

Create hive external table:
DROP TABLE IF EXISTS ${hiveconf:dbnm}.tblnm ;
CREATE EXTERNAL TABLE ${hiveconf:dbnm}.tblnm (
C1 string,
C2 string
)
PARTITIONED BY (C3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = '|' (change it to your separator)
,"quoteChar" = '\"'
)
STORED AS TEXTFILE
LOCATION '/hdfspath'
--tblproperties ("skip.header.line.count"="1")
;
MSCK REPAIR TABLE ${hiveconf:dbnm}.tblnm;

Related

Export big psql table to JSON

how i can export big table to json, but output file is over 1Gb
copy (SELECT json_agg(export_data)::text FROM "table_name" export_data) TO '{{ path_name }}/{{ table_name }}.json' with csv quote E'\t' encoding 'UTF8'
I receive: out of memory, Cannot enlarge string buffer containing 1073741822 bytes by 1 more bytes.
Table column:
First
uuid
Second
timestamp
Three
uuid
Four
timestamp
Five
uuid
Six
int4
Seven
text
Eight
uuid
Nine
int4
Ten
uuid
Eleven
uuid
Twelve
varchar(50)
Maybe there is a way to split the output by lines?
Splitting the table into several parts and exporting each part separately :
Let's assume that the size is proportional to the number of rows and decide that one n-th of the total number of rows is a reasonnable size. You can then execute a procedure like the following one and will get n resulting files :
CREATE OR REPLACE PROCEDURE table_export (table_name text, path_name text, n integer) LANGUAGE plpgsql AS
$$
DECLARE
total_count bigint ;
i integer ;
BEGIN
EXECUTE FORMAT
( 'SELECT count(*)
INTO total_count
FROM %I'
, table_name
) ;
FOR i IN 0 .. (n-1) LOOP
EXECUTE FORMAT
(E'COPY ( SELECT json_agg(export_data)::text
FROM %1I export_data
LIMIT %4s
OFFSET (%3s * %4s)
)
TO \'%2s/%1s_%3s.json\' with csv quote E\'\\t\' encoding \'UTF8\''
, table_name
, path_name
, i
, ceil(total_count/n)
) ;
END LOOP ;
END ;
$$ ;

How to fix "invalid syntax for integer : 'NUM'

I have this CSV file and I want to copy it to the table I created but pgadmin outputs:
ERROR: invalid input syntax for integer: "NUM" CONTEXT: COPY tickets,
line 1, column num: "NUM" SQL state: 22P02
The COPY code :
copy TICKETS(NUM,KIND,LOCATIONS,PRICE,DATES,CAT)
FROM 'C:\tmp\tickets.csv' DELIMITER ',' CSV
The CSV file:
Why don't you try this way:
create table TICKETS(
NUM INT,
KIND INT,
LOCATION VARCHAR(100),
PRICE INT,
DATE DATE,
CAT CHAR(1)
)
LOAD DATA INFILE 'C:/tmp/tickets.csv'
INTO TABLE TICKETS
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
The important point is the last line IGNORE 1 ROWS excludes the titles, and no error raises.

Csv file ingestion from hdfs to hive

I am trying to ingest the csv file from my hdfs to hive using the command below.
create table test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",","quoteChar" = "\"")
stored as textfile;
But I am still getting double quotes in my hive table, so I tried the command below.
alter table test
set TBLPROPERTIES ('skip.header.line.count'='1','serialization.null.format' = '');
But still getting double quotes. What can I do to remove these double quotes?
You need to specify the file location.
For example:
CREATE TABLE test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
STORED AS textfile
LOCATION 'hdfs://<your-data-node-address>:8020/hdfs/path/to/csv/files-dir';
When I create the table this way I don't have quotes on the values in my table (the source csv file does have)

output hive query result as csv enclosed in quotes

I have to export data from a hive table in a csv file in which fields are enclosed in double quotes.
So far I am able to generate a csv without quotes using the following query
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT QUERY
The output generated looks like
1,Vikas Saxena,Banking,JL5
However, I need the output as
"1","Vikas Saxena","Banking","JL5"
I tried changing the query to
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
SELECT QUERY
But it displays error
Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'
Create an external table:
CREATE EXTERNAL TABLE new_table(field1 type1, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION '/user/vikas/output';
Then select into that table:
insert into new_table select * from original_table;
Your CSV is then on disk at /user/vikas/output

using sequence in SQL loader

I have created table as
CREATE TABLE TEST2
(Seq varchar2(255 CHAR),
ID varchar2(255 CHAR),
NAME VARCHAR2 (255 CHAR),
DOB TIMESTAMP(3)
);
my control file is
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
seq"TEST2_seq.nextval"
)
I am not able to use sequence in sql loader.
Can you please help
Although not a particularly pretty solution, it does what you ask:
CREATE OR REPLACE
FUNCTION get_test2_seq RETURN INTEGER
IS
BEGIN
RETURN TEST2_seq.nextval;
END;
/
And then your control file would be
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(
ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
SEQ "get_test2_seq()"
)
This will work for sure
options (DIRECT=TRUE,readsize=4096000,bindsize=4096000,skip=1,errors=1,rows=50000)
LOAD DATA
CHARACTERSET AL32UTF8 LENGTH SEMANTICS CHARACTER
INFILE /path/test.csv'
BADFILE '/path/file.bad'
INSERT INTO TABLE test_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'TRAILING NULLCOLS
(
Col1 sequence(1,1),
Col2 constant "N",
)