I have to export data from a hive table in a csv file in which fields are enclosed in double quotes.
So far I am able to generate a csv without quotes using the following query
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
SELECT QUERY
The output generated looks like
1,Vikas Saxena,Banking,JL5
However, I need the output as
"1","Vikas Saxena","Banking","JL5"
I tried changing the query to
INSERT OVERWRITE DIRECTORY '/user/vikas/output'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
SELECT QUERY
But it displays error
Error while compiling statement: FAILED: ParseException line 1:0 cannot recognize input near 'ROW' 'FORMAT' 'SERDE'
Create an external table:
CREATE EXTERNAL TABLE new_table(field1 type1, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION '/user/vikas/output';
Then select into that table:
insert into new_table select * from original_table;
Your CSV is then on disk at /user/vikas/output
Related
I have this CSV file and I want to copy it to the table I created but pgadmin outputs:
ERROR: invalid input syntax for integer: "NUM" CONTEXT: COPY tickets,
line 1, column num: "NUM" SQL state: 22P02
The COPY code :
copy TICKETS(NUM,KIND,LOCATIONS,PRICE,DATES,CAT)
FROM 'C:\tmp\tickets.csv' DELIMITER ',' CSV
The CSV file:
Why don't you try this way:
create table TICKETS(
NUM INT,
KIND INT,
LOCATION VARCHAR(100),
PRICE INT,
DATE DATE,
CAT CHAR(1)
)
LOAD DATA INFILE 'C:/tmp/tickets.csv'
INTO TABLE TICKETS
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
The important point is the last line IGNORE 1 ROWS excludes the titles, and no error raises.
I have a csv file which I want to directly use without creating a table. Is there a way to read and manipulate it directly?
As long as you can connect to the server, you will be able to create temp table.
For Microsoft SQL;
declare #TempTable csvtable
(
firstCol varchar(50) NOT NULL,
secondCol varchar(50) NOT NULL
)
BULK INSERT #TempTable FROM 'PathToCSVFile' WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\n')
GO
For MySQL;
CREATE TEMPORARY TABLE csvtable
LOAD DATA INFILE 'PathToCSVFile'
INTO TABLE csvtable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
I know this is tagged as MySQL, but I think the question is you just want to run SQL queries on a CSV. If you are open to using Python, you can use FugueSQL to do that.
A sample Python snippet would be:
from fugue_sql import fsql
query = """
df = LOAD "/path/to/myfile.csv"
SELECT *
FROM df
WHERE col > 1
PRINT
"""
fsql(query).run()
and this will use Pandas to run the query by default. There is also a SAVE keyword so you can save the output to another file.
I am trying to create an external Hive table pointing to a CSV file.
My CSV file has a column(col2) that could have double quotes and comma as part of the column value.
Data in each column:
Col1 : 150
Col2 : BATWING, ABC "D " TEST DATA
Col3 : 300
Row in CSV:
150,"BATWING, ABC ""D "" TEST DATA",300
Create table DDL :
CREATE EXTERNAL TABLE test (
col1 INT,
col2 STRING,
col3 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '"'
LOCATION 's3://test-folder/test-file.csv'
When I query the table, I see null values in col3.
What am I missing here while creating the table? Any help is appreciated
Use OpenCSVSerde. Here is an example
Create table
CREATE TABLE bala (col1 int, col2 string, col3 int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES(
"separatorChar" = ",", "escapeChar"='\"'
);
Load data
hive>LOAD DATA INPATH '/../test.csv' INTO TABLE bala
Loading data to table bala
Table testing.bala stats: [numFiles=1, totalSize=40]
OK
Time taken: 0.514 seconds
Check if it has loaded
hive> select * from bala;
OK
150 BATWING, ABC "D " TEST DATA 300
Time taken: 0.288 seconds, Fetched: 1 row(s)
Create hive external table:
DROP TABLE IF EXISTS ${hiveconf:dbnm}.tblnm ;
CREATE EXTERNAL TABLE ${hiveconf:dbnm}.tblnm (
C1 string,
C2 string
)
PARTITIONED BY (C3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = '|' (change it to your separator)
,"quoteChar" = '\"'
)
STORED AS TEXTFILE
LOCATION '/hdfspath'
--tblproperties ("skip.header.line.count"="1")
;
MSCK REPAIR TABLE ${hiveconf:dbnm}.tblnm;
I am trying to ingest the csv file from my hdfs to hive using the command below.
create table test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",","quoteChar" = "\"")
stored as textfile;
But I am still getting double quotes in my hive table, so I tried the command below.
alter table test
set TBLPROPERTIES ('skip.header.line.count'='1','serialization.null.format' = '');
But still getting double quotes. What can I do to remove these double quotes?
You need to specify the file location.
For example:
CREATE TABLE test (col1 string, col2 int, col3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES ("separatorChar" = ",")
STORED AS textfile
LOCATION 'hdfs://<your-data-node-address>:8020/hdfs/path/to/csv/files-dir';
When I create the table this way I don't have quotes on the values in my table (the source csv file does have)
I have created table as
CREATE TABLE TEST2
(Seq varchar2(255 CHAR),
ID varchar2(255 CHAR),
NAME VARCHAR2 (255 CHAR),
DOB TIMESTAMP(3)
);
my control file is
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
seq"TEST2_seq.nextval"
)
I am not able to use sequence in sql loader.
Can you please help
Although not a particularly pretty solution, it does what you ask:
CREATE OR REPLACE
FUNCTION get_test2_seq RETURN INTEGER
IS
BEGIN
RETURN TEST2_seq.nextval;
END;
/
And then your control file would be
load data
infile 'C:\Users\sgujar\Documents\CDAR\test2.csv'
append into table TEST2
fields terminated by ","
(
ID,
NAME,
DOB "TO_TIMESTAMP (:DOB, 'YYYY-MM-DD HH24:MI:SS.FF')",
SEQ "get_test2_seq()"
)
This will work for sure
options (DIRECT=TRUE,readsize=4096000,bindsize=4096000,skip=1,errors=1,rows=50000)
LOAD DATA
CHARACTERSET AL32UTF8 LENGTH SEMANTICS CHARACTER
INFILE /path/test.csv'
BADFILE '/path/file.bad'
INSERT INTO TABLE test_table
FIELDS TERMINATED BY "," OPTIONALLY ENCLOSED BY '"'TRAILING NULLCOLS
(
Col1 sequence(1,1),
Col2 constant "N",
)