Upload .csv data in Hive which is in enclosed format - csv

My .csv file is in an enclosed format.
"13","9827259163","0","D","2"
"13","9827961481","0","D","2"
"13","9827202228","0","A","2"
"13","9827529897","0","A","2"
"13","9827700249","0","A","2"
"12","9883219029","0","A","2"
"17","9861065312","0","A","2"
"17","9861220761","0","D","2"
"13","9827438384","0","A","2"
"13","9827336733","0","D","2"
"13","9827380905","0","D","2"
"13","9827115358","0","D","2"
"17","9861475884","0","D","2"
"17","9861511646","0","D","2"
"17","9861310397","0","D","2"
"13","9827035035","0","A","2"
"13","9827304969","0","D","2"
"13","9827355786","0","A","2"
"13","9827702373","0","A","2"
Like it is in mysql, I have tried using "enclosed" keyword as follows..
CREATE EXTERNAL TABLE dnd (ServiceAreaCode varchar(50), PhoneNumber varchar(15), Preferences varchar(15), Opstype varchar(15), PhoneType varchar(10))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
LOCATION '/dnd';
But, it is giving an error as follows...
NoViableAltException(26#[1704:103: ( tableRowFormatMapKeysIdentifier )?])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at org.apache.hadoop.hive.ql.parse.HiveParser.rowFormatDelimited(HiveParser.java:30427)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableRowFormat(HiveParser.java:30662)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:4683)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2144)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:748)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 5:33 cannot recognize input near 'ENCLOSED' 'BY' ''"'' in serde properties specification
Is there a way I can directly import this file ?? Thanks in advance.

Find another way. The solution is serde. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table dnd (ServiceAreaCode varchar(50), PhoneNumber varchar(15), Preferences varchar(15), Opstype varchar(15), PhoneType varchar(10))
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\,",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table dnd;
and then run :
select * from dnd;

Hey I did it quoted csv data in hive table:
first download csv serde(I downloaded csv-serde-1.1.2.jar)
Then
hive>add jar /opt/hive-1.1.1/lib/csv-serde-1.1.2.jar;
Hive>create table t1(schema) row format serde 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ("separatorChar" = ",") LOCATION '/user/hive/warehouse/dwb/ot1/';
Then we have to add serde in the hive-site.xml as below mentioned, so that we can query table from hive-shell.
<property><name>hive.aux.jars.path</name><value>hdfs://master-ip:54310/hive-serde/csv-serde-1.1.2.jar</value></property>

In hive we can use jar file to retrieve the data which is enclosed in double quotes.
For your problem please refer this link :
http://stackoverflow.com/questions/21156071/why-dont-hive-have-fields-enclosed-by-like-in-mysql

Related

How to remove double quotes when loading csv into external table in impala?

This is the data (you can also download it from here):
"Creation Date","Status","First 3 Chars of Postal Code","Intersection Street 1","Intersection Street 2","Ward","Service Request Type","Division","Section"
"2010-01-01 00:38:26.0000000","Closed","Intersection","High Park Blvd","Parkside Dr","Parkdale-High Park (13)","Road - Sanding / Salting Required","Transportation Services","Road Operations"
"2010-01-01 01:19:18.0000000","Closed","M4T","","","Toronto Centre-Rosedale (27)","Water Service Line-Turn On","Toronto Water","District Ops"
This is my create table query:
CREATE TABLE sr.sr2013 (
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'field.delim'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"") ;
This is the load data query:
load data inpath '/user/rxie/SR2013.csv' into table sr2013;
After data is loaded, checking the table found all the original quotes are retained:
So at least two issues here:
1. the header is not excluded by the option 'skip.header.line.count'='1', in the table creation;
2. the double quotes are not removed as indicated by the option 'quoteChar'= "\"" when loading data into the table
Can anyone share with more light? it looks like bugs to me.
UPDATE 1:
In Hue/Hive editor:
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"")
LOAD DATA LOCAL INPATH '/home/rxie/data/csv/SR2015.csv' INTO TABLE sr2015;
Error:
Error while compiling statement: FAILED: SemanticException line 1:26
Invalid path ''/home/rxie/data/csv/SR2015.csv'': No files matching
path file:/home/rxie/data/csv/SR2015.csv
Below is what works for me to load csv with quotes be excluded is as below:
In Hive Editor (I assume beeline is good too though I didn't test it out):
Create Hive table
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',',
'skip.header.line.count'='1',
'quoteChar'= "\"")
Load data into Hive table:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Pending issue(will be discussed here):
The table is not accessible in Impala

hive sql, serde how to not quote my fields?

Since by default serde quotes fields by ", How can I not quote my fields using serde?
I tried:
row format serde "org.apache.hadoop.hive.serde2.OpenCSVSerde"
with serdeproperties(
"separatorChar" = ",",
"quoteChar" = "")
But i'm getting
FAILED: SemanticException java.lang.StringIndexOutOfBoundsException: String index out of range: 0
You could achieve this by specifying \u0000 as the quote character. Since quoteChar expects a string, you should use this unicode version of NULL.
ROW FORMAT SERDE
"org.apache.hadoop.hive.serde2.OpenCSVSerde"
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\u0000")
This unicode NULL \u0000 is what used by the CSV writer class as value for NO_QUOTE_CHARACTER: http://www.java2s.com/Code/Java/Development-Class/AverysimpleCSVwriterreleasedunderacommercialfriendlylicense.htm
For some reason "quoteChar" = "\u0000" didn't work for me as suggested in Nirmal's answer above.
When saving to file without quotes around the fields, I use:
-- saving to file
INSERT OVERWRITE LOCAL DIRECTORY 'file:/home/sidazhou/temp'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT *
FROM temp_table
;
PS. I know this isn't what's being asked, which concerns ROW FORMAT SERDE instead of ROW FORMAT DELIMITED FIELDS.

HIVE escaped by not working '\\'

I have a data-set in S3
123, "some random, text", "", "", 236
I build a external table on this dataset :
CREATE EXTERNAL TABLE db1.myData(
field1 bigint,
field2 string,
field3 string,
field4 string,
field5 bigint,
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LOCATION 's3n://thisMyData/';
Problem/ Issue :
when I do
select * from db1.myData
field2 is shown as
some random
I need the field to be
some random, text
Gotcha's :
1. I cannot change the delimiter as there are over ~300 .csv files at this location
2. ESCAPED BY is not escaping the '\\'
3. I'm using HIVE 0.13 so there I cannot use CSV SerDe and neither i'm allowed to import new jars to cluster (its a complicated process to add a new jar as I have to go through Director level approvals)
Question:
Is there a workaround for making 'ESCAPED BY' come alive ?!
Any other workarounds for this ??
All suggestions are welcome !!
N.B : THis is not a repeat question. If you think its a repeat, please guide me to right page and I will take this off of this portal :)
I had to use: ESCAPED BY '\134' which translates to: ESCAPED BY '\'.
Additionally, because I was calling the Athena create table statement by passing in the statement from a JSON file I had to add an extra \ to mask the original \ in JSON. So my final statement within the JSON file looked like this: ESCAPED BY '\\134'.
If you are using Hive 0.14, you can use CSV Serde like this:
CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Refer below link for details:
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

Error while trying to import csv into hive

I tried to import my csv data into hive
My query:
CREATE EXTERNAL TABLE student(Stud_name String,dept String,year String)
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> with serdeproperties (
> "separatorChar" = ","
> )
> STORED AS TEXTFILE
> LOCATION '/home/codewarrior/Desktop/csv';
But it gives this error
, and quits from hive..i hope anybody help me..
You can try this code instead :
CREATE EXTERNAL TABLE student(Stud_name String,dept String,year String)
ROW FORMAT AS DELIMITED FIELDS TERMINATED by ',' location '/home/codewarrior/Desktop/csv';

Error running Hive query with JSON data?

I have data containing the following:
{"field1":{"data1": 1},"field2":100,"field3":"more data1","field4":123.001}
{"field1":{"data2": 1},"field2":200,"field3":"more data2","field4":123.002}
{"field1":{"data3": 1},"field2":300,"field3":"more data3","field4":123.003}
{"field1":{"data4": 1},"field2":400,"field3":"more data4","field4":123.004}
I uploaded it to S3 and converted it to a Hive table using the following from the Hive console:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
CREATE EXTERNAL TABLE impressions (json STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' LOCATION 's3://my-bucket/';
The query:
SELECT * FROM impressions;
gives output fine but as soon as I try and use the get_json_object UDF
and run the query:
SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
I get the following error:
> SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: cannot find dir = s3://nick.bucket.dev/snapshot.csv in pathToPartitionInfo: [s3://nick.bucket.dev/]
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:291)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:258)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:108)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:423)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:479)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:261)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:218)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:567)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(cannot find dir = s3://my-bucket/snapshot.csv in pathToPartitionInfo: [s3://my-bucket/])'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
Does anyone know what is wrong?
Any reason why you are declaring:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
But not using the serde in your table definition? See the code snippet below on how to use it. I can't see any reason to use get_json_object here.
CREATE EXTERNAL TABLE impressions (
field1 string, field2 string, field3 string, field4 string
)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='field1, field2, field3,
field4)
LOCATION 's3://mybucket' ;