Error running Hive query with JSON data? - json

I have data containing the following:
{"field1":{"data1": 1},"field2":100,"field3":"more data1","field4":123.001}
{"field1":{"data2": 1},"field2":200,"field3":"more data2","field4":123.002}
{"field1":{"data3": 1},"field2":300,"field3":"more data3","field4":123.003}
{"field1":{"data4": 1},"field2":400,"field3":"more data4","field4":123.004}
I uploaded it to S3 and converted it to a Hive table using the following from the Hive console:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
CREATE EXTERNAL TABLE impressions (json STRING ) ROW FORMAT DELIMITED LINES TERMINATED BY '\n' LOCATION 's3://my-bucket/';
The query:
SELECT * FROM impressions;
gives output fine but as soon as I try and use the get_json_object UDF
and run the query:
SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
I get the following error:
> SELECT get_json_object(impressions.json, '$.field2') FROM impressions;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
java.io.IOException: cannot find dir = s3://nick.bucket.dev/snapshot.csv in pathToPartitionInfo: [s3://nick.bucket.dev/]
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:291)
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:258)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:108)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:423)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1036)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1028)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:172)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:944)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:897)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:897)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:871)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:479)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:133)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1332)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1123)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:931)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:261)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:218)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:409)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:567)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Job Submission failed with exception 'java.io.IOException(cannot find dir = s3://my-bucket/snapshot.csv in pathToPartitionInfo: [s3://my-bucket/])'
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask
Does anyone know what is wrong?

Any reason why you are declaring:
ADD JAR s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar;
But not using the serde in your table definition? See the code snippet below on how to use it. I can't see any reason to use get_json_object here.
CREATE EXTERNAL TABLE impressions (
field1 string, field2 string, field3 string, field4 string
)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='field1, field2, field3,
field4)
LOCATION 's3://mybucket' ;

Related

how to tokenize a text by nltk python

i have a text like this:
Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid()
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
i tokenize this text with word_tokenize in python and output is:
Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist
But as you can see, the second line outputs several words that are dotted together. How to separate these as a Word?!
i use this python code:
>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]
and In fact, I want all words to be separated like this:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() \
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' \
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
#print(item)
s = s + ' ' + ''.join(item)+'\n'
print(s)
output
Exception
in
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid()
with
cause
=
'org
hibernate
exception
SQLGrammarException:
could
not
extract
ResultSet'
Caused
by:
java
sql
SQLSyntaxErrorException:
ORA-00942:
table
or
view
does
not
exist'
i found a simple way that use of RegexpTokenizer of nltk.tokenize like this:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'\w+')
The output after considering remove stopwords is as follows:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist

Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') How to solve this?

While doing Twitter sentiment analysis i come across this error despite searching for extensively for three days for the solution i couldn't get any solution.
Error
hive> select * from load_tweets;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected character ('O' (code 79)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
at [Source: java.io.StringReader#5a82bc58; line: 1, column: 2]
Time taken: 1.698 seconds
hive>
Here is the Table creation
hive> create external table load_tweets(id BIGINT,text STRING) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
> ;
I am using Cloudera Json serde version
json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar
I have checked jar file is properly added in hive class path.
Database:
http://freetexthost.com/ik4jyogkfm
This is database fetch by flume.
I am referring this article for the data fetching
https://acadgild.com/blog/streaming-twitter-data-using-flume/

Error :JsonStorage in Pig Local mode

I am running my Pigscript in Local mode in eclipse.
when I try to store the output in JsonStorage.
Exception in thread "main" java.lang.RuntimeException: Cannot instantiate:org.apache.pig.builtin.JsonStorage
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.NonEvalFuncSpec(QueryParser.java:4976)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.StoreClause(QueryParser.java:3473)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1351)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:893)
at org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:706)
at org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1017)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:967)
at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
at org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:716)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.PigServer.registerScript(PigServer.java:407)
at com.paypal.debugpig.DebugPig.main(DebugPig.java:13)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve org.apache.pig.builtin.JsonStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:458)
at org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:470)
... 14 more
PigScript :
REGISTER C:/path/to/jar/pig.jar;
REGISTER C:/path/to/jar/UpperUDf/UpperUDf_fat.jar;
A = LOAD 'C:/path/to/data/file/student.txt' using PigStorage('\t') AS (name: chararray, age: int, gpa: float);
B = FOREACH A GENERATE myudfs.UPPER(name) ,age, gpa ;
Store B into 'output_student_Json' using org.apache.pig.builtin.JsonStorage();
when I dump or store the ouput in text file its working and but issues occurs when I try to store in JSON format.
Any pointers appreciated
Thank you
I have verified it, and it is working for me if i am using the below line of code for storing output into json file format.
store B into 'json_output' using JsonStorage();

Upload .csv data in Hive which is in enclosed format

My .csv file is in an enclosed format.
"13","9827259163","0","D","2"
"13","9827961481","0","D","2"
"13","9827202228","0","A","2"
"13","9827529897","0","A","2"
"13","9827700249","0","A","2"
"12","9883219029","0","A","2"
"17","9861065312","0","A","2"
"17","9861220761","0","D","2"
"13","9827438384","0","A","2"
"13","9827336733","0","D","2"
"13","9827380905","0","D","2"
"13","9827115358","0","D","2"
"17","9861475884","0","D","2"
"17","9861511646","0","D","2"
"17","9861310397","0","D","2"
"13","9827035035","0","A","2"
"13","9827304969","0","D","2"
"13","9827355786","0","A","2"
"13","9827702373","0","A","2"
Like it is in mysql, I have tried using "enclosed" keyword as follows..
CREATE EXTERNAL TABLE dnd (ServiceAreaCode varchar(50), PhoneNumber varchar(15), Preferences varchar(15), Opstype varchar(15), PhoneType varchar(10))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\n'
LOCATION '/dnd';
But, it is giving an error as follows...
NoViableAltException(26#[1704:103: ( tableRowFormatMapKeysIdentifier )?])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:144)
at org.apache.hadoop.hive.ql.parse.HiveParser.rowFormatDelimited(HiveParser.java:30427)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableRowFormat(HiveParser.java:30662)
at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:4683)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2144)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:748)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 5:33 cannot recognize input near 'ENCLOSED' 'BY' ''"'' in serde properties specification
Is there a way I can directly import this file ?? Thanks in advance.
Find another way. The solution is serde. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table dnd (ServiceAreaCode varchar(50), PhoneNumber varchar(15), Preferences varchar(15), Opstype varchar(15), PhoneType varchar(10))
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\,",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table dnd;
and then run :
select * from dnd;
Hey I did it quoted csv data in hive table:
first download csv serde(I downloaded csv-serde-1.1.2.jar)
Then
hive>add jar /opt/hive-1.1.1/lib/csv-serde-1.1.2.jar;
Hive>create table t1(schema) row format serde 'com.bizo.hive.serde.csv.CSVSerde' with serdeproperties ("separatorChar" = ",") LOCATION '/user/hive/warehouse/dwb/ot1/';
Then we have to add serde in the hive-site.xml as below mentioned, so that we can query table from hive-shell.
<property><name>hive.aux.jars.path</name><value>hdfs://master-ip:54310/hive-serde/csv-serde-1.1.2.jar</value></property>
In hive we can use jar file to retrieve the data which is enclosed in double quotes.
For your problem please refer this link :
http://stackoverflow.com/questions/21156071/why-dont-hive-have-fields-enclosed-by-like-in-mysql

How can I execute a select into outfile query using Hibernate?

I am trying to execute de following code in Hibernate to create a .csv file from a mySQL database. :
String sql =
"SELECT * INTO OUTFILE 'table.csv' FIELDS TERMINATED BY ','" +
" OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\\n' " +
"FROM match INNER JOIN totala ON match_code= match";
The .csv file is created correctly but then I get the following error:
Exception in thread "main" org.hibernate.exception.GenericJDBCException: could not execute query
at org.hibernate.exception.SQLStateConverter.handledNonSpecificException(SQLStateConverter.java:140)
at org.hibernate.exception.SQLStateConverter.convert(SQLStateConverter.java:128)
at org.hibernate.exception.JDBCExceptionHelper.convert(JDBCExceptionHelper.java:66)
at org.hibernate.loader.Loader.doList(Loader.java:2536)
at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2276)
at org.hibernate.loader.Loader.list(Loader.java:2271)
at org.hibernate.loader.custom.CustomLoader.list(CustomLoader.java:316)
at org.hibernate.impl.SessionImpl.listCustomQuery(SessionImpl.java:1842)
at org.hibernate.impl.AbstractSessionImpl.list(AbstractSessionImpl.java:165)
at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:157)
at datuak.DatuBasea.sortu_csv_fitxategia(DatuBasea.java:101)
at sortzailea.csv_sortzailea.Csv.Sortu(Csv.java:8)
at html_erauzlea.Nagusia.main(Nagusia.java:34)
Caused by: java.sql.SQLException: ResultSet is from UPDATE. No Data
at com.mysql.jdbc.ResultSet.next(ResultSet.java:2491)
at org.hibernate.loader.Loader.doQuery(Loader.java:825)
at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
at org.hibernate.loader.Loader.doList(Loader.java:2533)
... 9 more
I think that the problem is that I am executing a select query that doesn´t return any value, it only creates a .csv file and the method is expecting to return a ResultSet.
So, could someone give some suggestions?
Thanks in advance