I'm running Pig
example$ pig --version
Apache Pig version 0.8.1-cdh3u1 (rexported)
compiled Jul 18 2011, 08:29:40
on very simple dataset
example$ hadoop fs -cat /user/pavel/trivial.log
1 one
2 two
3 three
I'm trying to save the bag format as json by using the following script:
REGISTER ./pig.jar;
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
B = GROUP A BY mynum;
DUMP B;
STORE B into 'trivial_json.out' USING JsonStorage();
and I get an error:
Backend error message
---------------------
java.lang.NullPointerException
at org.apache.pig.ResourceSchema.<init>(ResourceSchema.java:239)
at org.apache.pig.builtin.JsonStorage.prepareToWrite(JsonStorage.java:129)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.<init>(PigOutputFormat.java:124)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getRecordWriter(PigOutputFormat.java:85)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:559)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: java.lang.NullPointerException
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:154)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
at org.apache.pig.PigServer.execute(PigServer.java:1201)
at org.apache.pig.PigServer.access$100(PigServer.java:129)
at org.apache.pig.PigServer$Graph.execute(PigServer.java:1528)
at org.apache.pig.PigServer.executeBatchEx(PigServer.java:373)
at org.apache.pig.PigServer.executeBatch(PigServer.java:340)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:115)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:396)
at org.apache.pig.Main.main(Main.java:107)
================================================================================
I'm not strong enough in Java to debug in minutes, can somebody suggest what might be going on?
Thanks much!
-Pavel
Got some questions regarding this by email, thought this might be useful for others:
It turned out that the JsonStorage class wasn't included in my Pig installation. In fact it's not even out in a stable branch yet, you won't find it in 0.9.2. But if you get the latest trunk
http://svn.apache.org/repos/asf/pig/trunk/
then trunk/test/org/apache/pig/test/TestJsonLoaderStorage.java shows how it works. If you are not averse to unreleased versions then you could try it. If you go this way you should also take a look at Avro (binary format with json metadata). It's not officially out either.
I was trying to run a streaming python job and was trying to avoid manually specifying the schema. I ended up passing Pig bags and parsing them by hand.
Hope this helps.
-Pavel
Related
I am having a problem in executing a script in Apache Pig. I have 3 files namely movies.csv, ratings.csv, tags.csv. First I want to load "movies.csv", then load "ratings.csv" and join both tables. But I am encountering an error in loading of the files. Code given by me is as follows,
register 'piggybank-0.15.0.jar'
DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();
part1 = LOAD '/home/cloudera/ml-20m/movies' as (movieId: chararray, title: chararray, genre: chararray);
cat part1;
When I give "cat" command, I am getting an error,as
Pig Stack Trace
ERROR 2997: Encountered IOException. Directory part1 does not exist.
java.io.IOException: Directory part1 does not exist.
at org.apache.pig.tools.grunt.GruntParser.processCat(GruntParser.java:677)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:233)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:547)
at org.apache.pig.Main.main(Main.java:158)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
But I have the file at specified location. I don't know why pig is not able to recognize the input file. I have tried by placing the input file in hdfs and loading the file. But the error is same though. Can anyone please help me. Thanks in advance.
part1 is not a file but a relation.When you use the LOAD command in Pig,you are instructing to load the contents of the file into a relation.You cannot use cat on a relation since the most common use of cat is to read the contents of files.
To display the content of part1 use
DUMP part1;
else if you insist on using cat, then specify the full path to the file
cat /home/cloudera/ml-20m/movies;
I'm using Hive 0.13.0, and I was expecting it to work with table and column names having non alphanumerical characters, as said in the documentation, but it is not.
I've been able to create a table having column names with dots, for instance:
hive> create external table frb_test (recvTime string, fiwareServicePath string, entityId string, entityType string, `ORL.SOU.DH.SSTA10.T.HVAC.HeatLoad` string, `ORL.SOU.DH.SSTA10.T.HVAC.HeatLoad_md` array<struct<name:string,type:string,value:string>>) row format serde 'org.openx.data.jsonserde.JsonSerDe' location '/user/frb/test';
OK
Time taken: 0.286 seconds
As you can see, I'm using https://github.com/rcongiu/Hive-JSON-Serde as Json serde.Nevertheless, below is the content of hdfs:///user/frb/test:
$ hadoop fs -cat /user/frb/test/deleteme
{"recvTime":"2016-02-09T18:03:48.986Z","fiwareServicePath":"orl_sou","entityId":"ORL.SOU.DH.SSTA10","entityType":"ETS", "ORL.SOU.DH.SSTA10.T.HVAC.HeatLoad":"10.673299789428711", "ORL.SOU.DH.SSTA10.T.HVAC.HeatLoad_md":[{"name":"dofTimestamp","type":"ms","value":"2016-02-08T23:00:00.000Z"},{"name":"tag","type":"text","value":"ORL.SOU.DH.SSTA10.T.HVAC.HeatLoad"},{"name":"description","type":"text","value":"Electrical heat load"},{"name":"quality","type":"0:GOOD, +0:ERROR","value":"10813440"},{"name":"max","type":"max","value":"null"},{"name":"min","type":"min","value":"null"},{"name":"lcl","type":"lcl","value":"null"},{"name":"ucl","type":"ucl","value":"null"}]}
I'm not able to select the orl.sou.dh.ssta10.t.hvac.heatload column:
hive> add jar /home/frb/json-serde-1.3.7-jar-with-dependencies.jar;
hive> select `orl.sou.dh.ssta10.t.hvac.heatload` from frb_test; Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1455032234756_0008, Tracking URL = http://namenode.fiware.org:8088/proxy/application_1455032234756_0008/
Kill Command = /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hadoop/bin/hadoop job -kill job_1455032234756_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-02-11 17:05:56,150 Stage-1 map = 0%, reduce = 0%
2016-02-11 17:06:23,653 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1455032234756_0008 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1455032234756_0008_m_000000 (and more) from job job_1455032234756_0008
Task with the most failures(4):
-----
Task ID:
task_1455032234756_0008_m_000000
URL:
http://namenode.fiware.org:8088/taskdetails.jsp?jobid=job_1455032234756_0008&tipid=task_1455032234756_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:446)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1642)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 9 more
Caused by: java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38)
... 14 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 17 more
Caused by: java.lang.RuntimeException: Map operator initialization failed
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:157)
... 22 more
Caused by: java.lang.RuntimeException: cannot find field orl from [0:recvtime, 1:fiwareservicepath, 2:entityid, 3:entitytype, 4:orl.sou.dh.ssta10.t.hvac.heatload, 5:orl.sou.dh.ssta10.t.hvac.heatload_md]
at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getStandardStructFieldRef(ObjectInspectorUtils.java:415)
at org.apache.hadoop.hive.serde2.objectinspector.StandardStructObjectInspector.getStructFieldRef(StandardStructObjectInspector.java:150)
at org.apache.hadoop.hive.ql.exec.ExprNodeColumnEvaluator.initialize(ExprNodeColumnEvaluator.java:79)
at org.apache.hadoop.hive.ql.exec.Operator.initEvaluators(Operator.java:934)
at org.apache.hadoop.hive.ql.exec.Operator.initEvaluatorsAndReturnStruct(Operator.java:960)
at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:65)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:376)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:460)
at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:416)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:189)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:376)
at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:424)
at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:376)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.configure(ExecMapper.java:136)
... 22 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I've seen the Hive property governing how Hive works with regard to the non alphanumeric characters is hive.support.quoted.identifiers, which can value none (then Hive behaves as 0.12.0 version) or column, which I guess it is the default value for 0.13.0; nevertheless, I've tried setting it and no results:
hive> set hive.support.quoted.identifiers=column;
hive> select `orl.sou.dh.ssta10.t.hvac.heatload` from frb_test;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1455032234756_0009, Tracking URL = http://namenode.fiware.org:8088/proxy/application_1455032234756_0009/
Kill Command = /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hadoop/bin/hadoop job -kill job_1455032234756_0009
...
Caused by: java.lang.RuntimeException: cannot find field orl from [0:recvtime, 1:fiwareservicepath, 2:entityid, 3:entitytype, 4:orl.sou.dh.ssta10.t.hvac.heatload, 5:orl.sou.dh.ssta10.t.hvac.heatload_md]
...
FAILED: Execution Error, return code 2 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I would bet that the HQL parser considers the "dot" character as a way to access the inner fields of a STRUCT, and nothing else.
And I would bet that among all the people involved in the support of "quoted identifiers" in Hive, no-one ever thought of a test case with a "dot" in a column name. After all, who on earth would be crazy enough to use a "dot" in a column name??
OK, maybe. Then who would be crazy enough to define a STRUCT column with a "dot" in its name, out of perversity, just to add an extra "dot" in the mix??
OK, let's assume this might happen. Then would that hypothetical person push the perversity even further, by insisting on using the first ever version of Hive that did support "quoted identifiers"? With no battle-testing of that feature in actual production systems? And no chance to benefit from eventual bug fixes??
My 2 cents: since you clearly have no control on that junk JSON you receive, just run a fast sed on it (or a slow Java regular expression, if you wish) to replace these dotted monstruosities with sane column names. And be happy ever after.
I wanted to try out the encrypted big query client for google big query and I've been having some trouble.
I'm following the instructions outlined in this PDF:
https://docs.google.com/file/d/0B-WB8hYCrhZ6cmxfWFpBci1lOVE/edit
I get to the point where I'm running this command:
ebq load --master_key_filename="key_file" testdataset.cars cars.csv cars.schema
And I'm getting an error string which ends with:
raise ValueError("No JSON object could be decoded")
I've tried a few different formats for my .csv and .schema files but none have worked. Here are my latest versions.
cars.schema:
[{"name": "Year", "type": "integer", "mode": "required", "encrypt": "none"}
{"name": "Make", "type": "string", "mode": "required", "encrypt": "pseudonym"}
{"name": "Model", "type": "string", "mode": "required", "encrypt": "probabilistic_searchwords"}
{"name": "Description", "type": "string", "mode": "nullable", "encrypt": "searchwords"}
{"name": "Website", "type": "string", "mode": "nullable", "encrypt": "searchwords","searchwords_separator": "/"}
{"name": "Price", "type": "float", "mode": "required", "encrypt": "probabilistic"}
{"name": "Invoice_Price", "type": "integer", "mode": "required", "encrypt": "homomorphic"}
{"name": "Holdback_Percentage", "type": "float", "mode": "required", "encrypt":"homomorphic"}]
cars.csv:
1997,Ford,E350, "ac\xc4a\x87, abs, moon","www.ford.com",3000.00,2000,1.2
1999,Chevy,"Venture ""Extended Edition""","","www.cheverolet.com",4900.00,3800,2.3
1999,Chevy,"Venture ""Extended Edition, Very Large""","","www.chevrolet.com",5000.00,4300,1.9
1996,Jeep,Grand Cherokee,"MUST SELL! air, moon roof,loaded","www.chrysler.com/jeep/grandcherokee",4799.00,3950,2.4
I believe the issue may be that you need to move the --master_key_filename argument before the load argument. If that doesn't work, can you send the output of adding --apilog=- as the first argument?
Also, there is an example script file of running ebq here:
https://code.google.com/p/bigquery-e2e/source/browse/#git%2Fsamples%2Fch13
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.pig.piggybank.storage.JsonLoader using imports...
When i look at pig-0.12.1/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/ there is indeed no JsonLoader.java.
Can it be added or is there a 'correct' piggybank I can download?
TIA!!
I'm using:
Apache Pig version 0.12.2-SNAPSHOT (r: unknown)
compiled Apr 15 2014, 20:35:46
It looks like the correct location is org.apache.pig.builtin.JsonLoader. You shouldn't need the package path at all though.
We're running into a situation where documents that exist in our MongoDB can't be queried for without resulting in:
db.collection.find({"_id":ObjectId("50d393be70a580280b117ea5")})
Wed Jan 2 12:30:44 Assertion: 10320:BSONElement: bad type 65
0x6073f1 0x5d1aa9 0x4b0d98 0x5c17a6 0x6b3f35 0x6b6a2c 0x69be0a 0x6aa13f 0x668e46 0x668ec2 0x66a2ce 0x5cbcc4 0x4a4a14 0x4a67e6 0x7f2223434c4d 0x49f669
mongo(_ZN5mongo15printStackTraceERSo+0x21) [0x6073f1]
mongo(_ZN5mongo11msgassertedEiPKc+0x99) [0x5d1aa9]
mongo(_ZNK5mongo11BSONElement4sizeEv+0x1d8) [0x4b0d98]
mongo(_ZN5mongo16resolveBSONFieldEP9JSContextP8JSObjectljPS3_+0x146) [0x5c17a6]
mongo(js_LookupPropertyWithFlags+0x3f5) [0x6b3f35]
mongo(js_GetProperty+0x7c) [0x6b6a2c]
mongo(js_Interpret+0x10ea) [0x69be0a]
mongo(js_Execute+0x36f) [0x6aa13f]
mongo(JS_EvaluateUCScriptForPrincipals+0x66) [0x668e46]
mongo(JS_EvaluateUCScript+0x22) [0x668ec2]
mongo(JS_EvaluateScript+0x6e) [0x66a2ce]
mongo(_ZN5mongo7SMScope4execERKNS_10StringDataERKSsbbbi+0x144) [0x5cbcc4]
mongo(_Z5_mainiPPc+0x26c4) [0x4a4a14]
mongo(main+0x26) [0x4a67e6]
/lib/libc.so.6(__libc_start_main+0xfd) [0x7f2223434c4d]
mongo(__gxx_personality_v0+0x2a1) [0x49f669]
I can run a mongodump of that particular record and then when I convert bson to json I get:
$ bsondump server1_collection.bson > server1_collection.json
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
Aborted
Looks like there might be an issue with UTF-8 characters and/or base64 encoded strings causing invalid BSON to be stored:
- https://jira.mongodb.org/browse/SERVER-7769
Sounds like moving forward I can start mongod with --objcheck to ensure no invalid data can be inserted (unsure the performance penalties).
Not sure how I can easily scrub through my old data and remove these invalid records though.
Here is the related question and answer. You probably should heal the database corruption.
You could validate if there is any errors (use mongo shell):
db.getSiblingDB("your_base").oplog.rs.validate(true)
Try to use then: db.repairDatabase()
Note, it affects only the current database.
Or start mongod with mongod --repair.
Null termination symbol in the data could be the problem too (topic).
Sometimes, it's not possible to repair db: than one would recover data from replica or earlier dump.