Processing JSON using java Mapreduce

Processing JSON using java Mapreduce - json

I am new to hadoop mapreduce
I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"}
{"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"}
{"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"}
{"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"}
{"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
//TODO import necessary components
/*
* Modify this file to combine books from the same other into
* single JSON object.
* i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]}
* Beaware that, this may work on anynumber of nodes!
*
*/
public class CombineBooks {
//TODO define variables and implement necessary components
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My task is to create a Hadoop program in “CombineBooks.java”
returned in the “question-2” directory. The program should do
the following: Given the input author-book tuples, map-reduce
program should procude a JSON object which contains all the
books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?

First, the JSON objects you are trying to work with are not available for you. To solve this:
Go here and download as zip: https://github.com/douglascrockford/JSON-java
Extract to your sources folder in subdirectory org/json/*
Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".
Third, using combiner here is useless.
Here's the code I ended up with, it works and solves your problem:
package my.books;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val : values){
JSONObject jo = new JSONObject().put("book", val.toString());
ja.put(jo);
}
obj.put("books", ja);
obj.put("author", key.toString());
context.write(NullWritable.get(), new Text(obj.toString()));
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's the folder structure of my project:
src
src/my
src/my/books
src/my/books/CombineBooks.java
src/org
src/org/json
src/org/json/zip
src/org/json/zip/BitReader.java
...
src/org/json/zip/None.java
src/org/json/JSONStringer.java
src/org/json/JSONML.java
...
src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt
{"author":"author1", "book":"book1"}
{"author":"author1", "book":"book2"}
{"author":"author1", "book":"book3"}
{"author":"author2", "book":"book4"}
{"author":"author2", "book":"book5"}
{"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000
{"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"}
{"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"}
{"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.* classes into your cluster:
Pack the org.json.* classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer
Put the jar file containing org.json.* classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath)
Put the jar file containing org.json.* into HDFS (hdfs dfs -put <org.json jar> <hdfs path>) and use job.addFileToClassPath call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); to the main

Refer for splittable multi-line JSON:
https://github.com/alexholmes/json-mapreduce

Related

How can we read JSON file

Utility to read JSON file from file server
2. Utility should run at scheduled time let’s say 6AM
3. Error message if JSON file is not properly formatted
4. Error message if Category is missing and ITEM need that Category to be saved.
5. Entity mapping as per the given relationship in data model
5. Documentation for the API, preferably using any tool
6. Junit Test cases using Mockito
7. Use either MySQL or Oracle for API development
8. add one point as JSON validation for not null and value range
Thanks

i tried below to cover first question of this problem but not sure what to do to answer remaining 7 questions related with this probelm:-
Create a class named "JSONRead" in eclipse. In this we will using "JSONParser" to convert the JSON string in the file to JSONOBject.
In order to use JSON Parser makes sure that your string is in JSON format.
enter code here`enter code here`
enter code here
{
package logicProgramming;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.Iterator;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
public class JsonRead {
public static void main(String[] args) {
JSONParser parser = new JSONParser();
//JsonParser to convert JSON string into Json Object
try {
Object obj = parser.parse(new FileReader("g:\\newfile.json"));
//parsing the JSON string inside the file that we created earlier.
JSONObject jsonObject = (JSONObject) obj;
System.out.println(jsonObject);
//Json string has been converted into JSONObject
String name = (String) jsonObject.get("name");
System.out.println(name);
String department = (String) jsonObject.get("department");
System.out.println(department);
String branch = (String) jsonObject.get("branch");
System.out.println(branch);
long year = (long) jsonObject.get("year");
System.out.println(year);
//Displaying values from JSON OBject by using Keys
JSONArray remarks = (JSONArray) jsonObject.get("remarks");
//converting the JSONObject into JSONArray as remark was an array.
Iterator<String> iterator = remarks.iterator();
//Iterator is used to access the each element in the list
//loop will continue as long as there are elements in the array.
while (iterator.hasNext()) {
System.out.println(iterator.next());
//accessing each elemnt by using next function.
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
}
enter code here
enter code here
Thanks

AWS EMR File Already exists: Hadoop Job reading and writing to S3

I have a Hadoop job running in EMR and i am passing the S3 Path as input and output to this Job.
When i run locally everything is working fine.( As there is a single node)
How ever when i run in EMR with 5 node cluster i am running into File Already exists IO Exception.
The output path has a timestamp in it so the out put path doesn't exists in S3.
Error: java.io.IOException: File already exists:s3://<mybucket_name>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED/part-m-00000
I have a very simple hadoop app (primarily my mapper) which reads each line from a file and converts it (using an existing library)
Not sure why each node is trying to write with the same file name.
Here is mapper
public static class TokenizeMapper extends Mapper<Object,Text,Text,Text>{
public void map(Object key, Text value,Mapper.Context context) throws IOException,InterruptedException{
//TODO: Invoke Core Engine to transform the Data
Encryption encryption = new Encryption();
String tokenizedVal = encryption.apply(value.toString());
context.write(tokenizedVal,1);
}
}
Any my Reducer
public static class TokenizeReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text text,Iterable<Text> lines,Context context) throws IOException,InterruptedException{
Iterator<Text> iterator = lines.iterator();
int counter =0;
while(iterator.hasNext()){
counter++;
}
Text output = new Text(""+counter);
context.write(text,output);
}
}
And my main class
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
long startTime = System.currentTimeMillis();
try {
Configuration config = new Configuration();
String[] additionalArgs = new GenericOptionsParser(config, args).getRemainingArgs();
if (additionalArgs.length != 2) {
System.err.println("Usage: Tokenizer Input_File and Output_File ");
System.exit(2);
}
Job job = Job.getInstance(config, "Raw File Tokenizer");
job.setJarByClass(Tokenizer.class);
job.setMapperClass(TokenizeMapper.class);
job.setReducerClass(TokenizeReducer.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(Text.class);
FileInputFormat.addInputPath(job, new Path(additionalArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(additionalArgs[1]));
boolean status = job.waitForCompletion(true);
if (status) {
//System.exit(0);
System.out.println("Completed Job Successfully");
} else {
System.out.println("Job did not Succeed");
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
System.out.println("Total Time for processing =["+(System.currentTimeMillis()-startTime)+"]");
}
}
I am passing the arguments when i am launching the cluster as
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED
Appreciate any inputs.
Thanks

In the driver code, you have set Reducer to 0, then we do not need the reducer code.
In case you need to clear the output dir before job launch, you can use this snippet to clear the dir if it exists:-
FileSystem fileSystem = FileSystem.get(<hadoop config object>);
if(fileSystem.exists(new Path(<pathTocheck>)))
{
fileSystem.delete(new Path(<pathTocheck>), true);
}

Loading json into my unit test from a text file

I am working in AEM trying to get create txt files with JSON output so that I can load them into my unit test as strings and test my model / model processors. So far I have this...
public String readFile(String path, Charset encoding) throws IOException
{
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
private String sampleInput = readFile("/test/resources/map/sample-
input.txt",Charset.forName("UTF-8"));
I need sampleInput to take the json that is in 'sampleInput.txt' and convert it to a string. I am also running into issues with the Charset encoding.

I think the easiest way to manage JSON documents you use for unit testing is by keeping them organized in the classpath. Guava provides a neat wrapper for loading classpath resources.
import com.google.common.base.Charsets;
import com.google.common.io.Resources;
import java.io.IOException;
import java.net.URL;
public class TestJsonDocumentLoader {
public TestJsonDocumentLoader(Class clazz) {
this.clazz = clazz;
}
public String loadTestJson(String fileName) {
URL url = Resources.getResource(clazz, fileName);
try {
String data = Resources.toString(url, Charsets.UTF_8);
return data;
} catch (IOException e) {
throw new RuntimeException("Couldn't load a JSON file.", e);
}
}
}
This can then be used to load arbitrary JSON files placed in the same package as the test class. It is assumed that the files are UTF-8 encoded. I suggest keeping all sources encoded that way, regardless of the OS your team is using. It saves you a lot of trouble with version control.
Let's say you have MyTest in src/test/java/com/example/mytestsuite, then you could place a file data.json in src/test/resources/com/example/mytestsuite and load id by calling
TestJsonDocumentLoader loader = new TestJsonDocumentLoader(MyTest.class);
String jsonData = loader.loadTestJson("data.json");
String someOtherExample = loader.loadTestJson("other.json");
Actually, this could be used for all sorts of text files.

You could have also used object mapper from Jackson as an alternative
public class JsonResourceObjectMapper<T> {
private Class<T> model;
public JsonResourceObjectMapper(Class<T> model) {
this.model = model;
}
public T loadTestJson(String fileName) throws IOException{
ClassLoader classLoader = this.getClass().getClassLoader();
InputStream inputStream= classLoader.getResourceAsStream(fileName);
return new ObjectMapper().readValue(inputStream, this.model);
}
}
And then setup a fixture in the test passing a .class
private JsonClass json;
#Before
public void setUp() throws IOException {
JsonResourceObjectMapper mapper = new JsonResourceObjectMapper(JsonClass.class);
json = (JsonClass) mapper.loadTestJson("json/testJson.json");
}
Note that the testJson.json file is in resources/json folder same as what #toniedzwiedz mentioned
So then you could use the json model as:
#Test
public void testJsonNameProperty(){
//act
String name = json.getName();
// assert
assertEquals("testName", name);
}

Camel bindy marshal to file creates multiple header row

I have the following camel route:
from(inputDirectory)
.unmarshal(jaxb)
.process(jaxb2CSVDataProcessor)
.split(body()) //because there is a list of CSVRecords
.marshal(bindyCsvDataFormat)
.to(outputDirectory); //appending to existing file using "?autoCreate=true&fileExist=Append"
for my CSV model class I am using annotations:
#CsvRecord(separator = ",", generateHeaderColumns = true)
...
and for properties
#DataField(pos = 0)
...
My problem is that the headers are appended every time a new csv record is appended.
Is there a non-dirty way to control this? Am I missing anything here?

I made a work around which is working quite nicely, creating the header by querying the columnames of the #DataField annotation. This is happening once the first time the file is written. I wrote down the whole solution here:
How to generate a Flat file with header and footer using Camel Bindy

I ended up adding a processor that checks if the csv file exists just before the "to" clause. In there I do a manipulation of the byte array and remove the headers.

Hope this helps anyone else. I needed to do something similar where after my first split message I wanted to supress the header output. Here is a complete class (the 'FieldUtils' is part of the apache commons lib)
package com.routes;
import java.io.OutputStream;
import org.apache.camel.Exchange;
import org.apache.camel.dataformat.bindy.BindyAbstractFactory;
import org.apache.camel.dataformat.bindy.BindyCsvFactory;
import org.apache.camel.dataformat.bindy.BindyFactory;
import org.apache.camel.dataformat.bindy.FormatFactory;
import org.apache.camel.dataformat.bindy.csv.BindyCsvDataFormat;
import org.apache.commons.lang3.reflect.FieldUtils;
public class StreamingBindyCsvDataFormat extends BindyCsvDataFormat {
public StreamingBindyCsvDataFormat(Class<?> type) {
super(type);
}
#Override
public void marshal(Exchange exchange, Object body, OutputStream outputStream) throws Exception {
final StreamingBindyModelFactory factory = (StreamingBindyModelFactory) super.getFactory();
final int splitIndex = exchange.getProperty(Exchange.SPLIT_INDEX, -1, int.class);
final boolean splitComplete = exchange.getProperty(Exchange.SPLIT_COMPLETE, false, boolean.class);
super.marshal(exchange, body, outputStream);
if (splitIndex == 0) {
factory.setGenerateHeaderColumnNames(false); // turn off header generate after first exchange
} else if(splitComplete) {
factory.setGenerateHeaderColumnNames(true); // turn on header generate when split complete
}
}
#Override
protected BindyAbstractFactory createModelFactory(FormatFactory formatFactory) throws Exception {
BindyCsvFactory bindyCsvFactory = new StreamingBindyModelFactory(getClassType());
bindyCsvFactory.setFormatFactory(formatFactory);
return bindyCsvFactory;
}
public class StreamingBindyModelFactory extends BindyCsvFactory implements BindyFactory {
public StreamingBindyModelFactory(Class<?> type) throws Exception {
super(type);
}
public void setGenerateHeaderColumnNames(boolean generateHeaderColumnNames) throws IllegalAccessException {
FieldUtils.writeField(this, "generateHeaderColumnNames", generateHeaderColumnNames, true);
}
}
}

Hadoop: Unable to run mapreduce program ..java.io.IOException: error=12

Im trying to run a mapreduce program in hadoop. Basically it takes in a text file as input in which each line is a json text. Im using simple json to parse this data in my mapper and the reducer does some other stuff. I have included the simple json jar file in hadoop/lib folder. here is the code below
package org.myorg;
import java.io.IOException;
import java.util.Iterator;
import java.util.*;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ALoc
{
public static class AMapper extends Mapper<Text, Text, Text, Text>
{
private Text kword = new Text();
private Text vword = new Text();
JSONParser parser = new JSONParser();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
try {
String line = value.toString();
Object obj = parser.parse(line);
JSONObject jsonObject = (JSONObject) obj;
String val = (String)jsonObject.get("m1") + "," + (String)jsonObject.get("m3");
kword.set((String)jsonObject.get("m0"));
vword.set(val);
context.write(kword, vword);
}
catch (IOException e) {
e.printStackTrace();
}
catch (ParseException e) {
e.printStackTrace();
}
}
}
public static class CountryReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
int ccount = 0;
HashMap<Text, Integer> hm = new HashMap<Text, Integer>();
for (Text val : values)
{
if(hm.containsKey(val)){
Integer n = (Integer)hm.get(val);
hm.put(val, n+1);
}else{
hm.put(val, new Integer(1));
}
}
Set set = hm.entrySet();
Iterator i = set.iterator();
String agr = "";
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
agr += "|" + me.getKey() + me.getValue();
}
result.set(agr);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "ALoc");
job.setJarByClass(ALoc.class);
job.setMapperClass(AMapper.class);
job.setReducerClass(CountryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I try to run the job. It gives the following error.
I am running this in a aws micro instance single node.
I have been following this tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
hadoop#domU-18-11-19-02-92-8E:/$ bin/hadoop jar ALoc.jar org.myorg.ALoc /user/hadoop/adata /user/hadoop/adata-op5 -D mapred.reduce.tasks=16
13/02/12 08:39:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/02/12 08:39:50 INFO input.FileInputFormat: Total input paths to process : 1
13/02/12 08:39:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/12 08:39:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/12 08:39:51 INFO mapred.JobClient: Running job: job_201302120714_0006
13/02/12 08:39:52 INFO mapred.JobClient: map 0% reduce 0%
13/02/12 08:40:10 INFO mapred.JobClient: Task Id : attempt_201302120714_0006_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "/bin/ls": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 15 more
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

I guess you must be trying hadoop on Micro instance which have very less memory (~700MB).
Try increasing the HADOOP Heapsize parameter (in hadoop/conf/hadoop-env.sh) .. as the basic reason is shortage of memory required to fork processes

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Processing JSON using java Mapreduce - json

Refer for splittable multi-line JSON: https://github.com/alexholmes/json-mapreduce

Related

How can we read JSON file

AWS EMR File Already exists: Hadoop Job reading and writing to S3

Loading json into my unit test from a text file

Camel bindy marshal to file creates multiple header row

Hadoop: Unable to run mapreduce program ..java.io.IOException: error=12

Categories

Resources