Hadoop: Unable to run mapreduce program ..java.io.IOException: error=12 - json

Im trying to run a mapreduce program in hadoop. Basically it takes in a text file as input in which each line is a json text. Im using simple json to parse this data in my mapper and the reducer does some other stuff. I have included the simple json jar file in hadoop/lib folder. here is the code below
package org.myorg;
import java.io.IOException;
import java.util.Iterator;
import java.util.*;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ALoc
{
public static class AMapper extends Mapper<Text, Text, Text, Text>
{
private Text kword = new Text();
private Text vword = new Text();
JSONParser parser = new JSONParser();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
try {
String line = value.toString();
Object obj = parser.parse(line);
JSONObject jsonObject = (JSONObject) obj;
String val = (String)jsonObject.get("m1") + "," + (String)jsonObject.get("m3");
kword.set((String)jsonObject.get("m0"));
vword.set(val);
context.write(kword, vword);
}
catch (IOException e) {
e.printStackTrace();
}
catch (ParseException e) {
e.printStackTrace();
}
}
}
public static class CountryReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
int ccount = 0;
HashMap<Text, Integer> hm = new HashMap<Text, Integer>();
for (Text val : values)
{
if(hm.containsKey(val)){
Integer n = (Integer)hm.get(val);
hm.put(val, n+1);
}else{
hm.put(val, new Integer(1));
}
}
Set set = hm.entrySet();
Iterator i = set.iterator();
String agr = "";
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
agr += "|" + me.getKey() + me.getValue();
}
result.set(agr);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "ALoc");
job.setJarByClass(ALoc.class);
job.setMapperClass(AMapper.class);
job.setReducerClass(CountryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I try to run the job. It gives the following error.
I am running this in a aws micro instance single node.
I have been following this tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
hadoop#domU-18-11-19-02-92-8E:/$ bin/hadoop jar ALoc.jar org.myorg.ALoc /user/hadoop/adata /user/hadoop/adata-op5 -D mapred.reduce.tasks=16
13/02/12 08:39:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/02/12 08:39:50 INFO input.FileInputFormat: Total input paths to process : 1
13/02/12 08:39:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/12 08:39:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/12 08:39:51 INFO mapred.JobClient: Running job: job_201302120714_0006
13/02/12 08:39:52 INFO mapred.JobClient: map 0% reduce 0%
13/02/12 08:40:10 INFO mapred.JobClient: Task Id : attempt_201302120714_0006_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "/bin/ls": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 15 more
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

I guess you must be trying hadoop on Micro instance which have very less memory (~700MB).
Try increasing the HADOOP Heapsize parameter (in hadoop/conf/hadoop-env.sh) .. as the basic reason is shortage of memory required to fork processes

Related

AWS EMR File Already exists: Hadoop Job reading and writing to S3

I have a Hadoop job running in EMR and i am passing the S3 Path as input and output to this Job.
When i run locally everything is working fine.( As there is a single node)
How ever when i run in EMR with 5 node cluster i am running into File Already exists IO Exception.
The output path has a timestamp in it so the out put path doesn't exists in S3.
Error: java.io.IOException: File already exists:s3://<mybucket_name>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED/part-m-00000
I have a very simple hadoop app (primarily my mapper) which reads each line from a file and converts it (using an existing library)
Not sure why each node is trying to write with the same file name.
Here is mapper
public static class TokenizeMapper extends Mapper<Object,Text,Text,Text>{
public void map(Object key, Text value,Mapper.Context context) throws IOException,InterruptedException{
//TODO: Invoke Core Engine to transform the Data
Encryption encryption = new Encryption();
String tokenizedVal = encryption.apply(value.toString());
context.write(tokenizedVal,1);
}
}
Any my Reducer
public static class TokenizeReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text text,Iterable<Text> lines,Context context) throws IOException,InterruptedException{
Iterator<Text> iterator = lines.iterator();
int counter =0;
while(iterator.hasNext()){
counter++;
}
Text output = new Text(""+counter);
context.write(text,output);
}
}
And my main class
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
long startTime = System.currentTimeMillis();
try {
Configuration config = new Configuration();
String[] additionalArgs = new GenericOptionsParser(config, args).getRemainingArgs();
if (additionalArgs.length != 2) {
System.err.println("Usage: Tokenizer Input_File and Output_File ");
System.exit(2);
}
Job job = Job.getInstance(config, "Raw File Tokenizer");
job.setJarByClass(Tokenizer.class);
job.setMapperClass(TokenizeMapper.class);
job.setReducerClass(TokenizeReducer.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(Text.class);
FileInputFormat.addInputPath(job, new Path(additionalArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(additionalArgs[1]));
boolean status = job.waitForCompletion(true);
if (status) {
//System.exit(0);
System.out.println("Completed Job Successfully");
} else {
System.out.println("Job did not Succeed");
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
System.out.println("Total Time for processing =["+(System.currentTimeMillis()-startTime)+"]");
}
}
I am passing the arguments when i am launching the cluster as
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED
Appreciate any inputs.
Thanks
In the driver code, you have set Reducer to 0, then we do not need the reducer code.
In case you need to clear the output dir before job launch, you can use this snippet to clear the dir if it exists:-
FileSystem fileSystem = FileSystem.get(<hadoop config object>);
if(fileSystem.exists(new Path(<pathTocheck>)))
{
fileSystem.delete(new Path(<pathTocheck>), true);
}

NetworkOnMainThreadException when trying to read a json file

Good morning together from Germany!
I am trying to read a little json file from a webserver. When I start the app, I got the error message: android.os.NetworkOnMainThreadException.
The Target should be, that logcat is showing my json content:
{"first":"one","second":"two"}
Web server is working. When I access: 127.0.0.1/index.php, the browser shows me the upper line.
This is my Main:
package com.example.u0017007.jsonclient;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import android.app.Activity;
import android.os.Bundle;
import android.util.Log;
public class MainActivity extends Activity {
/** Called when the activity is first created. */
#Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
HttpClient httpClient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://127.0.0.1:80/index.php");
StringBuilder stringBuilder = new StringBuilder();
try {
HttpResponse response = httpClient.execute(httpGet);
int statusCode = response.getStatusLine().getStatusCode();
if (statusCode == 200) {
BufferedReader reader = new BufferedReader(
new InputStreamReader(response.getEntity().getContent()));
String line;
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
}
Log.i(this.getClass().getSimpleName(), stringBuilder.toString());
} else {
Log.e(this.getClass().getSimpleName(), "Fehler");
}
} catch (ClientProtocolException e) {
Log.e(this.getClass().getSimpleName(), e.getMessage());
} catch (IOException e) {
Log.e(this.getClass().getSimpleName(), e.getMessage());
}
}
}
That is the error log
FATAL EXCEPTION: main
Process: com.example.u0017007.jsonclient, PID: 16115
java.lang.RuntimeException: Unable to start activity ComponentInfo{com.example.u0017007.jsonclient/com.example.u0017007.jsonclient.MainActivity}: android.os.NetworkOnMainThreadException
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2646)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2707)
at android.app.ActivityThread.-wrap12(ActivityThread.java)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1460)
at android.os.Handler.dispatchMessage(Handler.java:102)
at android.os.Looper.loop(Looper.java:154)
at android.app.ActivityThread.main(ActivityThread.java:6077)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:866)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:756)
Caused by: android.os.NetworkOnMainThreadException
at android.os.StrictMode$AndroidBlockGuardPolicy.onNetwork(StrictMode.java:1303)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:333)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:196)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:178)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:356)
at java.net.Socket.connect(Socket.java:586)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:124)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:149)
at org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:169)
at org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:366)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:560)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:492)
at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:470)
at com.example.u0017007.jsonclient.MainActivity.onCreate(MainActivity.java:29)
at android.app.Activity.performCreate(Activity.java:6662)
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1118)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2599)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2707) 
at android.app.ActivityThread.-wrap12(ActivityThread.java) 
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1460) 
at android.os.Handler.dispatchMessage(Handler.java:102) 
at android.os.Looper.loop(Looper.java:154) 
at android.app.ActivityThread.main(ActivityThread.java:6077) 
at java.lang.reflect.Method.invoke(Native Method) 
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:866) 
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:756
There are related questions that you can find and make use of them, but to give you the gist of the solutions available - you can solve them using any of the approaches given below:
You can refrain writing any Network related calls in your Main UI thread but use an AsyncTask to achieve what you want to, in it.
You can add the following code into your Main file by adding the following import as well.
The import:
import android.os.StrictMode;
The code:
if (android.os.Build.VERSION.SDK_INT > 9) {
StrictMode.ThreadPolicy policy = new StrictMode.ThreadPolicy.Builder().permitAll().build();
StrictMode.setThreadPolicy(policy);
}

ElasticSearch : Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available

I am trying to use the Java Bulk API for creating the documents in ElasticSearch. I am using a JSON file as the bulk input. When I execute, I get the following exception:
Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available: []]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:280)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:197)
at org.elasticsearch.client.transport.support.TransportProxyClient.execute(TransportProxyClient.java:55)
at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:272)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at bulkApi.test.App.main(App.java:92)
This is the java code.
package bulkApi.test;
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.UnknownHostException;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.common.transport.TransportAddress;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
/**
* Hello world!
*
*/
public class App
{
public static void main( String[] args ) throws IOException, ParseException
{
System.out.println( "Hello World!" );
// configuration setting
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "test-cluster").build();
TransportClient client = TransportClient.builder().settings(settings).build();
String hostname = "localhost";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(hostname),port));
// bulk API
BulkRequestBuilder bulkBuilder = client.prepareBulk();
long bulkBuilderLength = 0;
String readLine = "";
String index = "testindex";
String type = "testtype";
String _id = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("/home/nilesh/Shiney/Learn/elasticSearch/ES/test/parseAccount.json"))));
JSONParser parser = new JSONParser();
while((readLine = br.readLine()) != null){
// it will skip the metadata line which is supported by bulk insert format
if (readLine.startsWith("{\"index")){
continue;
}
else {
Object json = parser.parse(readLine);
if(((JSONObject)json).get("account_number")!=null){
_id = String.valueOf(((JSONObject)json).get("account_number"));
System.out.println(_id);
}
//_id is unique field in elasticsearch
JSONObject jsonObject = (JSONObject) json;
bulkBuilder.add(client.prepareIndex(index, type, String.valueOf(_id)).setSource(jsonObject));
bulkBuilderLength++;
try {
if(bulkBuilderLength % 100== 0){
System.out.println("##### " + bulkBuilderLength + "data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
System.out.println("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}
}
Couple of things to check:
Check that your cluster name is same as the one defined in
elastic.yml
Try creating the index with the index name
('testindex' in your case) with PUT api from any REST Client
(ex:AdvancedRestClient) http://localhost:9200/testindex (Your
request has to be successful i.e status code==200 )
Run your program after you create the index with the above PUT request.

Processing JSON using java Mapreduce

I am new to hadoop mapreduce
I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"}
{"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"}
{"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"}
{"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"}
{"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
//TODO import necessary components
/*
* Modify this file to combine books from the same other into
* single JSON object.
* i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]}
* Beaware that, this may work on anynumber of nodes!
*
*/
public class CombineBooks {
//TODO define variables and implement necessary components
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My task is to create a Hadoop program in “CombineBooks.java”
returned in the “question-2” directory. The program should do
the following: Given the input author-book tuples, map-reduce
program should procude a JSON object which contains all the
books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?
First, the JSON objects you are trying to work with are not available for you. To solve this:
Go here and download as zip: https://github.com/douglascrockford/JSON-java
Extract to your sources folder in subdirectory org/json/*
Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".
Third, using combiner here is useless.
Here's the code I ended up with, it works and solves your problem:
package my.books;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val : values){
JSONObject jo = new JSONObject().put("book", val.toString());
ja.put(jo);
}
obj.put("books", ja);
obj.put("author", key.toString());
context.write(NullWritable.get(), new Text(obj.toString()));
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's the folder structure of my project:
src
src/my
src/my/books
src/my/books/CombineBooks.java
src/org
src/org/json
src/org/json/zip
src/org/json/zip/BitReader.java
...
src/org/json/zip/None.java
src/org/json/JSONStringer.java
src/org/json/JSONML.java
...
src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt
{"author":"author1", "book":"book1"}
{"author":"author1", "book":"book2"}
{"author":"author1", "book":"book3"}
{"author":"author2", "book":"book4"}
{"author":"author2", "book":"book5"}
{"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000
{"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"}
{"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"}
{"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.* classes into your cluster:
Pack the org.json.* classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer
Put the jar file containing org.json.* classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath)
Put the jar file containing org.json.* into HDFS (hdfs dfs -put <org.json jar> <hdfs path>) and use job.addFileToClassPath call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); to the main
Refer for splittable multi-line JSON:
https://github.com/alexholmes/json-mapreduce

Load testing with Cucumber and Java

I need to perform load testing for my REST web service using Cucumber and Java. This REST web service accepts one input which is a String called id and returns complex JSON object.
I wrote a .feature file with Given, When and Then annotations which are defined in java.
The skeleton definition of the class and annotations are here under.
1) Feature (UserActivity.feature)
#functional #integration
Feature: User System Load Test
Scenario Outline: Load test for user data summary from third party UserSystem
Given Simultaneously multiple users are hitting XYZ services with an id=<ids>
When I invoke third party link with above id for multiple users simultaneously
Then I should get response code and response message for all users
Examples:
| ids |
| "pABC123rmqst" |
| "fakXYZ321rmv" |
| "bncMG4218jst" |
2) LoadTestStepDef.java (Feature definition)
package com.system.test.cucumber.steps;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.junit.runners.model.InitializationError;
import com.system.test.restassured.LoadTestUtil;
import cucumber.api.java.en.Given;
import cucumber.api.java.en.Then;
import cucumber.api.java.en.When;
public class LoadTestStepDef
{
private static Logger LOG = LogManager.getLogger( LoadTestStepDef.class );
private String id = null;
private LoadTestUtil service = null;
#Given("^Simultaneously multiple users are hitting XYZ services with an a id=\"(.*?)\"$" )
public void Simultaneously_multiple_users_are_hitting_XYZ_services_with_a_id( String id )
{
LOG.debug( "ID {}", id );
LOG.info( "ID {}", id );
this.id = id;
}
#When( "^I invoke third party link with above id for multiple users simultaneously$" )
public void invoke_third_party_link_With_Above_ID_for_multiple_users_simultaneously() throws InitializationError
{
LOG.debug( " *** Calling simulatenously {} ", id );
LOG.info( " *** Calling simulatenously {}", id );
//Create object of service
service = new LoadTestUtil();
//Set the id to the created service and invoke method
service.setData(id);
service.invokeSimultaneosCalls(10);
}
#Then( "^I should get response code and response message for all users$" )
public void Should_get_response_code_and_response_message_for_all_users()
{
LOG.info( "*** Assert for response Code" );
service.assertHeaderResponseCodeAndMessage();
}
}
3) LoadTestUtil.java
package com.system.test.restassured;
import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import java.util.concurrent.Callable;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.jayway.restassured.path.json.JsonPath;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.FutureTask;
import java.util.concurrent.TimeUnit;
public class LoadTestUtil
{
private String id = null;
private int numberofTimes;
//Create List to hold all Future<Long>
private List<JsonPath> jsonResponseList = new ArrayList<JsonPath>();
//No arg Constructor
public LoadTestUtil()
{
}
//Set data method to set the initial id
public void setData(String id)
{
LOG.info( "LoadTestUtil.setData()", id );
this.id = id;
}
//This method is used call the REST webservice N times using threads and get response
public void invokeSimultaneosCalls(int numberofTimes)
{
LOG.info( "LoadTestUtil.invokeSimultaneosCalls() - Start" );
this.numberofTimes = numberofTimes;
try
{
long start = System.nanoTime();
int numberOfThreads = Runtime.getRuntime().availableProcessors();
LOG.info("Number of processor available {}" , numberOfThreads);
//Create pool for the Executor Service with numberOfThreads.
ExecutorService executor = Executors.newFixedThreadPool(numberOfThreads);
//Create a list to hold the Future object associated with Callable
List<Future<JsonPath>> futureList = new ArrayList<Future<JsonPath>>();
//Create new RESTServiceCallTask instance
Callable<JsonPath> callable = new RESTServiceCallTask(id);
Future<JsonPath> future = null;
//Iterate N number of times to submit the callable object
for(int count=1; count<=numberofTimes;count++)
{
//Submit Callable tasks to the executor
future = executor.submit(callable);
//Add Future to the list to get return value using Future
futureList.add(future);
}
//Create a flag to monitor the thread status. Check whether all worker threads are completed or not
boolean threadStatus = true;
while (threadStatus)
{
if (future.isDone())
{
threadStatus = false;
//Iterate the response obtained from the futureList
for(Future<JsonPath> futuree : futureList)
{
try
{
//print the return value of Future, notice the output delay in console
// because Future.get() waits for task to get completed
JsonPath response = futuree.get();
jsonResponseList.add(response);
}
catch(InterruptedException ie)
{
ie.printStackTrace();
}
catch(ExecutionException ee)
{
ee.printStackTrace();
}
catch(Exception e)
{
e.printStackTrace();
}
}//End of for to iterate the futuree list
} //End of future.isDone()
} //End of while (threadStatus)
//shut down the executor service now
executor.shutdown();
//Calculate the time taken by the threads for execution
executor.awaitTermination(1, TimeUnit.HOURS); // or longer.
long time = System.nanoTime() - start;
logger.info("Tasks took " + time/1e6 + " ms to run");
long milliSeconds = time / 1000000;
long seconds, minutes, hours;
seconds = milliSeconds / 1000;
hours = seconds / 3600;
seconds = seconds % 3600;
seconds = seconds / 60;
minutes = seconds % 60;
logger.info("Task took " + hours + " hours, " + minutes + " minutes and " + seconds + " seconds to complete");
} //End of try block
catch (Exception e)
{
e.printStackTrace();
}
LOG.info("LoadTestUtil.invokeSimultaneosCalls() - jsonResponseList {} " , jsonResponseList);
System.out.println("LoadTestUtil.invokeSimultaneosCalls() - jsonResponseList {} " + jsonResponseList);
LOG.info( "*** LoadTestUtil.invokeSimultaneosCalls() - End" );
}
public void assertHeaderResponseCodeAndMessage(){
//Number of response objects available
int size = jsonResponseList.size();
LOG.info("Number of REST service calls made = ", size);
for(JsonPath jsonResponse : jsonResponseList)
{
String responseCode = jsonResponse.get( "header.response_code").toString();
String responseMessage = jsonResponse.get( "header.response_message").toString();
assertEquals( "200", responseCode);
assertEquals( "success", responseMessage);
}
}
}
4) RESTServiceCallTask.java
This class implements Callable and override the call() method.
In the call() method, the response in the form of JsonPath is returned for each call
package com.system.test.restassured;
import static com.jayway.restassured.RestAssured.basePath;
import static com.jayway.restassured.RestAssured.baseURI;
import static com.jayway.restassured.RestAssured.given;
import static com.jayway.restassured.RestAssured.port;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import com.system.test.restassured.TestUtil;
import com.jayway.restassured.path.json.JsonPath;
import com.jayway.restassured.response.Response;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Callable;
public class RESTServiceCallTask implements Callable<JsonPath>
{
private static Logger LOG = LogManager.getLogger(RESTServiceCallTask.class);
private Response response = null;
private String id;
private String environment;
//private JsonPath jsonPath;
/**
* Constructor initializes the call to third party system
*
* #param id
*/
public RESTServiceCallTask(String id)
{
LOG.info("In RESTServiceCallTask() constructor ");
this.id = id;
//Read the environment variable ENV to get the corresponding environment's REST URL to call
this.environment = System.getProperty("ENV");
baseURI = TestUtil.getbaseURL(environment);
basePath = "/bluelink/tracker/member_summary";
port = 80;
LOG.info(" *** Environment : {}, URI: {} and Resource {} ", environment, baseURI, basePath);
}
//This method is called by the threads to fire the REST service and returns JSONPath for each execution
#Override
public JsonPath call() throws Exception
{
LOG.info(" *** In call() method ");
try
{
response = given().headers("id", this.id).log().all().get();
} catch (Exception e)
{
LOG.error("System Internal Server Error", e);
}
String strResponse = this.response.asString();
LOG.info("Response : {}", strResponse);
JsonPath jsonResponse = new JsonPath(strResponse);
return jsonResponse;
}
}
5) TestUtil.java
This utility class is used to get the REST URL corresponding to the passed environment
package com.system.test.restassured;
import java.util.HashMap;
import java.util.Map;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
public class TestUtil
{
private static Logger LOG = LogManager.getLogger(TestUtil.class);
private static final Map<String, String> ENVIRONMENT_MAP = new HashMap<String, String>();
static
{
ENVIRONMENT_MAP.put("LOCAL", "http://localhost:9080");
ENVIRONMENT_MAP.put("ENV1", "http://localhost:9080");
ENVIRONMENT_MAP.put("ENV2", "http://localhost:9080");
ENVIRONMENT_MAP.put("ENV3", "http://localhost:9080");
}
public static String getbaseURL(String environment)
{
LOG.info("Environment value fetched = {}", environment);
return ENVIRONMENT_MAP.get(environment);
}
}
The problem here is that the multi-threading feature is not getting executed.
I used the MavenSurefire Plugin and tried with parallel classes and methods. In those cases also the above scenario doesn't work.
Does Cucumber support java multi-threading? If so what is wrong with the above feature definition?
Note - The same task is performed with stand alone program and able to run for 10,000 times
using 4 threads without any issues. However not able to run the above code for 2000 times using Maven. With 2000 times, the system crashed abruptly.
I am using Rational Application Developer 8.5, Websphere Server 8.0 with Maven 3.x for the above setup.
Thanks for your response.