AWS EMR File Already exists: Hadoop Job reading and writing to S3 - hadoop2

I have a Hadoop job running in EMR and i am passing the S3 Path as input and output to this Job.
When i run locally everything is working fine.( As there is a single node)
How ever when i run in EMR with 5 node cluster i am running into File Already exists IO Exception.
The output path has a timestamp in it so the out put path doesn't exists in S3.
Error: java.io.IOException: File already exists:s3://<mybucket_name>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED/part-m-00000
I have a very simple hadoop app (primarily my mapper) which reads each line from a file and converts it (using an existing library)
Not sure why each node is trying to write with the same file name.
Here is mapper
public static class TokenizeMapper extends Mapper<Object,Text,Text,Text>{
public void map(Object key, Text value,Mapper.Context context) throws IOException,InterruptedException{
//TODO: Invoke Core Engine to transform the Data
Encryption encryption = new Encryption();
String tokenizedVal = encryption.apply(value.toString());
context.write(tokenizedVal,1);
}
}
Any my Reducer
public static class TokenizeReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text text,Iterable<Text> lines,Context context) throws IOException,InterruptedException{
Iterator<Text> iterator = lines.iterator();
int counter =0;
while(iterator.hasNext()){
counter++;
}
Text output = new Text(""+counter);
context.write(text,output);
}
}
And my main class
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
long startTime = System.currentTimeMillis();
try {
Configuration config = new Configuration();
String[] additionalArgs = new GenericOptionsParser(config, args).getRemainingArgs();
if (additionalArgs.length != 2) {
System.err.println("Usage: Tokenizer Input_File and Output_File ");
System.exit(2);
}
Job job = Job.getInstance(config, "Raw File Tokenizer");
job.setJarByClass(Tokenizer.class);
job.setMapperClass(TokenizeMapper.class);
job.setReducerClass(TokenizeReducer.class);
job.setNumReduceTasks(0);
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(Text.class);
FileInputFormat.addInputPath(job, new Path(additionalArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(additionalArgs[1]));
boolean status = job.waitForCompletion(true);
if (status) {
//System.exit(0);
System.out.println("Completed Job Successfully");
} else {
System.out.println("Job did not Succeed");
}
}
catch(Exception e){
e.printStackTrace();
}
finally{
System.out.println("Total Time for processing =["+(System.currentTimeMillis()-startTime)+"]");
}
}
I am passing the arguments when i am launching the cluster as
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt
s3://<mybucket>/8_9_0a4574ca-96d0-47c8-8eb8-4deb82944d4b/customer/RawFile12.txt/1523583593585/TOKENIZED
Appreciate any inputs.
Thanks

In the driver code, you have set Reducer to 0, then we do not need the reducer code.
In case you need to clear the output dir before job launch, you can use this snippet to clear the dir if it exists:-
FileSystem fileSystem = FileSystem.get(<hadoop config object>);
if(fileSystem.exists(new Path(<pathTocheck>)))
{
fileSystem.delete(new Path(<pathTocheck>), true);
}

Related

How to correctly load Firebase ServiceAccount json resource with Spring MVC?

I'm trying to connect my Spring MVC (not Spring Boot) application to Firebase. My application's folder structure looks like this:
folder structure
The problem is that I don't know where to place the api key json file, how to load the resource, and the correct order of the method calls.
I tried loading the resource the way shown below. Before that I also tried using ClassLoader to load it from the WEB-INF folder and it worked, but changed the code and kept receiving NullPointer Exception (why not FileNotFound Exception?) for the InputStream and couldn't restore the previous state.
With the current state I keep receiving FileNotFound Exception as I'm am not able to load the resource no matter how much I googled "Spring MVC load resource" and as I checked the debugger the service account's "init" method with #PostConstruct isn't running at starting the server.
I understand that I should be able to load the resource and call the "init" method in order to make it work. (I suppose it's enough to call it once after creating the bean and before using firebase methods) But I just couldn't come up with a working implementation.
I used examples from here:
https://github.com/savicprvoslav/Spring-Boot-starter
(Bottom of the Page)
My Controller Class:
#Controller
#RequestMapping("/firebase")
public class FirebaseController {
#Autowired
private FirebaseService firebaseService;
#GetMapping(value="/upload/maincategories")
public void uploadMainRecordCategories() {
firebaseService.uploadMainRecordCategories();
}
My Service Class:
#Service
public class FirebaseServiceBean implements FirebaseService {
#Value("/api.json")
Resource apiKey;
#Override
public void uploadMainRecordCategories() {
// do something
}
#PostConstruct
public void init() {
try (InputStream serviceAccount = apiKey.getInputStream()) {
FirebaseOptions options = new FirebaseOptions.Builder()
.setCredentials(GoogleCredentials.fromStream(serviceAccount))
.setDatabaseUrl(FirebaseStringValue.DB_URL).build();
FirebaseApp.initializeApp(options);
} catch (IOException e) {
e.printStackTrace();
}
}
}
how about saving value in a spring property and using #Value("${firebase.apiKey}")?
Alternatively, save path to file in property and reference that in #Value()
#Value("${service.account.path}")
private String serviceAccountPath;
In application.properties:
service.account.path = /path/to/service-account.json
then config code:
private String getAccessToken() throws IOException {
GoogleCredential googleCredential = GoogleCredential
.fromStream(getServiceAccountInputStream())
.createScoped(Collections.singletonList("https://www.googleapis.com/auth/firebase.messaging"));
googleCredential.refreshToken();
return googleCredential.getAccessToken();
}
private InputStream getServiceAccountInputStream() {
File file = new File(serviceAccountPath);
try {
return new FileInputStream(file);
} catch (FileNotFoundException e) {
throw new RuntimeException("Couldn't find service-account.json");
}
}

Loading json into my unit test from a text file

I am working in AEM trying to get create txt files with JSON output so that I can load them into my unit test as strings and test my model / model processors. So far I have this...
public String readFile(String path, Charset encoding) throws IOException
{
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
private String sampleInput = readFile("/test/resources/map/sample-
input.txt",Charset.forName("UTF-8"));
I need sampleInput to take the json that is in 'sampleInput.txt' and convert it to a string. I am also running into issues with the Charset encoding.
I think the easiest way to manage JSON documents you use for unit testing is by keeping them organized in the classpath. Guava provides a neat wrapper for loading classpath resources.
import com.google.common.base.Charsets;
import com.google.common.io.Resources;
import java.io.IOException;
import java.net.URL;
public class TestJsonDocumentLoader {
public TestJsonDocumentLoader(Class clazz) {
this.clazz = clazz;
}
public String loadTestJson(String fileName) {
URL url = Resources.getResource(clazz, fileName);
try {
String data = Resources.toString(url, Charsets.UTF_8);
return data;
} catch (IOException e) {
throw new RuntimeException("Couldn't load a JSON file.", e);
}
}
}
This can then be used to load arbitrary JSON files placed in the same package as the test class. It is assumed that the files are UTF-8 encoded. I suggest keeping all sources encoded that way, regardless of the OS your team is using. It saves you a lot of trouble with version control.
Let's say you have MyTest in src/test/java/com/example/mytestsuite, then you could place a file data.json in src/test/resources/com/example/mytestsuite and load id by calling
TestJsonDocumentLoader loader = new TestJsonDocumentLoader(MyTest.class);
String jsonData = loader.loadTestJson("data.json");
String someOtherExample = loader.loadTestJson("other.json");
Actually, this could be used for all sorts of text files.
You could have also used object mapper from Jackson as an alternative
public class JsonResourceObjectMapper<T> {
private Class<T> model;
public JsonResourceObjectMapper(Class<T> model) {
this.model = model;
}
public T loadTestJson(String fileName) throws IOException{
ClassLoader classLoader = this.getClass().getClassLoader();
InputStream inputStream= classLoader.getResourceAsStream(fileName);
return new ObjectMapper().readValue(inputStream, this.model);
}
}
And then setup a fixture in the test passing a .class
private JsonClass json;
#Before
public void setUp() throws IOException {
JsonResourceObjectMapper mapper = new JsonResourceObjectMapper(JsonClass.class);
json = (JsonClass) mapper.loadTestJson("json/testJson.json");
}
Note that the testJson.json file is in resources/json folder same as what #toniedzwiedz mentioned
So then you could use the json model as:
#Test
public void testJsonNameProperty(){
//act
String name = json.getName();
// assert
assertEquals("testName", name);
}

Starting and stopping instances on google compute engine

I want to start/resume and stop/suspend instances on google compute engine, but it gives "java.lang.UnsupportedOperationException".Is there any alternative way
to perform these operations?
public class Example {
public static void main(String[] args)
{
String provider = "google-compute-engine";
String identity = "****#developer.gserviceaccount.com";
String credential = "path to private key";
String groupName = "newgroup";
credential = getCredentialFromJsonKeyFile(credential);
Iterable<Module> modules = ImmutableSet.<Module> of(
new SshjSshClientModule(),
new SLF4JLoggingModule(),
new EnterpriseConfigurationModule());
ContextBuilder builder = ContextBuilder.newBuilder(provider)
.credentials(identity, credential)
.modules(modules);
ComputeService compute=builder.buildView(ComputeServiceContext.class).getComputeService();
compute.suspendNode("Instance id");
//compute.suspendNodesMatching(Predicates.<NodeMetadata> and(inGroup(groupName)));
System.out.println("suspended");
compute.getContext().close();
}
private static String getCredentialFromJsonKeyFile(String filename) {
try {
String fileContents = Files.toString(new File(filename), UTF_8);
Supplier<Credentials> credentialSupplier = new GoogleCredentialsFromJson(fileContents);
String credential = credentialSupplier.get().credential;
return credential;
} catch (IOException e) {
System.err.println("Exception reading private key from '%s': " + filename);
e.printStackTrace();
System.exit(1);
return null;
}
}
}
Output:
suspending node(node id)
Exception in thread "main" java.lang.UnsupportedOperationException: suspend is not supported by GCE
at org.jclouds.googlecomputeengine.compute.GoogleComputeEngineServiceAdapter.suspendNode(GoogleComputeEngineServiceAdapter.java:251)
at org.jclouds.compute.strategy.impl.AdaptingComputeServiceStrategies.suspendNode(AdaptingComputeServiceStrategies.java:171)
at org.jclouds.compute.internal.BaseComputeService.suspendNode(BaseComputeService.java:503)
at org.jclouds.examples.compute.basics.Example.main(Example.java:79)
It is not directly supported in the portable jclouds ComputeService, but from the ComputeServiceContext you can get the GoogleComputeEngineApi and the InstanceApi, and use the start/stop methods in there.
FYI, there is an ongoing patch to add support for the start/stop operations in the ComputeService: https://github.com/jclouds/jclouds-labs-google/pull/141
You can stop an instance from the API.
POST https://www.googleapis.com/compute/v1/projects/<project>/zones/<zone>/instances/<instance>/stop
Where :
project in URL is you project id.
zone in URL is the name of zone for the request.
instance in URL is the name of instances to stop.
Here's the docs

Processing JSON using java Mapreduce

I am new to hadoop mapreduce
I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"}
{"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"}
{"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"}
{"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"}
{"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
//TODO import necessary components
/*
* Modify this file to combine books from the same other into
* single JSON object.
* i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]}
* Beaware that, this may work on anynumber of nodes!
*
*/
public class CombineBooks {
//TODO define variables and implement necessary components
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My task is to create a Hadoop program in “CombineBooks.java”
returned in the “question-2” directory. The program should do
the following: Given the input author-book tuples, map-reduce
program should procude a JSON object which contains all the
books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?
First, the JSON objects you are trying to work with are not available for you. To solve this:
Go here and download as zip: https://github.com/douglascrockford/JSON-java
Extract to your sources folder in subdirectory org/json/*
Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".
Third, using combiner here is useless.
Here's the code I ended up with, it works and solves your problem:
package my.books;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val : values){
JSONObject jo = new JSONObject().put("book", val.toString());
ja.put(jo);
}
obj.put("books", ja);
obj.put("author", key.toString());
context.write(NullWritable.get(), new Text(obj.toString()));
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's the folder structure of my project:
src
src/my
src/my/books
src/my/books/CombineBooks.java
src/org
src/org/json
src/org/json/zip
src/org/json/zip/BitReader.java
...
src/org/json/zip/None.java
src/org/json/JSONStringer.java
src/org/json/JSONML.java
...
src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt
{"author":"author1", "book":"book1"}
{"author":"author1", "book":"book2"}
{"author":"author1", "book":"book3"}
{"author":"author2", "book":"book4"}
{"author":"author2", "book":"book5"}
{"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000
{"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"}
{"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"}
{"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.* classes into your cluster:
Pack the org.json.* classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer
Put the jar file containing org.json.* classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath)
Put the jar file containing org.json.* into HDFS (hdfs dfs -put <org.json jar> <hdfs path>) and use job.addFileToClassPath call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); to the main
Refer for splittable multi-line JSON:
https://github.com/alexholmes/json-mapreduce

Camel route loop not working

I am trying to insert json data in mySQL database using camel and hibernate.
Everything is working.
for (Module module : modules) {
from("timer://foo?delay=10000")
.loop(7)//not working
.to(module.getUrl() + "/api/json")
.convertBodyTo(String.class)
.process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
int index = (Integer)exchange.getProperty("CamelLoopIndex"); // not working
ObjectMapper mapper = new ObjectMapper();
JsonNode root = mapper.readTree(exchange.getIn().getBody().toString());
String[] lijst = {"lastBuild", "lastCompletedBuild", "lastFailedBuild", "lastStableBuild", "lastSuccessfulBuild", "lastUnstableBuild", "lastUnsuccessfulBuild"};
JSONObject obj = new JSONObject();
JsonNode node = root.get(lijst[index]);
JsonNode build = node.get("number");
obj.put("description", lijst[index]);
obj.put("buildNumber", build);
exchange.getIn().setBody(obj.toString());
}
})
.unmarshal(moduleDetail)
.to("hibernate:be.kdg.teamf.model.ModuleDetail")
.end();
}
When I debug, my CamelLoopIndex remains 0 so it is not incremented every time it goes through the loop.
All help is welcome!
In your case the only first instruction is processed in scope of the loop: .to(module.getUrl() + "/api/json"). You can add more instructions into a loop using Spring DSL, but I don't know how to declare a loop scope using Java DSL explicitly. I hope experts will explain more about a loop scope in Java DSL.
As a workaround I suggest to move all iteration instructions to a separate direct: route.
I can't reproduce your problem. This works:
from("restlet:http://localhost:9010}/loop?restletMethod=get")
.loop(7)
.process(new Processor() {
#Override
public void process(Exchange exchange) throws Exception {
int index = (int) exchange.getProperty("CamelLoopIndex");
exchange.getIn().setBody("index=" + index);
}
})
.convertBodyTo(String.class)
.end();
Output:
index=6