Weka: loading CSV file without headers - csv

How to load a CSV file without headers in Weka?
There are a few related questions, but none seems to get to the point.
MWE
Here is the test.csv file:
20,1,"+"
30,2,"+"
30,1,"+"
15,1,"-"
10,0,"-"
Here is the Test.java code:
// javac -Xlint -cp weka.jar Test.java && java -cp .:weka.jar Test
import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.io.File;
class Test
{
public static void main(String[] args) {
try {
CSVLoader loader = new CSVLoader();
loader.setOptions(new String[] {"-H"});
loader.setSource(new File("test.csv"));
Instances tr = loader.getDataSet();
tr.setClassIndex(tr.numAttributes() - 1);
Classifier m = (Classifier) new NaiveBayes();
m.buildClassifier(tr);
Evaluation eval = new Evaluation(tr);
eval.evaluateModel(m, tr);
System.out.println(eval.toSummaryString());
}
catch(Exception ex) {
System.out.println(ex);
}
}
}
When running, it only reports 4 instances, not 5. If I add headers, then it works correctly.
Correctly Classified Instances 4 100 %
Incorrectly Classified Instances 0 0 %
Kappa statistic 1
Mean absolute error 0.0065
Root mean squared error 0.0112
Relative absolute error 1.3088 %
Root relative squared error 2.2477 %
Total Number of Instances 4
Notice I have used:
loader.setOptions(new String[] {"-H"});
I have also tried the direct API loader.setNoHeaderRowPresent(true);, but it seems to not be available in Weka 3.6.13.
References:
CSVLoader API
EDIT: It turns out this was a problem in 3.6.13. The code works fine for 3.7.10.

I am not sure about 3.6.13, but the code for 3.7.10 shows that first row of data is added if setNoHeaderRowPresent is set true.
You are setting false, set it to true.Refrence from grepcode of CSVLoader
Set whether there is no header row in the data.
Parameters: b true if
there is no header row in the data
public void setNoHeaderRowPresent(boolean b) {
m_noHeaderRow = b; 293
}
if (m_noHeaderRow) {
m_rowBuffer.add(firstRow);
}
So in your code use
loader.setNoHeaderRowPresent(true)
and not loader.setNoHeaderRowPresent(false) to include first row in data set.

As a work-around, this reads the CSV file and passes it along as an ARFF file:
// javac -Xlint -cp weka.jar Test.java && java -cp .:weka.jar Test
import weka.core.converters.CSVLoader;
import weka.core.Instances;
import weka.classifiers.Classifier;
import weka.classifiers.bayes.NaiveBayes;
import weka.classifiers.Evaluation;
import java.io.FileReader;
import java.io.BufferedReader;
import java.io.StringReader;
import java.lang.StringBuffer;
class Test
{
public static void main(String[] args) {
try {
String filename = "test.csv";
BufferedReader br = new BufferedReader(new FileReader(filename));
String line = br.readLine();
int cols = line.length() - line.replace(",", "").length() + 1;
StringBuilder arff = new StringBuilder("#RELATION test\n");
for(int i = 0; i < cols-1; i++) {
arff.append("#ATTRIBUTE ");
arff.append(String.valueOf((char)(i + 'a')));
arff.append(" NUMERIC\n");
}
arff.append("#ATTRIBUTE class {+,-}\n");
arff.append("#DATA\n");
while(line != null) {
arff.append(line);
arff.append("\n");
line = br.readLine();
}
System.out.println(arff.toString());
Instances tr = new Instances(new StringReader(arff.toString()));
tr.setClassIndex(tr.numAttributes() - 1);
Classifier m = (Classifier) new NaiveBayes();
m.buildClassifier(tr);
Evaluation eval = new Evaluation(tr);
eval.evaluateModel(m, tr);
System.out.println(eval.toSummaryString());
}
catch(Exception ex) {
System.out.println(ex);
}
}
}

Related

Camel Bindy Streaming Payload and Writing to File

I have a route which supposes to read a huge XML file and then write a CSV file with a header. XML Record needs to be transformed first so I map it to java POJO and then marshal it again to write into a csv file.
I can't load all of the records in memory as the file contains more 200k records.
Issue: I am only seeing the last record being added to the CSV file. Not sure why it's not appending the data into the existing file.
Any idea how to make it work. The header is required in CSV.I am not seeing any other option to directly transform the stream and write headers along with to CSV without unmarshalling it to Pojo first. I tried using BeanIO as well, which requires me to add a Header record and not sure how that can be injected into a stream.
from("{{xml.files.route}}")
.split(body().tokenizeXML("EMPLOYEE", null))
.streaming()
.unmarshal().jacksonXml(Employee.class)
.marshal(bindyDataFormat)
.to("file://C:/Files/Test/emp/csv/?fileName=test.csv")
.end();
If I try to append into the existing file then CSV file appends headers to each iteration of records.
.to("file://C:/Files/Test/emp/csv/?fileName=test.csv&fileExist=append")
Your problem here is related to camel-bindy and not the file-component. It kinda expects you to marshal collection objects instead of individual objects hence if you marshal each object individually and have #CsvRecord(generateHeaderColumns = true ) on your Employee class then you'll get headers every time you marshal an individual Employee object.
You could set generateHeaderColumns to false and start the file with headers string manually. One way to obtain headers for Bindy annotated class is to get fields annotated with DataField using org.apache.commons.lang3.reflect.FieldUtils from apache-commons and construct headers string based on position, columnName and fieldName.
I usually prefer camel-stream over file-component when I need to stream something to a file but using file-component with appends probably works just as well.
Example:
package com.example;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import org.apache.camel.RoutesBuilder;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.dataformat.bindy.annotation.DataField;
import org.apache.camel.dataformat.bindy.csv.BindyCsvDataFormat;
import org.apache.camel.test.junit4.CamelTestSupport;
import org.apache.commons.lang3.reflect.FieldUtils;
import org.junit.Test;
public class ExampleTest extends CamelTestSupport {
#Test
public void testStreamEmployeesToCsvFile(){
List<Employee> body = new ArrayList<>();
body.add(new Employee("John", "Doe", 1965));
body.add(new Employee("Mary", "Sue", 1987));
body.add(new Employee("Gary", "Sue", 1991));
template.sendBody("direct:streamEmployeesToCSV", body);
}
#Override
protected RoutesBuilder createRouteBuilder() throws Exception {
return new RouteBuilder(){
#Override
public void configure() throws Exception {
BindyCsvDataFormat csvDataFormat = new BindyCsvDataFormat(Employee.class);
System.out.println(getCSVHeadersForClass(Employee.class, ","));
from("direct:streamEmployeesToCSV")
.setProperty("Employees", body())
// a bit hacky due to camel writing first entry and headers
// on the same line for some reason with (camel 2.25.2)
.setBody().constant("")
.to("file:target/testoutput?fileName=test.csv&fileExist=Override")
.setBody().constant(getCSVHeadersForClass(Employee.class, ","))
.to("stream:file?fileName=./target/testoutput/test.csv")
.split(exchangeProperty("Employees"))
.marshal(csvDataFormat)
.to("stream:file?fileName=./target/testoutput/test.csv")
.end()
.log("Done");
}
private String getCSVHeadersForClass(Class clazz, String separator ) {
Field[] fieldsArray = FieldUtils.getFieldsWithAnnotation(clazz, DataField.class);
List<Field> fields = new ArrayList<>(Arrays.asList(fieldsArray));
fields.sort(new Comparator<Field>(){
#Override
public int compare(Field lhsField, Field rhsField) {
DataField lhs = lhsField.getAnnotation(DataField.class);
DataField rhs = rhsField.getAnnotation(DataField.class);
return lhs.pos() < rhs.pos() ? -1 : (lhs.pos() > rhs.pos()) ? 1 : 0;
}
});
String[] fieldHeaders = new String[fields.size()];
for (int i = 0; i < fields.size(); i++) {
DataField dataField = fields.get(i).getAnnotation(DataField.class);
if(dataField.columnName().equals(""))
fieldHeaders[i] = fields.get(i).getName();
else
fieldHeaders[i] = dataField.columnName();
}
String csvHeaders = "";
for (int i = 0; i < fieldHeaders.length; i++) {
csvHeaders += fieldHeaders[i];
csvHeaders += i < fieldHeaders.length - 1 ? separator : "";
}
return csvHeaders;
}
};
}
}
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>${apache-commons.version}</version>
</dependency>

ElasticSearch : Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available

I am trying to use the Java Bulk API for creating the documents in ElasticSearch. I am using a JSON file as the bulk input. When I execute, I get the following exception:
Exception in thread "main" NoNodeAvailableException[None of the configured nodes are available: []]
at org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(TransportClientNodesService.java:280)
at org.elasticsearch.client.transport.TransportClientNodesService.execute(TransportClientNodesService.java:197)
at org.elasticsearch.client.transport.support.TransportProxyClient.execute(TransportProxyClient.java:55)
at org.elasticsearch.client.transport.TransportClient.doExecute(TransportClient.java:272)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at bulkApi.test.App.main(App.java:92)
This is the java code.
package bulkApi.test;
import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.InetAddress;
import java.net.InetSocketAddress;
import java.net.UnknownHostException;
import org.elasticsearch.action.bulk.BulkRequestBuilder;
import org.elasticsearch.action.bulk.BulkResponse;
import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.common.transport.TransportAddress;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
/**
* Hello world!
*
*/
public class App
{
public static void main( String[] args ) throws IOException, ParseException
{
System.out.println( "Hello World!" );
// configuration setting
Settings settings = Settings.settingsBuilder()
.put("cluster.name", "test-cluster").build();
TransportClient client = TransportClient.builder().settings(settings).build();
String hostname = "localhost";
int port = 9300;
client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(hostname),port));
// bulk API
BulkRequestBuilder bulkBuilder = client.prepareBulk();
long bulkBuilderLength = 0;
String readLine = "";
String index = "testindex";
String type = "testtype";
String _id = null;
BufferedReader br = new BufferedReader(new InputStreamReader(new DataInputStream(new FileInputStream("/home/nilesh/Shiney/Learn/elasticSearch/ES/test/parseAccount.json"))));
JSONParser parser = new JSONParser();
while((readLine = br.readLine()) != null){
// it will skip the metadata line which is supported by bulk insert format
if (readLine.startsWith("{\"index")){
continue;
}
else {
Object json = parser.parse(readLine);
if(((JSONObject)json).get("account_number")!=null){
_id = String.valueOf(((JSONObject)json).get("account_number"));
System.out.println(_id);
}
//_id is unique field in elasticsearch
JSONObject jsonObject = (JSONObject) json;
bulkBuilder.add(client.prepareIndex(index, type, String.valueOf(_id)).setSource(jsonObject));
bulkBuilderLength++;
try {
if(bulkBuilderLength % 100== 0){
System.out.println("##### " + bulkBuilderLength + "data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
br.close();
if(bulkBuilder.numberOfActions() > 0){
System.out.println("##### " + bulkBuilderLength + " data indexed.");
BulkResponse bulkRes = bulkBuilder.execute().actionGet();
if(bulkRes.hasFailures()){
System.out.println("##### Bulk Request failure with error: " + bulkRes.buildFailureMessage());
}
bulkBuilder = client.prepareBulk();
}
}
}
Couple of things to check:
Check that your cluster name is same as the one defined in
elastic.yml
Try creating the index with the index name
('testindex' in your case) with PUT api from any REST Client
(ex:AdvancedRestClient) http://localhost:9200/testindex (Your
request has to be successful i.e status code==200 )
Run your program after you create the index with the above PUT request.

Processing JSON using java Mapreduce

I am new to hadoop mapreduce
I have input text file where data has been stored as follow. Here are only a few tuples (data.txt)
{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"}
{"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"}
{"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"}
{"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"}
{"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}
This is my java file that I am supposed to write my code in (CombineBooks.java)
package org.hwone;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.GenericOptionsParser;
//TODO import necessary components
/*
* Modify this file to combine books from the same other into
* single JSON object.
* i.e. {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]}
* Beaware that, this may work on anynumber of nodes!
*
*/
public class CombineBooks {
//TODO define variables and implement necessary components
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
//TODO implement CombineBooks
Job job = new Job(conf, "CombineBooks");
//TODO implement CombineBooks
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
My task is to create a Hadoop program in “CombineBooks.java”
returned in the “question-2” directory. The program should do
the following: Given the input author-book tuples, map-reduce
program should procude a JSON object which contains all the
books from same author in a JSON array, i.e.
{"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}
Any idea how it can be done ?
First, the JSON objects you are trying to work with are not available for you. To solve this:
Go here and download as zip: https://github.com/douglascrockford/JSON-java
Extract to your sources folder in subdirectory org/json/*
Next, the first line of your code makes a package "org.json", which is incorrect, you shold create a separate package, for instance "my.books".
Third, using combiner here is useless.
Here's the code I ended up with, it works and solves your problem:
package my.books;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.json.*;
import javax.security.auth.callback.TextInputCallback;
public class CombineBooks {
public static class Map extends Mapper<LongWritable, Text, Text, Text>{
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String author;
String book;
String line = value.toString();
String[] tuple = line.split("\\n");
try{
for(int i=0;i<tuple.length; i++){
JSONObject obj = new JSONObject(tuple[i]);
author = obj.getString("author");
book = obj.getString("book");
context.write(new Text(author), new Text(book));
}
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
try{
JSONObject obj = new JSONObject();
JSONArray ja = new JSONArray();
for(Text val : values){
JSONObject jo = new JSONObject().put("book", val.toString());
ja.put(jo);
}
obj.put("books", ja);
obj.put("author", key.toString());
context.write(NullWritable.get(), new Text(obj.toString()));
}catch(JSONException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
if (args.length != 2) {
System.err.println("Usage: CombineBooks <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "CombineBooks");
job.setJarByClass(CombineBooks.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Here's the folder structure of my project:
src
src/my
src/my/books
src/my/books/CombineBooks.java
src/org
src/org/json
src/org/json/zip
src/org/json/zip/BitReader.java
...
src/org/json/zip/None.java
src/org/json/JSONStringer.java
src/org/json/JSONML.java
...
src/org/json/JSONException.java
Here's the input
[localhost:CombineBooks]$ hdfs dfs -cat /example.txt
{"author":"author1", "book":"book1"}
{"author":"author1", "book":"book2"}
{"author":"author1", "book":"book3"}
{"author":"author2", "book":"book4"}
{"author":"author2", "book":"book5"}
{"author":"author3", "book":"book6"}
The command to run:
hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output
Here's the output:
[pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000
{"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"}
{"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"}
{"books":[{"book":"book6"}],"author":"author3"}
You can use on of the three options to put the org.json.* classes into your cluster:
Pack the org.json.* classes into your jar file (can easily be done using GUI IDE). This is the option I used in my answer
Put the jar file containing org.json.* classes on each of the cluster nodes into one of the CLASSPATH directories (see yarn.application.classpath)
Put the jar file containing org.json.* into HDFS (hdfs dfs -put <org.json jar> <hdfs path>) and use job.addFileToClassPath call for this jar file to be available for all of the tasks executing your job on the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); to the main
Refer for splittable multi-line JSON:
https://github.com/alexholmes/json-mapreduce

Extract first line of CSV file in Pig

I have several CSV files and the header is always the first line in the file. What's the best way to get that line out of the CSV file as a string in Pig? Preprocessing with sed, awk etc is not an option.
I've tried loading the file with regular PigStorage and the Piggy bank CsvLoader, but its not clear to me how I can get that first line, if at all.
I'm open to writing an UDF, if that's what it takes.
Disclaimer: I'm not great with Java.
You are going to need a UDF. I'm not sure exactly what you are asking for, but this UDF will take a series of CSV files and turn them into maps, where the keys are the values at the top of the file. This should hopefully be enough of a skeleton so that you can change it into what you want.
The couple of tests I've done remotely and locally indicate that this will work.
package myudfs;
import java.io.IOException;
import org.apache.pig.LoadFunc;
import java.util.Map;
import java.util.HashMap;
import java.util.ArrayList;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.pig.PigException;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
public class ExampleCSVLoader extends LoadFunc {
protected RecordReader in = null;
private String fieldDel = "" + '\t';
private Map<String, String> outputMap = null;
private TupleFactory mTupleFactory = TupleFactory.getInstance();
// This stores the fields that are defined in the first line of the file
private ArrayList<Object> topfields = null;
public ExampleCSVLoader() {}
public ExampleCSVLoader(String delimiter) {
this();
this.fieldDel = delimiter;
}
#Override
public Tuple getNext() throws IOException {
try {
boolean notDone = in.nextKeyValue();
if (!notDone) {
outputMap = null;
topfields = null;
return null;
}
String value = in.getCurrentValue().toString();
String[] values = value.split(fieldDel);
Tuple t = mTupleFactory.newTuple(1);
ArrayList<Object> tf = new ArrayList<Object>();
int pos = 0;
for (int i = 0; i < values.length; i++) {
if (topfields == null) {
tf.add(values[i]);
} else {
readField(values[i], pos);
pos = pos + 1;
}
}
if (topfields == null) {
topfields = tf;
t = mTupleFactory.newTuple();
} else {
t.set(0, outputMap);
}
outputMap = null;
return t;
} catch (InterruptedException e) {
int errCode = 6018;
String errMsg = "Error while reading input";
throw new ExecException(errMsg, errCode,
PigException.REMOTE_ENVIRONMENT, e);
}
}
// Applies foo to the appropriate value in topfields
private void readField(String foo, int pos) {
if (outputMap == null) {
outputMap = new HashMap<String, String>();
}
outputMap.put((String) topfields.get(pos), foo);
}
#Override
public InputFormat getInputFormat() {
return new TextInputFormat();
}
#Override
public void prepareToRead(RecordReader reader, PigSplit split) {
in = reader;
}
#Override
public void setLocation(String location, Job job)
throws IOException {
FileInputFormat.setInputPaths(job, location);
}
}
Sample output loading a directory with:
csv1.in csv2.in
------- ---------
A|B|C D|E|F
Hello|This|is PLEASE|WORK|FOO
FOO|BAR|BING OR|EVERYTHING|WILL
BANG|BOSH BE|FOR|NAUGHT
Produces this output:
A: {M: map[]}
()
([D#PLEASE,E#WORK,F#FOO])
([D#OR,E#EVERYTHING,F#WILL])
([D#BE,E#FOR,F#NAUGHT])
()
([A#Hello,B#This,C#is])
([A#FOO,B#BAR,C#BING])
([A#BANG,B#BOSH])
The ()s are the top lines of the file. getNext() requires that we return something, otherwise the file will stop being processed. Therefore they return a null schema.
If your CSV comply with CSV conventions of Excel 2007 you can use already available loader from Piggybank http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java?view=markup
It has an option to skip the CSV header SKIP_INPUT_HEADER

Hadoop: Unable to run mapreduce program ..java.io.IOException: error=12

Im trying to run a mapreduce program in hadoop. Basically it takes in a text file as input in which each line is a json text. Im using simple json to parse this data in my mapper and the reducer does some other stuff. I have included the simple json jar file in hadoop/lib folder. here is the code below
package org.myorg;
import java.io.IOException;
import java.util.Iterator;
import java.util.*;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class ALoc
{
public static class AMapper extends Mapper<Text, Text, Text, Text>
{
private Text kword = new Text();
private Text vword = new Text();
JSONParser parser = new JSONParser();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
try {
String line = value.toString();
Object obj = parser.parse(line);
JSONObject jsonObject = (JSONObject) obj;
String val = (String)jsonObject.get("m1") + "," + (String)jsonObject.get("m3");
kword.set((String)jsonObject.get("m0"));
vword.set(val);
context.write(kword, vword);
}
catch (IOException e) {
e.printStackTrace();
}
catch (ParseException e) {
e.printStackTrace();
}
}
}
public static class CountryReducer
extends Reducer<Text,Text,Text,Text>
{
private Text result = new Text();
public void reduce(Text key, Iterable<Text> values,
Context context
) throws IOException, InterruptedException
{
int ccount = 0;
HashMap<Text, Integer> hm = new HashMap<Text, Integer>();
for (Text val : values)
{
if(hm.containsKey(val)){
Integer n = (Integer)hm.get(val);
hm.put(val, n+1);
}else{
hm.put(val, new Integer(1));
}
}
Set set = hm.entrySet();
Iterator i = set.iterator();
String agr = "";
while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next();
agr += "|" + me.getKey() + me.getValue();
}
result.set(agr);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{
Configuration conf = new Configuration();
Job job = new Job(conf, "ALoc");
job.setJarByClass(ALoc.class);
job.setMapperClass(AMapper.class);
job.setReducerClass(CountryReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(TextInputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
When I try to run the job. It gives the following error.
I am running this in a aws micro instance single node.
I have been following this tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
hadoop#domU-18-11-19-02-92-8E:/$ bin/hadoop jar ALoc.jar org.myorg.ALoc /user/hadoop/adata /user/hadoop/adata-op5 -D mapred.reduce.tasks=16
13/02/12 08:39:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/02/12 08:39:50 INFO input.FileInputFormat: Total input paths to process : 1
13/02/12 08:39:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/02/12 08:39:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/02/12 08:39:51 INFO mapred.JobClient: Running job: job_201302120714_0006
13/02/12 08:39:52 INFO mapred.JobClient: map 0% reduce 0%
13/02/12 08:40:10 INFO mapred.JobClient: Task Id : attempt_201302120714_0006_m_000000_0, Status : FAILED
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "/bin/ls": java.io.IOException: error=12, Cannot allocate memory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:475)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:200)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:710)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:443)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
at java.lang.UNIXProcess.<init>(UNIXProcess.java:164)
at java.lang.ProcessImpl.start(ProcessImpl.java:81)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:468)
... 15 more
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:468)
at org.apache.hadoop.fs.RawLocalFileSystem$RawLocalFileStatus.getOwner(RawLocalFileSystem.java:426)
at org.apache.hadoop.mapred.TaskLog.obtainLogDirOwner(TaskLog.java:267)
at org.apache.hadoop.mapred.TaskLogsTruncater.truncateLogs(TaskLogsTruncater.java:124)
at org.apache.hadoop.mapred.Child$4.run(Child.java:260)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
I guess you must be trying hadoop on Micro instance which have very less memory (~700MB).
Try increasing the HADOOP Heapsize parameter (in hadoop/conf/hadoop-env.sh) .. as the basic reason is shortage of memory required to fork processes