In hadoop Map-Reduce, how to know end of map task or end of file for mapper - hadoop2

In a MapReduce job, mapper starts processing from the 1st line of the input file till the n'th line. I need to find out when our mapper starts processing the n'th line. I want to perform some action, when the mapper is executing the last line of input. So, I want an indication to the mapper for the last line of the file. Is there any method in Hadoop library which can achieve this?
I am using Hadoop 2.4.

It seems you're trying to perform some cleanup in the mapper before the task is destroyed. Is that correct? If so, would overriding org.apache.hadoop.mapreduce.Mapper#cleanup(Context) suffice?

Related

Lda using mallet

I run the file simple lda.java and I got
exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at cc.mallet.topics.SimpleLDA.main(SimpleLDA.java:560)
It looks like you are using the command line interface without any parameters.
If you look at the source file at https://github.com/mimno/Mallet/blob/master/src/cc/mallet/topics/SimpleLDA.java, you can see that line 560 is expecting to find that there is a command line argument specifying a file at the first position in the arguments array.
If you need to train a topic model from the command line, there is information here: http://mallet.cs.umass.edu/topics.php
The interface you are using is intended as a base for constructing new, custom topic models in Java. Without significant programming experience in Java this will be frustrating and unlikely to succeed.

spring batch: Dump a set of queries over a database in parallel to flat files

So my scenario drilled down to the essence is as follows:
Essentially, I have a config file containing a set of SQL queries whose result sets need to be exported as CSV files.
Since some queries may return billions of rows, and because something may interrupt the process (bug, crash, ...), I want to use a framework such as spring batch, which gives me restartabilty and job monitoring.
I am using a file based H2 database for persisting spring batch jobs.
So, here are my questions:
Upon creating a Job, I need to provide my RowMapper some initial configuration. So what happens when a job needs to be restarted after a e.g. crash? Concretly:
Is the state of the RowMapper automatically persisted, and upon restart Spring batch will try to restore the object from its database, or
will the RowMapper object be used that is part of the original spring batch XML config file, or
I have to maintain the RowMapper's state using the step's/job's ExecutionContext?
Above question is related to whether there is magic going on when using the spring batch XML configuration, or whether I could as well create all these beans in a programmatic way:
Since I need to parse my own config format into a spring batch job config, I rather just use spring batch's Java classes (beans) and fill them out appropriately, rather attempting to manually write out valid XML. However, if my Job crashes, I would create all the beans myself again. Does spring batch automagically restore the Job state from its database?
If I really need XML, is there a way to serialize a spring-batch JobRepository (or one of these objects) as a spring batch XML config?
Right now, I tried to configure my Step with the following code - but I am unsure if this is the proper way to do this:
Is TaskletStep the way to go?
Is the way I create the chunked reader/writer correct, or is there some other object which I should use instead?
I would have assumed that opening of the reader and writer would occur automatically as part of the JobExecution, but if I don't open these resources prior to running the Job, I get an exception telling me that I need to open them first. Maybe I need to create some other object that manages the resoures (jdbc connection and file handle)?
JdbcCursorItemReader<Foobar> itemReader = new JdbcCursorItemReader<Foobar>();
itemReader.setSql(sqlStr);
itemReader.setDataSource(dataSource);
itemReader.setRowMapper(rowMapper);
itemReader.afterPropertiesSet();
ExecutionContext executionContext = new ExecutionContext();
itemReader.open(executionContext);
FlatFileItemWriter<String> itemWriter = new FlatFileItemWriter<String>();
itemWriter.setLineAggregator(new PassThroughLineAggregator<String>());
itemWriter.setResource(outResource);
itemWriter.afterPropertiesSet();
itemWriter.open(executionContext);
int commitInterval = 50000;
CompletionPolicy completionPolicy = new SimpleCompletionPolicy(commitInterval);
RepeatTemplate repeatTemplate = new RepeatTemplate();
repeatTemplate.setCompletionPolicy(completionPolicy);
RepeatOperations repeatOperations = repeatTemplate;
ChunkProvider<Foobar> chunkProvider = new SimpleChunkProvider<Foobar>(itemReader, repeatOperations);
ItemProcessor<Foobar, String> itemProcessor = new ItemProcessor<Foobar, String>() {
/* Custom implemtation */ };
ChunkProcessor<Foobar> chunkProcessor = new SimpleChunkProcessor<Foobar, String>(itemProcessor, itemWriter);
Tasklet tasklet = new ChunkOrientedTasklet<QuadPattern>(chunkProvider, chunkProcessor); //new SplitFilesTasklet();
TaskletStep taskletStep = new TaskletStep();
taskletStep.setName(taskletName);
taskletStep.setJobRepository(jobRepository);
taskletStep.setTransactionManager(transactionManager);
taskletStep.setTasklet(tasklet);
taskletStep.afterPropertiesSet();
job.addStep(taskletStep);
Most of you questions are really complex and can be difficult give a good answer without write a long paper.
I'm new with spring-batch as you, and I found a lot of really useful info - and all the answers to your questions - reading Spring batch in action: it's completed, well explained, full of example and cover all aspects of framework (reader/writer/processor, job/tasklet/chunk lifecycle/persistence, tx/resources management, job flow, integration with other service, partitioning, restarting/retry, failure management and a lot of interesting things).
Hope to help

Write to Log file in Flex 4.6

Is there any way to write to a text file in Flex 4.6? Its a desktop app for AIR. I would like to write the data from several arrays, as well as the time and date.
Threw a simple logger together for this test project:
http://www.shaunhusain.com/DrawTextRandomly/srcview/
it's in src/util/Logger.as
As is it marks the first time a log entry is made then counts the time from then till all other log entries and outputs it along with the logged string, it also outputs the time difference from the last log entry so you can get some idea of how long it takes for a function/algorithm/operation to complete. Feel free to grab this, albeit just a test snippet I should probably post a license on my code, I'll update the src folders with a license.txt with MIT License http://www.opensource.org/licenses/mit-license.html
You can re-purpose this class and have it write using a FileStream/File object in Flex. File itself is basically a handle to a particular file, FileStream will allow you to call writeUTFBytes(string) to write data to a file.
Code would be something like this:
var fs:FileStream = new FileStream();
fs.open(new File("logfile.txt"),FileMode.WRITE);
fs.writeUTFBytes("Some output");
fs.close();
http://help.adobe.com/en_US/FlashPlatform/reference/actionscript/3/flash/filesystem/FileStream.html
The as3corelib has a FileTarget class that can be used with the Flex Logging API.
This documentation page explains how to use the logging API.

SSIS: How to read WebSphere MQ, transform, and write to flat file?

I have data on a WebSphere MQ queue. I've written a script task to read the data, and I can output it to a variable or a text file. But I want to use that as input to a dataflow step and transform the data. The ultimate destination is a flat file.
Is there a way to read the variable as a source into a dataflow step? I could write the MQ data to a text file, and read the text file in the dataflow, but that seems like a lot of overhead. Or I could skip the dataflow altogther, and write all the transformations in a script (but then why bother with SSIS in the first place?)
Is there a way to write a Raw File out of the script step, to pass into the dataflow component?
Any ideas appreciated!
If you've got the script that consumes the webservice, you can skip all the intermediary outputs and simply use it as a source in your dataflow.
Drag a Data Flow Task onto the canvas and the add a Script Component. Instead of selecting Transformation (last option), select Source.
Double-Click on the Script Component and choose the Input and Output Properties. Under Output 0, select Output Columns and click Add Column for however many columns the web service has. Name them appropriately and be certain to correctly define their metadata.
Once the columns are defined, click back to the Script tab, select your language and edit the script. Take all of your existing code that could write that consumes the service and we'll use it here.
In the CreateNewOutputRows method, you will need to iterate through the results of the Websphere MQ request. For each row that is returned, you would apply the following pattern.
public override void CreateNewOutputRows()
{
// TODO: Add code here or in the PreExecute to fill the iterable object, mqcollection
foreach (var row in mqcollection)
{
// Adds a new row into the downstream buffer
Output0Buffer.AddRow();
// Assign all the data to the correct locations
Output0Buffer.Column = row.Column;
Output0Buffer.Column1 = row.Column1;
// handle nulls appropriately
if (string.IsNullOrEmpty(row.Column2))
{
Output0Buffer.Column2_IsNull = true;
}
else
{
Output0Buffer.Column2 = row.Column2;
}
}
}
You must handle nulls via the _IsNull attribute or your script will blow up. It's tedious work versus a normal source but you'll be far more efficient, faster and consume fewer resources than dumping to disk or some other staging mechanism.
Since I ran into some additional "gotchas", I thought I'd post my final solution.
The script I am using does not call a webservice, but directly connects and reads the WebSphere queue. However, in order to do this, I have to add a reference to amqmdnet.dll.
You can add a reference to a Script Task (which sits on the Control Flow canvas), but not to a Script Component (which is part of the Data Flow).
So I have a Script Task, with reference and code to read the contents of the queue. Each line in the queue is just a fixed width record, and each is added to a List. At the end, the List is put into a Read/Write object variable declared at the package level.
The Script feeds into a Data Flow task. The first component of the Data Flow is a Script Component, created as Source, as billinkc describes above. This script casts the object variable back to a list. Then parses each item in the list to fields in the Output Buffer.
Various split and transform tasks take over from there.
Try using the Q program available in the MA01 MQ supportpac instead of your script.

Perl Can't call method "parse_html_string" or unbless reference

I'm trying to use the module HTML::Grabber to parse html in perl. It works when I just use it in my main process, but it throws me error when I attempt to use it with threading.
Specifically, I got this error,
Thread 1 terminated abnormally: Can't call method "parse_html_string"
on unblessed reference at /usr/local/ActivePerl-5.10/site/lib/HTML/Grabber.pm line 79.
where the creation of Grabber object.
$mech->get($link);
$dom = HTML::Grabber->new(html => $mech->content); #at this point
Any idea how to fix this weird problem?
The parse_html_string method is called on an XML::LibXML parser object.
XML::LibXML seems to have mixed support for threads:
http://search.cpan.org/~shlomif/XML-LibXML-1.78/LibXML.pod#THREAD_SUPPORT
What is probably happening is HTML::Grabber is creating the parser object when it is imported by your script in the main thread. Then you create a child thread, and since XML::LibXML does not clone between threads, the object disappears. You will need to do a runtime load of HTML::Grabber with require in the thread after it is spawned.
If that is not the case, you will have to boil down your problem to a small example and post the code here.