Flink keyedstream generate duplicate results with same key and window timestamp - duplicates

Here is my Flink job workflow:
DataStream<FlinkEvent> events = env.addSource( consumer ).flatMap(...).assignTimestampsAndWatermarks( new EventTsExtractor() );
DataStream<SessionStatEvent> sessionEvents = events.keyBy(
new KeySelector<FlinkEvent, Tuple2<String, String> >()
{
#Override
public Tuple2<String, String> getKey( FlinkEvent value ) throws Exception {
return(Tuple2.of( value.getF0(), value.getSessionID ) );
}
} )
.window( TumblingEventTimeWindows.of( Time.minutes( 2 ) ) )
.allowedLateness( Time.seconds( 10 ) )
.aggregate( new SessionStatAggregator(), new SessionStatProcessor() );
/* ... */
sessionEvents.addSink( esSinkBuilder.build() );
First I encountered
java.lang.Exception: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
in flatMap operator and the task keep restarting. I observed many duplicate results with different value by same key and window timestamp.
Q1: I guess the duplicates was becaused the downstream operators consume message duplicately after job restarted. Am I right?
I resubmitted the job after fixed the ExceptionInChainedOperatorException problem. I observed duplicates in the first time window again. And after that, the job seems to worked out right (one result in one time window per key).
Q2: Where did the duplicates come from?

... there should be one result per key for one window
This is not (entirely) correct. Because of the allowedLateness, any late events (within the period of allowed lateness) will cause late (or in other words, extra) firings of the relevant windows. With the default EventTimeTrigger (which you appear to be using), each late event causes an additional window firing, and an updated window result will be emitted.

This is how Flink achieves exactly-once semantics. In case of failures Flink replays events from the last successful checkpoint. It is important to note that exactly-once means affecting state once, not processing / publishing events exactly once.
Answering Q1: yes, each restart resulted in processing the same messages over and over again
Answering Q2: first window after your bug fix processed these messages again; then everything went back to normal.

Related

GCP dataflow - processing JSON takes too long

I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.

RxJava: retryWhen with retry limit

I am new to ReactiveX and reactive programming in general. I need to implement a retry mechanism for Couchbase CAS operations, but the example on the Couchbase website shows a retryWhen which seems to retry indefinitely. I need to have a retry limit and retry count somewhere in there.
The simple retry() would work, since it accepts a retryLimit, but I don't want it to retry on every exception, only on CASMismatchException.
Any ideas? I'm using the RxJava library.
In addition to what Simon Basle said, here is a quick version with linear backoff:
.retryWhen(notification ->
notification
.zipWith(Observable.range(1, 5), Tuple::create)
.flatMap(att ->
att.value2() == 3 ? Observable.error(att.value1()) : Observable.timer(att.value2(), TimeUnit.SECONDS)
)
)
Note that "att" here is a tuple which consists of both the throwable and the number of retries, so you can very specifically implement a return logic based on those two params.
If you want to learn even more, you can peek at the resilient doc I'm currently writing: https://gist.github.com/daschl/db9fcc9d2b932115b679#retry-with-delay
retryWhen is clearly a little bit more complicated than simple retry, but here's the gist of it:
you pass a notificationHandler function to retryWhen which takes an Observable<Throwable> and outputs an Observable<?>
the emission of this returned Observable determine when retry should occur or stop
so, for each occurring Exception in the original stream, if the handler's one emits 1 item, there'll be 1 retry. If it emits 2 items, there'll be 2...
as soon as the handler's stream emits an error, retry is aborted.
Using this, you can both:
work only on CasMismatchExceptions: just have your function return an Observable.error(t) in other cases
retry only for a specific number of times: for each exception, flatMap from an Observable.range representing the max number of retries, have it return an Observable.timer using the retry # if you need increasing delays.
Your use case is pretty close to the one in RxJava doc here
reviving this thread since in the Couchbase Java SDK 2.1.2 there's a new simpler way to do that: use the RetryBuilder:
Observable<Something> retryingObservable =
sourceObservable.retryWhen(
RetryBuilder
//will limit to the relevant exception
.anyOf(CASMismatchException.class)
//will retry only 5 times
.max(5)
//delay doubling each time, from 100ms to 2s
.delay(Delay.linear(TimeUnit.MILLISECONDS, 2000, 100, 2.0))
.build()
);

In the "ActionBarTabsPager" tutorial, getActivity returns null

I have successfully implemented the tutorial:
http://developer.android.com/reference/android/support/v4/view/ViewPager.html, as a Tab'ed viewpager activity with fragments on each tab. Each Fragment maintains various UI TextFields etc and everything is working fine with the exception of getActivity(), which returns null when called from any of the fragments.
UPDATE: Read this, then please see my own answer below that broadens the scope regarding the cause of this error. Continued:
BUT, the null status appears after a while. Initially, in fragment.onStart(), the getActivity() is working so that the default UI setup may be performed. But the first time the user has made changes, getActivity() already returns null.
Strange to say, in the same moment, it is still possible to make any change to the fragment UI fields from the Activity, which means that as the context=activity is passed to the fragment in a setSomeText(this, ...), this will enable the fragment to make the corresponding changes. Of course, the design should be such that the Fragment takes care of it's own detailed task.
It does not help to save the context in the onStart(), because that reference will point to a null after a while.
It is explicitely stated in the tutorial that the feature is in early development, but as this "null" problem has become quite a timethief here, and as I see that "getActivity returns null" is a very common problem, I wanted to muse aloud whether there could be a bug in the getActivity() when combined with ViewPager and/or Tab?
What took me so long to detect the problem was that it is hard to guess that a fragment would EVER loose knowledge of its activity. Anyway, I am on the next hurdle and just wanted to share this finding: Don't trust getActivity(), but pass on context from Activity to its Fragments as a parameter in the set/get methods or other api.
This is not an answer, but I needed the space to explain and I am circeling in the problem:
It seems there is more general problem than just getActivity(). Because variable declarations of the fragments are also "vanishing" to null. A new instance of the fragment has "taken over". This happens when current tab is shifted more than one tab to either side.
EXAMPLE: I define 5 tabs. I have a tab 2 that can be manipulated from UI. After a change of 2 content, I move between tabs, either with tab click or fingersweep. Either way.
RESULTS? As long as I visit the next tab on left or right side, and then move back, the changed data are still there on tab 2. As soon as I move two or more tabs away from tab 2 and then return, the fragment instance of tab 2 is always reset. Does not matter how many tabs are present. Whether I hit last or first tab during this process is not significant. The code? it is the same as in the referenced tutorial, and in addition:
//add tabs (notice the once only saving of the fragment into profileViewer)
mTabsAdapter.addTab(actionBar.newTab()
//.setText(R.string.action_favorite)
.setIcon(R.drawable.ic_action_favorite),
TabFragmentDemo.class, null);
mTabsAdapter.addTab(actionBar.newTab()
//.setText(R.string.new_profile)
.setIcon(R.drawable.ic_action_add_person),
ProfileViewer.class, null);
profileViewer = (ProfileViewer) mTabsAdapter.getItem(NEW_PROFILE);
mTabsAdapter.addTab(actionBar.newTab()
//.setText(R.string.action_select)
.setIcon(R.drawable.ic_action_view_as_list),
TabFragmentDemo.class, null);
mTabsAdapter.addTab(actionBar.newTab()
//.setText(R.string.action_select)
.setIcon(R.drawable.ic_action_view_as_list),
TabFragmentDemo.class, null);
mTabsAdapter.addTab(actionBar.newTab()
//.setText(R.string.action_select)
.setIcon(R.drawable.ic_action_view_as_list),
TabFragmentDemo.class, null);
Then a simple dialogFragment selecting a date and then this date is set (Does not help to omit the datepicker and just set a date directly)
public void showDatePickerDialog(View v) {
dateOfBirthPicker = new DateOfBirthPicker();
dateOfBirthPicker.show(fragmentManager, datePickerTag);
}
//the callback from datepicker:
#Override
public void onDateSet(DatePicker view, int year, int month, int day) {
//update the fragment
profileViewer.setCardFromDate (this, day, month, year);
Notice that "this" is passed on as forced context to the fragment. The big question here is why the tab looses its original fragment as long as there is nothing in my code requesting that ?

Get Current Location Instantly

I am using the following code to request current location:
private void RequestCurrentLocation()
{
Criteria locationCriteria = new Criteria () { Accuracy = Accuracy.NoRequirement, PowerRequirement = Power.NoRequirement };
this.mgr = GetSystemService (Context.LocationService) as LocationManager;
String locationProvider = this.mgr.GetBestProvider (locationCriteria, true);
this.mgr.RequestLocationUpdates (locationProvider, 0, 0, this);
}
As I need location only once, so I call RemoveUpdates() as soon as OnLocationChanged is raised and then carry on with rest of the working:
public void OnLocationChanged (Location location)
{
this.mgr.RemoveUpdates (this);
//Rest of the method
}
I face two issues:
1) Although I have provided zero in the distance and time parameters of RequestLocationUpdates but it still needs at least 2-3 seconds before the OnLocationChanged is triggered. How can I make it instantaneous?
2) I intermittently face the issue that the OnLocationChanged event does not fire at all. Yesterday I spent the whole day to get my code working which was flawlessly working a day earlier. It's really strange that something works properly one day and the next day it simply stops even with the very same source code! Can somebody give me an idea?
Thanks.
There's no way to get an accurate, up-to-date location instantly.
For a device to determine location it needs to track GPS, WiFi, 3G towers and that depends on radio signal strength, transmission speed and other little stuff like that. You can trick the parameters of your request to try to have a faster update (for example, instead of best provider, try to use any provider available).
As a alternative approach you can request to getLastKnownLocation, this method is synchronous and replies instantly with the last location that the system found. The issue is that the "last known location" might be null (if the system never found any location), or might be very old (if the last location was days ago).

Adding timestamp in kafka message payload

Is there any way I can a timestamp header in Kafka message payload? I want to check when the message was created at the consumer end and apply custom logic based on that.
EDIT:
I'm trying to find a way to attach some custom value (basically timestamp) to the message published by producers, so that I may able to consume message for a specific time duration. Right now Kafka only make sure that the message will be delivered in a order they were put in the queue. But in my case a previously generated record might arrive after a certain delay( so a message generated at time T1 might have a higher offset 1 than another generated at later time T2 with offset 0). For this reason they will not be in the order I expect at the consumer's end. So I am basically looking for a way out to consume them in a ordered way.
The current Kafka 0.8 release provides no way to attach anything other than the "Message Key" at the producer end, found a similar topic here where it was advised to encode the same in the message payload. But I did a lot of searching but couldn't find a possible approach.
Also I don't know if such approach have any impact on the overall performance of Kafka as it manages the message offset internally and there are no such API exposed so far as can be seen from this page
Really appreciate any clue if this at all the right way I am thinking or if there is any probable approach, I am all set to give it a try
If you want to consume message for specific time duration then I can provide you a solution, however to consume messages in ordered way from that time duration is difficult. I am also looking for the same solution. Check the below link
Message Sorting in Kafka Qqueue
Solution to fetch data for specific time
For time T1,T2,...TN , where T is the range of time; divide the topic in N number of partition. Now produced the messages using Partitioner Class in such a way that messages generation time should be used to decide which partition should be used for this message.
Similarly while consuming subscribe to the exact partition for the time range you want to consume.
You can make a class that contains your partitioning information and the timestamp when this message was created, and then use this as the key to the Kafka message. You can then use a wrapper Serde that transforms this class into a byte array and back because Kafka can only understand bytes. Then, when you receive the message at the consumer end as a bag of bytes, you can deserialize it and retrieve the timestamp and then channel that into your logic.
For example:
public class KafkaKey implements Serializable {
private long mTimeStampInSeconds;
/* This contains other partitioning data that will be used by the
appropriate partitioner in Kafka. */
private PartitionData mPartitionData;
public KafkaKey(long timeStamp, ...) {
/* Initialize key */
mTimeStampInSeconds = timestamp;
}
/* Simple getter for timestamp */
public long getTimeStampInSeconds() {
return mTimeStampInSeconds;
}
public static byte[] toBytes(KafkaKey kafkaKey) {
/* Some serialization logic. */
}
public static byte[] toBytes(byte[] kafkaKey) throws Exception {
/* Some deserialization logic. */
}
}
/* Producer End */
KafkaKey kafkaKey = new KafkaKey(System.getCurrentTimeMillis(), ... );
KeyedMessage<byte[], byte[]> kafkaMessage = new KeyedMessage<>(topic, KafkaKey.toBytes(kafkaKey), KafkaValue.toBytes(kafkaValue));
/* Consumer End */
MessageAndMetadata<byte[],byte[]> receivedMessage = (get from consumer);
KafkaKey kafkaKey = KafkaKey.fromBytes(receivedMessage.key());
long timestamp = kafkaKey.getTimeStampInSeconds();
/*
* And happily ever after */
This will be more flexible than making specific partitions correspond to time intervals. Else, you'll have to keep adding partitions for different time ranges, and keep a separate, synchronized tabulation of what partition corresponds to what time range, which can get unwieldy quickly.
This looks like it will help you achieve your goals. It allows you with little effort define and write your message headers hiding the (de)serialization burden. The only thing you have to provide is a (de)serializer for the actual object you're sending through the wire. This implementation actually delays the deserialization process of the payload object as much as possible, this means that you can (in a very performant and transparent way) deserialize the headers, check the timestamp and only deserialize the payload (the heavy bit) if/when you are sure the object is useful to you.
Note, Kafka introduced timestamps to the internal representation of a message pursuant to this discussion:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-32+-+Add+timestamps+to+Kafka+message
and these tickets:
https://issues.apache.org/jira/browse/KAFKA-2511
It should be available in all versions of Kafka 0.10.0.0 and greater.
The problem here is that you ingested messages in an order you no longer want. If the order matters, then you need to abandon parallelism in the relevant Producer(s). Then the problem at the Consumer level goes away.