Caffeine Cache - How to get information of creation date of an element - caffeine

Is there any way to access the creation timestamp of an element from the CaffeineCache?
Kind of cache.get("x").getTimestamp()?

The cache tries to store the minimal metadata required to perform its operations, so some conveniences are left out to avoid waste. In those cases you should consider adding that metadata by wrapping your value.
The cache does expose runtime metadata, but this often depends on how it was constructed. That can be accessed by using Cache.policy(). For example cache.policy().expireAfterWrite() offers an ageOf(key) method to determine how long the entry has resided since its expiration timestamp was last reset. To calculate how much time until the entry expires, you could subtract the age from the policy's duration (via getExpiresAfter()).

Cache interface provides getPolicy method to inspect and perform low-level operations based on the cache's runtime characteristics. For example below snippet return the age of an entry in the cache using after write policy.
private static void printAgeOfEntryUsingAfterWritePolicy(Cache cache, Object key) {
Optional<FixedExpiration> expirationPolicy = cache.policy().expireAfterWrite();
if (!expirationPolicy.isPresent()) {
return;
}
Optional<Duration> ageOfEntry = expirationPolicy.get().ageOf(key);
if (!ageOfEntry.isPresent()) {
return;
}
Duration duration = ageOfEntry.get();
System.out.println("Element with key " + key + " is in the cache from " + duration.getSeconds() + " seconds....");
}
Reference link.

Related

Flink keyedstream generate duplicate results with same key and window timestamp

Here is my Flink job workflow:
DataStream<FlinkEvent> events = env.addSource( consumer ).flatMap(...).assignTimestampsAndWatermarks( new EventTsExtractor() );
DataStream<SessionStatEvent> sessionEvents = events.keyBy(
new KeySelector<FlinkEvent, Tuple2<String, String> >()
{
#Override
public Tuple2<String, String> getKey( FlinkEvent value ) throws Exception {
return(Tuple2.of( value.getF0(), value.getSessionID ) );
}
} )
.window( TumblingEventTimeWindows.of( Time.minutes( 2 ) ) )
.allowedLateness( Time.seconds( 10 ) )
.aggregate( new SessionStatAggregator(), new SessionStatProcessor() );
/* ... */
sessionEvents.addSink( esSinkBuilder.build() );
First I encountered
java.lang.Exception: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator
in flatMap operator and the task keep restarting. I observed many duplicate results with different value by same key and window timestamp.
Q1: I guess the duplicates was becaused the downstream operators consume message duplicately after job restarted. Am I right?
I resubmitted the job after fixed the ExceptionInChainedOperatorException problem. I observed duplicates in the first time window again. And after that, the job seems to worked out right (one result in one time window per key).
Q2: Where did the duplicates come from?
... there should be one result per key for one window
This is not (entirely) correct. Because of the allowedLateness, any late events (within the period of allowed lateness) will cause late (or in other words, extra) firings of the relevant windows. With the default EventTimeTrigger (which you appear to be using), each late event causes an additional window firing, and an updated window result will be emitted.
This is how Flink achieves exactly-once semantics. In case of failures Flink replays events from the last successful checkpoint. It is important to note that exactly-once means affecting state once, not processing / publishing events exactly once.
Answering Q1: yes, each restart resulted in processing the same messages over and over again
Answering Q2: first window after your bug fix processed these messages again; then everything went back to normal.

AWS SDK in java - How to get activities from worker when multple execution on going for a state machine

AWS Step Function
My problem is to how to sendTaskSuccess or sendTaskFailuer to Activity which are running under the state machine in AWS .
My Actual intent is to Notify the specific activities which belongs to particular State machine execution.
I successfully send notification to all waiting activities by activityARN. But my actual need is to send notification to specific activity which belong to particular state machine execution .
Example . StateMachine - SM1
There two execution on going for SM1-- SM1E1, SM1E2 . In that case I want to sendTaskSuccess to activity which belongs to SM1E1 .
follwoing code i used . But it send notification to all activities
GetActivityTaskResult getActivityTaskResult = client.getActivityTask(new GetActivityTaskRequest()
.withActivityArn("arn detail"));
if (getActivityTaskResult.getTaskToken() != null) {
try {
JsonNode json = Jackson.jsonNodeOf(getActivityTaskResult.getInput());
String outputResult = patientRegistrationActivity.setStatus(json.get("patientId").textValue());
System.out.println("outputResult " + outputResult);
SendTaskSuccessRequest sendTaskRequest = new SendTaskSuccessRequest().withOutput(outputResult)
.withTaskToken(getActivityTaskResult.getTaskToken());
client.sendTaskSuccess(sendTaskRequest);
} catch (Exception e) {
client.sendTaskFailure(
new SendTaskFailureRequest().withTaskToken(getActivityTaskResult.getTaskToken()));
}
As far as I know you have no control over which task token is returned. You may get one for SM1E1 or SM1E2 and you cannot tell by looking at the task token. GetActivityTask returns "input" so based on that you may be able to tell which execution you are dealing with but if you get a token you are not interested in, I don't think there's a way to put it back so you won't be able to get it again with GetActivityTask later. I guess you could store it in a database somewhere for use later.
One idea you can try is to use the new callback integration pattern. You can specify the Payload parameter in the state definition to include the task token like this token.$: "$$.Task.Token" and then use GetExecutionHistory to find the TaskScheduled state of the execution you are interested in and retrieve the parameters.Payload.token value and then use that with sendTaskSuccess.
Here's a snippet of my serverless.yml file that describes the state
WaitForUserInput: #Wait for the user to do something
Type: Task
Resource: arn:aws:states:::lambda:invoke.waitForTaskToken
Parameters:
FunctionName:
Fn::GetAtt: [WaitForUserInputLambdaFunction, Arn]
Payload:
token.$: "$$.Task.Token"
executionArn.$: "$$.Execution.Id"
Next: DoSomethingElse
I did a POC to check and below is the solution .
if token is consumed by getActivityTaskResult.getTaskToken() and if your conditions not satisfied by request input then you can use below line to avoid token consumption .awsStepFunctionClient.sendTaskHeartbeat(new SendTaskHeartbeatRequest().withTaskToken(taskToken))

what is the standard procedure for Generating keys for each document in java..?

I want to insert documents into Couchbase as a bulk in Java. So what is the standard procedure for Generating keys for each document in java..?
You could use a Couchbase "counter" document as a form of sequence. Using the reactive approach with the Java SDK, this would go something like this, assuming your batch is a List<JsonObject> with each content to save to Couchbase:
//start with a sequence of contents to save
Observable.from(listOfDocumentContent)
//for each content, asynchronously generate something...
.flatMap(content -> bucket.async() //assuming bucket is a `Bucket`
//atomically generate an increasing value, starting from 0
.counter("DOCUMENT_KEY_GENERATOR", 1, 0) //use a more relevant document key
//this gives a `JsonLongDocument`, so extract the number and turn that + the original content into a `JsonDocument` to be saved
.map(cDoc -> JsonDocument.create(KEY_PREFIX + cDoc.content(), content))
)
//next up, asynchronously saving each document we generated in the previous step... You could also use insert since you don't expect the keys to already exist in Couchbase
.flatMap(docToSave -> bucket.async().upsert(docToSave))
//this will perform the above asynchronously but wait for the last doc in the batch to finish saving:
.toBlocking().last();
Notice we use a KEY_PREFIX when generating the document to be saved, so that there is less risk of collision (otherwise, other documents in the bucket could be named "1" if you do that for multiple types of documents inside the same bucket).
Also tune the saving method used to your needs (here upsert vs create vs update, TTL, durability requirements, etc...)

GCP dataflow - processing JSON takes too long

I am trying to process json files in a bucket and write the results into a bucket:
DataflowPipelineOptions options = PipelineOptionsFactory.create()
.as(DataflowPipelineOptions.class);
options.setRunner(BlockingDataflowPipelineRunner.class);
options.setProject("the-project");
options.setStagingLocation("gs://some-bucket/temp/");
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from("gs://some-bucket/2016/04/28/*/*.json"))
.apply(ParDo.named("SanitizeJson").of(new DoFn<String, String>() {
#Override
public void processElement(ProcessContext c) {
try {
JsonFactory factory = JacksonFactory.getDefaultInstance();
String json = c.element();
SomeClass e = factory.fromString(json, SomeClass.class);
// manipulate the object a bit...
c.output(factory.toString(e));
} catch (Exception err) {
LOG.error("Failed to process element: " + c.element(), err);
}
}
}))
.apply(TextIO.Write.to("gs://some-bucket/output/"));
p.run();
I have around 50,000 files under the path gs://some-bucket/2016/04/28/ (in sub-directories).
My question is: does it make sense that this takes more than an hour to complete? Doing something similar on a Spark cluster in amazon takes about 15-20 minutes. I suspect that I might be doing something inefficiently.
EDIT:
In my Spark job I aggregate all the results in a DataFrame and only then write the output, all at once. I noticed that my pipeline here writes each file separately, I assume that is why it's taking much longer. Is there a way to change this behavior?
Your jobs are hitting a couple of performance issues in Dataflow, caused by the fact that it is more optimized for executing work in larger increments, while your job is processing lots of very small files. As a result, some aspects of the job's execution end up dominated by per-file overhead. Here's some details and suggestions.
The job is limited rather by writing output than by reading input (though reading input is also a significant part). You can significantly cut that overhead by specifying withNumShards on your TextIO.Write, depending on how many files you want in the output. E.g. 100 could be a reasonable value. By default you're getting an unspecified number of files which in this case, given current behavior of the Dataflow optimizer, matches number of input files: usually it is a good idea because it allows us to not materialize the intermediate data, but in this case it's not a good idea because the input files are so small and per-file overhead is more important.
I recommend to set maxNumWorkers to a value like e.g. 12 - currently the second job is autoscaling to an excessively large number of workers. This is caused by Dataflow's autoscaling currently being geared toward jobs that process data in larger increments - it currently doesn't take into account per-file overhead and behaves not so well in your case.
The second job is also hitting a bug because of which it fails to finalize the written output. We're investigating, however setting maxNumWorkers should also make it complete successfully.
To put it shortly:
set maxNumWorkers=12
set TextIO.Write.to("...").withNumShards(100)
and it should run much better.

Adding timestamp in kafka message payload

Is there any way I can a timestamp header in Kafka message payload? I want to check when the message was created at the consumer end and apply custom logic based on that.
EDIT:
I'm trying to find a way to attach some custom value (basically timestamp) to the message published by producers, so that I may able to consume message for a specific time duration. Right now Kafka only make sure that the message will be delivered in a order they were put in the queue. But in my case a previously generated record might arrive after a certain delay( so a message generated at time T1 might have a higher offset 1 than another generated at later time T2 with offset 0). For this reason they will not be in the order I expect at the consumer's end. So I am basically looking for a way out to consume them in a ordered way.
The current Kafka 0.8 release provides no way to attach anything other than the "Message Key" at the producer end, found a similar topic here where it was advised to encode the same in the message payload. But I did a lot of searching but couldn't find a possible approach.
Also I don't know if such approach have any impact on the overall performance of Kafka as it manages the message offset internally and there are no such API exposed so far as can be seen from this page
Really appreciate any clue if this at all the right way I am thinking or if there is any probable approach, I am all set to give it a try
If you want to consume message for specific time duration then I can provide you a solution, however to consume messages in ordered way from that time duration is difficult. I am also looking for the same solution. Check the below link
Message Sorting in Kafka Qqueue
Solution to fetch data for specific time
For time T1,T2,...TN , where T is the range of time; divide the topic in N number of partition. Now produced the messages using Partitioner Class in such a way that messages generation time should be used to decide which partition should be used for this message.
Similarly while consuming subscribe to the exact partition for the time range you want to consume.
You can make a class that contains your partitioning information and the timestamp when this message was created, and then use this as the key to the Kafka message. You can then use a wrapper Serde that transforms this class into a byte array and back because Kafka can only understand bytes. Then, when you receive the message at the consumer end as a bag of bytes, you can deserialize it and retrieve the timestamp and then channel that into your logic.
For example:
public class KafkaKey implements Serializable {
private long mTimeStampInSeconds;
/* This contains other partitioning data that will be used by the
appropriate partitioner in Kafka. */
private PartitionData mPartitionData;
public KafkaKey(long timeStamp, ...) {
/* Initialize key */
mTimeStampInSeconds = timestamp;
}
/* Simple getter for timestamp */
public long getTimeStampInSeconds() {
return mTimeStampInSeconds;
}
public static byte[] toBytes(KafkaKey kafkaKey) {
/* Some serialization logic. */
}
public static byte[] toBytes(byte[] kafkaKey) throws Exception {
/* Some deserialization logic. */
}
}
/* Producer End */
KafkaKey kafkaKey = new KafkaKey(System.getCurrentTimeMillis(), ... );
KeyedMessage<byte[], byte[]> kafkaMessage = new KeyedMessage<>(topic, KafkaKey.toBytes(kafkaKey), KafkaValue.toBytes(kafkaValue));
/* Consumer End */
MessageAndMetadata<byte[],byte[]> receivedMessage = (get from consumer);
KafkaKey kafkaKey = KafkaKey.fromBytes(receivedMessage.key());
long timestamp = kafkaKey.getTimeStampInSeconds();
/*
* And happily ever after */
This will be more flexible than making specific partitions correspond to time intervals. Else, you'll have to keep adding partitions for different time ranges, and keep a separate, synchronized tabulation of what partition corresponds to what time range, which can get unwieldy quickly.
This looks like it will help you achieve your goals. It allows you with little effort define and write your message headers hiding the (de)serialization burden. The only thing you have to provide is a (de)serializer for the actual object you're sending through the wire. This implementation actually delays the deserialization process of the payload object as much as possible, this means that you can (in a very performant and transparent way) deserialize the headers, check the timestamp and only deserialize the payload (the heavy bit) if/when you are sure the object is useful to you.
Note, Kafka introduced timestamps to the internal representation of a message pursuant to this discussion:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-32+-+Add+timestamps+to+Kafka+message
and these tickets:
https://issues.apache.org/jira/browse/KAFKA-2511
It should be available in all versions of Kafka 0.10.0.0 and greater.
The problem here is that you ingested messages in an order you no longer want. If the order matters, then you need to abandon parallelism in the relevant Producer(s). Then the problem at the Consumer level goes away.