I am trying to run the Flink streaming job. I want to determine the throughput and latency for the streaming process. i have started the Kafka broker server and have incoming messages from kafka.How do i count messages per second (Throughput)?
(Like rdd.count. Is there any similar method to get the count of incoming messages)
(Complete scenerio : I have sent the message through Producer as a Json Object. I am adding some information like name as string and also System.currentTimeMills in the Json object.
During streaming , how do i obtain the sent json object through messageStream(DataStream)?)
Thanks in advance.
CODE :
/**
* Read Strings from Kafka and print them to standard out.
*/
public static void main(String[] args) throws Exception {
System.setProperty("hadoop.home.dir", "c:/winutils/");
// parse input argum ents
final ParameterTool parameterTool = ParameterTool.fromArgs(args);
if(parameterTool.getNumberOfParameters() < 4) {
System.out.println("Missing parameters!\nUsage: Kafka --topic <topic> " +
"--bootstrap.servers <kafka brokers> --zookeeper.connect <zk quorum> --group.id <some id>");
return;
}
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().disableSysoutLogging();
env.getConfig().setRestartStrategy(RestartStrategies.fixedDelayRestart(4, 10000));
env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
DataStream<String> messageStream = env
.addSource(new FlinkKafkaConsumer010<>(
parameterTool.getRequired("topic"),
new SimpleStringSchema(),
parameterTool.getProperties()));
messageStream.print();
env.execute();
}
There are a few metrics which are available in the Flink UI where you can calculate the number of events per second and stuff like that.
You can also add your own metrics where you calculate some numbers based on your requirements and this can be displayed in the Flink UI.
And lastly for specifically latency tracking maybe you can try what's explained here - latency-tracking and similarly you can get throughputs using - meters
This benchmarking application might be a good place to start. The documentation on latency tracking and the metrics available from Flink's Kafka connector should also be interesting reading.
Related
AWS Step Function
My problem is to how to sendTaskSuccess or sendTaskFailuer to Activity which are running under the state machine in AWS .
My Actual intent is to Notify the specific activities which belongs to particular State machine execution.
I successfully send notification to all waiting activities by activityARN. But my actual need is to send notification to specific activity which belong to particular state machine execution .
Example . StateMachine - SM1
There two execution on going for SM1-- SM1E1, SM1E2 . In that case I want to sendTaskSuccess to activity which belongs to SM1E1 .
follwoing code i used . But it send notification to all activities
GetActivityTaskResult getActivityTaskResult = client.getActivityTask(new GetActivityTaskRequest()
.withActivityArn("arn detail"));
if (getActivityTaskResult.getTaskToken() != null) {
try {
JsonNode json = Jackson.jsonNodeOf(getActivityTaskResult.getInput());
String outputResult = patientRegistrationActivity.setStatus(json.get("patientId").textValue());
System.out.println("outputResult " + outputResult);
SendTaskSuccessRequest sendTaskRequest = new SendTaskSuccessRequest().withOutput(outputResult)
.withTaskToken(getActivityTaskResult.getTaskToken());
client.sendTaskSuccess(sendTaskRequest);
} catch (Exception e) {
client.sendTaskFailure(
new SendTaskFailureRequest().withTaskToken(getActivityTaskResult.getTaskToken()));
}
As far as I know you have no control over which task token is returned. You may get one for SM1E1 or SM1E2 and you cannot tell by looking at the task token. GetActivityTask returns "input" so based on that you may be able to tell which execution you are dealing with but if you get a token you are not interested in, I don't think there's a way to put it back so you won't be able to get it again with GetActivityTask later. I guess you could store it in a database somewhere for use later.
One idea you can try is to use the new callback integration pattern. You can specify the Payload parameter in the state definition to include the task token like this token.$: "$$.Task.Token" and then use GetExecutionHistory to find the TaskScheduled state of the execution you are interested in and retrieve the parameters.Payload.token value and then use that with sendTaskSuccess.
Here's a snippet of my serverless.yml file that describes the state
WaitForUserInput: #Wait for the user to do something
Type: Task
Resource: arn:aws:states:::lambda:invoke.waitForTaskToken
Parameters:
FunctionName:
Fn::GetAtt: [WaitForUserInputLambdaFunction, Arn]
Payload:
token.$: "$$.Task.Token"
executionArn.$: "$$.Execution.Id"
Next: DoSomethingElse
I did a POC to check and below is the solution .
if token is consumed by getActivityTaskResult.getTaskToken() and if your conditions not satisfied by request input then you can use below line to avoid token consumption .awsStepFunctionClient.sendTaskHeartbeat(new SendTaskHeartbeatRequest().withTaskToken(taskToken))
Hey I read this jdbc docs
https://www.playframework.com/documentation/2.1.0/ScalaDatabase
and this question
Is it good to put jdbc operations in actors?
Now I have an ActorClass for my mysql transaction, and this actor instantiated several times, whenever request comes. So each request would instantiate new actor. Is it safe for connection pool?
Can I use
val connection = DB.getConnection()
is connection object could handle async transaction?
So I could just a singleton to handle mysql connection and used it in all instantiated actors. Also if I want to use anorm, how do I make an implicit connection object?
Thanks
Your DB.getConnection() should be a promise[Connection] or a future[Connection] if you don't want to block the actor. (caveats at the end of the answer)
If your DB.getConnection() is synchronous (returning only connection without wrapping type) your actor will hang until it actually gets a connection from the pool while processing the actual message. It doesn't matter your DB being singleton or not, in the end it will hit the connection pool.
That being said, you can create actors to handle the messaging and other actors to handle persistence in the database, put them in different thread dispatchers giving more thread to database intensive. This is suggested also in the PlayFramework.
Caveats:
If you run futures inside the actor you are not ensure of the thread/timing it will run, I'm assuming you did something in the line of these (read the comments)
def receive = {
case aMessage =>
val aFuture = future(db.getConnection)
aFuture.map { theConn => //from previous line when you acquire the conn and when you execute the next line
//it could pass a long time they run in different threads/time
//that's why you should better create an actor that handles this sync and let
//akka do the async part
theConn.prepareStatemnt(someSQL)
//omitted code...
}
}
so my suggestion would be
//actor A receives,
//actor B proccess db (and have multiple instances of this one due to slowness from db)
class ActorA(routerOfB : ActorRef) extends Actor {
def recieve = {
case aMessage =>
routerOfB ! aMessage
}
}
class ActorB(db : DB) extends Actor {
def receive = {
case receive = {
val conn = db.getConnection //this blocks but we have multiple instances
//and enforces to run in same thread
val ps = conn.prepareStatement(someSQL)
}
}
}
You will need routing: http://doc.akka.io/docs/akka/2.4.1/scala/routing.html
As I know you couldn't run multiple concurrent query on a single connection for RDBMS ( I've not seen any resource/reference for async/non-blocking call for mysql even in C-API; ). To run your queries concurrently, you most have multiple connection instances.
DB.getConnection isn't expensive while you have multiple instances of connection. The most expensive area for working with DB is running sql query and waiting for its response.
To being async with your DB-calls, you should run them in other threads (not in the main thread-pool of Akka or Play); Slick does it for you. It manages a thread-pool and run your DB-calls on them, then your main threads will be free for processing income requests. Then you dont need to wrap your DB-calls in actors to being async.
I think you should take a connection from pool and return when it is done. If we go by single connection per actor what if that connection gets disconnect, you may need to reinitialize it.
For transactions you may want to try
DB.withTransaction { conn => // do whatever you need with the connection}
For more functional way of database access I would recommend to look into slick which has a nice API that can be integrated with plain actors and going further with streams.
Is there any way I can a timestamp header in Kafka message payload? I want to check when the message was created at the consumer end and apply custom logic based on that.
EDIT:
I'm trying to find a way to attach some custom value (basically timestamp) to the message published by producers, so that I may able to consume message for a specific time duration. Right now Kafka only make sure that the message will be delivered in a order they were put in the queue. But in my case a previously generated record might arrive after a certain delay( so a message generated at time T1 might have a higher offset 1 than another generated at later time T2 with offset 0). For this reason they will not be in the order I expect at the consumer's end. So I am basically looking for a way out to consume them in a ordered way.
The current Kafka 0.8 release provides no way to attach anything other than the "Message Key" at the producer end, found a similar topic here where it was advised to encode the same in the message payload. But I did a lot of searching but couldn't find a possible approach.
Also I don't know if such approach have any impact on the overall performance of Kafka as it manages the message offset internally and there are no such API exposed so far as can be seen from this page
Really appreciate any clue if this at all the right way I am thinking or if there is any probable approach, I am all set to give it a try
If you want to consume message for specific time duration then I can provide you a solution, however to consume messages in ordered way from that time duration is difficult. I am also looking for the same solution. Check the below link
Message Sorting in Kafka Qqueue
Solution to fetch data for specific time
For time T1,T2,...TN , where T is the range of time; divide the topic in N number of partition. Now produced the messages using Partitioner Class in such a way that messages generation time should be used to decide which partition should be used for this message.
Similarly while consuming subscribe to the exact partition for the time range you want to consume.
You can make a class that contains your partitioning information and the timestamp when this message was created, and then use this as the key to the Kafka message. You can then use a wrapper Serde that transforms this class into a byte array and back because Kafka can only understand bytes. Then, when you receive the message at the consumer end as a bag of bytes, you can deserialize it and retrieve the timestamp and then channel that into your logic.
For example:
public class KafkaKey implements Serializable {
private long mTimeStampInSeconds;
/* This contains other partitioning data that will be used by the
appropriate partitioner in Kafka. */
private PartitionData mPartitionData;
public KafkaKey(long timeStamp, ...) {
/* Initialize key */
mTimeStampInSeconds = timestamp;
}
/* Simple getter for timestamp */
public long getTimeStampInSeconds() {
return mTimeStampInSeconds;
}
public static byte[] toBytes(KafkaKey kafkaKey) {
/* Some serialization logic. */
}
public static byte[] toBytes(byte[] kafkaKey) throws Exception {
/* Some deserialization logic. */
}
}
/* Producer End */
KafkaKey kafkaKey = new KafkaKey(System.getCurrentTimeMillis(), ... );
KeyedMessage<byte[], byte[]> kafkaMessage = new KeyedMessage<>(topic, KafkaKey.toBytes(kafkaKey), KafkaValue.toBytes(kafkaValue));
/* Consumer End */
MessageAndMetadata<byte[],byte[]> receivedMessage = (get from consumer);
KafkaKey kafkaKey = KafkaKey.fromBytes(receivedMessage.key());
long timestamp = kafkaKey.getTimeStampInSeconds();
/*
* And happily ever after */
This will be more flexible than making specific partitions correspond to time intervals. Else, you'll have to keep adding partitions for different time ranges, and keep a separate, synchronized tabulation of what partition corresponds to what time range, which can get unwieldy quickly.
This looks like it will help you achieve your goals. It allows you with little effort define and write your message headers hiding the (de)serialization burden. The only thing you have to provide is a (de)serializer for the actual object you're sending through the wire. This implementation actually delays the deserialization process of the payload object as much as possible, this means that you can (in a very performant and transparent way) deserialize the headers, check the timestamp and only deserialize the payload (the heavy bit) if/when you are sure the object is useful to you.
Note, Kafka introduced timestamps to the internal representation of a message pursuant to this discussion:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-32+-+Add+timestamps+to+Kafka+message
and these tickets:
https://issues.apache.org/jira/browse/KAFKA-2511
It should be available in all versions of Kafka 0.10.0.0 and greater.
The problem here is that you ingested messages in an order you no longer want. If the order matters, then you need to abandon parallelism in the relevant Producer(s). Then the problem at the Consumer level goes away.
Consider a running Hadoop job, in which a custom InputFormat needs to communicate ("return", similarly to a callback) a few simple values to the driver class (i.e., to the class that has launched the job), from within its overriden getSplits() method, using the new mapreduce API (as opposed to mapred).
These values should ideally be returned in-memory (as opposed to saving them to HDFS or to the DistributedCache).
If these values were only numbers, one could be tempted to use Hadoop counters. However, in numerous tests counters do not seem to be available at the getSplits() phase and anyway they are restricted to numbers.
An alternative could be to use the Configuration object of the job, which, as the source code reveals, should be the same object in memory for both the getSplits() and the driver class.
In such a scenario, if the InputFormat wants to "return" a (say) positive long value to the driver class, the code would look something like:
// In the custom InputFormat.
public List<InputSplit> getSplits(JobContext job) throws IOException
{
...
long value = ... // A value >= 0
job.getConfiguration().setLong("value", value);
...
}
// In the Hadoop driver class.
Job job = ... // Get the job to be launched
...
job.submit(); // Start running the job
...
while (!job.isComplete())
{
...
if (job.getConfiguration().getLong("value", -1))
{
...
}
else
{
continue; // Wait for the value to be set by getSplits()
}
...
}
The above works in tests, but is it a "safe" way of communicating values?
Or is there a better approach for such in-memory "callbacks"?
UPDATE
The "in-memory callback" technique may not work in all Hadoop distributions, so, as mentioned above, a safer way is, instead of saving the values to be passed back in the Configuration object, create a custom object, serialize it (e.g., as JSON), saved it (in HDFS or in the distributed cache) and have it read in the driver class. I have also tested this approach and it works as expected.
Using the configuration is a perfectly suitable solution (admittedly for a problem I'm not sure I understand), but once the job has actually been submitted to the Job tracker, you will not be able to amend this value (client side or task side) and expect to see the change on the opposite side of the comms (setting configuration values in a map task for example will not be persisted to the other mappers, nor to the reducers, nor will be visible to the job tracker).
So to communicate information back from within getSplits back to your client polling loop (to see when the job has actually finished defining the input splits) is fine in your example.
What's your greater aim or use case for using this?
we use RabbitMQ to send jobs from a producer on one machine, to a small group of consumers distributed across several machines.
The producer generates jobs and places them on the queue, and the consumers check the queue every 10ms to see if there are any unclaimed jobs and fetch a job at a time if a job is available. If one particular worker takes too long to process a job (GC pauses or other transient issue), other consumers are free to remove jobs from the queue so that overall job throughput stays high.
When we originally set up this system, we were unable to figure out how to set up a subscriber relationship for more than one consumer on the queue that would prevent us from having to poll and introduce that little extra bit of latency.
Inspecting the documentation has not yielded satisfying answers. We are new to using message queues and it is possible that we don't know the words that accurately describe the above scenario. This is something like a blackboard system, but in this case the "specialists" are all identical and never consume each other's results -- results are always reported back to the job producer.
Any ideas?
Getting pub-subscribe is straight forward, i inital had same problems but works well. The project now has some great help pages at http://www.rabbitmq.com/getstarted.html
RabbitMQ has timeout and a resernt flag which can be used as you see fit.
You can also get the workers to be event driven as aposed to checking every 10ms etc. If you need help on this i have a small project at http://rabbitears.codeplex.com/ which might help slightly.
Here you have to keep in mind that rabbitMQ channel is not thread safe.
so create a singleton class that will handle all these rabbitmq operations
like
I am writing code sample in SCALA
Object QueueManager{
val FACTORY = new ConnectionFactory
FACTORY setUsername (RABBITMQ_USERNAME)
FACTORY setPassword (RABBITMQ_PASSWORD)
FACTORY setVirtualHost (RABBITMQ_VIRTUALHOST)
FACTORY setPort (RABBITMQ_PORT)
FACTORY setHost (RABBITMQ_HOST)
conn = FACTORY.newConnection
var channel: com.rabbitmq.client.Channel = conn.createChannel
//here to decare consumer for queue1
channel.exchangeDeclare(EXCHANGE_NAME, "direct", durable)
channel.queueDeclare(QUEUE1, durable, false, false, null)
channel queueBind (QUEUE1, EXCHANGE_NAME, QUEUE1_ROUTING_KEY)
val queue1Consumer = new QueueingConsumer(channel)
channel basicConsume (QUEUE1, false, queue1Consumer)
//here to decare consumer for queue2
channel.exchangeDeclare(EXCHANGE_NAME, "direct", durable)
channel.queueDeclare(QUEUE2, durable, false, false, null)
channel queueBind (QUEUE2, EXCHANGE_NAME, QUEUE2_ROUTING_KEY)
val queue2Consumer = new QueueingConsumer(channel)
channel basicConsume (QUEUE2, false, queue2Consumer)
//here u should mantion distinct ROUTING key for each queue
def addToQueueOne{
channel.basicPublish(EXCHANGE_NAME, QUEUE1_ROUTING_KEY, MessageProperties.PERSISTENT_TEXT_PLAIN, obj.getBytes)
}
def addToQueueTwo{
channel.basicPublish(EXCHANGE_NAME, QUEUE2_ROUTING_KEY, MessageProperties.PERSISTENT_TEXT_PLAIN, obj.getBytes)
}
def getFromQueue1:Delivery={
queue1Consumer.nextDelivery
}
def getFromQueue2:Delivery={
queue2Consumer.nextDelivery
}
}
i have written a code sample for 2 queues u can add more queues like above........