Group messages by key for pulsar key_shared subscription type - message-queue

Say I have an unbounded set of keys for messages published to a pulsar topic.
{k0, k1, ..., kn}
And a finite set of expected message categories, where the category information is part of the message payload.
{c0, c1, c2}
Whenever all message categories for a given key are consumed I want to invoke an action in my application. For example, if I see the following key/category pairs, I would expect to see an action invoked.
{(k0, c0), (k0, c1), (k0, c2)} => action invoked for key k0
{(k1, c0), (k1, c1), (k1, c2)} => action invoked for key k1
In order to ensure application resiliency I only ack messages once all categories have been consumed. If a message pertaining to the same category is consumed twice I can ack the older message, holding on to one message per category.
Now, let's say I have a single consumer attached to the subscription and configured with the key_shared subscription type. We consume the following key/category pairs.
{(k0, c0), (k0, c1)}
And while waiting for (k0, c2) a second consumer is added to the subscription. According to this issue, the new consumer will not receive messages until the existing consumer acks or nacks the pending messages. This seems to be expected behaviour, and is indeed the behaviour I am seeing.
I am wondering if there is there a more idiomatic way I can go about implementing this feature? Does it make sense to delay acking of messages in order to achieve this grouping behaviour?

Using a partitioned topic with the failover subscription type achieves our design goal. Below are a description of the approaches we explored and the observed behaviour.
Non-partitioned topic with key_shared subscription
When the application is scaled out (more consumers added to the subscription), any pending messages (messages delivered to a consumer, but not yet acked) cause the new consumer to not receive any messages until the pending messages are acked/nacked or the pre-existing consumer unsubscribes.
Partitioned topic with failover subscription
When the application is scaled out, topic/partition pairs are re-assigned evenly across consumers and pending messages (if any) are re-delivered. Consumers needs to be informed when ownership of a topic partition changes in order to clear the internal state, for this the consumer event listener can be used.

Related

Looking for an example of a OBD-II complete data frame

I'm developing an OBD-II reader where I want to query requests to read PID parameters with a stm32 processor. I already understand what should go on the data field, but the ID is giving me a headache. As I have read, one must send 0x7DF to broadcast a request, and each ECU will respond with his own ID. However, I have been asked to do this within the SAE J1939 protocol, which uses the 29 bit extended identifier, and I don't know what I need to add to this ID.
As I stated in the title, could someone show me some actual data from a bus using this method? I've been searching on the internet for real frames but did not have any luck so far.
I woud also appreciate if someone could shred some light to if the OBD-II communication needs some acknowledgment to work properly.
Thanks
I would suggest you to take a look on the SAE J1939 documentation, in the more specifically on the J1939/21,J1939-71 and J1939/73.
Generally, a J1939 transport protocol response sequence can be processed as follows:
Identify the BAM frame, indicating a new sequence being initiated
(via the PGN 60416 - 0xEC00 can be reach by 0x1CECFF00 )
Extract the J1939 PGN from bytes 6-8 of the BAM payload to use as the
identifier of the new frame
Construct the new data payload by concatenating bytes 2-8 of the data
transfer frames (i.e. excl. the 1st byte)
A J1939 data transfer messages with ID 1CEBFF00 (PGN 60160 or EB00).
Above, the last 3 bytes of the BAM equal E3FE00. When reordered, these equal the PGN FEE3 aka Engine Configuration 1 (EC1). Further, the payload is found by combining the the first 39 bytes across the 6 data transfer packets/fram
The administrative control device or any device issuing the vehicle use status PID should be sensitive to the run switch status (SPN 3046 - 0xFDC0 which probably can be reach by 0xCFDC000) and any other locally defined criteria for authorized use (i.e., driver log-ons) before the vehicle use status PID is used to generate an unauthorized use alarm.
Also, you can't forget to uses a read/send to extend ID message, since that is a 24-bit.
In fact, i will suggest you to use can-utils to make your a analyses even easier. A simple can-dump or can-sniffer you can see what is coming on your broadcast.
Some car's dbc https://github.com/commaai/opendbc

How does Deduplication work in Apache Pulsar?

I'm trying to use Deduplication feature of Apache Pulsar.
brokerDeduplicationEnabled=true is set in standalone.conf file, But
when I send the same message from producer multiple times, I get all the messages at consumer end, is this expected behaviour ?
Isn't deduplication means content based deduplication as in AWS SQS ?
Here is my producer code for reference.
import pulsar
import json
client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer(
'persistent://public/default/my-topic',
send_timeout_millis=0,
producer_name="producer-1")
data = {'key1': 0, 'key2' : 1}
for i in range(10):
encoded_data = json.dumps(data).encode('utf-8')
producer.send(encoded_data)
client.close()
In Pulsar, deduplication doesn't work on the content of the message. It works on the individual message. The intention isn't to deduplicate the content but to ensure an individual message cannot be be published more than once.
When you send a message, Pulsar assigns it an unique identifier. Deduplication ensures that in failure scenarios the same message doesn't get stored in (or written to) Pulsar more than once. It does this by comparing the identifier to a list of already stored identifiers. If the identifier of the message has already been stored, Pulsar ignores it. This way, Pulsar will only store the message once. This is part of Pulsar's mechanism to guarantee a message will be sent exactly once.
For more details, see PIP 6: Guaranteed Message Deduplication.

Duplicates on Apache Beam / Dataflow inputs even when using withIdAttribute

I am trying to ingest data from a 3rd party API into a Dataflow pipeline. Since the 3rd party doesn't make webhooks available, I wrote a custom script that constantly polls their endpoint for more data.
The data is refreshed every 15 minutes, but since I don't want to miss any datapoints and I want to consume as soon as new data is available, my "crawler" runs every 1 minute. The script then sends the data to a PubSub topic. Easy to see that PubSub will receive about 15 repeated messages for each datapoint in the source.
My first attempt to identify and discard those repeated messages was to add a custom attribute to each PubSub message (eventid), created from a hash of its [ID + updated_time] at source.
const attributes = {
eventid: Buffer.from(`${item.lastupdate}|${item.segmentid}`).toString('base64'),
timestamp: item.timestamp.toString()
};
const dataBuffer = Buffer.from(JSON.stringify(item))
publisher.publish(dataBuffer, attributes)
Then I configured Dataflow with a withIdAttribute() (which is the new idLabel(), based on Record IDs).
PCollection<String> input = p
.apply("ReadFromPubSub", PubsubIO
.readStrings()
.fromTopic(String.format("projects/%s/topics/%s", options.getProject(), options.getIncomingDataTopic()))
.withTimestampAttribute("timestamp")
.withIdAttribute("eventid"))
.apply("OutputToBigQuery", ...)
With that implementation, I was expecting that when the script sends the same datapoint a second time, the repeated eventid would be the same and the message discarded. But for some reason, I still see duplicates on the output dataset.
Some questions:
Is there a clever way to ingest the data to dataflow from that 3rd party API if they don't provide webhooks?
Any ideas on why dataflow is not discarding the messages on this situation?
I know about the 10-minute restriction for deduplication on dataflow, but I see duplicated data even on the 2nd insertion (2 minutes).
Any help will be greatly appreciated!
I think you are on the right track, instead of the hash I recommend to use timestamps. A better way to to this is by using windows. Review this document which filters data that is outside of the window.
Regarding the additional duplicate data, if you are using pull subscriptions and the acknowledgement deadline is reached before having the data processed the message will be resent as per the at-least-once delivery. In this case change the acknowledgement deadline, the defaults is 10 seconds.

Is TTL implemented in Pika?

I'd like my queue to drop messages not processed within a certain time.
I already do this in the consumer by recording the publish time. However, in the case that no-one is subcribing, it would be better for the queue to simply drop stale messages.
Can I set a expiry time (TTL) in messages in Pika. The RabbitMQ docs talk about it but i don't see TTL references in the Pika docs.
You can set the per message TTL using the expiration flag on the BasicProperties object, as seen in the pika documentation here.
Using it would look something like this.
channel.basic_publish(
exchange='',
routing_key='hello_world',
properties=pika.BasicProperties(
expiration='60000',
),
body='my message'
)
Keep in mind that the expiration policy is expressed using milliseconds as string, so 60000 would translate to 60 seconds.
You can read more about message based TTL and it's caveats here.

kafka-python 1.3.3: KafkaProducer.send with explicit key fails to send message to broker

(Possibly a duplicate of Can't send a keyedMessage to brokers with partitioner.class=kafka.producer.DefaultPartitioner, although the OP of that question didn't mention kafka-python. And anyway, it never got an answer.)
I have a Python program that has been successfully (for many months) sending messages to the Kafka broker, using essentially the following logic:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
retries=3)
...
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg)
Recently, I tried to upgrade it to send messages to different partitions based on a definite key value extracted from the message:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
key_serializer=str.encode,
retries=3)
...
try:
key = some_message[0]
except:
key = None
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg, key=key)
However, with this code, no messages ever make it out of the program to the broker. I've verified that the key value extracted from some_message is always a valid string. Presumably I don't need to define my own partitioner, since, according to the documentation:
The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages with the same key are assigned to the same partition.
Furthermore, with the new code, when I try to determine what happened to my send by calling res.get (to obtain a kafka.FutureRecordMetadata), that call throws a TypeError exception with the message descriptor 'encode' requires a 'str' object but received a 'unicode'.
(As a side question, I'm not exactly sure what I'd do with the FutureRecordMetadata if I were actually able to get it. Based on the kafka-python source code, I assume I'd want to call either its succeeded or its failed method, but the documentation is silent on the point. The documentation does say that the return value of send "resolves to" RecordMetadata, but I haven't been able to figure out, from either the documentation or the code, what "resolves to" means in this context.)
Anyway: I can't be the only person using kafka-python 1.3.3 who's ever tried to send messages with a partitioning key, and I have not seen anything on teh Intertubes describing a similar problem (except for the SO question I referenced at the top of this post).
I'm certainly willing to believe that I'm doing something wrong, but I have no idea what that might be. Is there some additional parameter I need to supply to the KafkaProducer constructor?
The fundamental problem turned out to be that my key value was a unicode, even though I was quite convinced that it was a str. Hence the selection of str.encode for my key_serializer was inappropriate, and was what led to the exception from res.get. Omitting the key_serializer and calling key.encode('utf-8') was enough to get my messages published, and partitioned as expected.
A large contributor to the obscurity of this problem (for me) was that the kafka-python 1.3.3 documentation does not go into any detail on what a FutureRecordMetadata really is, nor what one should expect in the way of exceptions its get method can raise. The sole usage example in the documentation:
# Asynchronous by default
future = producer.send('my-topic', b'raw_bytes')
# Block for 'synchronous' sends
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
suggests that the only kind of exception it will raise is KafkaError, which is not true. In fact, get can and will (re-)raise any exception that the asynchronous publishing mechanism encountered in trying to get the message out the door.
I also faced the same error. Once I added json.dumps while sending the key, it worked.
producer.send(topic="first_topic", key=json.dumps(key)
.encode('utf-8'), value=json.dumps(msg)
.encode('utf-8'))
.add_callback(on_send_success).add_errback(on_send_error)