kafka-python 1.3.3: KafkaProducer.send with explicit key fails to send message to broker - kafka-producer-api

(Possibly a duplicate of Can't send a keyedMessage to brokers with partitioner.class=kafka.producer.DefaultPartitioner, although the OP of that question didn't mention kafka-python. And anyway, it never got an answer.)
I have a Python program that has been successfully (for many months) sending messages to the Kafka broker, using essentially the following logic:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
retries=3)
...
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg)
Recently, I tried to upgrade it to send messages to different partitions based on a definite key value extracted from the message:
producer = kafka.KafkaProducer(bootstrap_servers=[some_addr],
key_serializer=str.encode,
retries=3)
...
try:
key = some_message[0]
except:
key = None
msg = json.dumps(some_message)
res = producer.send(some_topic, value=msg, key=key)
However, with this code, no messages ever make it out of the program to the broker. I've verified that the key value extracted from some_message is always a valid string. Presumably I don't need to define my own partitioner, since, according to the documentation:
The default partitioner implementation hashes each non-None key using the same murmur2 algorithm as the java client so that messages with the same key are assigned to the same partition.
Furthermore, with the new code, when I try to determine what happened to my send by calling res.get (to obtain a kafka.FutureRecordMetadata), that call throws a TypeError exception with the message descriptor 'encode' requires a 'str' object but received a 'unicode'.
(As a side question, I'm not exactly sure what I'd do with the FutureRecordMetadata if I were actually able to get it. Based on the kafka-python source code, I assume I'd want to call either its succeeded or its failed method, but the documentation is silent on the point. The documentation does say that the return value of send "resolves to" RecordMetadata, but I haven't been able to figure out, from either the documentation or the code, what "resolves to" means in this context.)
Anyway: I can't be the only person using kafka-python 1.3.3 who's ever tried to send messages with a partitioning key, and I have not seen anything on teh Intertubes describing a similar problem (except for the SO question I referenced at the top of this post).
I'm certainly willing to believe that I'm doing something wrong, but I have no idea what that might be. Is there some additional parameter I need to supply to the KafkaProducer constructor?

The fundamental problem turned out to be that my key value was a unicode, even though I was quite convinced that it was a str. Hence the selection of str.encode for my key_serializer was inappropriate, and was what led to the exception from res.get. Omitting the key_serializer and calling key.encode('utf-8') was enough to get my messages published, and partitioned as expected.
A large contributor to the obscurity of this problem (for me) was that the kafka-python 1.3.3 documentation does not go into any detail on what a FutureRecordMetadata really is, nor what one should expect in the way of exceptions its get method can raise. The sole usage example in the documentation:
# Asynchronous by default
future = producer.send('my-topic', b'raw_bytes')
# Block for 'synchronous' sends
try:
record_metadata = future.get(timeout=10)
except KafkaError:
# Decide what to do if produce request failed...
log.exception()
pass
suggests that the only kind of exception it will raise is KafkaError, which is not true. In fact, get can and will (re-)raise any exception that the asynchronous publishing mechanism encountered in trying to get the message out the door.

I also faced the same error. Once I added json.dumps while sending the key, it worked.
producer.send(topic="first_topic", key=json.dumps(key)
.encode('utf-8'), value=json.dumps(msg)
.encode('utf-8'))
.add_callback(on_send_success).add_errback(on_send_error)

Related

How does Deduplication work in Apache Pulsar?

I'm trying to use Deduplication feature of Apache Pulsar.
brokerDeduplicationEnabled=true is set in standalone.conf file, But
when I send the same message from producer multiple times, I get all the messages at consumer end, is this expected behaviour ?
Isn't deduplication means content based deduplication as in AWS SQS ?
Here is my producer code for reference.
import pulsar
import json
client = pulsar.Client('pulsar://localhost:6650')
producer = client.create_producer(
'persistent://public/default/my-topic',
send_timeout_millis=0,
producer_name="producer-1")
data = {'key1': 0, 'key2' : 1}
for i in range(10):
encoded_data = json.dumps(data).encode('utf-8')
producer.send(encoded_data)
client.close()
In Pulsar, deduplication doesn't work on the content of the message. It works on the individual message. The intention isn't to deduplicate the content but to ensure an individual message cannot be be published more than once.
When you send a message, Pulsar assigns it an unique identifier. Deduplication ensures that in failure scenarios the same message doesn't get stored in (or written to) Pulsar more than once. It does this by comparing the identifier to a list of already stored identifiers. If the identifier of the message has already been stored, Pulsar ignores it. This way, Pulsar will only store the message once. This is part of Pulsar's mechanism to guarantee a message will be sent exactly once.
For more details, see PIP 6: Guaranteed Message Deduplication.

Simperium Data Dictionary or Decoder Ring for Return Value on "all" call?

I've looked through all of the Simperium API docs for all of the different programming languages and can't seem to find this. Is there any documentation for the data returned from an ".all" call (e.g. api.todo.all(:cv=>nil, :data=>false, :username=>false, :most_recent=>false, :timeout=>nil) )?
For example, this is some data returned:
{"ccid"=>"10101010101010101010101010110101010",
"o"=>"M",
"cv"=>"232323232323232323232323232",
"clientid"=>"ab-123123123123123123123123",
"v"=>{
"date"=>{"o"=>"+", "v"=>"2015-08-20T00:00:00-07:00"},
"calendar"=>{"o"=>"+", "v"=>false},
"desc"=>{"o"=>"+", "v"=>"<p>test</p>\r\n"},
"location"=>{"o"=>"+", "v"=>"Los Angeles"},
"id"=>{"o"=>"+", "v"=>43}
},
"ev"=>1,
"id"=>"abababababababababababababab/10101010101010101010101010110101010"}
I can figure out some of it just from context or from the name of the key but a lot of it is guesswork and trial and error. The one that concerns me is the value returned for the "o" key. I assume that a value of "M" is modify and a value of "+" is add. I've also run into "-" for delete and just recently discovered that there is also a "! '-'" which is also a delete but don't know what else it signifies. What other values can be returned in the "o" key? Are there other keys/values that can be returned but are rare? Is there documentation that details what can be returned (that would be the most helpful)?
If it matters, I am using the Ruby API but I think this is a question that, if answered, can be helpful for all APIs.
The response you are seeing is a list of all of the changes which have occurred in the given bucket since some point in its history. In the case where cv is blank, it tries to get the full history.
You can find some of the details in the protocol documentation though it's incomplete and focused on the WebSocket message syntax (the operations are the same however as with the HTTP API).
The information provided by the v parameter is the result of applying the JSON-diff algorithm to the data between changes. With this diff information you can reconstruct the data at any given version as the changes stream in.

How to best validate JSON on the server-side

When handling POST, PUT, and PATCH requests on the server-side, we often need to process some JSON to perform the requests.
It is obvious that we need to validate these JSONs (e.g. structure, permitted/expected keys, and value types) in some way, and I can see at least two ways:
Upon receiving the JSON, validate the JSON upfront as it is, before doing anything with it to complete the request.
Take the JSON as it is, start processing it (e.g. access its various key-values) and try to validate it on-the-go while performing business logic, and possibly use some exception handling to handle vogue data.
The 1st approach seems more robust compared to the 2nd, but probably more expensive (in time cost) because every request will be validated (and hopefully most of them are valid so the validation is sort of redundant).
The 2nd approach may save the compulsory validation on valid requests, but mixing the checks within business logic might be buggy or even risky.
Which of the two above is better? Or, is there yet a better way?
What you are describing with POST, PUT, and PATCH sounds like you are implementing a REST API. Depending on your back-end platform, you can use libraries that will map JSON to objects which is very powerful and performs that validation for you. In JAVA, you can use Jersey, Spring, or Jackson. If you are using .NET, you can use Json.NET.
If efficiency is your goal and you want to validate every single request, it would be ideal if you could evaluate on the front-end if you are using JavaScript you can use json2.js.
In regards to comparing your methods, here is a Pro / Cons list.
Method #1: Upon Request
Pros
The business logic integrity is maintained. As you mentioned trying to validate while processing business logic could result in invalid tests that may actually be valid and vice versa or also the validation could inadvertently impact the business logic negatively.
As Norbert mentioned, catching the errors before hand will improve efficiency. The logical question this poses is why spend the time processing, if there are errors in the first place?
The code will be cleaner and easier to read. Having validation and business logic separated will result in cleaner, easier to read and maintain code.
Cons
It could result in redundant processing meaning longer computing time.
Method #2: Validation on the Go
Pros
It's efficient theoretically by saving process and compute time doing them at the same time.
Cons
In reality, the process time that is saved is likely negligible (as mentioned by Norbert). You are still doing the validation check either way. In addition, processing time is wasted if an error was found.
The data integrity can be comprised. It could be possible that the JSON becomes corrupt when processing it this way.
The code is not as clear. When reading the business logic, it may not be as apparent what is happening because validation logic is mixed in.
What it really boils down to is Accuracy vs Speed. They generally have an inverse relationship. As you become more accurate and validate your JSON, you may have to compromise some on speed. This is really only noticeable in large data sets as computers are really fast these days. It is up to you to decide what is more important given how accurate you think you data may be when receiving it or whether that extra second or so is crucial. In some cases, it does matter (i.e. with the stock market and healthcare applications, milliseconds matter) and both are highly important. It is in those cases, that as you increase one, for example accuracy, you may have to increase speed by getting a higher performant machine.
Hope this helps.
The first approach is more robust, but does not have to be noticeably more expensive. It becomes way less expensive even when you are able to abort the parsing process due to errors: Your business logic usually takes >90% of the resources in a process, so if you have an error % of 10%, you are already resource neutral. If you optimize the validation process so that the validations from the business process are performed upfront, your error rate might be much lower (like 1 in 20 to 1 in 100) to stay resource neutral.
For an example on an implementation assuming upfront data validation, look at GSON (https://code.google.com/p/google-gson/):
GSON works as follows: Every part of the JSON can be cast into an object. This object is typed or contains typed data:
Sample object (JAVA used as example language):
public class someInnerDataFromJSON {
String name;
String address;
int housenumber;
String buildingType;
// Getters and setters
public String getName() { return name; }
public void setName(String name) { this.name=name; }
//etc.
}
The data parsed by GSON is by using the model provided, already type checked.
This is the first point where your code can abort.
After this exit point assuming the data confirmed to the model, you can validate if the data is within certain limits. You can also write that into the model.
Assume for this buildingType is a list:
Single family house
Multi family house
Apartment
You can check data during parsing by creating a setter which checks the data, or you can check it after parsing in a first set of your business rule application. The benefit of first checking the data is that your later code will have less exception handling, so less and easier to understand code.
I would definitively go for validation before processing.
Let's say you receive some json data with 10 variables of which you expect:
the first 5 variables to be of type string
6 and 7 are supposed to be integers
8, 9 and 10 are supposed to be arrays
You can do a quick variable type validation before you start processing any of this data and return a validation error response if one of the ten fails.
foreach($data as $varName => $varValue){
$varType = gettype($varValue);
if(!$this->isTypeValid($varName, $varType)){
// return validation error
}
}
// continue processing
Think of the scenario where you are directly processing the data and then the 10th value turns out to be of invalid type. The processing of the previous 9 variables was a waste of resources since you end up returning some validation error response anyway. On top of that you have to rollback any changes already persisted to your storage.
I only use variable type in my example but I would suggest full validation (length, max/min values, etc) of all variables before processing any of them.
In general, the first option would be the way to go. The only reason why you might need to think of the second option is if you were dealing with JSON data which was tens of MBs large or more.
In other words, only if you are trying to stream JSON and process it on the fly, you will need to think about second option.
Assuming that you are dealing with few hundred KB at most per JSON, you can just go for option one.
Here are some steps you could follow:
Go for a JSON parser like GSON that would just convert your entire
JSON input into the corresponding Java domain model object. (If GSON
doesn't throw an exception, you can be sure that the JSON is
perfectly valid.)
Of course, the objects which were constructed using GSON in step 1
may not be in a functionally valid state. For example, functional
checks like mandatory fields and limit checks would have to be done.
For this, you could define a validateState method which repeatedly
validates the states of the object itself and its child objects.
Here is an example of a validateState method:
public void validateState(){
//Assume this validateState is part of Customer class.
if(age<12 || age>150)
throw new IllegalArgumentException("Age should be in the range 12 to 120");
if(age<18 && (guardianId==null || guardianId.trim().equals(""))
throw new IllegalArgumentException("Guardian id is mandatory for minors");
for(Account a:customer.getAccounts()){
a.validateState(); //Throws appropriate exceptions if any inconsistency in state
}
}
The answer depends entirely on your use case.
If you expect all calls to originate in trusted clients then the upfront schema validation should be implement so that it is activated only when you set a debug flag.
However, if your server delivers public api services then you should validate the calls upfront. This isn't just a performance issue - your server will likely be scrutinized for security vulnerabilities by your customers, hackers, rivals, etc.
If your server delivers private api services to non-trusted clients (e.g., in a closed network setup where it has to integrate with systems from 3rd party developers), then you should at least run upfront those checks that will save you from getting blamed for someone else's goofs.
It really depends on your requirements. But in general I'd always go for #1.
Few considerations:
For consistency I'd use method #1, for performance #2. However when using #2 you have to take into account that rolling back in case of non valid input may become complicated in the future, as the logic changes.
Json validation should not take that long. In python you can use ujson for parsing json strings which is a ultrafast C implementation of the json python module.
For validation, I use the jsonschema python module which makes json validation easy.
Another approach:
if you use jsonschema, you can validate the json request in steps. I'd perform an initial validation of the most common/important parts of the json structure, and validate the remaining parts along the business logic path. This would allow to write simpler json schemas and therefore more lightweight.
The final decision:
If (and only if) this decision is critical I'd implement both solutions, time-profile them in right and wrong input condition, and weight the results depending on the wrong input frequency. Therefore:
1c = average time spent with method 1 on correct input
1w = average time spent with method 1 on wrong input
2c = average time spent with method 2 on correct input
2w = average time spent with method 2 on wrong input
CR = correct input rate (or frequency)
WR = wrong input rate (or frequency)
if ( 1c * CR ) + ( 1w * WR) <= ( 2c * CR ) + ( 2w * WR):
chose method 1
else:
chose method 2

What are the cons of returning an Exception instance instead of raising it in Python?

I have been doing some work with python-couchdb and desktopcouch. In one of the patches I submitted I wrapped the db.update function from couchdb. For anyone that is not familiar with python-couchdb the function is the following:
def update(self, documents, **options):
"""Perform a bulk update or insertion of the given documents using a
single HTTP request.
>>> server = Server('http://localhost:5984/')
>>> db = server.create('python-tests')
>>> for doc in db.update([
... Document(type='Person', name='John Doe'),
... Document(type='Person', name='Mary Jane'),
... Document(type='City', name='Gotham City')
... ]):
... print repr(doc) #doctest: +ELLIPSIS
(True, '...', '...')
(True, '...', '...')
(True, '...', '...')
>>> del server['python-tests']
The return value of this method is a list containing a tuple for every
element in the `documents` sequence. Each tuple is of the form
``(success, docid, rev_or_exc)``, where ``success`` is a boolean
indicating whether the update succeeded, ``docid`` is the ID of the
document, and ``rev_or_exc`` is either the new document revision, or
an exception instance (e.g. `ResourceConflict`) if the update failed.
If an object in the documents list is not a dictionary, this method
looks for an ``items()`` method that can be used to convert the object
to a dictionary. Effectively this means you can also use this method
with `schema.Document` objects.
:param documents: a sequence of dictionaries or `Document` objects, or
objects providing a ``items()`` method that can be
used to convert them to a dictionary
:return: an iterable over the resulting documents
:rtype: ``list``
:since: version 0.2
"""
As you can see, this function does not raise the exceptions that have been raised by the couchdb server but it rather returns them in a tuple with the id of the document that we wanted to update.
One of the reviewers went to #python on irc to ask about the matter. In #python they recommended to use sentinel values rather than exceptions. As you can imaging just an approach is not practical since there are lots of possible exceptions that can be received. My questions is, what are the cons of using Exceptions over sentinel values besides that using exceptions is uglier?
I think it is ok to return the exceptions in this case, because some parts of the update function may succeed and some may fail. When you raise the exception, the API user has no control over what succeeded already.
Raising an Exception is a notification that something that was expected to work did not work. It breaks the program flow, and should only be done if whatever is going on now is flawed in a way that the program doesn't know how to handle.
But sometimes you want to raise a little error flag without breaking program flow. You can do this by returning special values, and these values can very well be exceptions.
Python does this internally in one case. When you compare two values like foo < bar, the actual call is foo.__lt__(bar). If this method raises an exception, program flow will be broken, as expected. But if it returns NotImplemented, Python will then try bar.__ge__(foo) instead. So in this case returning the exception rather than raising it is used to flag that it didn't work, but in an expected way.
It's really the difference between an expected error and an unexpected one, IMO.
exceptions intended to be raised. It helps with debugging, handling causes of the errors and it's clear and well-established practise of other developers.
I think looking at the interface of the programme, it's not clear what am I supposed to do with returned exception. raise it? from outside of the chain that actually caused it? it seems a bit convoluted.
I'd suggest, returning docid, new_rev_doc tuple on success and propagating/raising exception as it is. Your approach duplicates success and type of 3rd returned value too.
Exceptions cause the normal program flow to break; then exceptions go up the call stack until they're intercept, or they may reach the top if they aren't. Hence they're employed to mark a really special condition that should be handled by the caller. Raising an exception is useful since the program won't continue if a necessary condition has not been met.
In languages that don't support exceptions (like C) you're often forced to check return values of functions to verify everything went on correctly; otherwise the program may misbehave.
By the way the update() is a bit different:
it takes multiple arguments; some may fail, some may succeed, hence it needs a way to communicate results for each arg.
a previous failure has no relation with operations coming next, e.g. it is not a permanent error
In that situation raising an exception would NOT be usueful in an API. On the other hand, if the connection to the db drops while executing the query, then an exception is the way to go (since it's a permament error and would impact all operations coming next).
By the way if your business logic requires all operations to complete successfully and you don't know what to do when an update fails (i.e. your design says it should never happen), feel free to raise an exception in your own code.

api documentation and "value limits": do they match?

Do you often see in API documentation (as in 'javadoc of public functions' for example) the description of "value limits" as well as the classic documentation ?
Note: I am not talking about comments within the code
By "value limits", I mean:
does a parameter can support a null value (or an empty String, or...) ?
does a 'return value' can be null or is guaranteed to never be null (or can be "empty", or...) ?
Sample:
What I often see (without having access to source code) is:
/**
* Get all readers name for this current Report. <br />
* <b>Warning</b>The Report must have been published first.
* #param aReaderNameRegexp filter in order to return only reader matching the regexp
* #return array of reader names
*/
String[] getReaderNames(final String aReaderNameRegexp);
What I like to see would be:
/**
* Get all readers name for this current Report. <br />
* <b>Warning</b>The Report must have been published first.
* #param aReaderNameRegexp filter in order to return only reader matching the regexp
* (can be null or empty)
* #return array of reader names
* (null if Report has not yet been published,
* empty array if no reader match criteria,
* reader names array matching regexp, or all readers if regexp is null or empty)
*/
String[] getReaderNames(final String aReaderNameRegexp);
My point is:
When I use a library with a getReaderNames() function in it, I often do not even need to read the API documentation to guess what it does. But I need to be sure how to use it.
My only concern when I want to use this function is: what should I expect in term of parameters and return values ? That is all I need to know to safely setup my parameters and safely test the return value, yet I almost never see that kind of information in API documentation...
Edit:
This can influence the usage or not for checked or unchecked exceptions.
What do you think ? value limits and API, do they belong together or not ?
I think they can belong together but don't necessarily have to belong together. In your scenario, it seems like it makes sense that the limits are documented in such a way that they appear in the generated API documentation and intellisense (if the language/IDE support it).
I think it does depend on the language as well. For example, Ada has a native data type that is a "restricted integer", where you define an integer variable and explicitly indicate that it will only (and always) be within a certain numeric range. In that case, the datatype itself indicates the restriction. It should still be visible and discoverable through the API documentation and intellisense, but wouldn't be something that a developer has to specify in the comments.
However, languages like Java and C# don't have this type of restricted integer, so the developer would have to specify it in the comments if it were information that should become part of the public documentation.
I think those kinds of boundary conditions most definitely belong in the API. However, I would (and often do) go a step further and indicate WHAT those null values mean. Either I indicate it will throw an exception, or I explain what the expected results are when the boundary value is passed in.
It's hard to remember to always do this, but it's a good thing for users of your class. It's also difficult to maintain it if the contract the method presents changes (like null values are changed to no be allowed)... you have to be diligent also to update the docs when you change the semantics of the method.
Question 1
Do you often see in API documentation (as in 'javadoc of public functions' for example) the description of "value limits" as well as the classic documentation?
Almost never.
Question 2
My only concern when I want to use this function is: what should I expect in term of parameters and return values ? That is all I need to know to safely setup my parameters and safely test the return value, yet I almost never see that kind of information in API documentation...
If I used a function not properly I would expect a RuntimeException thrown by the method or a RuntimeException in another (sometimes very far) part of the program.
Comments like #param aReaderNameRegexp filter in order to ... (can be null or empty) seems to me a way to implement Design by Contract in a human-being language inside Javadoc.
Using Javadoc to enforce Design by Contract was used by iContract, now resurrected into JcontractS, that let you specify invariants, preconditions, postconditions, in more formalized way compared to the human-being language.
Question 3
This can influence the usage or not for checked or unchecked exceptions.
What do you think ? value limits and API, do they belong together or not ?
Java language doesn't have a Design by Contract feature, so you might be tempted to use Execption but I agree with you about the fact that you have to be aware about When to choose checked and unchecked exceptions. Probably you might use unchecked IllegalArgumentException, IllegalStateException, or you might use unit testing, but the major problem is how to communicate to other programmers that such code is about Design By Contract and should be considered as a contract before changing it too lightly.
I think they do, and have always placed comments in the header files (c++) arcordingly.
In addition to valid input/output/return comments, I also note which exceptions are likly to be thrown by the function (since I often want to use the return value for...well returning a value, I prefer exceptions over error codes)
//File:
// Should be a path to the teexture file to load, if it is not a full path (eg "c:\example.png") it will attempt to find the file usign the paths provided by the DataSearchPath list
//Return: The pointer to a Texture instance is returned, in the event of an error, an exception is thrown. When you are finished with the texture you chould call the Free() method.
//Exceptions:
//except::FileNotFound
//except::InvalidFile
//except::InvalidParams
//except::CreationFailed
Texture *GetTexture(const std::string &File);
#Fire Lancer: Right! I forgot about exception, but I would like to see them mentioned, especially the unchecked 'runtime' exception that this public method could throw
#Mike Stone:
you have to be diligent also to update the docs when you change the semantics of the method.
Mmmm I sure hope that the public API documentation is at the very least updated whenever a change -- that affects the contract of the function -- takes place. If not, those API documentations could be drop altogether.
To add food to yours thoughts (and go with #Scott Dorman), I just stumble upon the future of java7 annotations
What does that means ? That certain 'boundary conditions', rather than being in the documentation, should be better off in the API itself, and automatically used, at compilation time, with appropriate 'assert' generated code.
That way, if a '#CheckForNull' is in the API, the writer of the function might get away with not even documenting it! And if the semantic change, its API will reflect that change (like 'no more #CheckForNull' for instance)
That kind of approach suggests that documentation, for 'boundary conditions', is an extra bonus rather than a mandatory practice.
However, that does not cover the special values of the return object of a function. For that, a complete documentation is still needed.