Understanding the max.inflight property of kafka producer - configuration

I work on a bench of my Kafka cluster in version 1.0.0-cp1.
In part of my bench who focus on the max throughput possible with ordering guarantee and no data loss (a topic with only one partition), need I to set the max.in.flight.requests.per.connection property to 1?
I've read this article
And I understand that I only have to set the max.in.flight to 1 if I enable the retry feature at my producer with the retries property.
Another way to ask my question: Only one partition + retries=0 (producer props) is sufficient to guarantee the ordering in Kafka?
I need to know because increase the max.in.flight increases drastically the throughput.

Your use case is slightly unclear. You mention ordering and no data loss but don't specify if you tolerate duplicate messages. So it's unclear if you want At least Once (QoS 1) or Exactly Once
Either way, as you're using 1.0.0 and only using a single partition, you should have a look at the Idempotent Producer instead of tweaking the Producer configs. It allows to properly and efficiently guarantee ordering and no data loss.
From the documentation:
Idempotent delivery ensures that messages are delivered exactly once
to a particular topic partition during the lifetime of a single
producer.
The early Idempotent Producer was forcing max.in.flight.requests.per.connection to 1 (for the same reasons you mentioned) but in the latest releases it can now be used with max.in.flight.requests.per.connection set to up to 5 and still keep its guarantees.
Using the Idempotent Producer you'll not only get stronger delivery semantics (Exactly Once instead of At least Once) but it might even perform better!
I recommend you check the delivery semantics [in the docs]
[in the docs]:http://kafka.apache.org/documentation/#semantics
Back to your question
Yes without the idempotent (or transactional) producer, if you want to avoid data loss (QoS 1) and preserve ordering, you have to set max.in.flight.requests.per.connection to 1, allow retries and use acks=all. As you saw this comes at a significant performance cost.

Yes, you must set the max.in.flight.requests.per.connection property to 1.
In the article you have read it was an initial mistake (currently corrected) where author wrote:
max.in.flights.requests.per.session
which doesn't exist in the Kafka documentation.
This errata comes probably from the book "Kafka The Definitive Guide" (1st edition) where you can read in the page 52:
<...so if guaranteeing order is critical, we recommend setting
in.flight.requests.per.session=1 to make sure that while a batch of
messages is retrying, additional messages will not be sent ...>

imo, it is invaluable to also know about this issue that makes things far more interesting and slightly more complicated.
When you enable enable.idempotence=true , every time you send a message to the broker, you also send a sequence number, starting from zero. Brokers store that sequence number too on their side. When you make a next request to the broker, let’s say with sequence_id=3, the broker can look at its currently stored sequence number and say :
if its 4 - good, its a new batch of records
if its 3 - its a duplicate
if its 5 (or higher), it means messages were lost
And now max.inflight.requests.per.connection . A producer can send as many as this value concurrent requests without actually waiting for an answer from the broker. When we reach 3 (let’s say max.inflight.requests.per.connection=3) , we start to ask the broker for the previous results (at the same time we can’t process any batches now even if they are ready).
Now, for the sake of the example, let’s say the broker says this : “1 was OK, I stored it”, “2 has failed” and now the important part: because 2 failed, the only possible thing you can get for 3 is “out of order”, which means it did not store it. The client now knows that it needs to reprocess 2 and 3 and it creates a List and resends them - in that exact order; if retry is enabled.
This explanation is probably over simplified, but this is my basic understanding after reading the source code a bit.

Related

Spring Data JPA - Best Way to Update Concurrently Accessed "Total" Field

(Using Spring Boot 2.3.3 w/ MySQL 8.0.)
Let's say I have an Account entity that contains a total field, and one of those account entities represents some kind of master account. I.e. that master account has its total field updated by almost every transaction, and it's important that any updates to that total field are done on the most recent value.
Which is the better choice within such a transaction:
Using a PESSIMISTIC_WRITE lock, fetch the master account, increment the total field, and commit the transaction. Or,
Have a dedicated query that essentially does something like, UPDATE Account SET total = total + x as part of the transaction? I'm assuming I'd still need the same pessimistic lock in this case for the UPDATE query, e.g. via #Query and #Lock.
Also, is it an anti-pattern to retry a failed transaction a set number of times due to a lock-acquisition timeout (or other lock-based exception)? Or is it better to let it fail, report it to the client, and let the client try to call the transaction/service again?
Apologies for the basic question, but, it's been some time since I've had to worry about doing such a thing in Spring.
Thanks in advance!
After exercising my Google Fu a bit more and digging even deeper, it seems variations of this question have already been asked, at least insofar as the 'locking' portion goes.
That is, while the Spring Data JPA docs mention redeclaring repository methods and adding the #Lock annotation, it seems that it is meant strictly for queries that read only. This is what I'd originally thought as it wouldn't make much sense to "lock" an UPDATE query unless there was some additional magic happening with the JPQL query.
As for retrying, retrying does seem to be the way to go, but of course using a number of retries that makes sense for the situation.
Hopefully this helps someone else in the future who has a brain cramp like I did.

How to handle "View count" in redis

Our DB is mostly reads, but we want to add a "View count" and "thumbs up/thumbs down" to our videos.
When we stress tested incrementing views in mysql, our database started deadlocking.
I was thinking about handling this problem by having a redis DB that holds the view count, and only writes to the DB once the key expires. But, I hear the notifications are not consistent, and I don't want to lose the view data.
Is there a better way of going about this? Or is the talk of redis notifications being inconsistent not true.
Thanks,
Sammy
Redis' keyspace notifications are consistent, but delivery isn't guaranteed.
If you don't want to lose data, implement your own background process that manually expires the counters - i.e. copies to MySQL and deleted from Redis.
There are several approaches to implementing this lazy eviction pattern. For example, you can use a Redis Hash with two fields: a value field that you can HINCRBY and a timestamp field for expiry logic purposes. Your background process can then SCAN the keyspace to identify outdated keys.
Another way is to use Sorted Sets to manage the counters. In some cases you can use just one Sorted Set, encoding both TTL and count into each member's score (using the float's integer and fractional parts, respectively), but in most cases it is simpler to use two Sorted Sets - one for TTLs and the other fur values.

Could it make sense to schedule an export of SQL database to NoSQL for graphical data mining?

Would it make sense for me to schedule an export my SQL database to a graph database (such as Neo4j) in order to generate interactive graphics of relationships such as here?
UPDATE: Or by extension, should I even be looking to move over to a graph database altogeher?
My graphical database would not need to be a live reflection of the relational database - an extract every few days would be more than sufficient.
In my case, I currently have a relational database (MySQL) where I’m recording stock items as they pass between individuals/depots. The concept is as follows:
Items:
STOCKID DISPATCHDATE
0001 2014-01-01
0002 2015-06-03
Individuals:
USERID FIRSTNAME
0001 Tom
0002 Jones
Depots:
DEPOTID ZIPCODE
0001 50421
0002 71028
Owners:
STOCK_ID USER_ID RECEIVED DISPATCHED
0001 0001 2015-05-01 2015-05-10
0001 0002 2015-05-11 2015-05-20
From the NoSQL database I would like to be able to visually see things such as:
The flow of which people an item has passed through (and dates of each relationship)
Which items are at each individual/depot (on a given date)
Which individuals are at which depots (on a given date)
As N.B. says in the comments, if the tool is useful then use it - worst case is you find that the tool isn't useful after all and you stop using it (having wasted some time in setting it up, but such is life).
In general, there are three ways to sync the database:
Two Phase Commit: modify MySql in one transaction, modify Neo4j in another transaction, if either transaction fails then you roll back both transactions; the transactions don't commit until both signal that they can be committed. This provides the highest data integrity but is very expensive.
Loosely synchronized transactions: modify MySql in one transaction, modify Neo4j in another transaction, if one succeeds and the other fails then retry the failed transaction a few times, and if it still fails then decide what to do (e.g. undo the successful transaction, which is complicated by the fact that the transaction has already committed and the values may have been used; or log the error and ask for a database administrator to manually sync the databases; or third option). This offers decent data integrity and is cheaper than two phase commit, but is more difficult to recover from if something goes horribly wrong.
Batch synchronization: modify MySql, and then after a time interval (five minutes, an hour, whatever's appropriate) you sync the changes with Neo4j based on a row version number or a timestamp (note that it's not much of a problem if you sync a bit too much data since you'll just be overwriting a value with the same value, so err on the side of syncing too much per batch). This solution is easy to program, and is appropriate if Neo4j doesn't need the latest and greatest data.
I worked on a similar project where we were syncing MySql with a key-value nosql database (caching expensive queries), using loosely synchronized transactions. We wrote a customized Transaction wrapper that contained a concurrent queue of side-effects (i.e. changes to be made to the key-value database); if the MySql transaction succeeded then we committed all of the side-effects in the queue to the key-value database (with three retries in the case of transient network failure, after which we logged the error, invalidated the key-value database entry which would result in a fallback to MySql, and notified a database admin - this happened one time when the key-value database crashed for an extended period, and was solved by running a batch synchronization), else we discarded them.
I think before starting with the migration there are some questions worth asking yourself:
Can I do the graphical representation without migrating/adding a new data source (using MySQL)?
What degree of efficiency do I want when using such graphical interface?
How easy would it be, in case, to add a new data source?
What you see in that video is done by a visual component on some data from either databases or flat files, so I'd say that the answer to the first question it's likely to be yes.
Depending on how many people and the kind of user it's going to use such graphical representation (internal or external, analyst or not, etc...) this can be another driver for the decision.
About the third question, without writing a duplicate of the other answer, I think #Zim-Zam O'Pootertoot already covered it. As usual, with many data sources the problem is always to keep things in sync together and the entity resolution problem (which you minimise using the same dataset).
In my experience what Neo4J is very good at is "pattern" querying: given a specific network pattern (drawn with the Cypher language) it will apply and find it to the network dataset.
When it's about neighbours querying also a SQL solution, in small projects, can achieve the same result without too many problems. Of course if your solution has to scale to hundreds analysts and hundred of thousands queries per day consider to move.
Anyway, given your dataset it looks to me that you are working on a time-based type of data. In this kind of scenarios it could be worth to have a look at the dynamical behaviour of your network to find also temporal pattern, more than simple network ones.
From the same author of the video you've posted have also a look at this other graphical representation.
In case you want to model a time based graph just note that there's not a bullet proof solution with any data source yet.
Here's a Neo4J tutorial on your to model and represent the data in case of a time based dataset.
I bet you can do similar things also with MySQL (probably with less efficiency and elegance in querying) but I haven't done it myself yet to give some numbers - maybe someone else did it and can add some benchmarks here.
Disclaimer: I work in the KeyLines team.

Can I achieve ordered processing with multiple consumers in Kafka?

In Kafka, I have a producer queuing up work of clients. Each piece of work has a client ID on it. Work of different clients can be processed out of order, but work of one client must be processed in order.
To do this, I intend to have (for example) 20 topics to achieve parallelism. The producer will queue up work of a client ID into topic[client ID mod 20]. I then intend to have many consumers each capable of processing work of any client but I still want the work processed in order. This means that the next price of work in the topic can't begin to be processed before the previous piece has completed. In case of consumer failure it's OK to process work twice, but it means that the offset of that topic can't progress to the next piece of work.
Note: the number of messages per second is rather small (10s-100s messages).
To sum up:
'At least once' processing of every message (=work)
In order processing of work for one topic
Multiple consumers for every topic to support consumer failure
Can this be done using Kafka?
Yes, you can do this with Kafka. But you shouldn't do it quite the way you've described. Kafka already supports semantic partitioning within a topic if you provide a key with each message. In this case you'd create a topic with 20 partitions, then make the key for each message the client ID. This guarantees all messages with the same key end up in the same partition, i.e. it will do the partitioning that you were going to do manually.
When consuming, use the high level consumer, which automatically balances partitions across available consumers. If you want to absolutely guarantee at least once processing, you should commit the offsets manually and make sure you have fully processed messages you have consumed before committing them. Beware that consumers joining or leaving the group will cause rebalancing of partitions across the instances, and you'll have to make sure you handle that correctly (e.g. if your processing is stateful, you'll have to make sure that state can be moved between consumers upon rebalancing).

Is there / would be feasible a service providing random elements from a given SQL table?

ABSTRACT
Talking with some colleagues we came accross the "extract random row from a big database table" issue. It's a classic one and we know the naive approach (also on SO) is usually something like:
SELECT * FROM mytable ORDER BY RAND() LIMIT 1
THE PROBLEM
We also know a query like that is utterly inefficient and actually usable only with very few rows. There are some approaches that could be taken to attain better efficiency, like these ones still on SO, but they won't work with arbitrary primary keys and the randomness will be skewed as soon as you have holes in your numeric primary keys. An answer to the last cited question links to this article which has a good explanation and some bright solutions involving an additional "equal distribution" table that must be maintained whenever the "master data" table changes. But then again if you have frequent DELETEs on a big table you'll probably be screwed up by the constant updating of the added table. Also note that many solutions rely on COUNT(*) which is ridiculously fast on MyISAM but "just fast" on InnoDB (I don't know how it performs on other platforms but I suspect the InnoDB case could be representative of other transactional database systems).
In addition to that, even the best solutions I was able to find are fast but not Ludicrous Speed fast.
THE IDEA
A separate service could be responsible to generate, buffer and distribute random row ids or even entire random rows:
it could choose the best method to extract random row ids depending on how the original PKs are structured. An ordered list of keys could be maintained in ram by the service (shouldn't take too many bytes per row in addition to the actual size of the PK, it's probably ok up to 100~1000M rows with standard PCs and up to 1~10 billion rows with a beefy server)
once the keys are in memory you have an implicit "row number" for each key and no holes in it so it's just a matter of choosing a random number and directly fetch the corresponding key
a buffer of random keys ready to be consumed could be maintained to quickly respond to spikes in the incoming requests
consumers of the service will connect and request N random rows from the buffer
rows are returned as simple keys or the service could maintain a (pool of) db connection(s) to fetch entire rows
if the buffer is empty the request could block or return EOF-like
if data is added to the master table the service must be signaled to add the same data to its copy too, flush the buffer of random picks and go on from that
if data is deleted from the master table the service must be signaled to remove that data too from both the "all keys" list and "random picks" buffer
if data is updated in the master table the service must be signaled to update corresponding rows in the key list and in the random picks
WHY WE THINK IT'S COOL
does not touch disks other than the initial load of keys at startup or when signaled to do so
works with any kind of primary key, numerical or not
if you know you're going to update a large batch of data you can just signal it when you're done (i.e. not at every single insert/update/delete on the original data), it's basically like having a fine grained lock that only blocks requests for random rows
really fast on updates of any kind in the original data
offloads some work from the relational db to another, memory only process: helps scalability
responds really fast from its buffers without waiting for any querying, scanning, sorting
could easily be extended to similar use cases beyond the SQL one
WHY WE THINK IT COULD BE A STUPID IDEA
because we had the idea without help from any third party
because nobody (we heard of) has ever bothered to do something similar
because it adds complexity in the mix to keep it updated whenever original data changes
AND THE QUESTION IS...
Does anything similar already exists? If not, would it be feasible? If not, why?
The biggest risk with your "cache of eligible primary keys" concept is keeping the cache up to date, when the origin data is changing continually. It could be just as costly to keep the cache in sync as it is to run the random queries against the original data.
How do you expect to signal the cache that a value has been added/deleted/updated? If you do it with triggers, keep in mind that a trigger can fire even if the transaction that spawned it is rolled back. This is a general problem with notifying external systems from triggers.
If you notify the cache from the application after the change has been committed in the database, then you have to worry about other apps that make changes without being fitted with the signaling code. Or ad hoc queries. Or queries from apps or tools for which you can't change the code.
In general, the added complexity is probably not worth it. Most apps can tolerate some compromise and they don't need an absolutely random selection all the time.
For example, the inequality lookup may be acceptable for some needs, even with the known weakness that numbers following gaps are chosen more often.
Or you could pre-select a small number of random values (e.g. 30) and cache them. Let app requests choose from these. Every 60 seconds or so, refresh the cache with another set of randomly chosen values.
Or choose a random value evenly distributed between MIN(id) and MAX(id). Try a lookup by equality, not inequality. If the value corresponds to a gap in the primary key, just loop and try again with a different random value. You can terminate the loop if it's not successful after a few tries. Then try another method instead. On average, the improved simplicity and speed of an equality lookup may make up for the occasional retries.
It appears you are basically addressing a performance issue here. Most DB performance experts recommend you have as much RAM as your DB size, then disk is no longer a bottleneck - your DB lives in RAM and flushes to disk as required.
You're basically proposing a custom developed in-RAM CDC Hashing system.
You could just build this as a standard database only application and lock your mapping table in RAM, if your DB supports this.
I guess I am saying that you can address performance issues without developing custom applications, just use already existing performance tuning methods.