I have an application that produces messages to Pulsar under a specific topic and shut down the application when it's finished; at the same time, no consumer exists to read this topic.
After a while, when I create a consumer and want to read the written data out, I found all data are lost since the topic I've written been deleted by Pulsar.
How can I disable the auto-deletion of inactive topics in Pulsar?
Generally, there are two ways to achieve this.
Firstly, retention policies keep the data for at least X hours (until Y GBs), you could set it via pulsar-admin to infinite at the namespace level.
pulsar-admin namespaces set-retention my-tenant/my-ns \
--size 1T \
--time -1
Secondly, manually set brokerDeleteInactiveTopicsEnabled=false in conf/broker.conf could disable the deletion of inactive topics as well.
It's recommended to set the above two settings simultaneously for proper control.
If you create a subscription on the topic using the Consumer client interface or the REST API, messages are kept until they are acknowledged. An unacknowledged message in a subscription backlog will never be removed, unless you configure a time-to-live (TTL) which will automatically acknowledge the message after some time.
Messages that are not in a subscription or have already been acknowledged are retained in the topic based on the retention policy. You can specify messages to be retained by size or time.
If you want to use a Pulsar topic like a queue that holds all messages until they are acknowledged, which sounds like what you are trying to do, you just need to use the Consumer client interface with a named subscription. Then all your messages will be kept in the topic while your application is inactive. And because the topic still has messages, it won't be automatically deleted (though you can disable that behavior as explained in the answer by yjshen).
Related
I am reading this writing: https://medium.com/#narengowda/system-design-dropbox-or-google-drive-8fd5da0ce55b. In the Synchronization Service part, it writes that:
The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client. Since a message will be deleted from the queue once received by a client, we need to create separate Response Queues for each client to be able to share an update message which should be sent to multiple subscribed clients.
The context is that we need a response queue to send the file updates from one client to other clients. I am confused by this statement. If Dropbox has 100 million clients, we need to create 100 million queues, based on the statement. It is unimaginable to me. For example, a Kafka cluster can support up to 5K topics (https://stackoverflow.com/questions/32950503/can-i-have-100s-of-thousands-of-topics-in-a-kafka-cluster#:~:text=The%20rule%20of%20thumb%20is,5K%20topics%20should%20be%20fine.). We need 20K Kafka clusters in this case. Which queuing system can do 100 million "topics"?
Not sure but I expect such notification to clients via web-sockets only.
Additionally as this medium blog states that if client is not online then messages might have to be persisted in DB. After that when client comes online they can request for all updates after certain timestamp after which web sockets can be setup to facilitate future communication.
Happy to know your thoughts on this.
P.S : Most of dropbox system design blogs/vlogs have just copied from each other without going into low level detail.
Background
I am attempting to implement token authentication with my web application using JSON Web tokens.
There are two things I am trying to maintain with whatever strategy I end up using: statelessness and security. However, from reading answers on this site and blog posts around the internet, there appears to be some folks who are convinced that these two properties are mutually exclusive.
There are some practical nuances that come into play when trying to maintain statelessness. I can think of the following list:
Invalidating compromised tokens on a per-user basis before their expiration date.
Allowing a user to log out of all of their "sessions" on all machines at once and having it take immediate effect.
Allowing a user to log out of the current "session" on their current machine and having it take immediate effect.
Making permission/role changes on a user record take immediate effect.
Current Strategy
If you utilize an "issued time" claim inside the JWT in conjunction with a "last modified" column in the database table representing user records, then I believe all of the points above can be handled gracefully.
When a web token comes in for authentication, you could query the database for the user record and:
if (token.issued_at < user.last_modified) then token_valid = false;
If you find out someone has compromised a user's account, then the user can change their password and the last_modified column can be updated, thus invalidating any previously issued tokens. This also takes care of the problem with permission/role changes not taking immediate effect.
Additionally, if the user requests an immediate log out of all devices then, you guessed it: update the last_modified column.
The final problem that this leaves is per-device log out. However, I believe this doesn't even require a trip to the server, let alone a trip to the database. Couldn't the sign out action just trigger some client-side event listener to delete the secure cookie holding the JWT?
Problems
First of all, are there any security flaws that you see in the approach above? How about a usability issue that I am missing?
Once that question is resolved, I'm really not fond of having to query the database each time someone makes an API request to a secure end point, but this is the only strategy that I can think of. Does anyone have any better ideas?
You have made a very good analysis of how some common needs break the stateleness of JWT. I can only propose some improvements on your current strategy
Current strategy
The drawback I see is that always is required a query to the database. And trivial modifications on user data could change last_modified and invalidate tokens.
An alternative is to maintain a token blacklist. Usually is assigned an ID to each token, but I think you can use the last_modified. As operations revocation of tokens probably are rare, you could keep a light blacklist (even cached in memory) with just userId, and last_modified.
You only need to set an entry after updating critical data on user (password, permissions, etc) and currentTime - maxExpiryTime < last_login_date. The entry can be discarded when currentTime - maxExpiryTime > last_modified (no more non-expired tokens sent).
Could not sign out the action just trigger some client-side event listener to delete the cookie secure holding the JWT?
If you are in the same browser with several open tabs, you can use the localStorage events to sync info between tabs to build a logout mechanism (or login / user changed). If you mean different browsers or devices, then a you would need to send some way of event from server to client. But it means maintain an active channel, for example a WebSocket, or sending a push message to a native mobile app
Are there any security flaws that you 'see in the above approach?
If you are using a cookie, note you need to set an additional protection against CSRF attacks. Also if you do not need to access cookie from client side, mark it as HttpOnly
How about a usability issue that i am missing?
You need to deal also with rotating tokens when the are close to expire.
I need to architect a database and service, I have resource that I need to deliver to the users. And the delivery takes some time or requires user to do some more job.
These are the tables I store information into.
Table - Description
_______________________
R - to store resources
RESERVE - to reserve requested resources
HACK - to track some requests that couldn`t be made with my client application (statistics)
FAIL - to track requests that can`t be resolved, but the user isn't guilty (statistics)
SUCCESS - to track successfully delivery (statistics)
The first step when a user requests resouce
IF (condition1 is true - user have the right to request resource) THEN
IF (i've successfully RESERVE-d resource and commited the transaction) THEN
nothing to do more
ELSE
save request into FAIL
ELSE
save request into HACK
Then the second step
IF (condition2 is true - user done his job and requests the reserved resource) THEN
IF (the resource delivered successfully) THEN
save request into SUCCESS
ELSE
save request into FAIL
depending on application logic move resource from RESERVE to R or not
ELSE
save request into HACK, contact to the user,
if this is really a hacker move resource from RESERVE to R
This is how I think to implement the system. I've stored transactions into the procedures. But the main application logic, where I decide which procedure to call are done in the application/service layer.
Am I on a right way, is such code division between the db and the service layers normal? Your experienced opinions are very important.
Clarifying and answering to RecentCoin's questions.
The difference between the HACK and FAIL tables are that I store more information in the HACK table, like user IP and XFF. I`m not going to penalize each user that appeared in that table. There can be 2 reasons that a user(request) is tracked as a hack. The first is that I have a bug (mainly in the client app) and this will help me to fix them. The second is that someone does manually requests, and tries to bypass the rules. If he tries 'harder' I'll be able to take some precautions.
The separation of the reserve and the success tables has these reasons.
2.1. I use reserve table in some transactions and queries without using the success table, so I can lock them separately.
2.2. The data stored in success will not slow down my queries, wile I'm querying the reserve table.
2.3. The success table is kind of a log for statistics, that I can delete or move to other database for future analyse.
2.4. I delete the rows from the reserve after I move them to the success table. So I can evaluate approximately the max rows count in that table, because I have max limit for reservations for each user.
The points 2.3 and 2.4 could be achieved too by keeping in one table.
So are the reasons 2.1 and 2.2 enough good to keep the data separately?
The resource "delivered successfully" mean that the admin and the service are done everything they could do successfully, if they couldn't then the reservation fails
4 and 6. The restrictions and right are simple, they are like city and country restrictions, The users are 'flat', don't have any roles or hierarchy.
I have some tables to store users and their information. I don't have LDAP or AD.
You're going in the right direction, but there are some other things that need to be more clearly thought out.
You're going to have to define what constitutes a "hack" vs a "fail". Especially with new systems, users get confused and it's pretty easy for them to make honest mistakes. This seems like something you want to penalize them for in some fashion so I'd be extremely careful with this.
You will want to consider having "reserve" and "success" be equivalent. Why store the same record twice? You should have a really compelling reason do that.
You will need to define "delivered successfully" since that could be anything from an entry in a calendar to getting more pens and post notes.
You will want to define your resources as well as which user(s) have rights to them. For example, you may have a conference room that only managers are allowed to book, but you might want to include the managers' administrative assistants in that list since they would be booking the room for the manager(s).
Do you have a database of users? LDAP or Active Directory or will you need to create all of that yourself? If you do have LDAP or AD, can use something like SAML?
6.You are going to want to consider how you want to assign those rights. Will they be group based where group membership confers the rights to reserve, request, or use a given thing? For example, you may only want architects printing to the large format printer.
I'm intending to use AMQP to allow a distributed collection of machines to report to a central location asynchronously. The idea is to drop messages into the queue and allow the central logging entity to process the queue in a decoupled fashion; the 'process' is simply to create or update a row in a database table.
A problem that I'm anticipating is the effect of network jitter in the message queuing process - what happens if an update accidentally gets in front of an insert because the time between the two messages being issued is less than the network jitter?
Reading the AMQP spec, it seems that I could just apply a higher priority to inserts so they skip the queue and get processed first. But presumably this only applies if a queue actually exists at the broker to be skipped. Is there a way to impose a buffer or delay at the broker to absorb this jitter and allow priority to be enacted before the messages are passed on to the consumer(s)?
Or do I have to go down the route of a resequencer as ActiveMQ suggests?
The lack of ordering between multiple publishers has nothing to do with network jitter, it's a completely natural thing in distributed applications. Messages from the same publisher will always be ordered. If you really need causal ordering of actions performed by different nodes then either a resequencer or a global sequence numbering scheme are your only options. Note that you cannot use sender timestamps for this, which is what everyone seems to try first..
I've got something like a job queue over RabbitMQ and, upon a request to cancel a job, I'd like to retract the tasks that have not yet started processing (their messages have not been ack'd), which corresponds to retracting these messages from the queues that they've been routed to.
I haven't found this functionality in AMQP or in the RabbitMQ API; perhaps I haven't searched well enough? Or will I have to use a workaround (it's not hard, but still)?
I would solve this scenario by having the worker check some sort of authoritative data source to determine if the the job should proceed or not. For example, the worker would check the job's status in a database to see if the job was canceled already.
For scenarios where the speed of processing jobs may be faster than the speed with which the authoritative store can be updated and read, a less guaranteed data store that trades speed for other characteristics may be useful.
An example of this would be to use Redis as the store for canceling processing of a message instead of a relational DB like MySQL. Redis is very fast, but makes fewer guarantees regarding the data it holds, whereas MySQL is much slower, but offers more guarantees about the data it holds.
In the end, the concept of checking with another source for whether or not to process a message is the same, but the way you implement that depends on your particular scenario.
RabbitMQ doesn't let you modify or delete messages after they've been enqueued. For that, you want some kind of database to hold the state of each job, and to use RabbitMQ to notify interested parties of changes in that state.
For lowish volumes, you can kludge it together with a queue per job. Create the queue, post the job description to the queue, announce the name of the queue to the workers. If the job needs to be cancelled before it is processed, deleted the job's queue; when the workers come to fetch the job description, they'll notice the queue has vanished.
Lighterweight and generally better would be to use redis or another key/value store to hold the job state (with a deleted or absent record meaning a cancelled or nonexistent job) and to use rabbitmq to notify about new/removed/changed records in the key/value store.
At least two ways to achieve your target:
basic.reject will requeue message if requeue=true is set (otherwise it will reject message).
(supported since RabbitMQ 2.0.0; see http://www.rabbitmq.com/blog/2010/08/03/well-ill-let-you-go-basicreject-in-rabbitmq/).
basic.recover will ask broker to redeliver unacked messages on channel.
You need to subscribe to all the queues to which messages have been routed, and consume them with ack.
For instance if you publish to a topic exchange with "test" as the routing key, and there are 3 persistent queues which subscribe to "test" you would need to consume those three queues. It might be better to add another queue which your consumer processes would also listen too, and tell them to ignore those messages.
An alternative, since you are using RabbitMQ, is to write a custom exchange plugin that will accept some out of band instruction to clear all queues. For instance you might have that exchange read a special message header that tells it to clear all queues to which this message is destined. This does require writing Erlang code, but there are 4 different exchange types implemented so you would only need to copy the most similar one and write the code for the new bahaviours. If you only use custom headers for this, then the body of the message can be a normal message for the consumers.
To sum up:
1) the publisher needs to consume the messages itself
2) the publisher can send a special message in a special queue to tell consumers to ignore the message
3) the publisher can send a special message to a custom exchange that will clear any existing messages from the queues before sending this special message to consumers.