What are examples of real-world scenarios where a message queuing system can accept the loss of some messages? - message-queue

I was reading this blog post, in which the author proposes the following question, in the context of message queues:
does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed
At first I thought that the main point of handling messages was to never loose a single message - after all, a message lost could mean a hotel reservation not booked, a checkout not completed, or any other functionality not carried through, which seems too similar to a bug for me. I suppose I am missing something, so, what are examples of scenarios where it is OK for a messaging system to loose a few messages?

Well, your initial expectation:
the main point of handling messageswas to never loose a single message
was just not a correct one.
Right, if one strives for a one certain type of robustness, where fail-safe measures have to take all due care and precautions, so as not a single message could get lost, yes, there your a priori expressed expectation fits.
This does not mean that all other system designs have to carry all the immense burdens and have to pay all that incurred costs ( resources-wise, latency-wise et al ), as the "100+% guaranteed delivery" systems do ( but, again, only if they can ).
Anti-pattern cases:
There are many use-cases, where an absolute certainty of delivery of each and every message originally sent is actually an anti-pattern.
Just imagine a weakly synchronised system ( including ones, that have nothing like backthrottling or even any simplest form of feedback propagation at all ), where the sensors read an actual temperature, a sound, a video-frame and send a message with that value(s).
Whenever a postprocessing system gets such information delivered, there may be a reason not to read any and all "old" values, but the most recent one(s).
If a delivery framework already got any newer set of values, all the "older" values, not processed yet, just hanging at some depth from the queue-head, yet in the queue, might create the anti-pattern, where one would not like to have to read and process any and all of those "older" values, but just the most recent one(s).
Like no one will make a trade with you based on yesterday prices, there is not positive value to make any new, current, decision based on reading any and all "old" temperature readings, that still wait in the queue.
Some smart-messaging frameworks provide explicit means for taking just the very "newest" message from a given source - thus enabling to imperatively discard any "older" messages, avoiding them from being read and processed right due to a known presence of a one "most" recent.
This answers the original question about the assumed main point of handling messages.
Efficiency first:
In any case, where a smart-delivery takes place ( either deliver an exact copy of the original message content or noting-at-all ), the resources are used at their best efforts, yet, without spending a single penny on anything but the "just-enough" smart-delivery.
Building robustness costs more than that.
Building an ultimate robustness, costs even way more than that.
Systems than do have such an extreme requirement can and may extend the resources-efficient smart-delivery so as to reach some requirements defined level of robustness, at some add-on costs.
The same but reversed is not possible -- if an "everything-proof" system is to get a slimmer form and fashion, so as to fit onto any restricted-resources hardware or to make it "forget" some "old" messages, that are of no positive value at this very moment ( but on the contrary, constitute a must for the processing element to read and process each and every "unwanted" message, just due to the fact it was delivered, while knowing a core-logic needs just the most recent one ).
Distributed systems accrue E2E-latency from many distributed sources, so any rigid-delivery system just block and penalise the only one element, who is ( latency-wise ) innocent -- the receiver.

I suppose it's OK to loose few messages from some measurement units that deliver the value once in.... Also for big data analytics solutions few lost messages won't make a big difference

It all depends on the application/larger system. The message queue is only one link in the chain, so to speak. If the application(s) at the ends are prepared to deal with loss, losing some messages is not a problem. If the application(s) rely on total messaging integrity then there will be problems.
An example of a system that will be ok with loss is weather updates for your phone. If a few temperature/wind updates don't make it to you there's no real harm in that.
Now, if you're running a nuclear reactor and you lose a few temperature updates on the core, well that is a problem.
I work a lot on safety critical, infrastructure-level systems, and am responsible for messaging much of the time. Many of those systems state clearly that messaging may reorder, duplicate, or lose messages; it's just a fact of life where distributed systems and networks are involved. The endpoint systems need to be designed to work correctly in that environment. So they track messages, ack end to end, deal with duplicates and retransmits, etc.

Related

State definition in Reinforcement learning

When defining state for a specific problem in reinforcement learning, How to decide what to include and what to leave for the definition, and also how to set difference between an observation and a state.
For example assuming that the agent is in the context of human resource and planning where it needs to hire some workers based on the demand of jobs, considering the cost of hiring them (assuming the budget is limited) is a state in the format of (# workers, cost) a good definition of state?
In total I don't know what information is needed to be in state and what should be left as it's rather observation.
Thank you
I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].
Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".
Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).
Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).
Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).
A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.
Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).
I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

Dealing with exceptions in an event driven world

I'm trying to understand how exceptions are handled in an event driven world using micro-services (using apache kafka). For example, if you take the following order scenario whereby the following actions need to happen before the order can be completed.
1) Authorise the payment with the payment service provider
2) Reserve the item from stock
3.1) Capture the payment with the payment service provider
3.2) Order the item
4) Send a email notification accepting the order with a receipt
At any stage in this scenario, there could be a failure such as:
The item is no longer in stock
The payment information was incorrect
The account the payee is using doesn't have the funds available
External calls such as those to the payment service provider fail, such as downtime
How do you track that each stage has been called for and/or completed?
How do you deal with issues that arise? How would you notify the frontend of the failure?
Some of the things you describe are not errors or exceptions, but alternative flows that you should consider in your distributed architecture.
For example, that an item is out of stock is a perfectly valid alternative flow in your business process. One that possibly requires human intervention. You could move the message to a separate queue and provide some UI where a human operator can deal with the problem, solve it and cause the flow of events to continue.
A similar thing could be said of the payment problems you describe. If an order cannot successfully be settled, a human operator will need to investigate the case and solve it. For that matter, your design must contemplate that alternative flow as part of it, and make it so a human can intervene somehow when the messages end up in a queue that requires a person to review them.
Those cases should be differentiated from errors or exceptions being thrown by the program. Those cases, depending on the circumstance, might in fact require to move the message to a dead letter queue (DLQ) for an engineer to take a look at them.
This is a very broad topic and entire books could written about this.
I believe you could probably benefit from gaining more understanding of concepts like:
Compensating Transactions Pattern
Try/Cancel/Confirm Pattern
Long Running Transactions
Sagas
The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels that order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.
I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.
In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.
The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).
As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.
The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.
A few other aspects I also typically consider are the following:
Idempotency
I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.
Transient vs Persistent Errors
When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.
Embrace Error as an Alternative Flow
As mentioned at the beginning of my answer, not everything is an error. Some things are just alternative flows.
In those cases of interservice communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can just embrace error as part of you workflow. Catalog the possible causes of error and make them an alternative flow of events that simply requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.
For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.
Transaction and Data Model State are Key
I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.
For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.
The difficulty here is discovering how to design transaction granularity in way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.
Designing for Distributed Transactional Workflows
So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in a inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.
That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client–server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.
As I was originally saying, the topic is way too broad and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.
At any rate, I hope this somehow helped you with your investigation.

Any good threads related job-interview question?

When interviewing graduates I usually ask them questions about data structures, algorithms and complexity theory. I would really like to ask a question that will enable them to show their familiarity with multi-threaded concepts, without dwelling into language specific issues.
Any good questions? The only question I could think of is how to write a Singleton that supports multi-threaded access.
I find the classic "write me a consumer-producer queue" question to be quite good. You can talk about synchronization in a handwavy way beforehand for five minutes or so (e.g. start with "What does Object.wait() do? What other methods on Object is it related to? Can you give me an example of when you might use these? What other concurrency techniques might you use in practice [because really, it's quite rare that actually using the wait/notify primitives is the best approach]?"). Make sure the candidate addresses (or at least makes clear he is aware of) both atomicity ("missed updates") and volatility (visibility of the new value on other threads)
Then after you've had a chat about the theory of these, get them to spend a few minutes actually writing the code for a primitive producer-consumer queue. This should be straightforward to anyone who actually understands what they were talking about above, yet it will weed out those who can "talk the talk" but don't actually understand it in practice (arguably the most dangerous group).
What I like about these mini-coding exercises, is that they're often easy to extend. For instance, if the candidate completes the task easily, you can ask how they would extend it for situation XXX - invent requirements that you know will push the limits of the noddy solution you asked for. This not only lets you tailor the depth of questions you're asking but gives some insight into how well the candidate handles clarification of requirements, and modifications of existing design (which is pretty important in this industry).
Here you can find some topics to discuss:
threads implementation ( kernel vs user space)
thread local storage
synchronization primitives
deadlocks, livelocks
Differences between mutex and
semaphore.
Use of condition variables.
When not to use threads. (eg. IO multiplexing)
Talk with them about a popular, but not well-known topic, where thread handling is essential.
I recommend you, build a web server with them, of course, only on paper or just in words. The result should look something like this: there is a main thread, it's listening on a socket. When something arrives, it passes the socket into the pool, then this thread returns back to socket listening. The pool has fixed number of slots. The request processing threads are dedicated to get job from the pool. Find out, what's better, if the threads are checking the pool concurrently, or the listner main thread selects a free slot/thread for the new incoming request. Try to write a small pseudocode, or a graph for both side of the pool handling.
Let's introduce a small application: page counter, which tells that how many page request has been made since server startup. Don't tell them that the counter must be protected against concurrent modification, let them to find it out how to do this with mutexes or synchronization or whatsoever. Maybe you could skip the web server part, the page counter app is easier to specify.
Another example is a chat, with 2+ clients and a server, find out, how to solve the problem, that all the messages should arrive in the same order for all clients. Or reflex game: the server waits for 1..5 secs random, then says "peek-a-boo", and the player wins who presses space key first. Specify it with 2 player, then try to expand it to N players.
Also, be aware of NPPs. NPP stands: "non-programming programmer". There are dudes, who can talk about programming issues, they know all the 3/4-letter abbrevations (there're lot in the Java world, EJB, JSP, XSLT, and my favourite: POJO, which means Pure Old Java Objects, lol), they understand and modify codes, or make similar programs from a base, but they fail even with small problems, it it has to do it theirselves, e.g. finding the nearest element to a base in an array. Sometimes it takes months, until it turns out. They performs well at interviews, because they prepare for it. Maybe they don't even known, that they're NPPs, this is a known effect: http://en.wikipedia.org/wiki/Dunning-Kruger_effect
It's hard to recognize the opposite dudes, who have not heard about trendy libraries or patterns, but they can learn it even at the job interview. (Personal remark: my last interview was in 1999, and it seems that I will not do interview anymore. I have never heard of dynamic web pages before, but I've figured out the term "session" during the interview, the question was that how to build a simple hanging man web app. I was hired.)

What's the difference between polling and pulling?

What's the difference between polling and pulling (if any)?
They're two distinct words. To "poll" is to ask for an answer. To "pull" is to use force to move (actually or conceptually) something towards oneself (again, actually or conceptually).
One "polls" a server when software on a client periodically asks the server for something. One "pulls" data from a database towards client software.
Note that both words have various distinct uses even within the world of computing, but I can't think of any case where they're interchangeable in such a way as to leave meaning unchanged. Low-level device driver code may "poll" an interface to check whether it's ready for some operation, and there's no network traffic involved. In electronics, one "pulls" a signal up or down.
Clients may both "poll" a server and "pull" data from a server, but note that when I use each verb I use different direct objects. It only makes sense to say "pull the server" when you're dragging it across the computer room floor.
Poll is like when Gallup does a poll of the American people. They are querying for specific information by asking a question.
Pull is like what you do to a rope. You want the rope (or a file, or some data) to be in your location, so you pull it towards you.
There is a possible slight difference.
Polling is attempting to request information at set intervals.
Pulling just refers to the fact that you are requesting data from somebody else rather than having them send it to you.
That being said, I've heard them used interchangeably.
With respect to network communications, they both refer to the same scheme, where you are periodically requesting data from an external source. See Pull Technology.
Of course the opposite is Pushing, where data is sent as it becomes available.
A poll is quick request while a pull is a slow demand.
One may poll asking if information is immediately available which can be pulled. The distinction is not that the answer to a poll must be boolean, but that the answer to a poll is quick and readily available or the answer will be denied. A poll implies that a choice is being offered which is contrary to a pull, where no choice is offered. A pull may cause the caller to wait for the information to become available or may offer other means of returning the detailed information to the caller later when it actually becomes available.

How to prioritize bugs?

In my current company there isn't clear understanding between the test and development teams as to how severe a bug should be? There are arguments which go back and forth to reduce or to increase the severity. We are not as of now aware of any documents which lays the rules. The tester raises the bug and assigns priority based on his intuition. The developer would request a change based on his load or some other factor.
How are severity/priority of bugs classified? Are there any standards which guide how software defect priorities needs to be determined based on customer needs, time lines and other things?
Use priority levels that deliberately have nothing to do with severity or impact, and describe only the conceptual position of the bug in the schedule. This field will determine which bugs get worked on, so it will be a target for negotiation.
Use severity levels that deliberately have concrete, verifiable definitions not open to negotiation, that have nothing to do with scheduling or priority. I've worked successfully with the severity definitions used by the Debian BTS, generalised to apply to programming projects in general.
That way, the severity is much more a matter of verifiable fact, independent of a statement of priority. The priority is then free to be tweaked up and down by negotiation or whatever, without affecting the factual information in the severity field.
Attempting to conflate both “severity” and “priority” into a single field will lead to soul-draining arguments and wasted time. The bug reporter needs a firm guide of fact to determine how “bad” the bug is, and this needs to be easily agreed on by independent parties. The priority, on the other hand, is the correct target for negotiation and scheduling games.
I work on emergency control centre systems, so this set of bug levels is a little, well... extreme:
someone dies
total system failure requiring DR invocation
server failure requiring engineer response
failure involving loss of call continuity
failure involving loss of data
incorrect data recorded
application failure - non-recoverable
application failure - non-recoverable, but automatically restarted
does not meet requirement spec, no workaround
does not meet requirement spec, but has workaround
cosmetic - layout etc.
actually a feature request
That's off the top of my head. In case you were wondering, it's from most extreme to least :-)
Some stuff we used before. We split the defect rating into priority and severity.
Severity (set by submitter during submission of defect)
Highest (5): Data loss, hardware damage possible, or a security-related failure
High (4): Loss of functionality without any reasonable workaround
Medium (3): Loss of functionality with a reasonable workaround
Low (2): Partial loss of a function or a feature set (feature still hits the design requirements)
Lowest (1): A cosmetic error
Priority (adjusted by development, management and QA during defect evaluation)
Highest (5): The system is practically unusable with this defect.
High (4): The defect will have a serious impact on the company’s ability to sell and maintain this system.
Medium (3): The company will lose some money if this defect is in the system, but it might be more important to meet the schedule. Fix after release.
Low (2): Do not delay the release, but do fix this problem afterwards.
Lowest (1): Fix as time and resources allow.
Both numbers together create a risk priority number (RPN). Simply multiply severity with priority. Higher result means higher risk. 25 defines the ultimate defect bomb. 1 can be done during idle time or if someone is bored and needs something to do.
First goal: Defects with a rating of highest or high of any kind should be fixed before release.
Second goal: Defects with RPN > 8 should be fixed before releasing the product.
This is of course a little bit artificial but helps to give all parties (Support, QA/Test, Engineering, and Product Managers) a tool to set priorities without blowing away the opinion of other side.
Replace your bug tracking system with fogbugz and get rid of severity field altogether.
See Priority vs Severity
"Been there done that".
I've had this discussion over and over again, on different projects. We've tried to combine priority with severity, but the lesson I've learned: do not combine severity with priority !
We've had a lot of brainstorms and meetings which ended with the words "this is it". Multiple guideline-documents have been created and spread between the different "parties", but after a while we discovered that it didn't work at the end. Different "parties" think different about bugs: our helpdesk has another understanding of priority than the development team or the sales has.
Having both a severity and a priority level will very quickly become very confusion because:
when using numbers (between 1 to 5) one will not know what each number means
what if an issue has the highest possible priority, but the lowest possible severity - and I'm sure that this will happen!
what if someone reduces a severity, does he need to reduce the priority also?
"So what should you do then?":
Only use one kind of indicator for the 'level' of an issue: Doesn't matter how you call it.
Use numbers (eg 1 - 5, but could be more or less depending on your needs) to clearly indicate the importance but combine it with a keyword so that it's clear what it means (eg. 'nice to have', 'show stopper'). For some people prio 1 means the most import, for others 5 does -> therefore a keyword to indicate what a number means is necessary.
Make a distinction between a 'normal issue' or a 'red alert'. In our case a 'Red Alert' must be solved immediately and put in production immediately. A normal issue will follow the normal development-test-deployment-flow. The priority/severity/however-how-you-call-it should only be set for normal issues and will be ignored for 'red alerts'.
*> In practice, a 'Red Alert' can become a
'Normal Issue': the support team
discovered a major bug and created a
'Red Alert'. But after some
investigation we discovered that data
had become 'corrupt' in the database
since it was inserted there directly
and not via the application.*
Choose a good tool that allows you to customize the flow; but most tools do.
As for a standard, IEEE guide to classification for software anomalies although I am not sure how widely this is adopted. IEEE 1044.1-1995
One option is to have the product owner determine the priority of the bug. While there is some general intuition on how "bad" a bug is, it can be the responsibility of the owner of the product to set an order of precidence (i.e. bug A should be fixed before bug B etc...).
The more information (clear and concise) that can be provided to the product owner can assist that individual make those determinations (i.e. how many users have experienced the bug, what features are not available as a result of the bug, etc...)
Must be done now
Must be done before we ship
Minor annoyance (Doesn't prevent the user from exercising the functionality)
Edge case/Remote/Tester-from-Mordor scenario
Well I just made that up... my point being categorizing bugs should not be a weekly hour+ long ritual..
IMHO, prioritizing acc to a flowchart is wasted time. Fix bugs in Cat#1 and #2 - as quickly as they surface. If you find yourself swamped by bugs, slow down and reflect. Defer Cat#3 and Cat#4 if the schedule doesn't permit or higher priority items override.
The critical thing is that all of you have a shared understanding of this severity and expected quality. Don't let compliance to the holy standards of X slow you down from delivering what the customer wants... working software.
Personally I favour the two tier severity/priority model. I know the arguments for a single level but the places I've worked generally I've just seen a two level heirarchy work better
Severity is set by the support team (based on input from the client). Priority is set by the client (with input from the support team).
For severity I use:
1 - Blocker/show stopped
2 - Major functionality unavailable (or effectively unavailable), no practical work around possible
3 - Major functionality unavailable (or ...), work around possible
4 - Minor functionality unavailable (or effectively unavailable), no work around possible
5 - Minor functionality unavailable (or ...), work around possible
6 - Cosmetic or other trivial
Then for priority I just use High, Medium, Low but anything from 3 - 5 levels works (much more than that is just over the top).
I'd generally then order by Priority first and then severity within that. The important thing about this is that the client has the most important say. If they say the way their logo is printing out on a report is the highest priority then that's what gets looked at BUT it gets looked at after the other client's high priority which is stopping them logging in.
Generally speaking I wouldn't release with any high priority issues or any medium priority issues with severity 1 - 4. Obviously in an ideal world you'd fix everything but I've never been lucky enough to have that option.
The tester tells what is broken
The developer estimates how much work it will be to fix
The customer decides the business value, i.e. the priority.
Set the requirements of the project so you can base the priority of a fix on the priority of the requirements interfered by the bug.
I had the same issue with one of our customers. In the end we set up a document together describing what kind of bugs would match to a certain severity. Aside from an occasional discussion using this document as a guideline appears to work.
But be well aware that test teams and development teams may have very different opinions on what is a severe bug and what is not. From the point of view of the testers a small layout bug can be high priority when a developer would just say that no one will notice.
In our document those bugs can be high priority if they are "brand damaging", i.e. if the layout bug is in the logo or one of the products then it is severe - if it's just a paragraph on the page that is 2 pixels off then it's not.
I use the following categories both for features and bugs:
Showstopper, the program (or a major feature) will not work
Must have, a significant part of the customers will be bothered by this
Would have, some customers will be bothered
Nice to have, a few customers want this
Normally you plan to fix 1, 2 and 3, but 3 is often postponed to a next release due to time constraints.
I think this is the scale we used at a previous job:
Causes loss of files or system instability.
Crashes the program.
Feature doesn't work.
Feature doesn't work, but there are workarounds.
Cosmetic issue.
Request for enhancement.
Sometimes this was abused - if a feature was so poorly designed that someone couldn't figure out how to use it, that was classified as a 6, and it never got fixed.
I agree with the FogBugz folks that this should be kept super simple: http://fogbugz.stackexchange.com/questions/352/priority-vs-severity
I made up this scheme, which I find easy to remember:
pS: seconds matter, eg, server is on fire
pM: minutes matter, eg, something is broken
pH: hours matter, ie, don't go to bed till this is done
pd: days matter, ie, normal priority
pw: weeks matter, ie, lower priority
pm: months matter, ie, no hurry
py: years matter, ie, maybe/someday, ie, wishlist
It roughly parallels Debian's scheme: http://www.debian.org/Bugs/Developer#severities
I like it because it straightforwardly combines priority and severity into a single field that's easy to set a value for.
PS: You can also pick intermediate urgencies like "pMH" for in between "minutes matter" and "hours matter". Or "pHd" is in between "hours mattter" and "days matter" -- roughly, "don't literally pull an all-nighter for it but don't work on anything else till it's done".