How to prioritize bugs? - language-agnostic

In my current company there isn't clear understanding between the test and development teams as to how severe a bug should be? There are arguments which go back and forth to reduce or to increase the severity. We are not as of now aware of any documents which lays the rules. The tester raises the bug and assigns priority based on his intuition. The developer would request a change based on his load or some other factor.
How are severity/priority of bugs classified? Are there any standards which guide how software defect priorities needs to be determined based on customer needs, time lines and other things?

Use priority levels that deliberately have nothing to do with severity or impact, and describe only the conceptual position of the bug in the schedule. This field will determine which bugs get worked on, so it will be a target for negotiation.
Use severity levels that deliberately have concrete, verifiable definitions not open to negotiation, that have nothing to do with scheduling or priority. I've worked successfully with the severity definitions used by the Debian BTS, generalised to apply to programming projects in general.
That way, the severity is much more a matter of verifiable fact, independent of a statement of priority. The priority is then free to be tweaked up and down by negotiation or whatever, without affecting the factual information in the severity field.
Attempting to conflate both “severity” and “priority” into a single field will lead to soul-draining arguments and wasted time. The bug reporter needs a firm guide of fact to determine how “bad” the bug is, and this needs to be easily agreed on by independent parties. The priority, on the other hand, is the correct target for negotiation and scheduling games.

I work on emergency control centre systems, so this set of bug levels is a little, well... extreme:
someone dies
total system failure requiring DR invocation
server failure requiring engineer response
failure involving loss of call continuity
failure involving loss of data
incorrect data recorded
application failure - non-recoverable
application failure - non-recoverable, but automatically restarted
does not meet requirement spec, no workaround
does not meet requirement spec, but has workaround
cosmetic - layout etc.
actually a feature request
That's off the top of my head. In case you were wondering, it's from most extreme to least :-)

Some stuff we used before. We split the defect rating into priority and severity.
Severity (set by submitter during submission of defect)
Highest (5): Data loss, hardware damage possible, or a security-related failure
High (4): Loss of functionality without any reasonable workaround
Medium (3): Loss of functionality with a reasonable workaround
Low (2): Partial loss of a function or a feature set (feature still hits the design requirements)
Lowest (1): A cosmetic error
Priority (adjusted by development, management and QA during defect evaluation)
Highest (5): The system is practically unusable with this defect.
High (4): The defect will have a serious impact on the company’s ability to sell and maintain this system.
Medium (3): The company will lose some money if this defect is in the system, but it might be more important to meet the schedule. Fix after release.
Low (2): Do not delay the release, but do fix this problem afterwards.
Lowest (1): Fix as time and resources allow.
Both numbers together create a risk priority number (RPN). Simply multiply severity with priority. Higher result means higher risk. 25 defines the ultimate defect bomb. 1 can be done during idle time or if someone is bored and needs something to do.
First goal: Defects with a rating of highest or high of any kind should be fixed before release.
Second goal: Defects with RPN > 8 should be fixed before releasing the product.
This is of course a little bit artificial but helps to give all parties (Support, QA/Test, Engineering, and Product Managers) a tool to set priorities without blowing away the opinion of other side.

Replace your bug tracking system with fogbugz and get rid of severity field altogether.
See Priority vs Severity

"Been there done that".
I've had this discussion over and over again, on different projects. We've tried to combine priority with severity, but the lesson I've learned: do not combine severity with priority !
We've had a lot of brainstorms and meetings which ended with the words "this is it". Multiple guideline-documents have been created and spread between the different "parties", but after a while we discovered that it didn't work at the end. Different "parties" think different about bugs: our helpdesk has another understanding of priority than the development team or the sales has.
Having both a severity and a priority level will very quickly become very confusion because:
when using numbers (between 1 to 5) one will not know what each number means
what if an issue has the highest possible priority, but the lowest possible severity - and I'm sure that this will happen!
what if someone reduces a severity, does he need to reduce the priority also?
"So what should you do then?":
Only use one kind of indicator for the 'level' of an issue: Doesn't matter how you call it.
Use numbers (eg 1 - 5, but could be more or less depending on your needs) to clearly indicate the importance but combine it with a keyword so that it's clear what it means (eg. 'nice to have', 'show stopper'). For some people prio 1 means the most import, for others 5 does -> therefore a keyword to indicate what a number means is necessary.
Make a distinction between a 'normal issue' or a 'red alert'. In our case a 'Red Alert' must be solved immediately and put in production immediately. A normal issue will follow the normal development-test-deployment-flow. The priority/severity/however-how-you-call-it should only be set for normal issues and will be ignored for 'red alerts'.
*> In practice, a 'Red Alert' can become a
'Normal Issue': the support team
discovered a major bug and created a
'Red Alert'. But after some
investigation we discovered that data
had become 'corrupt' in the database
since it was inserted there directly
and not via the application.*
Choose a good tool that allows you to customize the flow; but most tools do.

As for a standard, IEEE guide to classification for software anomalies although I am not sure how widely this is adopted. IEEE 1044.1-1995

One option is to have the product owner determine the priority of the bug. While there is some general intuition on how "bad" a bug is, it can be the responsibility of the owner of the product to set an order of precidence (i.e. bug A should be fixed before bug B etc...).
The more information (clear and concise) that can be provided to the product owner can assist that individual make those determinations (i.e. how many users have experienced the bug, what features are not available as a result of the bug, etc...)

Must be done now
Must be done before we ship
Minor annoyance (Doesn't prevent the user from exercising the functionality)
Edge case/Remote/Tester-from-Mordor scenario
Well I just made that up... my point being categorizing bugs should not be a weekly hour+ long ritual..
IMHO, prioritizing acc to a flowchart is wasted time. Fix bugs in Cat#1 and #2 - as quickly as they surface. If you find yourself swamped by bugs, slow down and reflect. Defer Cat#3 and Cat#4 if the schedule doesn't permit or higher priority items override.
The critical thing is that all of you have a shared understanding of this severity and expected quality. Don't let compliance to the holy standards of X slow you down from delivering what the customer wants... working software.

Personally I favour the two tier severity/priority model. I know the arguments for a single level but the places I've worked generally I've just seen a two level heirarchy work better
Severity is set by the support team (based on input from the client). Priority is set by the client (with input from the support team).
For severity I use:
1 - Blocker/show stopped
2 - Major functionality unavailable (or effectively unavailable), no practical work around possible
3 - Major functionality unavailable (or ...), work around possible
4 - Minor functionality unavailable (or effectively unavailable), no work around possible
5 - Minor functionality unavailable (or ...), work around possible
6 - Cosmetic or other trivial
Then for priority I just use High, Medium, Low but anything from 3 - 5 levels works (much more than that is just over the top).
I'd generally then order by Priority first and then severity within that. The important thing about this is that the client has the most important say. If they say the way their logo is printing out on a report is the highest priority then that's what gets looked at BUT it gets looked at after the other client's high priority which is stopping them logging in.
Generally speaking I wouldn't release with any high priority issues or any medium priority issues with severity 1 - 4. Obviously in an ideal world you'd fix everything but I've never been lucky enough to have that option.

The tester tells what is broken
The developer estimates how much work it will be to fix
The customer decides the business value, i.e. the priority.

Set the requirements of the project so you can base the priority of a fix on the priority of the requirements interfered by the bug.

I had the same issue with one of our customers. In the end we set up a document together describing what kind of bugs would match to a certain severity. Aside from an occasional discussion using this document as a guideline appears to work.
But be well aware that test teams and development teams may have very different opinions on what is a severe bug and what is not. From the point of view of the testers a small layout bug can be high priority when a developer would just say that no one will notice.
In our document those bugs can be high priority if they are "brand damaging", i.e. if the layout bug is in the logo or one of the products then it is severe - if it's just a paragraph on the page that is 2 pixels off then it's not.

I use the following categories both for features and bugs:
Showstopper, the program (or a major feature) will not work
Must have, a significant part of the customers will be bothered by this
Would have, some customers will be bothered
Nice to have, a few customers want this
Normally you plan to fix 1, 2 and 3, but 3 is often postponed to a next release due to time constraints.

I think this is the scale we used at a previous job:
Causes loss of files or system instability.
Crashes the program.
Feature doesn't work.
Feature doesn't work, but there are workarounds.
Cosmetic issue.
Request for enhancement.
Sometimes this was abused - if a feature was so poorly designed that someone couldn't figure out how to use it, that was classified as a 6, and it never got fixed.

I agree with the FogBugz folks that this should be kept super simple: http://fogbugz.stackexchange.com/questions/352/priority-vs-severity
I made up this scheme, which I find easy to remember:
pS: seconds matter, eg, server is on fire
pM: minutes matter, eg, something is broken
pH: hours matter, ie, don't go to bed till this is done
pd: days matter, ie, normal priority
pw: weeks matter, ie, lower priority
pm: months matter, ie, no hurry
py: years matter, ie, maybe/someday, ie, wishlist
It roughly parallels Debian's scheme: http://www.debian.org/Bugs/Developer#severities
I like it because it straightforwardly combines priority and severity into a single field that's easy to set a value for.
PS: You can also pick intermediate urgencies like "pMH" for in between "minutes matter" and "hours matter". Or "pHd" is in between "hours mattter" and "days matter" -- roughly, "don't literally pull an all-nighter for it but don't work on anything else till it's done".

Related

State definition in Reinforcement learning

When defining state for a specific problem in reinforcement learning, How to decide what to include and what to leave for the definition, and also how to set difference between an observation and a state.
For example assuming that the agent is in the context of human resource and planning where it needs to hire some workers based on the demand of jobs, considering the cost of hiring them (assuming the budget is limited) is a state in the format of (# workers, cost) a good definition of state?
In total I don't know what information is needed to be in state and what should be left as it's rather observation.
Thank you
I am assuming you are formulating this as an RL problem because the demand is an unknown quantity. And, maybe [this is optional criteria] the Cost of hiring them may take into account a worker's contribution towards the job which is unknown initially. If however, both these quantities are known or can be approximated beforehand then you can just run a Planning algorithm to solve the problem [or just some sort of Optimization].
Having said this, the state in this problem could be something as simple as (#workers). Note I'm not including the cost, because cost must be experienced by the agent, and therefore is unknown to the agent until it reaches a specific state. Depending on the problem, you might need to add another factor of "time", or the "job-remaining".
Most of the theoretical results on RL hinge on a key assumption in several setups that the environment is Markovian. There are several works where you can get by without this assumption, but if you can formulate your environment in a way that exhibits this property, then you would have much more tools to work with. The key idea being, the agent can decide which action to take (in your case, an action could be : Hire 1 more person. Other actions could be Fire a person) based on the current state, say (#workers = 5, time=6). Note that we are not distinguishing between workers yet, so firing "a" person, instead of firing "a specific" person x. If the workers have differing capabilities, you may need to add several other factors each representing which worker is currently hired, and which are currently in the pool, yet to be hired so like a boolean array of a fixed length. (I hope you get the idea of how to form a state representation, and this can vary based on the specifics of the problem, which are missing in your question).
Now, once we have the State definition S, the action definition A (hire / fire), we have the "known" quantities for an MDP-setup in an RL framework. We also need an environment that can supply us with the cost function when we query it (Reward Function / Cost Function), and tell us the outcome of taking a certain action on a certain state (Transition). Note that we don't necessarily need to know these Reward / Transition function beforehand, but we should have a means of getting these values when we query for a specific (state, action).
Coming to your final part, the difference between observation and state. There are much better resources to dig deep into it, but in a crude sense, observation is an agent's (any agent, AI, human etc) sensory data. For example, in your case the agent has the ability to count number of workers currently employed (but it does not have an ability to distinguish between workers).
A state, more formally, a true MDP state must be something that is Markovian and captures the environment at its fundamental level. So, maybe in order to determine the true cost to the company, the agent needs to be able to differentiate between workers, working hours of each worker, jobs they are working at, interactions between workers and so on. Note that, much of these factors may not be relevant to your task, for example a worker's gender. Typically one would like to form a good hypothesis on which factors are relevant beforehand.
Now, even though we can agree that a worker's assignment (to a specific job) maybe a relevant feature which making a decision to hire or fire them, your observation does not have this information. So you have two options, either you can ignore the fact that this information is important and work with what you have available, or you try to infer these features. If your observation is incomplete for the decision making in your formulation we typically classify them as Partially Observable Environments (and use POMDP frameworks for it).
I hope I clarified a few points, however, there is huge theory behind all of this and the question you asked about "coming up with a state definition" is a matter of research. (Much like feature engineering & feature selection in Machine Learning).

Code duplication metrics - Best practice

When looking at code duplication metrics over a long period of time (>10 years) are there guidelines / best practices for what level of code duplication is "normal" or "recommended"?
I have great difficulty with this question as if the code quality was great then nobody needs to maintain it so who cares? But, in general terms, are there references on "normal". Say for 10 lines of duplication threshold.
Is a, say X%, duplication unusual or normal? If normal does that mean that there are healthy profitable projects our there with this level of duplication.
Perhaps the answer is if there is a study that includes code duplication as one of the metrics against success / average / failure? Perhaps people can share their success experience in maintenance costs for a level of code duplication?
In my opinion there is no general answer to this.
You should inspect every finding of the tool and decide if it is a false positive or justified (may be some standard coding pattern or generated code parts) or does contain problematic copy pasted code that should be refactored and extracted into a own function/module/whatever.
In case of false positive or intended code there should be a tool specific way to suppress the warning on this occurrence or on a general base on a specific pattern.
So in future runs, you always should get warnings if new duplicates are found (or older unfixed true positives).
May be on false positives consider filing a bug report to the author (crafting the smallest possible code-configuration that still throws that warning).

What are examples of real-world scenarios where a message queuing system can accept the loss of some messages?

I was reading this blog post, in which the author proposes the following question, in the context of message queues:
does it matter if a message is lost? If you application node, processing the request, dies, can you recover? You’ll be surprised how often it doesn’t actually matter, and you can function properly without guaranteeing all messages are processed
At first I thought that the main point of handling messages was to never loose a single message - after all, a message lost could mean a hotel reservation not booked, a checkout not completed, or any other functionality not carried through, which seems too similar to a bug for me. I suppose I am missing something, so, what are examples of scenarios where it is OK for a messaging system to loose a few messages?
Well, your initial expectation:
the main point of handling messageswas to never loose a single message
was just not a correct one.
Right, if one strives for a one certain type of robustness, where fail-safe measures have to take all due care and precautions, so as not a single message could get lost, yes, there your a priori expressed expectation fits.
This does not mean that all other system designs have to carry all the immense burdens and have to pay all that incurred costs ( resources-wise, latency-wise et al ), as the "100+% guaranteed delivery" systems do ( but, again, only if they can ).
Anti-pattern cases:
There are many use-cases, where an absolute certainty of delivery of each and every message originally sent is actually an anti-pattern.
Just imagine a weakly synchronised system ( including ones, that have nothing like backthrottling or even any simplest form of feedback propagation at all ), where the sensors read an actual temperature, a sound, a video-frame and send a message with that value(s).
Whenever a postprocessing system gets such information delivered, there may be a reason not to read any and all "old" values, but the most recent one(s).
If a delivery framework already got any newer set of values, all the "older" values, not processed yet, just hanging at some depth from the queue-head, yet in the queue, might create the anti-pattern, where one would not like to have to read and process any and all of those "older" values, but just the most recent one(s).
Like no one will make a trade with you based on yesterday prices, there is not positive value to make any new, current, decision based on reading any and all "old" temperature readings, that still wait in the queue.
Some smart-messaging frameworks provide explicit means for taking just the very "newest" message from a given source - thus enabling to imperatively discard any "older" messages, avoiding them from being read and processed right due to a known presence of a one "most" recent.
This answers the original question about the assumed main point of handling messages.
Efficiency first:
In any case, where a smart-delivery takes place ( either deliver an exact copy of the original message content or noting-at-all ), the resources are used at their best efforts, yet, without spending a single penny on anything but the "just-enough" smart-delivery.
Building robustness costs more than that.
Building an ultimate robustness, costs even way more than that.
Systems than do have such an extreme requirement can and may extend the resources-efficient smart-delivery so as to reach some requirements defined level of robustness, at some add-on costs.
The same but reversed is not possible -- if an "everything-proof" system is to get a slimmer form and fashion, so as to fit onto any restricted-resources hardware or to make it "forget" some "old" messages, that are of no positive value at this very moment ( but on the contrary, constitute a must for the processing element to read and process each and every "unwanted" message, just due to the fact it was delivered, while knowing a core-logic needs just the most recent one ).
Distributed systems accrue E2E-latency from many distributed sources, so any rigid-delivery system just block and penalise the only one element, who is ( latency-wise ) innocent -- the receiver.
I suppose it's OK to loose few messages from some measurement units that deliver the value once in.... Also for big data analytics solutions few lost messages won't make a big difference
It all depends on the application/larger system. The message queue is only one link in the chain, so to speak. If the application(s) at the ends are prepared to deal with loss, losing some messages is not a problem. If the application(s) rely on total messaging integrity then there will be problems.
An example of a system that will be ok with loss is weather updates for your phone. If a few temperature/wind updates don't make it to you there's no real harm in that.
Now, if you're running a nuclear reactor and you lose a few temperature updates on the core, well that is a problem.
I work a lot on safety critical, infrastructure-level systems, and am responsible for messaging much of the time. Many of those systems state clearly that messaging may reorder, duplicate, or lose messages; it's just a fact of life where distributed systems and networks are involved. The endpoint systems need to be designed to work correctly in that environment. So they track messages, ack end to end, deal with duplicates and retransmits, etc.

Modeling an RTS or how Blizzard was able to put together Starcraft 1 & 2?

Not sure if this question is related to software development, but I hope someone can at least point me in the right direction(no, not that direction..)
I am very curious as to how Blizzard achieve such a balance of strategic/tactic forces in their games? If you look at Starcraft 1 or now 2, each race have unique features that sort of counterpart other unique features of other races and all together create a pretty beautiful(to my mind at least) balance.
Is there some sort of area of mathematics that could help model these things? How do they do it basically?
I don't have full answer but here is what I know. Initially, when game is technically ready the balance is not ideal. When they started first public beta there were holes in balance that they patched very fast. They let players (testers) play as is and captured statistics of % of wins per race and tuned the parameters according to it. When the beta was at the end ratio was almost ideal: 33%/33%/33%.
I've no idea how Blizzard specifically did it - it might have just been through a lot of user testing.
However, the entire field of CS and statistics dedicated to these kinds of problems is simulation. Generally the idea is to construct a model of your system and then generate inputs according to some statistical distribution to try to understand the behaviour of that model. In the case of finding game balance, you would want to show that most sequences of game events led to some kind of equilibrium. This would probably involve some kind of stochastic modeling.
Linear algebra, maybe a little calculus, and a lot of testing.
There's no real mystery to this sort of thing, but people often just don't think about it in the rigorous terms you need to get a system that is fairly well-balanced.
Basically the designers know how quickly you can gather resources (both the best case and the average case), and they know how long it takes to build a unit, and they know roughly how powerful a unit is (eg. by reference to approximations such as damage per second). With this, you can ensure a unit's cost in resources makes sense. And from that, it's possible to compare resource gathering with unit cost to model the strength of a force growing over time. Similarly you can measure a force's capacity for damage over time, and compare the two. If there's a big disparity then they can tweak one or more of the variables to reduce it. Obviously there are many permutations of units but the interesting thing is that you only really need to understand the most powerful permutations - if a player picks a poor combination then it's ok if they lose. You don't want every approach to be equally good, as that would imply a boring game where your decisions are meaningless.
This is, of course, a simplification, but it helps get everything in roughly the right place and ensure that there are no units that are useless. Then testing can hammer down the rough edges and find most of the exploits that remain.
If you've tried SCII you've noticed that Blizzard records the relevant data for each game played on B.Net. They did the same in WC3, and presumably in SC1. With enough games stored, it is possible to get accurate results from statistical analysis. For example, the Protoss should win a third of all match-ups with similarly skilled opponents. If this is not the case, Blizzard can analyze the games where the Protoss won vs the games where they lost, and see what units made the difference. Then they can nerf those units (with a bit of in-house testing), and introduce the change at the next patch. Balancing is a difficult problem - a change that fixes problems in top-level games may break the balance in mid-level games.
Proper testing in a closed beta is impossible - you just can't accumulate enough data. This is why Blizzard do open betas. Even so the real test is the release of the game.

How to measure usability to get hard data?

There are a few posts on usability but none of them was useful to me.
I need a quantitative measure of usability of some part of an application.
I need to estimate it in hard numbers to be able to compare it with future versions (for e.g. reporting purposes). The simplest way is to count clicks and keystrokes, but this seems too simple (for example is the cost of filling a text field a simple sum of typing all the letters ? - I guess it is more complicated).
I need some mathematical model for that so I can estimate the numbers.
Does anyone know anything about this?
P.S. I don't need links to resources about designing user interfaces. I already have them. What I need is a mathematical apparatus to measure existing applications interface usability in hard numbers.
Thanks in advance.
http://www.techsmith.com/morae.asp
This is what Microsoft used in part when they spent millions redesigning Office 2007 with the ribbon toolbar.
Here is how Office 2007 was analyzed:
http://cs.winona.edu/CSConference/2007proceedings/caty.pdf
Be sure to check out the references at the end of the PDF too, there's a ton of good stuff there. Look up how Microsoft did Office 2007 (regardless of how you feel about it), they spent a ton of money on this stuff.
Your main ideas to approach in this are Effectiveness and Efficiency (and, in some cases, Efficacy). The basic points to remember are outlined on this webpage.
What you really want to look at doing is 'inspection' methods of measuring usability. These are typically more expensive to set up (both in terms of time, and finance), but can yield significant results if done properly. These methods include things like heuristic evaluation, which is simply comparing the system interface, and the usage of the system interface, with your usability heuristics (though, from what you've said above, this probably isn't what you're after).
More suited to your use, however, will be 'testing' methods, whereby you observe users performing tasks on your system. This is partially related to the point of effectiveness and efficiency, but can include various things, such as the "Think Aloud" concept (which works really well in certain circumstances, depending on the software being tested).
Jakob Nielsen has a decent (short) article on his website. There's another one, but it's more related to how to test in order to be representative, rather than how to perform the testing itself.
Consider measuring the time to perform critical tasks (using a new user and an experienced user) and the number of data entry errors for performing those tasks.
First you want to define goals: for example increasing the percentage of users who can complete a certain set of tasks, and reducing the time they need for it.
Then, get two cameras, a few users (5-10) give them a list of tasks to complete and ask them to think out loud. Half of the users should use the "old" system, the rest should use the new one.
Review the tapes, measure the time it took, measure success rates, discuss endlessly about interpretations.
Alternatively, you can develop a system for bucket-testing -- it works the same way, though it makes it far more difficult to find out something new. On the other hand, it's much cheaper, so you can do many more iterations. Of course that's limited to sites you can open to public testing.
That obviously implies you're trying to get comparative data between two designs. I can't think of a way of expressing usability as a value.
You might want to look into the GOMS model (Goals, Operators, Methods, and Selection rules). It is a very difficult research tool to use in my opinion, but it does provide a "mathematical" basis to measure performance in a strictly controlled environment. It is best used with "expert" users. See this very interesting case study of Project Ernestine for New England Telephone operators.
Measuring usability quantitatively is an extremely hard problem. I tackled this as a part of my doctoral work. The short answer is, yes, you can measure it; no, you can't use the results in a vacuum. You have to understand why something took longer or shorter; simply comparing numbers is worse than useless, because it's misleading.
For comparing alternate interfaces it works okay. In a longitudinal study, where users are bringing their past expertise with version 1 into their use of version 2, it's not going to be as useful. You will also need to take into account time to learn the interface, including time to re-understand the interface if the user's been away from it. Finally, if the task is of variable difficulty (and this is the usual case in the real world) then your numbers will be all over the map unless you have some way to factor out this difficulty.
GOMS (mentioned above) is a good method to use during the design phase to get an intuition about whether interface A is better than B at doing a specific task. However, it only addresses error-free performance by expert users, and only measures low-level task execution time. If the user figures out a more efficient way to do their work that you haven't thought of, you won't have a GOMS estimate for it and will have to draft one up.
Some specific measures that you could look into:
Measuring clock time for a standard task is good if you want to know what takes a long time. However, lab tests generally involve test subjects working much harder and concentrating much more than they do in everyday work, so comparing results from the lab to real users is going to be misleading.
Error rate: how often the user makes mistakes or backtracks. Especially if you notice the same sort of error occurring over and over again.
Appearance of workarounds; if your users are working around a feature, or taking a bunch of steps that you think are dumb, it may be a sign that your interface doesn't give the tools to figure out how to solve their problems.
Don't underestimate simply asking users how well they thought things went. Subjective usability is finicky but can be revealing.