When are MySQL Triggers not a good idea? - mysql

It seems to me that any data tier consistency/integrity updates should almost always be handled by a trigger.
I've been told in the past they can reduce performance, but I'm not sure under what circumstances. On one hand I could see increased locking contention when further actions are chained by triggers, but it seems as though the aggregate performance should still be improved by reducing the need for multiple round-trip queries. One counterexample would be logging, which might be better handled asynchronously outside the application critical path for performance. It also seems that one would not want too much application-specific algorithmic complexity implemented in the data tier.
I've read the docs, FAQs, Forum and other sites and witnessed plenty of use cases, but haven't come across a discussion of best practices or anti-patterns.
Are there general rules-of-thumb or specific cases where triggers are not a good idea?

Related

How do you make real, secure benchmarks?

According to this question, a benchmark run on the same machine had very varying results.
I'm not asking about how to use microtime or whichever framework, but rather, how do you make sure that your benchmarks are not biased in any way? Any machine setup, software setup, process setup? Is there a way to make sure your benchmarks can be safely used as a reference?
Basically benchmarking is kind of like a scientific study, so the same rules apply. A benchmark is usually done to answer some kind of question, so start with formulating a good question. After that it is practice and experience to eliminate all the wrong bias.
Make sure you know and document the runtime environment in detail(e.g. switch off power management and other background tasks that might disturb measurements).
Make sure you repeat the experiment (benchmark run) often enough to get good and stable averages and document it.
Make sure you know what you are measuring (e.g. use a working set thats larger than all caches if you want to measure memory performance etc., or using as many threads as you have cores and so on).
In some cases this involves getting caches filled and datasets cached, in other cases you need to do the exact opposite. Depends on the question you want to answer with your benchmark.

Weighing High-Volume Database Servers

I suppose this is partially subjective in that it's probably dependent on everyone's interpretation of "high volume", but for the sake of discussion, I'd like to approach this in a hypothetical way. Also, if this is something that should be exclusive to ServerFault, let me know and I'll happily repost there.
Obviously there are numerous well-known database servers - the most lauded of which likely being MySQL. Many people swear by SQLite, PostgreSQL, or even MSSQL (I've admittedly only used MySQL and SQLite). I've had plenty of success dealing with MySQL for low-medium (<= 1,000,000 hits/month) traffic where database interaction was either minimal or moderate (eg, no complex subqueries, wide joins, etc), and MySQL clusters for medium-high traffic. That said, I'm wondering about the validity of filesystem-based systems for extremely high traffic (say 100,000 concurrent connections, hypothetically).
There's always the approach of "build something solid, optimize it, and then scale it by throwing more CPUs at it" which isn't unreasonable given the cloud, and I'm not necessarily afraid of spawning slaves to keep things well distributed. But from a minimalist (and efficiency) standpoint, for something with that many concurrent requests, it seems like adding more gears to the machine is just adding unnecessary complexity.
I know that using something like MySQL Cluster has support for redistributing queries across working slaves should one fail, but if you had a single application such that logically breaking usage into separate servers was not possible, is there a solution that is more efficient than just increasing CPUs? Possibly using filesystem storage across N mount points? I'd love to get some thoughts about pros and cons.
See Wikipedia on the subject of the C10K problem - or references from that page since the Wikipedia page is rather light on material. Suffice to say, C10K refers to the problem of having 10,000 concurrent clients. You are asking about a problem an order of magnitude larger - which is correspondingly harder and less achievable in practice. You are rapidly encroaching on Google's search territory, and require Google-sized infrastructure to cope.

How to measure usability to get hard data?

There are a few posts on usability but none of them was useful to me.
I need a quantitative measure of usability of some part of an application.
I need to estimate it in hard numbers to be able to compare it with future versions (for e.g. reporting purposes). The simplest way is to count clicks and keystrokes, but this seems too simple (for example is the cost of filling a text field a simple sum of typing all the letters ? - I guess it is more complicated).
I need some mathematical model for that so I can estimate the numbers.
Does anyone know anything about this?
P.S. I don't need links to resources about designing user interfaces. I already have them. What I need is a mathematical apparatus to measure existing applications interface usability in hard numbers.
Thanks in advance.
http://www.techsmith.com/morae.asp
This is what Microsoft used in part when they spent millions redesigning Office 2007 with the ribbon toolbar.
Here is how Office 2007 was analyzed:
http://cs.winona.edu/CSConference/2007proceedings/caty.pdf
Be sure to check out the references at the end of the PDF too, there's a ton of good stuff there. Look up how Microsoft did Office 2007 (regardless of how you feel about it), they spent a ton of money on this stuff.
Your main ideas to approach in this are Effectiveness and Efficiency (and, in some cases, Efficacy). The basic points to remember are outlined on this webpage.
What you really want to look at doing is 'inspection' methods of measuring usability. These are typically more expensive to set up (both in terms of time, and finance), but can yield significant results if done properly. These methods include things like heuristic evaluation, which is simply comparing the system interface, and the usage of the system interface, with your usability heuristics (though, from what you've said above, this probably isn't what you're after).
More suited to your use, however, will be 'testing' methods, whereby you observe users performing tasks on your system. This is partially related to the point of effectiveness and efficiency, but can include various things, such as the "Think Aloud" concept (which works really well in certain circumstances, depending on the software being tested).
Jakob Nielsen has a decent (short) article on his website. There's another one, but it's more related to how to test in order to be representative, rather than how to perform the testing itself.
Consider measuring the time to perform critical tasks (using a new user and an experienced user) and the number of data entry errors for performing those tasks.
First you want to define goals: for example increasing the percentage of users who can complete a certain set of tasks, and reducing the time they need for it.
Then, get two cameras, a few users (5-10) give them a list of tasks to complete and ask them to think out loud. Half of the users should use the "old" system, the rest should use the new one.
Review the tapes, measure the time it took, measure success rates, discuss endlessly about interpretations.
Alternatively, you can develop a system for bucket-testing -- it works the same way, though it makes it far more difficult to find out something new. On the other hand, it's much cheaper, so you can do many more iterations. Of course that's limited to sites you can open to public testing.
That obviously implies you're trying to get comparative data between two designs. I can't think of a way of expressing usability as a value.
You might want to look into the GOMS model (Goals, Operators, Methods, and Selection rules). It is a very difficult research tool to use in my opinion, but it does provide a "mathematical" basis to measure performance in a strictly controlled environment. It is best used with "expert" users. See this very interesting case study of Project Ernestine for New England Telephone operators.
Measuring usability quantitatively is an extremely hard problem. I tackled this as a part of my doctoral work. The short answer is, yes, you can measure it; no, you can't use the results in a vacuum. You have to understand why something took longer or shorter; simply comparing numbers is worse than useless, because it's misleading.
For comparing alternate interfaces it works okay. In a longitudinal study, where users are bringing their past expertise with version 1 into their use of version 2, it's not going to be as useful. You will also need to take into account time to learn the interface, including time to re-understand the interface if the user's been away from it. Finally, if the task is of variable difficulty (and this is the usual case in the real world) then your numbers will be all over the map unless you have some way to factor out this difficulty.
GOMS (mentioned above) is a good method to use during the design phase to get an intuition about whether interface A is better than B at doing a specific task. However, it only addresses error-free performance by expert users, and only measures low-level task execution time. If the user figures out a more efficient way to do their work that you haven't thought of, you won't have a GOMS estimate for it and will have to draft one up.
Some specific measures that you could look into:
Measuring clock time for a standard task is good if you want to know what takes a long time. However, lab tests generally involve test subjects working much harder and concentrating much more than they do in everyday work, so comparing results from the lab to real users is going to be misleading.
Error rate: how often the user makes mistakes or backtracks. Especially if you notice the same sort of error occurring over and over again.
Appearance of workarounds; if your users are working around a feature, or taking a bunch of steps that you think are dumb, it may be a sign that your interface doesn't give the tools to figure out how to solve their problems.
Don't underestimate simply asking users how well they thought things went. Subjective usability is finicky but can be revealing.

What optimizations are OK to do right away?

One of the most common mantras in computer science and programming is to never optimize prematurely, meaning that you should not optimize anything until a problem has been identified, since code readability/maintainability is likely to suffer.
However, sometimes you might know that a particular way of doing things will perform poorly. When is it OK to optimize before identifying a problem? What sorts of optimizations are allowable right from the beginning?
For example, using as few DB connections as possible, and paying close attention to that while developing, rather than using a new connection as needed and worrying about the performance cost later
I think you are missing the point of that dictum. There's nothing wrong with doing something the most efficient way possible right from the start, provided it's also clear, straight forward, etc.
The point is that you should not tie yourself (and worse, your code) in knots trying to solve problems that may not even exist. Save that level of extreme optimizations, which are often costly in terms of development, maintenance, technical debt, bug breeding grounds, portability, etc. for cases where you really need it.
I think you're looking at this the wrong way. The point of avoiding premature optimization isn't to avoid optimizing, it's to avoid the mindset you can fall into.
Write your algorithm in the clearest way that you can first. Then make sure it's correct. Then (and only then) worry about performance. But also think about maintenance etc.
If you follow this approach, then your question answers itself. The only "optimizations" that are allowable right from the beginning are those that are at least as clear as the straightforward approach.
The best optimization you can make at any time is to pick the correct algorithm for the problem. It's amazing how often a little thought yields a better approach that will save orders of magnitude, rather than a few percent. It's a complete win.
Things to look for:
Mathematical formulas rather than iteration.
Patterns that are well known and documented.
Existing code / components
IMHO, none. Write your code without ever thinking about "optimisation". Instead, think "clarity", "correctness", "maintainability" and "testability".
From wikipedia:
We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil. Yet we should not
pass up our opportunities in that
critical 3%.
- Donald Knuth
I think that sums it up. The question is knowing if you are in the 3% and what route to take. Personally I ignore most optimizations until I at least get my code working. Usually as a separate pass with a profiler so I can make sure I am optimizing things that actually matter. Often times code simply runs fast enough that anything you do will have little or no effect.
If you don't have a performance problem, then you should not sacrifice readability for performance. However, when choosing a way to implement some functionality, you should avoid using code you know is problematic from a performance point of view. So if there are 2 ways to implement a function, choose the one likely to perform better, but if it's not the most intuitive solution, make sure you put in some comments as to why you coded it that way.
As you develop in your career as a developer, you'll simply grow in awareness of better, more reasonable approaches to various problems. In most cases I can think of,
performance enhancement work resulted in code that was actually smaller and simpler than some complex tangle that evolved from working through a problem. As you get better, such simpler, faster solutions just become easier and more natural to generate.
Update: I'm voting +1 for everyone on the thread so far because the answers are so good. In particular, DWC has captured the essence of my position with some wonderful examples.
Documentation
Documenting your code is the #1 optimization (of the development process) that you can do right from the get go. As projects grow, the more people you interact with and the more people need to understand what you wrote, the more time you will spend
Toolkits
Make sure your toolkit is appropriate for the application you're developing. If you're making a small app, there's no reason to invoke the mighty power of an Eclipse based GUI system.
Complilers
Let the compiler do the tough work. Most of the time, optimization switches on a compiler will do most of the important things you need.
System Specific Optimizations
Especially in the embedded world, gain an understanding of the underlying architecture of the CPU and system you're interacting with. For example, on a Coldfire CPU, you can gain large performance improvements by ensuring that your data lies on the proper byte boundary.
Algorithms
Strive to make access algorithms O(1) or O(Log N). Strive to make iteration over a list no more than O(N). If you're dealing with large amounts of data, avoid anything more than O(N^2) if it's at all possible.
Code Tricks
Avoid, if possible. This is an optimization in itself - an optimization to make your application more maintainable in the long run.
You should avoid all optimizations if the only belief that the code you are optimizing will be slow. The only code you should optimize is when you know it is slow (preferably through a profiler).
If you write clear, easy to understand code then odds are it'll be fast enough, and if it isn't then when you go to speed it up it should be easier to do.
That being said, common sense should apply (!). Should you read a file over and over again or should you cache the results? Probably cache the results. So from a high level architecture point of view you should be thinking of optimization.
The "evil" part of optimization is the "sins" that are committed in the name of making something faster - those sins generally result in the code being very hard to understand. I am not 100% sure this is one of them.. but look at this question here, this may or may not be an example of optimization (could be the way the person thought to do it), but there are more obvious ways to solve the problem than what was chosen.
Another thing you can do, which I recently did do, is when you are writing the code and you need to decide how to do something write it both ways and run it through a profiler. Then pick the clearest way to code it unless there is a large difference in speed/memory (depending on what you are after). That way you are not guessing at what is "better" and you can document why you did it that way so that someone doesn't change it later.
The case that I was doing was using memory mapped files -vs- stream I/O... the memory mapped file was significantly faster than the other way, so I wasn't concerned if the code was harder to follow (it wasn't) because the speed up was significant.
Another case I had was deciding to "intern" String in Java or not. Doing so should save space, but at a cost of time. In my case the space savings wasn't huge, and the time was double, so I didn't do the interning. Documenting it lets someone else know not to bother interning it (or if they want to see if a newer version of Java makes it faster then they can try).
In addition to being clear and straightforward, you also have to take a reasonable amount of time to implement the code correctly. If it takes you a day to get the code to work right, instead of the two hours it would have taken if you'd just written it, then you've quite possibly wasted time you could have spent on fixing the real performance problem (Knuth's 3%).
Agree with Neil's opinion here, doing performance optimizations in code right away is a bad development practice.
IMHO, performance optimization is dependent on your system design. If your system has been designed poorly, from the perspective of performance, no amount of code optimization will get you 'good' performance - you may get relatively better performance, but not good performance.
For instance, if one intends to build an application that accesses a database, a well designed data model, that has been de-normalized just enough, if likely to yield better performance characteristics than its opposite - a poorly designed data model that has been optimized/tuned to obtain relatively better performance.
Of course, one must not forget requirements in this mix. There are implicit performance requirements that one must consider during design - designing a public facing web site often requires that you reduce server-side trips to ensure a 'high-performance' feel to the end user. That doesn't mean that you rebuild the DOM on the browser on every action and repaint the same (I've seen this in reality), but that you rebuild a portion of the DOM and let the browser do the rest (which would have been handled by a sensible designer who understood the implicit requirements).
Picking appropriate data structures. I'm not even sure it counts as optimizing but it can affect the structure of your app (thus good to do early on) and greatly increase performance.
Don't call Collection.ElementCount directly in the loop check expression if you know for sure this value will be calculated on each pass.
Instead of:
for (int i = 0; i < myArray.Count; ++i)
{
// Do something
}
Do:
int elementCount = myArray.Count;
for (int i = 0; i < elementCount ; ++i)
{
// Do something
}
A classical case.
Of course, you have to know what kind of collection it is (actually, how the Count property/method is implemented). May not necessarily be costy.

How to robustly, but minimally, distribute items across a peer-to-peer system

If one has a peer-to-peer system that can be queried, one would like to
reduce the total number of queries across the network (by distributing "popular" items widely and "similar" items together)
avoid excess storage at each node
assure good availability to even moderately rare items in the face of client downtime, hardware failure, and users leaving (possibly detecting rare items for archivists/historians)
avoid queries failing to find matches in the event of network partitions
Given these requirements:
Are there any standard approaches? If not, is there any respected, but experimental, research? I'm familiar some with distribution schemes, but I haven't seen anything really address learning for robustness.
Am I missing any obvious criteria?
Is anybody interested in working on/solving this problem? (If so, I'm happy to open-source part of a very lame simulator I threw together this weekend, and generally offer unhelpful advice).
#cdv: I've now watched the video and it is very good, and although I don't feel it quite gets to a pluggable distribution strategy, it's definitely 90% of the way there. The questions, however, highlight useful differences with this approach that address some of my further concerns, and gives me some references to follow up on. Thus, I'm provisionally accepting your answer, although I consider the question open.
There are multiple systems out there with various aspects of what you seek and each making different compromises, including but not limited to:
Amazon's Dynamo: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
Kai: http://www.slideshare.net/takemaru/kai-an-open-source-implementation-of-amazons-dynamo-472179
Hadoop: http://hadoop.apache.org/core/docs/current/hdfs_design.html
Chord: http://pdos.csail.mit.edu/chord/
Beehive: http://www.cs.cornell.edu/People/egs/beehive/
and many others. After building a custom system along those lines, I let some of the building blocks out in open source form as well: http://code.google.com/p/distributerl/
(that's not a whole system, but a few libraries useful in building one)