segmented reduction with scattered segments - cuda

I got to solve a pretty standard problem on the GPU, but I'm quite new to practical GPGPU, so I'm looking for ideas to approach this problem.
I have many points in 3-space which are assigned to a very small number of groups (each point belongs to one group), specifically 15 in this case (doesn't ever change). Now I want to compute the mean and covariance matrix of all the groups. So on the CPU it's roughly the same as:
for each point p
{
mean[p.group] += p.pos;
covariance[p.group] += p.pos * p.pos;
++count[p.group];
}
for each group g
{
mean[g] /= count[g];
covariance[g] = covariance[g]/count[g] - mean[g]*mean[g];
}
Since the number of groups is extremely small, the last step can be done on the CPU (I need those values on the CPU, anyway). The first step is actually just a segmented reduction, but with the segments scattered around.
So the first idea I came up with, was to first sort the points by their groups. I thought about a simple bucket sort using atomic_inc to compute bucket sizes and per-point relocation indices (got a better idea for sorting?, atomics may not be the best idea). After that they're sorted by groups and I could possibly come up with an adaption of the segmented scan algorithms presented here.
But in this special case, I got a very large amount of data per point (9-10 floats, maybe even doubles if the need arises), so the standard algorithms using a shared memory element per thread and a thread per point might make problems regarding per-multiprocessor resources as shared memory or registers (Ok, much more on compute capability 1.x than 2.x, but still).
Due to the very small and constant number of groups I thought there might be better approaches. Maybe there are already existing ideas suited for these specific properties of such a standard problem. Or maybe my general approach isn't that bad and you got ideas for improving the individual steps, like a good sorting algorithm suited for a very small number of keys or some segmented reduction algorithm minimizing shared memory/register usage.
I'm looking for general approaches and don't want to use external libraries. FWIW I'm using OpenCL, but it shouldn't really matter as the general concepts of GPU computing don't really differ over the major frameworks.

Even though there are few groups, I don't think you will be able to avoid the initial sorting into groups while still keeping the reduction step efficient. You will probably also want to perform the full sort, not just sorting indexes, because that will help keep memory access efficient in the reduction step.
For sorting, read about general strategies here:
http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter46.html
For reduction (old but still good):
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf
For an example implementation of parallel reduction:
http://developer.nvidia.com/cuda-cc-sdk-code-samples#reduction

Related

speed up ideas -- can CUDA help here?

I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.

Does using binary numbers in code improves performance?

I've seen quite a few examples where binary numbers are being used in code, like 32,64,128 and so on (for instance, very well known example - minecraft)
I want to ask, does using binary numbers in such high level languages as Java / C++ help anything?
I know assembly and that you would always rather use these because in low level language it overcomplicates things if you go above register limit.
Will programs run any faster/save up more memory if you use binary numbers?
As with most things, "it depends".
In compiled languages, the better compilers will deduce that slow machine instructions can sometimes be done with different faster machine instructions (but only for special values, such as powers of two). Sometimes coders know this and program accordingly. (e.g. multiplying by a power of two is cheap)
Other times, algorithms are suited towards representations involving powers of two (e.g. many divide and conquer algorithms like the Fast Fourier Transform or a merge sort).
Yet other times, it's the most compact way to represent boolean values (like a bitmask).
And on top of that, other times it's more efficiency for memory purposes (typically because it's so fast do to multiply and divide logic with powers of two, the OS/hardware/etc will use cache line / page sizes / etc that are powers of two, so you'd do well to have nice power of two sizes for your important data structures).
And then, on top of that, other times.. programmers are just so used to using powers of two that they simply do it because it seems like a nice number.
There are some benefits of using powers of two numbers in your programs. Bitmasks are one application of this, mainly because bitwise operators (&, |, <<, >>, etc) are incredibly fast.
In C++ and Java, this is done a fair bit- especially with GUI applications. You could have a field of 32 different menu options (such as resizable, removable, editable, etc), and apply each one without having to go through convoluted addition of values.
In terms of raw speedup or any performance improvement, that really depends on the application itself. GUI packages can be huge, so getting any speedup out of those when applying menu/interface options is a big win.
From the title of your question, it sounds like you mean, "Does it make your program more efficient if you write constants in binary?" If that's what you meant, the answer is emphatically, No. The compiler translates all your constants to binary at compile time, so by the time the program runs, it makes no difference. I don't know if the compiler can interpret binary constants faster than decimal, but the difference would surely be trivial.
But the body of your question seems to indicate that you mean, "use constants that are round number in binary" rather than necessarily expressing them in binary digits.
For most purposes, the answer would be no. If, say, the computer has to add two numbers together, adding a number that happens to be a round number in binary is not going to be any faster than adding a not-round number.
It might be slightly faster for multiplication. Some compilers are smart enough to turn multiplication by powers of 2 into a bit shift operation rather than a hardware multiply, and bit shifts are usually faster than multiplies.
Back in my assembly-language days I often made elements in arrays have sizes that were powers of 2 so I could index into the array with a bit-shift rather than a multiply. But in a high-level language that would be hard to do, as you'd have to do some research to find out just how much space your primitives take in memory, whether the compiler adds padding bytes between them, etc etc. And if you did add some bytes to an array element to pad it out to a power of 2, the entire array is now bigger, and so you might generate an extra page fault, i.e. the operating system runs out of memory and has to write a chunck of your data to the hard drive and then read it back when it needs it. One extra hard drive right takes more time than 1000 multiplications.
In practice, (a) the difference is so trivial that it would almost never be worth worrying about; and (b) you don't normally know everything happenning at the low level, so it would often be hard to predict whether a change with its intendent ramifications would help or hurt.
In short: Don't bother. Use the constant values that are natural to the problem.
The reason they're used is probably different - e.g. bitmasks.
If you see them in array sizes, it doesn't really increase performance, but usually memory is allocated by power of 2. E.g. if you wrote char x[100], you'd probably get 128 allocated bytes.
No, your code will ran the same way, no matter what is the number you use.
If by binary numbers you mean numbers that are power of 2, like: 2, 4, 8, 16, 1024.... they are common due to optimization of space, normally. Example, if you have a 8 bit pointer it is capable of point to 256 (that is a power of 2), addresses, so if you use less than 256 you are wasting your pointer.... so normally you allocate a 256 buffer... this same works for all other power of 2 numbers....
In most cases the answer is almost always no, there is no noticeable performance difference.
However, there are certain cases (very few) when NOT using binary numbers for array/structure sizes/length will give noticeable performance benefits. These are cases when you're filling the cache and because you're looping over a structure that fills the cache in a such a way that you have cache collisions every time you loop through your array/structure. This case is very rare, and shouldn't be preoptimized unless you're having problems with your code performing much more slowly than theoretical limits say it should. Also, this case is very hardware dependent and will change from system to system.

Nvidia Cuda Program - is my prob appropriate for the Cuda architecture?

I've been reading about Nvidia Cuda and I've seen some questions on SO that people have answered where they include the comment that "your problem is not appropriate to be running on a GPU".
At my office, we have a database that has an enormous number of records that we query against, and it can take forever. We've implemented SQL queries that SELECT DISTINCT or they apply an uppercase function against a value. As an introduction to Cuda, I thought about writing a program that could take all the strings and uppercase them on the GPU.
I've been reading a book about Cuda where the author talks about trying to make the GPU cores execute as much as possible in order to hide latency of reading data across the PCI bus or putting things in global memory. Since the memory sizes are pretty small and since I have millions of distinct words, naturally I'm going to saturate the bus and starve the GPU cores.
Is this a problem that would not receive a fantastic performance boost from a graphics card as opposed to the CPU?
Thanks,
mj
We've implemented SQL queries that SELECT DISTINCT or they apply an uppercase function against a value.
Have you considered adding a column in your table with precomputed upper case versions of your strings?
I'm inclined to think that if your database is entirely in RAM and queries still take "forever", your database may not be properly structured and indexed. Examine your query plans.
I think that, in the normal case, where your selects are neatly covered by indexes, you won't be able to optimize with the GPU. But maybe there are things that could be optimized for the GPU, like queries that require table scans such as LIKE queries with wildcards and queries that select rows based on calculations (value less than, etc). Maybe even things like queries with many joins when join columns have many duplicated values.
The key to such an implementation would be to keep a mirror of some the data in your database on the GPU and keep it in sync with the database. And then run operations such as parallel reductions on that data to come up with row IDs to then use for selects against the regular database.
Before taking such a step though, I would explore the countless possibilities for datebase query optimizations that use space-time tradeoffs.
You will have a pretty big bottleneck in global memory access since your operation/transfer ratio is O(1).
What would probably be more worthwhile is doing the comparisons on the GPU, as that has a operation/transfer ratio is much larger.
While you load a string into shared memory to do this, you could also capitalize it, effectively including what you wanted to do before, and a bit more.
I can't help but feel a CPU based implementation would probably give you better performance. It would, at least, give you fewer headaches...

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.

What are the lesser known but useful data structures?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
There are some data structures around that are really useful but are unknown to most programmers. Which ones are they?
Everybody knows about linked lists, binary trees, and hashes, but what about Skip lists and Bloom filters for example. I would like to know more data structures that are not so common, but are worth knowing because they rely on great ideas and enrich a programmer's tool box.
PS: I am also interested in techniques like Dancing links which make clever use of properties of a common data structure.
EDIT:
Please try to include links to pages describing the data structures in more detail. Also, try to add a couple of words on why a data structure is cool (as Jonas Kölker already pointed out). Also, try to provide one data-structure per answer. This will allow the better data structures to float to the top based on their votes alone.
Tries, also known as prefix-trees or crit-bit trees, have existed for over 40 years but are still relatively unknown. A very cool use of tries is described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function.
Bloom filter: Bit array of m bits, initially all set to 0.
To add an item you run it through k hash functions that will give you k indices in the array which you then set to 1.
To check if an item is in the set, compute the k indices and check if they are all set to 1.
Of course, this gives some probability of false-positives (according to wikipedia it's about 0.61^(m/n) where n is the number of inserted items). False-negatives are not possible.
Removing an item is impossible, but you can implement counting bloom filter, represented by array of ints and increment/decrement.
Rope: It's a string that allows for cheap prepends, substrings, middle insertions and appends. I've really only had use for it once, but no other structure would have sufficed. Regular strings and arrays prepends were just far too expensive for what we needed to do, and reversing everthing was out of the question.
Skip lists are pretty neat.
Wikipedia
A skip list is a probabilistic data structure, based on multiple parallel, sorted linked lists, with efficiency comparable to a binary search tree (order log n average time for most operations).
They can be used as an alternative to balanced trees (using probalistic balancing rather than strict enforcement of balancing). They are easy to implement and faster than say, a red-black tree. I think they should be in every good programmers toolchest.
If you want to get an in-depth introduction to skip-lists here is a link to a video of MIT's Introduction to Algorithms lecture on them.
Also, here is a Java applet demonstrating Skip Lists visually.
Spatial Indices, in particular R-trees and KD-trees, store spatial data efficiently. They are good for geographical map coordinate data and VLSI place and route algorithms, and sometimes for nearest-neighbor search.
Bit Arrays store individual bits compactly and allow fast bit operations.
Zippers - derivatives of data structures that modify the structure to have a natural notion of 'cursor' -- current location. These are really useful as they guarantee indicies cannot be out of bound -- used, e.g. in the xmonad window manager to track which window has focused.
Amazingly, you can derive them by applying techniques from calculus to the type of the original data structure!
Here are a few:
Suffix tries. Useful for almost all kinds of string searching (http://en.wikipedia.org/wiki/Suffix_trie#Functionality). See also suffix arrays; they're not quite as fast as suffix trees, but a whole lot smaller.
Splay trees (as mentioned above). The reason they are cool is threefold:
They are small: you only need the left and right pointers like you do in any binary tree (no node-color or size information needs to be stored)
They are (comparatively) very easy to implement
They offer optimal amortized complexity for a whole host of "measurement criteria" (log n lookup time being the one everybody knows). See http://en.wikipedia.org/wiki/Splay_tree#Performance_theorems
Heap-ordered search trees: you store a bunch of (key, prio) pairs in a tree, such that it's a search tree with respect to the keys, and heap-ordered with respect to the priorities. One can show that such a tree has a unique shape (and it's not always fully packed up-and-to-the-left). With random priorities, it gives you expected O(log n) search time, IIRC.
A niche one is adjacency lists for undirected planar graphs with O(1) neighbour queries. This is not so much a data structure as a particular way to organize an existing data structure. Here's how you do it: every planar graph has a node with degree at most 6. Pick such a node, put its neighbors in its neighbor list, remove it from the graph, and recurse until the graph is empty. When given a pair (u, v), look for u in v's neighbor list and for v in u's neighbor list. Both have size at most 6, so this is O(1).
By the above algorithm, if u and v are neighbors, you won't have both u in v's list and v in u's list. If you need this, just add each node's missing neighbors to that node's neighbor list, but store how much of the neighbor list you need to look through for fast lookup.
I think lock-free alternatives to standard data structures i.e lock-free queue, stack and list are much overlooked.
They are increasingly relevant as concurrency becomes a higher priority and are much more admirable goal than using Mutexes or locks to handle concurrent read/writes.
Here's some links
http://www.cl.cam.ac.uk/research/srg/netos/lock-free/
http://www.research.ibm.com/people/m/michael/podc-1996.pdf [Links to PDF]
http://www.boyet.com/Articles/LockfreeStack.html
Mike Acton's (often provocative) blog has some excellent articles on lock-free design and approaches
I think Disjoint Set is pretty nifty for cases when you need to divide a bunch of items into distinct sets and query membership. Good implementation of the Union and Find operations result in amortized costs that are effectively constant (inverse of Ackermnan's Function, if I recall my data structures class correctly).
Fibonacci heaps
They're used in some of the fastest known algorithms (asymptotically) for a lot of graph-related problems, such as the Shortest Path problem. Dijkstra's algorithm runs in O(E log V) time with standard binary heaps; using Fibonacci heaps improves that to O(E + V log V), which is a huge speedup for dense graphs. Unfortunately, though, they have a high constant factor, often making them impractical in practice.
Anyone with experience in 3D rendering should be familiar with BSP trees. Generally, it's the method by structuring a 3D scene to be manageable for rendering knowing the camera coordinates and bearing.
Binary space partitioning (BSP) is a
method for recursively subdividing a
space into convex sets by hyperplanes.
This subdivision gives rise to a
representation of the scene by means
of a tree data structure known as a
BSP tree.
In other words, it is a method of
breaking up intricately shaped
polygons into convex sets, or smaller
polygons consisting entirely of
non-reflex angles (angles smaller than
180°). For a more general description
of space partitioning, see space
partitioning.
Originally, this approach was proposed
in 3D computer graphics to increase
the rendering efficiency. Some other
applications include performing
geometrical operations with shapes
(constructive solid geometry) in CAD,
collision detection in robotics and 3D
computer games, and other computer
applications that involve handling of
complex spatial scenes.
Huffman trees - used for compression.
Have a look at Finger Trees, especially if you're a fan of the previously mentioned purely functional data structures. They're a functional representation of persistent sequences supporting access to the ends in amortized constant time, and concatenation and splitting in time logarithmic in the size of the smaller piece.
As per the original article:
Our functional 2-3 finger trees are an instance of a general design technique in- troduced by Okasaki (1998), called implicit recursive slowdown. We have already noted that these trees are an extension of his implicit deque structure, replacing pairs with 2-3 nodes to provide the flexibility required for efficient concatenation and splitting.
A Finger Tree can be parameterized with a monoid, and using different monoids will result in different behaviors for the tree. This lets Finger Trees simulate other data structures.
Circular or ring buffer - used for streaming, among other things.
I'm surprised no one has mentioned Merkle trees (ie. Hash Trees).
Used in many cases (P2P programs, digital signatures) where you want to verify the hash of a whole file when you only have part of the file available to you.
<zvrba> Van Emde-Boas trees
I think it'd be useful to know why they're cool. In general, the question "why" is the most important to ask ;)
My answer is that they give you O(log log n) dictionaries with {1..n} keys, independent of how many of the keys are in use. Just like repeated halving gives you O(log n), repeated sqrting gives you O(log log n), which is what happens in the vEB tree.
How about splay trees?
Also, Chris Okasaki's purely functional data structures come to mind.
An interesting variant of the hash table is called Cuckoo Hashing. It uses multiple hash functions instead of just 1 in order to deal with hash collisions. Collisions are resolved by removing the old object from the location specified by the primary hash, and moving it to a location specified by an alternate hash function. Cuckoo Hashing allows for more efficient use of memory space because you can increase your load factor up to 91% with only 3 hash functions and still have good access time.
A min-max heap is a variation of a heap that implements a double-ended priority queue. It achieves this by by a simple change to the heap property: A tree is said to be min-max ordered if every element on even (odd) levels are less (greater) than all childrens and grand children. The levels are numbered starting from 1.
http://internet512.chonbuk.ac.kr/datastructure/heap/img/heap8.jpg
I like Cache Oblivious datastructures. The basic idea is to lay out a tree in recursively smaller blocks so that caches of many different sizes will take advantage of blocks that convenient fit in them. This leads to efficient use of caching at everything from L1 cache in RAM to big chunks of data read off of the disk without needing to know the specifics of the sizes of any of those caching layers.
Left Leaning Red-Black Trees. A significantly simplified implementation of red-black trees by Robert Sedgewick published in 2008 (~half the lines of code to implement). If you've ever had trouble wrapping your head around the implementation of a Red-Black tree, read about this variant.
Very similar (if not identical) to Andersson Trees.
Work Stealing Queue
Lock-free data structure for dividing the work equaly among multiple threads
Implementation of a work stealing queue in C/C++?
Bootstrapped skew-binomial heaps by Gerth Stølting Brodal and Chris Okasaki:
Despite their long name, they provide asymptotically optimal heap operations, even in a function setting.
O(1) size, union, insert, minimum
O(log n) deleteMin
Note that union takes O(1) rather than O(log n) time unlike the more well-known heaps that are commonly covered in data structure textbooks, such as leftist heaps. And unlike Fibonacci heaps, those asymptotics are worst-case, rather than amortized, even if used persistently!
There are multiple implementations in Haskell.
They were jointly derived by Brodal and Okasaki, after Brodal came up with an imperative heap with the same asymptotics.
Kd-Trees, spatial data structure used (amongst others) in Real-Time Raytracing, has the downside that triangles that cross intersect the different spaces need to be clipped. Generally BVH's are faster because they are more lightweight.
MX-CIF Quadtrees, store bounding boxes instead of arbitrary point sets by combining a regular quadtree with a binary tree on the edges of the quads.
HAMT, hierarchical hash map with access times that generally exceed O(1) hash-maps due to the constants involved.
Inverted Index, quite well known in the search-engine circles, because it's used for fast retrieval of documents associated with different search-terms.
Most, if not all, of these are documented on the NIST Dictionary of Algorithms and Data Structures
Ball Trees. Just because they make people giggle.
A ball tree is a data structure that indexes points in a metric space. Here's an article on building them. They are often used for finding nearest neighbors to a point or accelerating k-means.
Not really a data structure; more of a way to optimize dynamically allocated arrays, but the gap buffers used in Emacs are kind of cool.
Fenwick Tree. It's a data structure to keep count of the sum of all elements in a vector, between two given subindexes i and j. The trivial solution, precalculating the sum since the begining doesn't allow to update a item (you have to do O(n) work to keep up).
Fenwick Trees allow you to update and query in O(log n), and how it works is really cool and simple. It's really well explained in Fenwick's original paper, freely available here:
http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol24/issue3/spe884.pdf
Its father, the RQM tree is also very cool: It allows you to keep info about the minimum element between two indexes of the vector, and it also works in O(log n) update and query. I like to teach first the RQM and then the Fenwick Tree.
Van Emde-Boas trees. I have even a C++ implementation of it, for up to 2^20 integers.
Nested sets are nice for representing trees in the relational databases and running queries on them. For instance, ActiveRecord (Ruby on Rails' default ORM) comes with a very simple nested set plugin, which makes working with trees trivial.
It's pretty domain-specific, but half-edge data structure is pretty neat. It provides a way to iterate over polygon meshes (faces and edges) which is very useful in computer graphics and computational geometry.