How to represent a binomial tree in memory - language-agnostic

I've got such a structure, is described as a "binomial tree". Let'see a drawing:
Which is the best way to represent this in memory? Just to clarify, is not a simple binary tree since the node N4 is both the left child of N1 and the right child of N2, the same sharing happens for N7 and N8 and so on... I need a construction algorithm tha easily avoid to duplicates such nodes, but just referencing them.
UPDATE
Many of us does not agree with the "binomial tree deefinition" but this cames from finance ( expecially derivative pricing ) have a look here: http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter45.html for example. So I used the "Domain acceted definition".

You could generate the structure level by level. In each iteration, create one level of nodes, put them in an array, and connect the previous level to them. Something like this (C#):
Node GenerateStructure(int levels)
{
Node root = null;
Node[] previous = null;
for (int level = 1; level <= levels; level++)
{
int count = level;
var current = new Node[count];
for (int i = 0; i < count; i++)
current[i] = new Node();
if (level == 1)
root = current[0];
for (int i = 0; i < count - 1; i++)
{
previous[i].Left = current[i];
previous[i].Right = current[i + 1];
}
previous = current;
}
return root;
}
The whole structure requires O(N^2) memory, where N is the number of level. This approach requires O(N) additional memory for the two arrays. Another approach would be to generate the graph from left to right, but that would require O(N) additional memory too.
The time complexity is obviously O(N^2).

More than a tree, of which I would give a definition like 'connected graph of N vertex and N-1 edges', that structure seems like a Pascal (or Tartaglia, as teached in Italy) triangle. As such, an array with a suitable indexing suffices.
Details on construction depends on your data input: please give some more hint.

Related

How to optimize finding values in 2D array in verilog

I need to set up a function that determines if a match exists in a 2D array (index).
My current implementation works, but is creating a large chain of LUTs due to if statements checking each element of the array.
function result_type index_search ( index_type index, logic[7:0] address );
for ( int i=0; i < 8; i++ ) begin
if ( index[i] == address ) begin
result = i;
end
end
Is there a way to check for matches in a more efficient manner?
Not much to be done, really; at least for the code in hand. Since your code targets hardware, to optimize it think in terms of hardware, not function/verilog code.
For a general purpose implementation, without any known data patterns, you'll definitely need (a) N W-bit equality checks, plus (b) an N:1 FPA (Fixed Priority Arbiter, aka priority encoder, aka leading zero detector) that returns the first match, assuming N W-bit inputs. Something like this:
Not much optimization to be done, but here are some possible general-purpose optimizations:
Pipelining, as shown in the figure, if timing is an issue.
Consider an alternative FPA implementation that makes use of 2's complement characteristics and may result to a more LUT-efficient implementation: assign fpa_out = fpa_in & ((~fpa_in)+1); (result in one-hot encoding, not weighted binary, as in your code)
Sticking to one-hot encoding can come in handy and reduce some of your logic down your path, but I cannot say for sure until we see some more code.
This is what the implementation would look like:
logic[N-1:0] addr_eq_idx;
logic[N-1:0] result;
for (genvar i=0; i<N; i++) begin: g_eq_N
// multiple matches may exist in this vector
assign addr_eq_idx[i] = (address == index[i]) ? 1'b1 : 1'b0;
// pipelined version:
// always_ff #(posedge clk, negedge arstn)
// if (!arstn)
// addr_eq_idx[i] <= 1'b0;
// else
// addr_eq_idx[i] <= (address == index[i]) ? 1'b1 : 1'b0;
end
// result has a '1' at the position where the first match is found
assign result = addr_eq_idx & ((~addr_eq_idx) + 1);
Finally, try to think if your design can be simplified due to known run-time data characteristics. For example, let's say you are 100% sure that the address you're looking for may exist within the index 2D array in at most one position. If that is the case, then you do not need an FPA at all, since the first match will be the only match. In that case, addr_eq_idx already points to the matching index, as a one-hot vector.

Make a good hash algorithm for a Scrabble Cheater

I want to build a Scrabble Cheater which stores Strings in an array of linked lists. In a perfect scenario each linked list would only have words with the same permutations of letters (like POOL and LOOP for example). The user would put a String in there like OLOP and the linked list would be printed out.
I want the task to be explicitly solved using hashing.
I have built a stringHashFunction() for that (Java code):
public int stringHashFunction(String wordToHash) {
int hashKeyValue = 7;
//toLowerCase and sort letters in alphabetical order
wordToHash = normalize(wordToHash);
for(int i = 0; i < wordToHash.length(); i++) {
int charCode = wordToHash.charAt(i) - 96;
//calculate the hash key using the 26 letters
hashKeyValue = (hashKeyValue * 26 + charCode) % hashTable.length;
}
return hashKeyValue;
}
Does it look like an OK-hash-function? I realize that it's far from a perfect hash but how could I improve that?
My code overall works but I have the following statistics for now:
Number of buckets: 24043
All items: 24043
The biggest bucket counts: 11 items.
There are: 10264 empty buckets
On average there are 1.7449016619493432 per bucket.
Is it possible to avoid the collisions so that I only have buckets (linked lists) with the same permutations? I think if you have a whole dictionary in there it might be useful to have that so that you don't have to run an isPermutation() method on each bucket every time you want to get some possible permutations on your String.

How slow is comparison and branching on GPU

I have read that comparisons and branching is slow on GPU. I would like to know how much. (I'm familier with OpenCL, but the question is general also for CUDA, AMP ... )
I would like to know it, before I start to port my code to GPU. In particular I'm interested in finding lowest value in neighborhood ( 4 or 9 nearest neighbors) of each point in 2D array. i.e. something like convolution, but instead of summing and multiplying I need comparisons and branching.
for example code like this ( NOTE: this example code is not yet optimized for GPU to be more readeable ... so partitioning to workgroups, prefeaching of local memory ... is missing )
for(int i=1;i<n-1;i++){ for(int j=1;j<n-1;j++){ // iterate over 2D array
float hij = h[i][j];
int imin = 0,jmin = 0;
float dh,dhmin=0;
// find lowest neighboring element h[i+imin][j+jmin] of h[i][j]
dh = h[i-1][j ]-hij; if( dh<dhmin ){ imin = -1; jmin = 0; dhmin = dh; }
dh = h[i+1][j ]-hij; if( dh<dhmin ){ imin = +1; jmin = 0; dhmin = dh; }
dh = h[i ][j-1]-hij; if( dh<dhmin ){ imin = 0; jmin = -1; dhmin = dh; }
dh = h[i ][j+1]-hij; if( dh<dhmin ){ imin = 0; jmin = +1; dhmin = dh; }
if( dhmin<-0.00001 ){ // if lower
// ... Do something with hij, dhmin and save to h[i+imin][j+jmin] ...
}
} }
Would it be worth to port to GPU despite a lot of if branching and
comparison? ( i.e. if this 4-5 comparisons per elemet would be 10x slower than the same 4-5 comparisons on CPU it would be a bottleneck )
is there any optimization trick how to minizmize if
branching and comparison slow down?
Which I used in this hydraulic errosion code:
http://www.openprocessing.org/sketch/146982
Branching itself is not slow. Divergence is what gets you. GPUs compute multiple work items (typ. 16 or 32) in lock-step in "warps" or "wavefronts" and if different work items take different paths they all take all paths but gate writes based on which path they are on (using predicate flags)). So if your work items always (or mostly) branch the same way, you're good. If they don't the penalty can rob performance.
If you need to do comparison and if the array length 'n' is really big then you can use reduction instead of sequential comparison. Reduction would do comparison in parallel in O (log n) time as opposed to O (n) when done sequentially.
When you access memory sequentially in a GPU thread, the memory accesses are sequential since consecutive blocks are accessed from the same bank. Instead, its good to use coalesced reads. You can find plethora of examples on this.
On GPUs, don't access global memory multiple times (as GPU memory management and caching work not exactly like a CPU). Instead, cache the global memory elements into thread's private variables / shared memory as much as possible.

thrust equivalent of cilk::reducer_list_append

I have a list of n intervals or domains. I would like to subdivide in parallel each interval into k parts making a new list (unordered). However, most of the subdivision won't pass certain criteria and shouldn't be added to the new list.
cilk::reducer_list_append extends the idea of parallel reduction to forming a list with push_back. This way I can collect in parallel only valid sub-intervals.
What is the thrust way of accomplishing the task? I suspect one way would be to form a large nxk list, then use parallel filter and stream compaction? But I really hope there is a reduction list append operation, because nxk can be very large indeed.
I am new to this forum but maybe you find some of these useful..
If you are not fixed upon Thrust, you can also have a look at Arrayfire.
I learned about it quite recently and it's free for that sorts of problems.
For example, with arrayfire you can evaluate selection criterion for each interval
in parallel using gfor construct, ie. consider:
// # of intervals n and # of subintervals k
const int n = 10, k = 5;
// this array represets original intervals
array A = seq(n); // A = 0,1,2,...,n-1
// for each interval A[i], subI[i] counts # of subintervals
array subI = zeros(n);
gfor(array i, n) { // in parallel for all intervals
// here evaluate your predicate for interval's subdivision
array pred = A(i)*A(i) + 1234;
subI(i) = pred % (k + 1);
}
//array acc = accum(subI);
int n_total = sum<float>(subI); // compute the total # of intervals
// this array keeps intervals after subdivision
array B = zeros(n_total);
std::cout << "total # of subintervals: " << n_total << "\n";
print(A);
print(subI);
gfor(array i, n_total) {
// populate the array of new intervals
B(i) = ...
}
print(B);
of course, it depends on a way how your intervals are represented and
which criterion you use for subdivision..

AS3 math: nearest neighbour in array

So let's say i have T, T = 1200. I also have A, A is an array that contains 1000s of entries and these are numerical entries that range from 1000-2000 but does not include an entry for 1200.
What's the fastest way of finding the nearest neighbour (closest value), let's say we ceil it, so it'll match 1201, not 1199 in A.
Note: this will be run on ENTER_FRAME.
Also note: A is static.
It is also very fast to use Vector.<int>instead of Arrayand do a simple for-loop:
var vector:Vector.<int> = new <int>[ 0,1,2, /*....*/ 2000];
function seekNextLower( searchNumber:int ) : int {
for (var i:int = vector.length-1; i >= 0; i--) {
if (vector[i] <= searchNumber) return vector[i];
}
}
function seekNextHigher( searchNumber:int ) : int {
for (var i:int = 0; i < vector.length; i++) {
if (vector[i] >= searchNumber) return vector[i];
}
}
Using any array methods will be more costly than iterating over Vector.<int> - it was optimized for exactly this kind of operation.
If you're looking to run this on every ENTER_FRAME event, you'll probably benefit from some extra optimization.
If you keep track of the entries when they are written to the array, you don't have to sort them.
For example, you'd have an array where T is the index, and it would have an object with an array with all the indexes of the A array that hold that value. you could also put the closest value's index as part of that object, so when you're retrieving this every frame, you only need to access that value, rather than search.
Of course this would only help if you read a lot more than you write, because recreating the object is quite expensive, so it really depends on use.
You might also want to look into linked lists, for certain operations they are quite a bit faster (slower on sort though)
You have to read each value, so the complexity will be linear. It's pretty much like finding the smallest int in an array.
var closestIndex:uint;
var closestDistance:uint = uint.MAX_VALUE;
var currentDistance:uint;
var arrayLength:uint = A.length;
for (var index:int = 0; index<arrayLength; index++)
{
currentDistance = Math.abs(T - A[index]);
if (currentDistance < closestDistance ||
(currentDistance == closestDistance && A[index] > T)) //between two values with the same distance, prefers the one larger than T
{
closestDistance = currentDistance;
closestIndex = index;
}
}
return T[closestIndex];
Since your array is sorted you could adapt a straightforward binary search (such as explained in this answer) to find the 'pivot' where the left-subdivision and the right-subdivision at a recursive step bracket the value you are 'searching' for.
Just a thought I had... Sort A (since its static you can just sort it once before you start), and then take a guess of what index to start guessing at (say A is length 100, you want 1200, 100*(200/1000) = 20) so guess starting at that guess, and then if A[guess] is higher than 1200, check the value at A[guess-1]. If it is still higher, keep going down until you find one that is higher and one that is lower. Once you find that determine what is closer. if your initial guess was too low, keep going up.
This won't be great and might not be the best performance wise, but it would be a lot better than checking every single value, and will work quite well if A is evenly spaced between 1000 and 2000.
Good luck!
public function nearestNumber(value:Number,list:Array):Number{
var currentNumber:Number = list[0];
for (var i:int = 0; i < list.length; i++) {
if (Math.abs(value - list[i]) < Math.abs(value - currentNumber)){
currentNumber = list[i];
}
}
return currentNumber;
}