I have been trying to find a better Tokyo Cabinet (or Tokyo Tyrant) configuration for my application, but I don't know exactly how. I know what some parameters mean but I want to have a fine tuning control, so I need to know the impact of each one. The Tokyo documentation is really good but not at this point.
TCHDB -> *bool tchdbtune(TCHDB *hdb, int64_t bnum, int8_t apow, int8_t fpow, uint8_t opts);*
How do I use: bnum, apow and fpow?
TCBDB -> *bool tcbdbtune(TCBDB *bdb, int32_t lmemb, int32_t nmemb, int64_t bnum, int8_t apow, int8_t fpow, uint8_t opts);*
How do I use: lmemb, nmemb, bnum, apow and fpow?
TCFDB -> *bool tcfdbtune(TCFDB *fdb, int32_t width, int64_t limsiz);*
How do I use: width and limsiz? Note: I am only putting this to get all types of database in the topic, this one is really simple.
TCTDB -> *bool tctdbtune(TCTDB *tdb, int64_t bnum, int8_t apow, int8_t fpow, uint8_t opts);*
How do I use: bnum, apow and fpow?
I stood before the same problem.
But because the results will highly depend on your application, my advise as a two factrial experiment:
Benchmark your application with a low value and a high value for each parameter (use multiple runs to gain confidence into the results)
Calculate based on the benchmark results (the effect of each factor)
You have than data that indicates the importance of the parameters. Factors with a high effect are very significant for the performance, factors with a low effect are not important.
You should than fine tune the important parameters.
Related
It seems that Cuda does not allow me to "pass an object of a class derived from virtual base classes to __global__ function", for some reason related to "virtual table" or "virtual pointer".
I wonder is there some way for me to setup the "virtual pointer" manually, so that I can use the polymorphism?
Is There Any Way To Copy vtable From Host To Device
You wouldn't want to copy the vtable from host to device. The vtable on the host (i.e. in an object created on the host) has a set of host function pointers in the vtable. When you copy such an object to the device, the vtable doesn't get changed or "fixed up", and so you end up with an object on the device, whose vtable is full of host pointers.
If you then try and call one of those virtual functions (using the object on the device, from device code), bad things happen. The numerical function entry points listed in the vtable are addresses that don't make any sense in device code.
so that I can use the polymorphism
My recommendation for a way to use polymorphism in device code is to create the object on the device. This sets up the vtable with a set of device function pointers, rather than host function pointers, and questions such as this demonstrate that it works. To a first order approximation, if you have a way to create a set of polymorphic objects in host code, I don't know of any reason why you shouldn't be able to use a similar method in device code. The issue really has to do with interoperability - moving such objects between host and device - which is what the stated limitations in the programming guide are referring to.
I wonder is there some way for me to setup the "virtual pointer" manully
There might be. In the interest of sharing knowledge, I will outline a method. However, I don't know C++ well enough to say whether this is acceptable/legal. The only thing I can say is in my very limited testing, it appears to work. But I would assume it is not legal and so I do not recommend you use this method for anything other than experimentation. Even if we don't resolve whether or not it is legal, there is already a stated CUDA limitation (as indicated above) that you should not attempt to pass objects with virtual functions between host and device. So I offer it merely as an observation, which may be interesting for experimentation or research. I don't suggest it for production code.
The basic idea is outlined in this thread. It is predicated on the idea that an ordinary object-copy does not seem to copy the virtual function pointer table, which makes sense to me, but that the object as a whole does contain the table. Therefore if we use a method like this:
template<typename T>
__device__ void fixVirtualPointers(T *other) {
T temp = T(*other); // object-copy moves the "guts" of the object w/o changing vtable
memcpy(other, &temp, sizeof(T)); // pointer copy seems to move vtable
}
it seems to be possible to take a given object, create a new "dummy" object of that type, and then "fix up" the vtable by doing a pointer-based copy of the object (considering the entire object size) rather than a "typical" object-copy. Use this at your own risk. This blog may also be interesting reading, although I can't vouch for the correctness of any statements there.
Beyond this, there are a variety of other suggestions here on the cuda tag, you may wish to review them.
I would like to provide a different way to fix the vtable which does not rely on copying the vtable between objects. The idea is to use placement new on the device to let the compiler generate the appropriate vtable. However, this approach also violates the restrictions stated in the programming guide.
#include <cstdio>
struct A{
__host__ __device__
virtual void foo(){
printf("A\n");
}
};
struct B : public A{
B(int i = 13) : data(i){}
__host__ __device__
virtual void foo() override{
printf("B %d\n", data);
}
int data;
};
template<class T>
__global__
void fixKernel(T* ptr){
T tmp(*ptr);
new (ptr) T(tmp);
}
__global__
void useKernel(A* ptr){
ptr->foo();
}
int main(){
A a;
a.foo();
B b(7);
b.foo();
A* ab = new B();
ab->foo();
A* d_a;
cudaMalloc(&d_a, sizeof(A));
cudaMemcpy(d_a, &a, sizeof(A), cudaMemcpyHostToDevice);
B* d_b;
cudaMalloc(&d_b, sizeof(B));
cudaMemcpy(d_b, &b, sizeof(B), cudaMemcpyHostToDevice);
fixKernel<<<1,1>>>(d_a);
useKernel<<<1,1>>>(d_a);
fixKernel<<<1,1>>>(d_b);
useKernel<<<1,1>>>(d_b);
cudaDeviceSynchronize();
cudaFree(d_b);
cudaFree(d_a);
delete ab;
}
Because my pointers are all pointing to non-overlapping memory I've went all out and replaced my pointers passed to kernels (and their inlined functions) to be restricted, and to made them const too, where ever possible. This however increased the register usage of some kernels and decreased it for others. This doesn't make make much sense to me.
Does anybody know why this can be the case?
Yes, it can increase register usage.
Referring to the programming guide for __restrict__:
The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to "cached" loads and common sub-expressions.
Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.
const __restrict__ may be beneficial for at least 2 reasons:
On architectures that support it, it may enable the compiler to discover uses for the constant cache which may be a performance-enhancing feature.
As indicated in the above linked programming guide section, it may enable other optimizations to be made by the compiler (e.g. reducing instructions and memory accesses) which also may improve performance if the corresponding register pressure does not become an issue.
Reducing instructions and memory accesses leading to increased register pressure may be non-intuitive. Let's consider the example given in the above programming guide link:
void foo(const float* a, const float* b, float* c) {
c[0] = a[0] * b[0];
c[1] = a[0] * b[0];
c[2] = a[0] * b[0] * a[1];
c[3] = a[0] * a[1];
c[4] = a[0] * b[0];
c[5] = b[0]; ... }
If we allow for pointer aliasing in the above example, then the compiler can't make many optimizations, and the compiler is essentially reduced to performing the code exactly as written. The first line of code:
c[0] = a[0] * b[0];
will require 3 registers. The next line of code:
c[1] = a[0] * b[0];
will also require 3 registers, and because everything is being generated as-written, they can be the same 3 registers, reused. Similar register reuse can occur for the remainder of the example, resulting in low overall register usage/pressure.
But if we allow the compiler to re-order things, then we must have registers assigned for each value loaded up front, and reserved until that value is retired. This re-ordering can increase register usage/pressure, but may ultimately lead to faster code (or it may lead to slower code, if the register pressure becomes a performance limiter.)
The concept of lambdas (anonymous functions) is very clear to me. And I'm aware of polymorphism in terms of classes, with runtime/dynamic dispatch used to call the appropriate method based on the instance's most derived type. But how exactly can a lambda be polymorphic? I'm yet another Java programmer trying to learn more about functional programming.
You will observe that I don't talk about lambdas much in the following answer. Remember that in functional languages, any function is simply a lambda bound to a name, so what I say about functions translates to lambdas.
Polymorphism
Note that polymorphism doesn't really require the kind of "dispatch" that OO languages implement through derived classes overriding virtual methods. That's just one particular kind of polymorphism, subtyping.
Polymorphism itself simply means a function allows not just for one particular type of argument, but is able to act accordingly for any of the allowed types. The simplest example: you don't care for the type at all, but simply hand on whatever is passed in. Or, to make it not quite so trivial, wrap it in a single-element container. You could implement such a function in, say, C++:
template<typename T> std::vector<T> wrap1elem( T val ) {
return std::vector(val);
}
but you couldn't implement it as a lambda, because C++ (time of writing: C++11) doesn't support polymorphic lambdas.
Untyped values
...At least not in this way, that is. C++ templates implement polymorphism in rather an unusual way: the compiler actually generates a monomorphic function for every type that anybody passes to the function, in all the code it encounters. This is necessary because of C++' value semantics: when a value is passed in, the compiler needs to know the exact type (its size in memory, possible child-nodes etc.) in order to make a copy of it.
In most newer languages, almost everything is just a reference to some value, and when you call a function it doesn't get a copy of the argument objects but just a reference to the already-existing ones. Older languages require you to explicitly mark arguments as reference / pointer types.
A big advantage of reference semantics is that polymorphism becomes much easier: pointers always have the same size, so the same machine code can deal with references to any type at all. That makes, very uglily1, a polymorphic container-wrapper possible even in C:
typedef struct{
void** contents;
int size;
} vector;
vector wrap1elem_by_voidptr(void* ptr) {
vector v;
v.contents = malloc(sizeof(&ptr));
v.contents[0] = ptr;
v.size = 1;
return v;
}
#define wrap1elem(val) wrap1elem_by_voidptr(&(val))
Here, void* is just a pointer to any unknown type. The obvious problem thus arising: vector doesn't know what type(s) of elements it "contains"! So you can't really do anything useful with those objects. Except if you do know what type it is!
int sum_contents_int(vector v) {
int acc = 0, i;
for(i=0; i<v.size; ++i) {
acc += * (int*) (v.contents[i]);
}
return acc;
}
obviously, this is extremely laborious. What if the type is double? What if we want the product, not the sum? Of course, we could write each case by hand. Not a nice solution.
What would we better is if we had a generic function that takes the instruction what to do as an extra argument! C has function pointers:
int accum_contents_int(vector v, void* (*combine)(int*, int)) {
int acc = 0, i;
for(i=0; i<v.size; ++i) {
combine(&acc, * (int*) (v.contents[i]));
}
return acc;
}
That could then be used like
void multon(int* acc, int x) {
acc *= x;
}
int main() {
int a = 3, b = 5;
vector v = wrap2elems(a, b);
printf("%i\n", accum_contents_int(v, multon));
}
Apart from still being cumbersome, all the above C code has one huge problem: it's completely unchecked if the container elements actually have the right type! The casts from *void will happily fire on any type, but in doubt the result will be complete garbage2.
Classes & Inheritance
That problem is one of the main issues which OO languages solve by trying to bundle all operations you might perform right together with the data, in the object, as methods. While compiling your class, the types are monomorphic so the compiler can check the operations make sense. When you try to use the values, it's enough if the compiler knows how to find the method. In particular, if you make a derived class, the compiler knows "aha, it's ok to call that method from the base class even on a derived object".
Unfortunately, that would mean all you achieve by polymorphism is equivalent to compositing data and simply calling the (monomorphic) methods on a single field. To actually get different behaviour (but controlledly!) for different types, OO languages need virtual methods. What this amounts to is basically that the class has extra fields with pointers to the method implementations, much like the pointer to the combine function I used in the C example – with the difference that you can only implement an overriding method by adding a derived class, for which the compiler again knows the type of all the data fields etc. and you're safe and all.
Sophisticated type systems, checked parametric polymorphism
While inheritance-based polymorphism obviously works, I can't help saying it's just crazy stupid3 sure a bit limiting. If you want to use just one particular operation that happens to be not implemented as a class method, you need to make an entire derived class. Even if you just want to vary an operation in some way, you need to derive and override a slightly different version of the method.
Let's revisit our C code. On the face of it, we notice it should be perfectly possible to make it type-safe, without any method-bundling nonsense. We just need to make sure no type information is lost – not during compile-time, at least. Imagine (Read ∀T as "for all types T")
∀T: {
typedef struct{
T* contents;
int size;
} vector<T>;
}
∀T: {
vector<T> wrap1elem(T* elem) {
vector v;
v.contents = malloc(sizeof(T*));
v.contents[0] = &elem;
v.size = 1;
return v;
}
}
∀T: {
void accum_contents(vector<T> v, void* (*combine)(T*, const T*), T* acc) {
int i;
for(i=0; i<v.size; ++i) {
combine(&acc, (*T) (v[i]));
}
}
}
Observe how, even though the signatures look a lot like the C++ template thing on top of this post (which, as I said, really is just auto-generated monomorphic code), the implementation actually is pretty much just plain C. There are no T values in there, just pointers to them. No need to compile multiple versions of the code: at runtime, the type information isn't needed, we just handle generic pointers. At compile time, we do know the types and can use the function head to make sure they match. I.e., if you wrote
void evil_sumon (int* acc, double* x) { acc += *x; }
and tried to do
vector<float> v; char acc;
accum_contents(v, evil_sumon, acc);
the compiler would complain because the types don't match: in the declaration of accum_contents it says the type may vary, but all occurences of T do need to resolve to the same type.
And that is exactly how parametric polymorphism works in languages of the ML family as well as Haskell: the functions really don't know anything about the polymorphic data they're dealing with. But they are given the specialised operators which have this knowledge, as arguments.
In a language like Java (prior to lambdas), parametric polymorphism doesn't gain you much: since the compiler makes it deliberately hard to define "just a simple helper function" in favour of having only class methods, you can simply go the derive-from-class way right away. But in functional languages, defining small helper functions is the easiest thing imaginable: lambdas!
And so you can do incredible terse code in Haskell:
Prelude> foldr (+) 0 [1,4,6]
11
Prelude> foldr (\x y -> x+y+1) 0 [1,4,6]
14
Prelude> let f start = foldr (\_ (xl,xr) -> (xr, xl)) start
Prelude> :t f
f :: (t, t) -> [a] -> (t, t)
Prelude> f ("left", "right") [1]
("right","left")
Prelude> f ("left", "right") [1, 2]
("left","right")
Note how in the lambda I defined as a helper for f, I didn't have any clue about the type of xl and xr, I merely wanted to swap a tuple of these elements which requires the types to be the same. So that would be a polymorphic lambda, with the type
\_ (xl, xr) -> (xr, xl) :: ∀ a t. a -> (t,t) -> (t,t)
1Apart from the weird explicit malloc stuff, type safety etc.: code like that is extremely hard to work with in languages without garbage collector, because somebody always needs to clean up memory once it's not needed anymore, but if you didn't watch out properly whether somebody still holds a reference to the data and might in fact need it still. That's nothing you have to worry about in Java, Lisp, Haskell...
2There is a completely different approach to this: the one dynamic languages choose. In those languages, every operation needs to make sure it works with any type (or, if that's not possible, raise a well-defined error). Then you can arbitrarily compose polymorphic operations, which is on one hand "nicely trouble-free" (not as trouble-free as with a really clever type system like Haskell's, though) but OTOH incurs quite a heavy overhead, since even primitive operations need type-decisions and safeguards around them.
3I'm of course being unfair here. The OO paradigm has more to it than just type-safe polymorphism, it enables many things e.g. old ML with it's Hindler-Milner type system couldn't do (ad-hoc polymorphism: Haskell has type classes for that, SML has modules), and even some things that are pretty hard in Haskell (mainly, storing values of different types in a variable-size container). But the more you get accustomed to functional programming, the less need you will feel for such stuff.
In C++ polymorphic (or generic) lambda starting from C++14 is a lambda that can take any type as an argument. Basically it's a lambda that has auto parameter type:
auto lambda = [](auto){};
Is there a context that you've heard the term "polymorphic lambda"? We might be able to be more specific.
The simplest way that a lambda can be polymorphic is to accept arguments whose type is (partly-)irrelevant to the final result.
e.g. the lambda
\(head:tail) -> tail
has the type [a] -> [a] -- e.g. it's fully-polymorphic in the inner type of the list.
Other simple examples are the likes of
\_ -> 5 :: Num n => a -> n
\x f -> f x :: a -> (a -> b) -> b
\n -> n + 1 :: Num n => n -> n
etc.
(Notice the Num n examples which involve typeclass dispatch)
In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general).
When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like
struct pt{
int i;
int j;
int k;
}
even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...
Asked for position 0 of stack, stack only has 0 elements on it.
So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to
struct __align__(16) pt{
int i;
int j;
int k;
}
but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:
error: expected unqualified-id before numeric constant error: expected
‘)’ before numeric constant error: expected constructor, destructor,
or type conversion before ‘;’ token
so, am I supposed to have two different definitions for host and device structures ???
Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.
For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
struct {
int a;
int b;
int c;
int d;
float* el;
} ;
struct {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
} ;
Thank you in advance for any advice or hints
There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.
First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.
#if defined(__CUDACC__) // NVCC
#define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
#define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
#define MY_ALIGN(n) __declspec(align(n))
#else
#error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif
So, am I supposed to have two different definitions for host and device structures?
No, just use MY_ALIGN(n), like this
struct MY_ALIGN(16) pt { int i, j, k; }
For example, how should the following two be aligned?
First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...
struct MY_ALIGN(16) {
int a;
int b;
int c;
int d;
float* el;
};
In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.
struct MY_ALIGN(16) {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
};
In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.
or, how should a structure with 6 floats be aligned?
Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.
or 4 integers?
No tradeoff here. MY_ALIGN(16).
again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.
These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.
Is worth replacing all multiplications with the __umul24 function in a CUDA kernel? I read different and opposite opinions and I can't still make a bechmark to figure it out
Only in devices with architecture prior to fermi, that is with cuda capabilities prior to 2.0 where the integer arithmetic unit is 24 bit.
On Cuda Device with capabilities >= 2.0 the architecture is 32 bit the _umul24 will be slower instead of faster. The reason is because it has to emulate the 24 bit operation with 32 bit architecture.
The question is now: Is it worth the effort for the speed gain ? Probably not.
Just wanted to chime in with a slightly different opinion than Ashwin/fabrizioM...
If you're just trying to teach yourself CUDA, their answer is probably more or less acceptable. But if you're actually trying to deploy a production-grade app to a commercial or research setting, that sort of attitude is generally not acceptable, unless you are absolutely sure that your end users' (or you, if you're the end user) is Fermi or later.
More likely, there's many users who will be running CUDA on legacy machines who would receive benefits from using Compute Level appropriate functionality. And it's not as hard as Ashwin/fabrizioM make it out to be.
e.g. in a code I'm working on, I'm using:
//For prior to Fermi use umul, for Fermi on, use
//native mult.
__device__ inline void MultiplyFermi(unsigned int a, unsigned int b)
{ a*b; }
__device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b,
unsigned int c)
{ a*b+c; }
__device__ inline void MultiplyOld(unsigned int a, unsigned int b)
{ __umul24(a,b); }
__device__ inline void MultiplyAddOld(unsigned int a, unsigned int b,
unsigned int c)
{ __umul24(a,b)+c; }
//Maximum Occupancy =
//16384
void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability)
{
cudaDeviceProp DeviceProperties;
cudaGetDeviceProperties(&DeviceProperties, 0 );
MyCapability.ComputeCapability =
double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1;
}
Now there IS a downside here. What is it?
Well any kernel you use a multiplication, you must have two different versions of the kernel.
Is it worth it?
Well consider, this is a trivial copy & paste job, and you're gaining efficiency, yes in my opinion. After all, CUDA isn't the easiest form of programming conceptually (nor is any parallel programming). If performance is NOT critical, ask yourself: why are you using CUDA?
If performance is critical, it's negligent to to code lazy and either abandon legacy devices or post less-than-optimal execution, unless you're absolutely confident you can abandon legacy support for your deployment (allowing optimal execution).
For most, it makes sense to provide legacy support, given that it's not that hard once you realize how to do it. Be aware this means that that you will also need to update your code, in order to adjust in to changes in future architectures.
Generally you should note what the latest version the code was targeted at, when it was written and perhaps print some sort of warning to users if they have a compute capability beyond what your latest implementation is optimized for.