I am trying to measure the performance difference of a GPU between allocating memory using 'malloc' in a kernel function vs. using pre-allocated storage from 'cudaMalloc' on the host. To do this, I have two kernel functions, one that uses malloc, one that uses a pre-allocated array, and I time the execution of each function repeatedly.
The problem is that the first execution of each kernel function takes between 400 - 2500 microseconds, but all subsequent runs take about 15 - 30 microseconds.
Is this behavior expected, or am I witnessing some sort of carryover effect from previous runs? If this is carryover, what can I do to prevent it?
I have tried putting in a kernel function that zeros out all memory on the GPU between each timed test run to eliminate that carryover, but nothing changed. I have also tried reversing the order in which I run the tests, and that has no effect on relative or absolute execution times.
const int TEST_SIZE = 1000;
struct node {
node* next;
int data;
};
int main() {
int numTests = 5;
for (int i = 0; i < numTests; ++i) {
memClear();
staticTest();
memClear();
dynamicTest();
}
return 0;
}
__global__ void staticMalloc(int* sum) {
// start a linked list
node head[TEST_SIZE];
// initialize nodes
for (int j = 0; j < TEST_SIZE; j++) {
// allocate the node & assign values
head[j].next = NULL;
head[j].data = j;
}
// verify creation by adding up values
int total = 0;
for (int j = 0; j < TEST_SIZE; j++) {
total += head[j].data;
}
sum[0] = total;
}
/**
* This is a test that will time execution of static allocation
*/
int staticTest() {
int expectedValue = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedValue += i;
}
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
staticMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void dynamicMalloc(int* sum) {
// start a linked list
node* headPtr = (node*) malloc(sizeof(node));
headPtr->data = 0;
headPtr->next = NULL;
node* curPtr = headPtr;
// add nodes to test cudaMalloc in device
for (int j = 1; j < TEST_SIZE; j++) {
// allocate the node & assign values
node* nodePtr = (node*) malloc(sizeof(node));
nodePtr->data = j;
nodePtr->next = NULL;
// add it to the linked list
curPtr->next = nodePtr;
curPtr = nodePtr;
}
// verify creation by adding up values
curPtr = headPtr;
int total = 0;
while (curPtr != NULL) {
// add and increment current value
total += curPtr->data;
curPtr = curPtr->next;
// clean up memory
free(headPtr);
headPtr = curPtr;
}
sum[0] = total;
}
/**
* Host function that prepares data array and passes it to the CUDA kernel.
*/
int dynamicTest() {
// host output vector
int* h_sum = new int[1];
h_sum[0] = -1;
// device output vector
int* d_sum;
// vector size
size_t bytes = sizeof(int);
// allocate memory on device
cudaMalloc(&d_sum, bytes);
// only use 1 CUDA thread
dim3 blocksize(1, 1, 1), gridsize(1, 1, 1);
Timer runTimer;
int runTime = 0;
// check dynamic allocation time
runTime = 0;
runTimer.start();
dynamicMalloc<<<gridsize, blocksize>>>(d_sum);
runTime += runTimer.lap();
h_sum[0] = 0;
cudaMemcpy(h_sum, d_sum, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_sum);
delete (h_sum);
return 0;
}
__global__ void clearMemory(char *zeros) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
zeros[i] = 0;
}
void memClear() {
char *zeros[1024]; // device pointers
for (int i = 0; i < 1024; ++i) {
cudaMalloc((void**) &(zeros[i]), 4 * 1024 * 1024);
clearMemory<<<1024, 4 * 1024>>>(zeros[i]);
}
for (int i = 0; i < 1024; ++i) {
cudaFree(zeros[i]);
}
}
The first execution of a kernel takes more time because you have to load a lots of stuff on GPU (kernel, lib etc...). To prove it, you can just measure how long it takes to launch an empty kernel and you will see that it's take some times. Try like:
time -> start
launch emptykernel
time -> end
firstTiming = end - start
time -> start
launch empty kernel
time -> end
secondTiming = end - start
You will see that the secondTiming is significantly smaller thant the firstTiming.
The first CUDA (kernel) call initializes the CUDA system transparently. You can avoid this by calling an empty kernel first. Note that this is required in e.g. OpenCL, but there you have to do all that init-stuff manually. CUDA does it for you in the background.
Then some problems with your timing: CUDA kernel calls are asynchronous. So (assuming your Timer class is a host timer like time()) currently you measure the kernel launch time (and for the first call the init-time of CUDA) not the kernel execution time.
At the very least you HAVE to do a cudaDeviceSynchronize() before starting AND stopping the timer.
You are better of using CUDA events which can exactly measure the kernel execution time and only that. Using host-timers you still include the launch-overhead. See https://devblogs.nvidia.com/parallelforall/how-implement-performance-metrics-cuda-cc/
Im using a very big BitmapData as a pathing map for my platformer game, however I only use pixels for 4 particular values, instead of, well 4294967295.
Would converting this Bitmapdata as 2 2D Vectors of Boolean save me some memory ?
And if it does, what about performance, would it be faster or slower to do something like:
MapGetPixel(x:int, y:int):int
{
return MapBoolFirst[x][y] + MapBoolSecond[x][y]*2;
}
instead of the bitmapdata class getPixel32(x:int, y:int):uint ?
In short im looking for a way to reduce the size and/or optimize my 4 colors bitmapdata.
Edit :
Using my boolean method apparently consumes 2 times more memory than the bitmapdata one.
I guess a boolean takes more than one bit in memory, else that would be too easy. So im thinking about bitshifting ints and thus have an int store the value for several pixels, but im not sure about this…
Edit 2 :
Using int bitshifts I can manage the data of 16 pixels into a single int, this trick should work to save some memory, even if it'll probably hit performance a bit.
Bitshifting will be the most memory-optimized way of handling it. Performance wise, that shouldn't be too big of an issue unless you need to poll a lot of asks each frame. The issue with AS is that booleans are 4bits :(
As I see it you can handle it in different cases:
1) Create a lower res texture for the hit detections, usually it is okay to shrink it 4 times (256x256 --> 64x64)
2) Use some kind of technique of saving that data into some kind of storage (bool is easiest, but if that is too big, then you need to find another solution for it)
3) Do the integer-solution (I haven't worked with bit-shifting before, so I thought it would be a fun challenge, here's the result of that)
And that solution is way smaller than the one used for boolean, and also way harder to understand :/
public class Foobar extends MovieClip {
const MAX_X:int = 32;
const MAX_Y:int = 16;
var _itemPixels:Vector.<int> = new Vector.<int>(Math.ceil(MAX_X * MAX_Y / 32));
public function Foobar() {
var pre:Number = System.totalMemory;
init();
trace("size=" + _itemPixels.length);
for (var i = 0; i < MAX_Y; ++i) {
for (var j = 0; j < MAX_X; ++j) {
trace("item=" + (i*MAX_X+j) + "=" + isWalkablePixel(j, i));
}
}
trace("memory preInit=" + pre);
trace("memory postInit=" + System.totalMemory);
}
public function init() {
var MAX_SIZE:int = MAX_X * MAX_Y;
var id:int = 0;
var val:int = 0;
var b:Number = 0;
for(var y=0; y < MAX_Y; ++y) {
for (var x = 0; x < MAX_X; ++x) {
b = Math.round(Math.random()); //lookup the pixel from some kind of texture or however you expose the items
if (b == 1) {
id = Math.floor((y * MAX_X + x) / 32);
val = _itemPixels[id];
var it:uint = (y * MAX_X + x) % 32;
b = b << it;
val |= b;
_itemPixels[id] = val;
}
}
}
}
public function isWalkablePixel(x, y):Boolean {
var val:int = _itemPixels[Math.floor((y * MAX_X + x) / 32)];
var it:uint = 1 << (y * MAX_X + x) % 32;
return (val & it) != 0;
}
}
One simple improvement is to use a ByteArray instead of BitmapData. That means each "pixel" only takes up 1 byte instead of 4. This is still a bit wasteful since you're only needing 2 bits per pixel and not 8, but it's a lot less than using BitmapData. It also gives you some "room to grow" without having to change anything significant later if you need to store more than 4 values per pixel.
ByteArray.readByte()/ByteArray.writeByte() works with integers, so it's really convenient to use. Of course, only the low 8 bits of the integer is written when calling writeByte().
You set ByteArray.position to the point (0-based index) where you want the next read or write to start from.
To sum up: Think of the ByteArray as a one dimensional Array of integers valued 0-255.
Here are the results, I was using an imported 8 bit colored .png by the way, not sure if it changes anything when he gets converted into a
BitmapData.
Memory usage :
BitmapData : 100%
Double Boolean vectors : 200%
Int Bitshifting : 12%
So int bitshifting win hands down, it works pretty much the same way as hexadecimal color components, however in that case I store 16 components (pixel values in 2 bits) not the 4 ARGB:
var pixels:int = -1;// in binary full of 1
for (var i:int = 0; i < 16; i++)
trace("pixel " + (i + 1) +" value : " + (pixels >> i * 2 & 3));
outputs as expected :
"pixel i value : 3"
Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.
I have a CUDA kernel which I'm compiling to a cubin file without any special flags:
nvcc text.cu -cubin
It compiles, though with this message:
Advisory: Cannot tell what pointer points to, assuming global memory space
and a reference to a line in some temporary cpp file. I can get this to work by commenting out some seemingly arbitrary code which makes no sense to me.
The kernel is as follows:
__global__ void string_search(char** texts, int* lengths, char* symbol, int* matches, int symbolLength)
{
int localMatches = 0;
int blockId = blockIdx.x + blockIdx.y * gridDim.x;
int threadId = threadIdx.x + threadIdx.y * blockDim.x;
int blockThreads = blockDim.x * blockDim.y;
__shared__ int localMatchCounts[32];
bool breaking = false;
for(int i = 0; i < (lengths[blockId] - (symbolLength - 1)); i += blockThreads)
{
if(texts[blockId][i] == symbol[0])
{
for(int j = 1; j < symbolLength; j++)
{
if(texts[blockId][i + j] != symbol[j])
{
breaking = true;
break;
}
}
if (breaking) continue;
localMatches++;
}
}
localMatchCounts[threadId] = localMatches;
__syncthreads();
if(threadId == 0)
{
int sum = 0;
for(int i = 0; i < 32; i++)
{
sum += localMatchCounts[i];
}
matches[blockId] = sum;
}
}
If I replace the line
localMatchCounts[threadId] = localMatches;
after the first for loop with this line
localMatchCounts[threadId] = 5;
it compiles with no notices. This can also be achieved by commenting out seemingly random parts of the loop above the line. I have also tried replacing the local memory array with a normal array to no effect. Can anyone tell me what the problem is?
The system is Vista 64bit, for what its worth.
Edit: I fixed the code so it actually works, though it still produces the compiler notice. It does not seem as though the warning is a problem, at least with regards to correctness (it might affect performance).
Arrays of pointers like char** are problematic in kernels, since the kernels have no access to the host's memory.
It is better to allocate a single continuous buffer and to divide it in a manner that enables parallel access.
In this case I'd define a 1D array which contains all the strings positioned one after another and another 1D array, sized 2*numberOfStrings which contains the offset of each string within the first array and it's length:
For example - preparation for kernel:
char* buffer = st[0] + st[1] + st[2] + ....;
int* metadata = new int[numberOfStrings * 2];
int lastpos = 0;
for (int cnt = 0; cnt < 2* numberOfStrings; cnt+=2)
{
metadata[cnt] = lastpos;
lastpos += length(st[cnt]);
metadata[cnt] = length(st[cnt]);
}
In kernel:
currentIndex = threadId + blockId * numberOfBlocks;
char* currentString = buffer + metadata[2 * currentIndex];
int currentStringLength = metadata[2 * currentIndex + 1];
The problem seems to be associated with the char** parameter. Turning this into a char* solved the warning, so I suspect that cuda might have problems with this form of data. Perhaps cuda prefers that one uses the specific cuda 2D arrays in this case.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Not so long ago I was in an interview, that required solving two very interesting problems. I'm curious how would you approach the solutions.
Problem 1 :
Product of everything except current
Write a function that takes as input two integer arrays of length len, input and index, and generates a third array, result, such that:
result[i] = product of everything in input except input[index[i]]
For instance, if the function is called with len=4, input={2,3,4,5}, and index={1,3,2,0}, then result will be set to {40,24,30,60}.
IMPORTANT: Your algorithm must run in linear time.
Problem 2 : ( the topic was in one of Jeff posts )
Shuffle card deck evenly
Design (either in C++ or in C#) a class Deck to represent an ordered deck of cards, where a deck contains 52 cards, divided in 13 ranks (A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K) of the four suits: spades (?), hearts (?), diamonds (?) and clubs (?).
Based on this class, devise and implement an efficient algorithm to shuffle a deck of cards. The cards must be evenly shuffled, that is, every card in the original deck must have the same probability to end up in any possible position in the shuffled deck.
The algorithm should be implemented in a method shuffle() of the class Deck:
void shuffle()
What is the complexity of your algorithm (as a function of the number n of cards in the deck)?
Explain how you would test that the cards are evenly shuffled by your method (black box testing).
P.S. I had two hours to code the solutions
First question:
int countZeroes (int[] vec) {
int ret = 0;
foreach(int i in vec) if (i == 0) ret++;
return ret;
}
int[] mysticCalc(int[] values, int[] indexes) {
int zeroes = countZeroes(values);
int[] retval = new int[values.length];
int product = 1;
if (zeroes >= 2) { // 2 or more zeroes, all results will be 0
for (int i = 0; i > values.length; i++) {
retval[i] = 0;
}
return retval;
}
foreach (int i in values) {
if (i != 0) product *= i; // we have at most 1 zero, dont include in product;
}
int indexcounter = 0;
foreach(int idx in indexes) {
if (zeroes == 1 && values[idx] != 0) { // One zero on other index. Our value will be 0
retval[indexcounter] = 0;
}
else if (zeroes == 1) { // One zero on this index. result is product
retval[indexcounter] = product;
}
else { // No zeros. Return product/value at index
retval[indexcounter] = product / values[idx];
}
indexcouter++;
}
return retval;
}
Worst case this program will step through 3 vectors once.
For the first one, first calculate the product of entire contents of input, and then for every element of index, divide the calculated product by input[index[i]], to fill in your result array.
Of course I have to assume that the input has no zeros.
Tnilsson, great solution ( because I've done it the exact same way :P ).
I don't see any other way to do it in linear time. Does anybody ? Because the recruiting manager told me, that this solution was not strong enough.
Are we missing some super complex, do everything in one return line, solution ?
A linear-time solution in C#3 for the first problem is:-
IEnumerable<int> ProductExcept(List<int> l, List<int> indexes) {
if (l.Count(i => i == 0) == 1) {
int singleZeroProd = l.Aggregate(1, (x, y) => y != 0 ? x * y : x);
return from i in indexes select l[i] == 0 ? singleZeroProd : 0;
} else {
int prod = l.Aggregate(1, (x, y) => x * y);
return from i in indexes select prod == 0 ? 0 : prod / l[i];
}
}
Edit: Took into account a single zero!! My last solution took me 2 minutes while I was at work so I don't feel so bad :-)
Product of everything except current in C
void product_except_current(int input[], int index[], int out[],
int len) {
int prod = 1, nzeros = 0, izero = -1;
for (int i = 0; i < len; ++i)
if ((out[i] = input[index[i]]) != 0)
// compute product of non-zero elements
prod *= out[i]; // ignore possible overflow problem
else {
if (++nzeros == 2)
// if number of zeros greater than 1 then out[i] = 0 for all i
break;
izero = i; // save index of zero-valued element
}
//
for (int i = 0; i < len; ++i)
out[i] = nzeros ? 0 : prod / out[i];
if (nzeros == 1)
out[izero] = prod; // the only non-zero-valued element
}
Here's the answer to the second one in C# with a test method. Shuffle looks O(n) to me.
Edit: Having looked at the Fisher-Yates shuffle, I discovered that I'd re-invented that algorithm without knowing about it :-) it is obvious, however. I implemented the Durstenfeld approach which takes us from O(n^2) -> O(n), really clever!
public enum CardValue { A, Two, Three, Four, Five, Six, Seven, Eight, Nine, Ten, J, Q, K }
public enum Suit { Spades, Hearts, Diamonds, Clubs }
public class Card {
public Card(CardValue value, Suit suit) {
Value = value;
Suit = suit;
}
public CardValue Value { get; private set; }
public Suit Suit { get; private set; }
}
public class Deck : IEnumerable<Card> {
public Deck() {
initialiseDeck();
Shuffle();
}
private Card[] cards = new Card[52];
private void initialiseDeck() {
for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 13; ++j) {
cards[i * 13 + j] = new Card((CardValue)j, (Suit)i);
}
}
}
public void Shuffle() {
Random random = new Random();
for (int i = 0; i < 52; ++i) {
int j = random.Next(51 - i);
// Swap the cards.
Card temp = cards[51 - i];
cards[51 - i] = cards[j];
cards[j] = temp;
}
}
public IEnumerator<Card> GetEnumerator() {
foreach (Card c in cards) yield return c;
}
System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator() {
foreach (Card c in cards) yield return c;
}
}
class Program {
static void Main(string[] args) {
foreach (Card c in new Deck()) {
Console.WriteLine("{0} of {1}", c.Value, c.Suit);
}
Console.ReadKey(true);
}
}
In Haskell:
import Array
problem1 input index = [(left!i) * (right!(i+1)) | i <- index]
where left = scanWith scanl
right = scanWith scanr
scanWith scan = listArray (0, length input) (scan (*) 1 input)
Vaibhav, unfortunately we have to assume, that there could be a 0 in the input table.
Second problem.
public static void shuffle (int[] array)
{
Random rng = new Random(); // i.e., java.util.Random.
int n = array.length; // The number of items left to shuffle (loop invariant).
while (n > 1)
{
int k = rng.nextInt(n); // 0 <= k < n.
n--; // n is now the last pertinent index;
int temp = array[n]; // swap array[n] with array[k] (does nothing if k == n).
array[n] = array[k];
array[k] = temp;
}
}
This is a copy/paste from the wikipedia article about the Fisher-Yates shuffle. O(n) complexity
Tnilsson, I agree that YXJuLnphcnQ solution is arguably faster, but the idee is the same. I forgot to add, that the language is optional in the first problem, as well as int the second.
You're right, that calculationg zeroes, and the product int the same loop is better. Maybe that was the thing.
Tnilsson, I've also uset the Fisher-Yates shuffle :). I'm very interested dough, about the testing part :)
Trilsson made a separate topic about the testing part of the question
How to test randomness (case in point - Shuffling)
very good idea Trilsson:)
YXJuLnphcnQ, that's the way I did it too. It's the most obvious.
But the fact is, that if you write an algorithm, that just shuffles all the cards in the collection one position to the right every time you call sort() it would pass the test, even though the output is not random.
Shuffle card deck evenly in C++
#include <algorithm>
class Deck {
// each card is 8-bit: 4-bit for suit, 4-bit for value
// suits and values are extracted using bit-magic
char cards[52];
public:
// ...
void shuffle() {
std::random_shuffle(cards, cards + 52);
}
// ...
};
Complexity: Linear in N. Exactly 51 swaps are performed. See http://www.sgi.com/tech/stl/random_shuffle.html
Testing:
// ...
int main() {
typedef std::map<std::pair<size_t, Deck::value_type>, size_t> Map;
Map freqs;
Deck d;
const size_t ntests = 100000;
// compute frequencies of events: card at position
for (size_t i = 0; i < ntests; ++i) {
d.shuffle();
size_t pos = 0;
for(Deck::const_iterator j = d.begin(); j != d.end(); ++j, ++pos)
++freqs[std::make_pair(pos, *j)];
}
// if Deck.shuffle() is correct then all frequencies must be similar
for (Map::const_iterator j = freqs.begin(); j != freqs.end(); ++j)
std::cout << "pos=" << j->first.first << " card=" << j->first.second
<< " freq=" << j->second << std::endl;
}
As usual, one test is not sufficient.