I can't get CUDA profiling tools from being working. My laptop Asus has two video cards. One integrated (Intel) and another one, Nvidia GTX 960M.
I suspected that the visual profiler is using the integrated video card, so I changed the default video card for this specific application, under the “Nvidia Control Panel” and “Manager 3d Settings->Program Settings” to use the “High-Performance NVidia Processor”.
Nothing changed. Running the Visual Profiler, in the “Overall GPU usage” tab, I get “No GPU devices in Session”, which means I far as I can understand that no GPU’s were used.
Also, I noticed that the Nvidia display icon in the notification area is not reporting any applications that are using the video card.
What seems to be the problem here? How can I enable also the Nvidia GPU for both Visual Profiler and the command line nvprof.exe application? It seems neither Nsight works for me.
The code I am testing is the following:
#include<stdio.h>
#include<iostream>
#include<stdlib.h>
#include<string.h>
#define NUM_THREADS 256
#define IMG_SIZE 1048576
struct Coefficients_SOA {
int r;
int b;
int g;
int hue;
int saturation;
int maxVal;
int minVal;
int finalVal;
};
__global__
void complicatedCalculation(Coefficients_SOA* data)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
int grayscale = (data[i].r + data[i].g + data[i].b)/data[i].maxVal;
int hue_sat = data[i].hue * data[i].saturation / data[i].minVal;
data[i].finalVal = grayscale*hue_sat;
}
void complicatedCalculation()
{
Coefficients_SOA* d_x;
cudaMalloc(&d_x, IMG_SIZE*sizeof(Coefficients_SOA));
int num_blocks = IMG_SIZE/NUM_THREADS;
complicatedCalculation<<<num_blocks,NUM_THREADS>>>(d_x);
cudaFree(d_x);
}
int main(int argc, char*argv[])
{
complicatedCalculation();
return 0;
}
Best Regards,
PS: I installed CUDA Version 11 under win10/64bit
Also, I verified the CUDA installation according to https://docs.nvidia.com/cuda/pdf/CUDA_Installation_Guide_Windows.pdf
I attach the DeviceQuery and BandwidthTest CUDA sample programs for your convenience.
deviceQuery Sample report
D:\Program Files\nVidia\CUDA Samples\v11.0\bin\win64\Release>deviceQuery
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 960M"
CUDA Driver Version / Runtime Version 11.0 / 11.0
CUDA Capability Major/Minor version number: 5.0
Total amount of global memory: 4096 MBytes (4294967296 bytes)
( 5) Multiprocessors, (128) CUDA Cores/MP: 640 CUDA Cores
GPU Max Clock rate: 1176 MHz (1.18 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 4 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.0, CUDA Runtime Version = 11.0, NumDevs = 1
Result = PASS
BandwidthTest Sample report
D:\Program Files\nVidia\CUDA Samples\v11.0\bin\win64\Release>bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: GeForce GTX 960M
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.2
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 11.8
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 68.9
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Problem solved. As a beginner in the CUDA world, I didn't know that I should add the parameter gencode to compile my CUDA files under the command line (Visual Studio of CUDA SDK sample projects are already had these parameters that's why I had GPU activity).
So, the full parameter list under the command line should be like this for my maxwell architecture with CUDA Capability Major/Minor version number 5.0.
nvcc -run -m64 -gencode arch=compute_50,code=sm_50 -o aos_soa.exe aos_soa.cu
Unfortunately, my 1st book "Learn CUDA Programming", from Packt Publishing, on page 49 refers that I should compile ONLY with the following parameters, besides the fact that in the source code files is contains a "Makefile" that includes all of the parameters above (only for linux, so I ignored it).
$ nvcc -o aos_soa ./aos_soa.cu
Now I can see my GPU statistics under nvprof.
nvprof aos_soa.exe
==18308== NVPROF is profiling process 18308, command: aos_soa.exe
==18308== Profiling application: aos_soa.exe
==18308== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 1.1421ms 1 1.1421ms 1.1421ms 1.1421ms complicatedCalculation(Coefficients_SOA*)
API calls: 83.40% 226.57ms 1 226.57ms 226.57ms 226.57ms cudaMalloc
15.90% 43.183ms 1 43.183ms 43.183ms 43.183ms cuDevicePrimaryCtxRelease
0.58% 1.5790ms 1 1.5790ms 1.5790ms 1.5790ms cudaFree
0.07% 198.40us 1 198.40us 198.40us 198.40us cuModuleUnload
0.03% 70.100us 1 70.100us 70.100us 70.100us cudaLaunchKernel
0.01% 26.800us 1 26.800us 26.800us 26.800us cuDeviceTotalMem
0.01% 20.200us 101 200ns 100ns 3.3000us cuDeviceGetAttribute
0.00% 11.600us 1 11.600us 11.600us 11.600us cuDeviceGetPCIBusId
0.00% 1.4000us 3 466ns 200ns 700ns cuDeviceGetCount
0.00% 1.4000us 2 700ns 200ns 1.2000us cuDeviceGet
0.00% 600ns 1 600ns 600ns 600ns cuDeviceGetName
0.00% 400ns 1 400ns 400ns 400ns cuDeviceGetLuid
0.00% 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
Related
A device query on my Titan-XP shows that I have 30 multiprocessors with a maximum number of 2048 threads per multiprocessor. Is it correct to think that the maximum number of threads that can simultaneously be executed physically on the hardware is 30 * 2048? I.e: will a kernel configuration like the following exploit this?
kernel<<<60, 1024>>>(...);
I'd really like to physically have the maximum number of blocks executing while avoiding having blocks waiting to be scheduled. Here's the full output of device query:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "TITAN Xp"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 12190 MBytes (12781682688 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1582 MHz (1.58 GHz)
Memory Clock rate: 5705 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = TITAN Xp
Result = PASS
Yes, your conclusion is correct. The maximum number of threads that can be "in-flight" is 2048 * # of SMs for all GPUs supported by CUDA 9 or CUDA 9.1. (Fermi GPUs, supported by CUDA 8, are a bit lower at 1536 * # of SMs)
This is an upper bound, and the specifics of your kernel (resource utilization) may mean that fewer than this number can actually be "resident" or "in flight". This is in the general topic of GPU occupancy. CUDA includes an occupancy calculator spreadsheet and also a programmatic occupancy API to help determine this, for your specific kernel.
The usual kernel strategy to have a limited number of threads (e.g. 60 * 1024 in your case) handle an arbitrary data set size is to use some form of a construct called a grid striding loop.
I'm newbie in cuda programmation. I have a problem with this code (it was written by my teacher):
#include <stdio.h>
#define THREAD_PER_BLOCK 128
__global__ void add(int *a,const int N){
int index=threadIdx.x+blockIdx.x*blockDim.x;
if (index<N)
a[index] = a[index]+10;
}
int main( void ){
int *a;
// managed
int i;
int N=1024;
int size = N * sizeof( int );
cudaMallocManaged( &a, size );
for(i=0; i<N; i++) {
a[i]=i;
}
add<<< N/THREAD_PER_BLOCK, THREAD_PER_BLOCK >>>( a,N);
cudaDeviceSynchronize();
for (int i=0; i<10; i++){
printf("%d %d\n", i, a[i]);
}
cudaFree( a );
return 0;
}
I've detected a seg fault on fist for-loop, I have no idea of why the program crashes. My operative system is Ubuntu 14.04 and this is the output of querydevice:
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 820M"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1985 MBytes (2081095680 bytes)
( 2) Multiprocessors, ( 48) CUDA Cores/MP: 96 CUDA Cores
GPU Max Clock rate: 1550 MHz (1.55 GHz)
Memory Clock rate: 900 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 820M
Result = PASS
The problem here is that your GPU is a Fermi GPU (compute capability 2.x):
Device 0: "GeForce 820M"
...
CUDA Capability Major/Minor version number: 2.1
^^^
and unified memory (for cudaMallocManaged) requires a compute capability 3.0 or higher GPU.
Any time you are having trouble with a CUDA code, it's good practice to use proper CUDA error checking before asking others for help. Even if you don't understand the error output, it will be useful to others trying to help you. In this case you would have gotten a concise error message that says that the cudaMallocManaged function is not supported.
Comments / Notes
Can I have more thread blocks than the maximum number of CUDA cores?
How does warp size relate to what I am doing?
Begin
I am running a cuda program using the following code to launch cuda kernels:
cuda_kernel_func<<<960, 1>>> (... arguments ...)
I thought this would be the limit of what I would be allowed to do, as I have a GTX670MX graphics processor on a laptop, which according to Nvidia's website has 960 CUDA cores.
So I tried changing 960 to 961 assuming that the program would crash. It did not...
What's going on here?
This is the output of deviceQuery:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 670MX"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221028864 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 670MX
Result = PASS
I am not sure how to interpret this information. It says here "960 CUDA cores", but then "2048 threads per multiprocessor" and "1024 threads per block".
I am slightly confused about what these things mean, and therefore what the limitations of the cuda_kernel_func<<<..., ...>>> arguments are. (And how to get the maximum performance out of my device.)
I guess you could also interpret this question as "What do all the statistics about my device mean." For example, what actually is a CUDA core / thread / multiprocessor / texture dimension size?
It didn't crash because the number of 'CUDA cores' has nothing to do with the number of blocks. Not all blocks necessarily execute in parallel. CUDA just schedules some of your blocks after others, returning after all block executions have taken place.
You see, NVIDIA is misstating the number of cores in its GPUs, so as to make for a simpler comparison with single-threaded non-vectorized CPU execution. Your GPU actually has 6 cores in the proper sense of the word; but each of these can execute a lot of instructions in parallel on a lot of data. Note that the bona-fide cores on a Kepler GPUs are called "SMx"es (and are described here briefly).
So:
[Number of actual cores] x [max number of instructions a single core can execute in parallel] = [Number of "CUDA cores"]
e.g. 6 x 160 = 960 for your card.
Even this is a rough description of things, and what happens within an SMx doesn't always allow us to execute 160 instructions in parallel per cycle. For example, when each block has only 1 thread, that number goes down by a factor of 32 (!)
Thus even if you use 960 rather than 961 blocks, your execution isn't as well-parallelized as you would hope. And - you should really use more threads per block to utilize a GPU's capabilities for parallel execution. More importantly, you should find a good book on CUDA/GPU programming.
A simpler answer:
Blocks do not all execute at the same time. Some blocks may finish before others have even started. The GPU takes does X amount of blocks at a time, finishes those, grabs more blocks, and continues until all blocks are finished.
Aside: This is why thread_sync only synchronizes threads within a block, and not within the whole kernel.
I am trying to understand how the memory organization of my GPU is working.
According to the technical specification which are tabulated below my GPU can have 8 active blocks/SM and 768 threads/SM. Based on that I was thinking that in order to take advantage of the above each block should have 96 (=768/8) threads. The closest block that has this number of threads I think it is a 9x9 block, 81 threads. Using the fact that 8 blocks can run simultaneously in one SM we will have 648 threads. What about the rest 120 (= 768-648)?
I know that something wrong is happening with these thoughts. A simple example describing the connection between the maximum number of SM threads the maximum number of threads per block and the warp size based on my GPU specifications it would be very helpful.
Device 0: "GeForce 9600 GT"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 512 MBytes (536870912 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1680 MHz (1.68 GHz)
Memory Clock rate: 700 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192) x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Concurrent kernel execution: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 1 / 0
You could find the technical specification of your device in the cuda programming guide as follows, rather than the output of a sample program of cuda.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities
From the hardware point of view, we generally try to maximize the warp occupancy per Multiprocessor (SM) to get max performance. The max occupancy is limited by 3 types of hardware resources: #warp/SM, #register/SM and #shared memory/SM.
You could try the following tool in your cuda installation dir to understand how to do the calculation. It will give you a clearer understanding of the connections between #threads/SM, #threads/block, #warp/SM, etc.
$CUDA_HOME/tools/CUDA_Occupancy_Calculator.xls
I have exactly the same problem as described in the post:
CUDA Error on cudaBindTexture2D
I even have the following error:
error 18: invalid texture reference." and also experienced "wouldn't
throw the error on cudaMalloc, but only on cudaBindTexture
Unfortunately, the poster (Anton Roth) answered his own question in a manner that was a bit too cryptic for someone such as myself who is just starting out with CUDA:
The answer was in the comments, I used a sm that my GPU wasn't
compatible to.
The "not compatible with GPU" makes sense since the sample program FluidsGL (called "Fluids (OpenGL Version)" in NVIDIA CUDA Samples Browser) fails on my laptop, but works fine on my desktop at work. Unfortunately, I still don't know what "in the comments" was referring it, or how to even check for GPU SM compatibilities.
Here is the code that seems to be causing the issue:
#define DIM 512
In main:
setupTexture(DIM, DIM);
bindTexture();
In fluidsGL_kernels.cu:
texture<float2, 2> texref;
static cudaArray *array = NULL;
void setupTexture(int x, int y)
{
// Wrap mode appears to be the new default
texref.filterMode = cudaFilterModeLinear;
cudaChannelFormatDesc desc = cudaCreateChannelDesc<float2>();
cudaMallocArray(&array, &desc, y, x);
getLastCudaError("cudaMalloc failed");
}
void bindTexture(void)
{
cudaBindTextureToArray(texref, array);//this function itself doesn't throw the error but error 18 is caught by the function below
getLastCudaError("cudaBindTexture failed");
}
Hardware information
Here is the output of deviceQuery:
Device 0: "GeForce 9800M GS"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 8) Multiprocessors x ( 8) CUDA Cores/MP: 64 CUDA Cores
GPU Clock rate: 1325 MHz (1.32 GHz)
Memory Clock rate: 799 Mhz
Memory Bus Width: 256-bit
Max Texture Dimension Size (x,y,z) 1D=(8192), 2D=(65536,32768), 3D
=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(8192) x 512, 2D=(8192,8192)
x 512
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per multiprocessor: 768
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Mo
del)
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 8 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Versi
on = 5.0, NumDevs = 1, Device0 = GeForce 9800M GS
I know my GPU is kind of old, but it still runs most of the examples pretty well.
You need to compile your code for the proper architecture (as explained in the post you linked).
Since you have a CC 1.1 device, use the following nvcc compilation options:
-gencode arch=compute_11,code=sm_11
The default Visual Studio project or Makefile may not compile for the proper architectures, so always make sure that it does.
For Visual Studio, refer to this answer: https://stackoverflow.com/a/14413360/1043187
For a Makefile, it depends. The CUDA SDK samples often have a GENCODE_FLAGS variable that you can modify.