IADD.X GPU instruction - cuda

When looking into the SASS output generated for the NVIDIA Fermi architecture, the instruction IADD.X is observed. From NVIDIA documentation, IADD means integer add, but not understanding what it means by IADD.X. Can somebody please help... Is this meaning an integer addition with extended number of bits?
The instruction snippet is:
IADD.X R5, R3, c[0x0][0x24]; /* 0x4800400090315c43 */

Yes, the .X stands for eXtended precision. You will see IADD.X used together with IADD.CC, where the latter adds the less significant bits, and produces a carry flag (thus the .CC), and this carry flag is then incorporated into addition of the more significant bits performed by IADD.X.
Since NVIDIA GPUs are basically 32-bit processors with 64-bit addressing capability, a frequent use of this idiom is in address (pointer) arithmetic. The use of 64-bit integer types, such as long long int or uint64_t will likewise lead to the use of these instructions.
Here is a worked example of a kernel doing 64-bit integer addition. This CUDA code was compiled for compute capability 3.5 with CUDA 7.5, and the machine code dumped with cuobjdump --dump-sass.
__global__ void addint64 (long long int a, long long int b, long long int *res)
{
*res = a + b;
}
MOV R1, c[0x0][0x44];
MOV R2, c[0x0][0x148]; // b[31:0]
MOV R0, c[0x0][0x14c]; // b[63:32]
IADD R4.CC, R2, c[0x0][0x140]; // tmp[31:0] = b[31:0] + a[31:0]; carry-out
MOV R2, c[0x0][0x150]; // res[31:0]
MOV R3, c[0x0][0x154]; // res[63:32]
IADD.X R5, R0, c[0x0][0x144]; // tmp[63:32] = b[63:32] + a[63:32] + carry-in
ST.E.64 [R2], R4; // [res] = tmp[63:0]
EXIT

Related

For CUDA, is there a guarantee that Ternary Operator can avoid branch divergence?

I have read a lot of threads about CUDA branch divergence, telling me that using ternary operator is better than if/else statements, because ternary operator doesn't result in branch divergence.
I wonder, for the following code:
foo = (a > b) ? (bar(a)) : (b);
Where bar is another function or some more complicate statements, is it still true that there is no branch divergence ?
I don't know what sources you consulted, but with the CUDA toolchain there is no noticeable performance difference between the use of the ternary operator and the equivalent if-then-else sequence in most cases. In the case where such differences are noticed, they are due to second order effects in the code generation, and the code based on if-then-else sequence may well be faster in my experience. In essence, ternary operators and tightly localized branching are treated in much the same way. There can be no guarantees that a ternary operator may not be translated into machine code containing a branch.
The GPU hardware offers multiple mechanisms that help avoiding branches and the CUDA compiler makes good use of these mechanisms to minimize branches. One is predication, which can be applied to pretty much any instruction. The other is support for select-type instructions which are essentially the hardware equivalent of the ternary operator. The compiler uses if-conversion to translate short branches into branch-less code sequences. Often, it choses a combination of predicated code and a uniform branch. In cases of non-divergent control flow (all threads in a warp take the same branch) the uniform branch skips over the predicated code section.
Except in cases of extreme performance optimization, CUDA can (and should) be written in natural idioms that are clear and appropriate to the task at hand, using either if-then-else sequences or ternary operators as you see fit. The compiler will take care of the rest.
(I would like to add the comment for #njuffa's answer but my reputation is not enough)
I found the performance different between them with my program.
The if-clause style costs 4.78ms:
// fin is {0-4}, range_limit = 5
if(fin >= range_limit){
res_set = res_set^1;
fx_ref = fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X);
fin = 0;
}
// then branch for next loop iteration.
// nvvp report these assemblies.
#!P1 LOP32I.XOR R48, R48, 0x1;
#!P1 FMUL.FTZ R39, R7, 14;
#!P1 MOV R0, RZ;
MOV R40, R48;
{ #!P1 FFMA.FTZ R6, R39, c[0x2][0x0], R5;
#!P0 BRA `(.L_35); } // the predicate also use for loop's branching
And the ternary style costs 4.46ms:
res_set = (fin < range_limit) ? res_set: (res_set ^1);
fx_ref = (fin < range_limit) ? fx_ref : fx + (fxw*(float)DEPTH_M_H/(float)DEPTH_BLOCK_X) ;
fin = (fin < range_limit) ? fin:0;
//comments are where nvvp mark the instructions are for the particular code line
ISETP.GE.AND P2, PT, R34.reuse, c[0x0][0x160], PT; //res_set
FADD.FTZ R27, -R25, R4;
ISETP.LT.AND P0, PT, R34, c[0x0][0x160], PT; //fx_ref
IADD32I R10, R10, 0x1;
SHL R0, R9, 0x2;
SEL R4, R4, R27, P1;
ISETP.LT.AND P1, PT, R10, 0x5, PT;
IADD R33, R0, R26;
{ SEL R0, RZ, 0x1, !P2;
STS [R33], R58; }
{ FADD.FTZ R3, R3, 74.75;
STS [R33+0x8], R29; }
{ #!P0 FMUL.FTZ R28, R4, 14; //fx_ref
STS [R33+0x10], R30; }
{ IADD32I R24, R24, 0x1;
STS [R33+0x18], R31; }
{ LOP.XOR R9, R0, R9; //res_set
STS [R33+0x20], R32; }
{ SEL R0, R34, RZ, P0; //fin
STS [R33+0x28], R36; }
{ #!P0 FFMA.FTZ R2, R28, c[0x2][0x0], R3; //fx_ref
The inserted lines are from the next loop iteration calculation.
I think in the case of many instructions shared the same predicate value, the ternary style may provide more opportunity for ILP optimization.

Cuda Mutex, why deadlock?

I am trying to implement a atomic based mutex.
I succeed it but I have one question about warps / deadlock.
This code works well.
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
But this one doesn't...
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).
Thanks !
There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.
Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().
This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.
Here's my code for analysis:
$ cat t830.cu
#include <stdio.h>
__device__ int mLock = 0;
__device__ void doCriticJob(){
}
__global__ void kernel1(){
int index = 0;
int mSize = 1;
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
}
__global__ void kernel2(){
int index = 0;
int mSize = 1;
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
}
int main(){
kernel2<<<4,128>>>();
cudaDeviceSynchronize();
}
kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.
I use cuobjdump -sass to dump the machine code:
$ cuobjdump -sass ./t830
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_20
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_20
Function : _Z7kernel1v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R4, 0x1; /* 0x1800000004011de2 */
/*0010*/ SSY 0x48; /* 0x60000000c0000007 */
/*0018*/ MOV R2, c[0xe][0x0]; /* 0x2800780000009de4 */
/*0020*/ MOV R3, c[0xe][0x4]; /* 0x280078001000dde4 */
/*0028*/ ATOM.E.CAS R0, [R2], RZ, R4; /* 0x54080000002fdd25 */
/*0030*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0038*/ #P0 BRA 0x18; /* 0x4003ffff600001e7 */
/*0040*/ NOP.S; /* 0x4000000000001df4 */
/*0048*/ ATOM.E.EXCH RZ, [R2], RZ; /* 0x547ff800002fdd05 */
/*0050*/ EXIT; /* 0x8000000000001de7 */
............................
Function : _Z7kernel2v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R0, 0x1; /* 0x1800000004001de2 */
/*0010*/ MOV32I R3, 0x1; /* 0x180000000400dde2 */
/*0018*/ MOV R4, c[0xe][0x0]; /* 0x2800780000011de4 */
/*0020*/ MOV R5, c[0xe][0x4]; /* 0x2800780010015de4 */
/*0028*/ ATOM.E.CAS R2, [R4], RZ, R3; /* 0x54061000004fdd25 */
/*0030*/ ISETP.NE.AND P1, PT, R2, RZ, PT; /* 0x1a8e0000fc23dc23 */
/*0038*/ #!P1 MOV R0, RZ; /* 0x28000000fc0025e4 */
/*0040*/ #!P1 ATOM.E.EXCH RZ, [R4], RZ; /* 0x547ff800004fe505 */
/*0048*/ LOP.AND R2, R0, 0xff; /* 0x6800c003fc009c03 */
/*0050*/ I2I.S32.S16 R2, R2; /* 0x1c00000008a09e84 */
/*0058*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*0060*/ #P0 BRA 0x18; /* 0x4003fffec00001e7 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
............................
Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).
The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.
Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (#P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.
If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.
(#P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)
NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.

The concept of branch (taken, not taken, diverged) in CUDA

In Nsight Visual Studio, we will have a graph to present the statistics of "taken", "not taken" and "diverged" branches. I am confused about the differece between "not taken" and "diverged".
For example
kernel()
{
if(tid % 32 != 31)
{...}
else
{...}
}
In my opinion, when tid %31 == 31 in a warp, the divergency will happen, but what is "not taken"?
From the Nsight Visual Studio Edition User Guide:
Not Taken / Taken Total: number of executed branch instructions with a uniform control flow decision; that is all active threads of a warp either take or not take the branch.
Diverged: Total number of executed branch instruction for which the conditional resulted in different outcomes across the threads of the warp. All code paths with at least one participating thread get executed sequentially. Lower numbers are better, however, check the Flow Control Efficiency to understand the impact of control flow on the device utilization.
Now, let us consider the following simple code, which perhaps is what you are currently considering in your tests:
#include<thrust\device_vector.h>
__global__ void test_divergence(int* d_output) {
int tid = threadIdx.x;
if(tid % 32 != 31)
d_output[tid] = tid;
else
d_output[tid] = 30000;
}
void main() {
const int N = 32;
thrust::device_vector<int> d_vec(N,0);
test_divergence<<<2,32>>>(thrust::raw_pointer_cast(d_vec.data()));
}
The Branch Statistics graph produced by Nsight is reported below. As you can see, Taken is equal to 100%, since all the threads bump into the if statement. The surprising result is that you have no Diverge. This can be explained by taking a look at the disassembled code of the kernel function (compiled for a compute capability of 2.1):
MOV R1, c[0x1][0x100];
S2R R0, SR_TID.X;
SHR R2, R0, 0x1f;
IMAD.U32.U32.HI R2, R2, 0x20, R0;
LOP.AND R2, R2, -0x20;
ISUB R2, R0, R2;
ISETP.EQ.AND P0, PT, R2, 0x1f, PT;
ISCADD R2, R0, c[0x0][0x20], 0x2;
SEL R0, R0, 0x7530, !P0;
ST [R2], R0;
EXIT;
As you can see, the compiler is able to optimize the diassembled code so that no branching is present, except the uniform one due to the EXIT instruction, as pointed out by Greg Smith in the comment below.
EDIT: A MORE COMPLEX EXAMPLE FOLLOWING GREG SMITH'S COMMENT
I'm now considering the following more complex example
/**************************/
/* TEST DIVERGENCE KERNEL */
/**************************/
__global__ void testDivergence(float *a, float *b)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < 16) a[tid] = tid + 1;
else b[tid] = tid + 2;
}
/********/
/* MAIN */
/********/
void main() {
const int N = 64;
float* d_a; cudaMalloc((void**)&d_a,N*sizeof(float));
float* d_b; cudaMalloc((void**)&d_b,N*sizeof(float));
testDivergence<<<2,32>>>(d_a, d_b);
}
This is the Branch Statistics graph
while this is the disassembled code
MOV R1, c[0x1][0x100];
S2R R0, SR_CTAID.X; R0 = blockIdx.x
S2R R2, SR_TID.X; R0 = threadIdx.x
IMAD R0, R0, c[0x0][0x8], R2; R0 = threadIdx.x + blockIdx.x * blockDim.x
ISETP.LT.AND P0, PT, R0, 0x10, PT; Checks if R0 < 16 and puts the result in predicate register P0
/*0028*/ #P0 BRA.U 0x58; If P0 = true, jumps to line 58
#!P0 IADD R2, R0, 0x2; If P0 = false, R2 = R0 + 2
#!P0 ISCADD R0, R0, c[0x0][0x24], 0x2; If P0 = false, calculates address to store b[tid] in global memory
#!P0 I2F.F32.S32 R2, R2; "
#!P0 ST [R0], R2; "
/*0050*/ #!P0 BRA.U 0x78; If P0 = false, jumps to line 78
/*0058*/ #P0 IADD R2, R0, 0x1; R2 = R0 + 1
#P0 ISCADD R0, R0, c[0x0][0x20], 0x2;
#P0 I2F.F32.S32 R2, R2;
#P0 ST [R0], R2;
/*0078*/ EXIT;
As it can be seen, now we have two BRA instructions in the disassembled code. From the graph above, each warp bumps into 3 branches (one for the EXIT and the two BRAs). Both warps have 1 taken branch, since all the threads uniformly bump into the EXIT instruction. The first warp has 2 not taken branches, since the two BRAs paths are not followed uniformly across the warp threads. The second warp has 1 not taken branch and 1 taken branch since all the warp threads follow uniformly one of the two BRAs. I would say that again diverged* is equal to zero because the instructions in the two branches are exactly the same, although performed on different operands.

CUDA stack frame size increase by __forceinline__

When I declare device functions with __forceinline__, the linker outputs this information:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 28 registers, 456 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
and without it the output is:
2> nvlink : info : Function properties for '_ZN3GPU4Flux4calcILj512EEEvv':
2> nvlink : info : used 23 registers, 216 stack, 15776 bytes smem, 320 bytes cmem[0], 0 bytes lmem
Why is the size of the stack frame smaller when the __forceinline__ is not used?
How important is to keep the stack frame as small as possible?
Thank you for your help.
The main reason to reduce the stack frame is that the stack is allocated in local memory which resides in off-chip device memory. This makes the access to the stack (if not cached) slow.
To show this, let me make a simple example. Consider the case:
__device__ __noinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
__global__ void kernel_function(float* d_a) {
float test[16];
test[threadIdx.x] = threadIdx.x;
func(d_a,test,threadIdx.x);
}
Note that the __device__ function is declared __noinline__. In this case
ptxas : info : Function properties for _Z15kernel_functionPf
64 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 7 registers, 36 bytes cmem[0]
i.e., we have 64 bytes of stack frame. The corresponding disassembled code is
MOV R1, c[0x1][0x100];
ISUB R1, R1, 0x40;
S2R R6, SR_TID.X; R6 = ThreadIdx.x
MOV R4, c[0x0][0x20];
IADD R5, R1, c[0x0][0x4];
I2F.F32.U32 R2, R6; R2 = R6 (integer to float conversion)
ISCADD R0, R6, R1, 0x2;
STL [R0], R2; stores R2 to test[ThreadIdx.x]
CAL 0x50;
EXIT ; __device__ function part
ISCADD R2, R6, R5, 0x2;
ISCADD R3, R6, R4, 0x2;
LD R2, [R2]; loads d_a[tid]
LD R0, [R3]; loads test[tid]
FMUL R0, R2, R0; d_a[tid] = d_a[tid]*test[tid]
ST [R3], R0; store the new value of d_a[tid] to global memory
RET ;
As you can see, test is stored and loaded from global memory, forming the stack frame (it is 16 floats = 64 bytes).
Now change the device function to
__device__ __forceinline__ void func(float* d_a, float* test, int tid) {
d_a[tid]=test[tid]*d_a[tid];
}
that is, change the __device__ function from __noinline__ to __forceinline__. In this case, we have
ptxas : info : Compiling entry function '_Z15kernel_functionPf' for 'sm_20'
ptxas : info : Function properties for _Z15kernel_functionPf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
i.e., we have an empty stack frame now. Indeed, the disassembled code becomes:
MOV R1, c[0x1][0x100];
S2R R2, SR_TID.X; R2 = ThreadIdx.x
ISCADD R3, R2, c[0x0][0x20], 0x2;
I2F.F32.U32 R2, R2; R2 = R2 (integer to float conversion)
LD R0, [R3]; R2 = d_a[ThreadIdx.x] (load from global memory)
FMUL R0, R2, R0; d_a[ThreadIdx.x] = d_a[ThreadIdx.x] * ThreadIdx.x
ST [R3], R0; stores the new value of d_a[ThreadIdx.x] to global memory
EXIT ;
As you can see, forcing the inlining enables the compiler to perform proper optimizations so that now test is fully discarded from the code.
In the above example, __forceinline__ has an effect that is opposite to what you are experiencing, which also shows that, without any further information, the first question cannot be answered.

How to force var4 in registers.

I have the following representative code:
__global__ void func()
{
register ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(0.5), __float2half_rn(1.0));
}
When compiling, result is stored in local memory. Is it possible to force this to registers? Local memory is too slow for the intended application.
Furthermore, this result must be stored to an array of var4 elements. I would like to store these results coalesced, like ((ushort4*)(output))[x + y * width] = result;. Another solution without var4 is also an option.
A vector type should be compiled into registers if there is available registers to do so. Turning your snippet into something that will survive dead code removal:
__global__ void func(ushort4 *out)
{
ushort4 result = make_ushort4(__float2half_rn(0.5), __float2half_rn(0.5),
__float2half_rn(0.5), __float2half_rn(1.0));
out[threadIdx.x+blockDim.x*blockIdx.x] = result;
}
and compiling it:
>nvcc -cubin -arch=sm_20 -Xptxas="-v" ushort4.cu
ushort4.cu
ushort4.cu
tmpxft_000010b8_00000000-3_ushort4.cudafe1.gpu
tmpxft_000010b8_00000000-10_ushort4.cudafe2.gpu
ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_20'
ptxas info : Function properties for _Z4funcP7ushort4
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 8 registers, 36 bytes cmem[0]
shows no spills (ie. local memory). Further, disassembling the resulting cubin file shows:
>cuobjdump --dump-sass ushort4.cubin
code for sm_20
Function : _Z4funcP7ushort4
/*0000*/ /*0x00005de428004404*/ MOV R1, c [0x1] [0x100];
/*0008*/ /*0x01101c041000cfc0*/ F2F.F16.F32 R0, 0x3f000;
/*0010*/ /*0x94009c042c000000*/ S2R R2, SR_CTAid_X;
/*0018*/ /*0x8400dc042c000000*/ S2R R3, SR_Tid_X;
/*0020*/ /*0x01111c041000cfe0*/ F2F.F16.F32 R4, 0x3f800;
/*0028*/ /*0x00915c041c000000*/ I2I.U16.U16 R5, R0;
/*0030*/ /*0x20209c0320064000*/ IMAD.U32.U32 R2, R2, c [0x0] [0x8], R3;
/*0038*/ /*0x40019c03280ac040*/ BFI R6, R0, 0x1010, R5;
/*0040*/ /*0x4041dc03280ac040*/ BFI R7, R4, 0x1010, R5;
/*0048*/ /*0x80201c6340004000*/ ISCADD R0, R2, c [0x0] [0x20], 0x3;
/*0050*/ /*0x00019ca590000000*/ ST.64 [R0], R6;
/*0058*/ /*0x00001de780000000*/ EXIT;
.................................
ie. the ushort4 is stuffed into register and then a 64 bit store is used to write the packed vector out to global memory. No local memory access to be seen.
So if you have convinced yourself that you have a vector value compiling into local memory, it is either because you have a kernel with a lot of register pressure, or you are asking the compiler to (the volatile keyword will do that), or you have misinterpreted what the compiler/assembler are telling you at compile time.
EDIT: Using the CUDA 4.0 release tookit with Visual Studio Express 2008 and compiling on 32bit Windows 7 for a compute 1.1 device gives:
>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Fri_May_13_02:42:40_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
>cl.exe
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
usage: cl [ option... ] filename... [ /link linkoption... ]
>nvcc -cubin -arch=sm_11 -Xptxas=-v ushort4.cu
ushort4.cu
ushort4.cu
tmpxft_00001788_00000000-3_ushort4.cudafe1.gpu
tmpxft_00001788_00000000-10_ushort4.cudafe2.gpu
ptxas info : Compiling entry function '_Z4funcP7ushort4' for 'sm_11'
ptxas info : Used 4 registers, 4+16 bytes smem
which is the exact same result as for the original build for a compute 2.0 target.