Here i have been given an exam question that i partly solved but do not understand it completely
why it is used volatile here? and the missing expression
i have must be switches >>8.
when it comes to translation i have some difficulty.
Eight switches are memory mapped to the memory address 0xabab0020, where the
least significant bit (index 0) represents switch number 1 and the bit with index 7
represents switch number 8. A bit value 1 indicates that the switch is on and
0 means that it is off. Write down the missing C code expression, such that the
while loop exits if the toggle switch number 8 is off.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while(/* MISSING C CODE EXPRESSION */){
*leds = (*switches >> 4) & 1;
}
Translate the complete C code above into MIPS assembly code, including the missing C code expression. You are not allowed to use pseudo instructions.
without volatile your code can legally be interpreted by the compiler as:
int * switches = (volatile int *) 0xabab0020;
int * leds = (volatile int *) 0xabab0040;
*leds = (*switches >> 4) & 1;
while(/* MISSING C CODE EXPRESSION */){
}
The volatile qualifier is an indication to the C compiler that the data at addresses switches and leds can be changed by another agent in the system. Without the volatile qualifier, the compiler would be allowed to optimize references to these variables away.
The problem description says the loop should run while bit 7 of *switches is set, i.e: while (*switches & 0x80 != 0)
Translating the code is left as an exercise for the reader.
volatile int * switches = (volatile int *) 0xabab0020;
volatile int * leds = (volatile int *) 0xabab0040;
while((*switches >> 8) & 1){
*leds = (*switches >> 4) & 1;
}
To mips
lui $s0,0xabab #load the upper half
ori $s0,0x0020
lui $s1,0xabab
ori $s1,0x0040
while:
lw $t0,0x20($s0)
srl $t0,$t0,8 #only the 8th bit is important
andi $t0,$t0,1 # clear other bit keep the LSB
beq $t0,$0, done
lw $t1,0x40($s1)
srl $t1,$t1,4
andi $t1,$t1,1
sw $t1,0x40($s1)
j while
done:
sw $t0,0x20($s0)
Related
I have translated this C program into MIPS assembly code.
So, here I want to know:
Have I done something wrong?
If my assembly code is correct, can I do it more precisely by reducing the
number of instructions. I have used 14 instructions here
Here is my approach for converting this C program
This is my MIPS assembly code
#s0 = res
main:
addi $a0 , $0 , 27
addi $a1 , $0 , 3
jal division # call function
add $s0 , $v0 , $0 # res = returned value
division:
addi $t0 , $0 , 0 # $t0 = 0
addi $s0 , $0 , 0 # $s0 = val and val = 0
add $s1 , $0 , $a0 # $s1 = i and i = 27
loop:
slt $t1 , $s1 , $t0 # checking if i>0
bne $t1 , $0 , break # if $t1 = 0 then break
addi $s0 , $s0 , 1 # else val = val + 1
sub $s1 , $s1 , $a1 # i = i - y
add $v0 , $s0 , $0 # put return value in $v0
j loop
break:
jr $ra # return to caller
This is the high level C program
int main(){
int res;
res = division (27, 3);
}
int division(int x, int y)
{
int i;
int val = 0;
for (i = x; i>0; i = i-y){
val = val + 1;
}
return val;
}
Errors:
The relation is negated incorrectly. You have translated it as
for ( i = x; i >= 0; i -= y )
So, it will loop one time more than you want (when the integer division is exact). You are aware that loop condition test requires negation for the if-goto-label style of assembly:
if ( ! (i > 0) ) goto break;
However ! (i > 0) is i <= 0 where you have i < 0.
You don't properly terminate the program when the main is finished, so it will accidentally fall into the divide subroutine. The solution is to add an exit syscall to the main.
Suggestions:
The register usage is excessive, and using fewer registers will also decrease the number of instructions required. Further, using the $s registers without preserving their original values is a violation of the calling convention. However, the best solution is to simply avoid using them in that function — just use $a0 and $a1 directly instead of copying them elsewhere.
(It is ok to use s registers in main without preserving their original values, b/c main is at the top of the call chain — it has no caller — by contrast we might consider division as a general purpose function that could be called from any caller, and so should follow the calling convention.)
We not only want to minimize the instructions, but also the number of instructions in the loop. You're doing the work of "return val" by copying $s0 into $v0, but every time in the loop whereas only once is necessary, so that is better left for after the loop. Even better still, simply use $v0 for the counter (val) in the first place.
The slt instruction has an equivalent slti that is handy when comparing to a constant. With slti, no need to load a constant 0 into $t0 register. If you did need a constant 0 in a register, there is always the $0 register as well.
The MIPS instruction set has relational branch instructions that compare with zero, so you can branch on register <= 0 (or < 0) directly without slt/slti.
CUDA has popcount intrinsics for 32-bit and 64-bit types: __popc() and __popcll().
Does CUDA also have intrinsics to get the parity of 32-bit and 64-bit types?
(The parity refers to whether an integer has an even or odd amount of 1-bits.)
For example, GCC has __builtin_parityl() for 64-bit integers.
And here's a C function that does the same thing:
inline uint parity64(uint64 n){
n ^= n >> 1;
n ^= n >> 2;
n = (n & 0x1111111111111111lu) * 0x1111111111111111lu;
return (n >> 60) & 1;
}
I'm not aware of a parity intrinsic for CUDA.
However you should be able to create a fairly simple function to do it using either the __popc() (32-bit unsigned case) or __popcll() (64-bit unsigned case) intrinsics.
For example, the following function should indicate whether the number of 1 bits in a 64-bit unsigned quantity is odd (true) or even (false):
__device__ bool my_parity(unsigned long long d){
return (__popcll(d) & 1);}
Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:
"Can save instructions when performing complex logic operations
on multiple inputs."
At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.
However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.
Do I have to define them in .ptx file instead ? if so, how ?
float atan2f(const float dy, const float dx)
{
float flag, z = 0.0f;
__SET_LT(flag, fabsf(dy), fabsf(dx));
uint32_t m, t1 = 0x80000000;
float t2 = float(M_PI) / 2.0f;
__LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
float w = flag * __int_as_float(m) + float(M_PI)/2.0f;
float Offset = copysignf(w, dy);
float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));
uint32_t r, b = __float_as_int(flag) << 2;
uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
__LOP3_0xe2(r, mask, t1, __floast_as_int(t));
const float p = fabsf(__int_as_float(r)) - 1.0f;
return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}
Edit:
The macro defines are finally:
#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.
In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:
F = (A || B) && (!C) ((A or B) and (not-C))
Then we would compute immLut argument by:
immLut = (0xF0 | 0xCC) & (~0xAA)
Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.
For the above example, immLut would have a computed value of 0x54
If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.
Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.
$ cat t1149.cu
#include <stdio.h>
const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));
__device__ int my_LOP_0x54(int A, int B, int C){
int temp;
asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
return temp;
}
__global__ void testkernel(){
printf("A=true, B=false, C=true, F=%d\n", my_LOP_0x54(true, false, true));
printf("A=true, B=false, C=false, F=%d\n", my_LOP_0x54(true, false, false));
printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}
int main(){
printf("0x%x\n", A_or_B_and_notC);
testkernel<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true, F=0
A=true, B=false, C=false, F=1
A=false, B=false, C=false, F=0
$
Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)
Also see here for a method of understanding how the 8 bit truthtable (immLut) works for LOP3 at the source code level.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 9 years ago.
Improve this question
Given this CUDA code, I am trying to perform bit shifting operations and the return on these values is zero. This should not be happening. Does anyone know how to fix this issue? Am I missing a CUDA header include?
Code
__device__ unsigned int FI( unsigned int in_data, unsigned int subkey,
unsigned int *KLi1, unsigned int *KLi2, unsigned int *KOi1, unsigned int *KOi2,
unsigned int *KOi3, unsigned int *KIi1, unsigned int *KIi2, unsigned int *KIi3) {
unsigned int nine, seven;
unsigned int S7[128] = {};
unsigned int S9[512] = {};
nine = (in_data>>7);
seven = (in_data&0x7F);
/* Now run the various operations */
nine = (unsigned int)(S9[nine] ^ seven);
seven = (unsigned int)(S7[seven] ^ (nine & 0x7F));
seven ^= (subkey>>9);
nine ^= (subkey&0x1FF);
nine = (unsigned int)(S9[nine] ^ seven);
seven = (unsigned int)(S7[seven] ^ (nine & 0x7F));
in_data = (unsigned int)((seven<<9) + nine);
return( in_data );
}
Breakpoint Analysis
Here is an example of a code snippet that shifts an unsigned int 7 places to the right. When I cuda-gdb my exec and breakpoint at the instruction, I observe that the value after shifting remains zero when it shouldn't. When I normally execute the same operation in cuda-gdb command prompt, I get a non-zero value. Any suggestions or hints?
The variables nine and seven should be non-empty based on the value of in_data.
nine = (in_data>>7);
seven = (in_data&0x7F);
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
Breakpoint 1, FI (KLi1=0x3fffae0, KLi2=0x3fffb00, KOi1=0x3fffb20, KOi2=0x3fffb40, KOi3=0x3fffb60,
KIi1=0x3fffb80, KIi2=0x3fffba0, KIi3=0x3fffbc0, in_data=461, subkey=0) at kasumiOp.cu:61
61 nine = (in_data>>7);
(cuda-gdb) p in_data
$1 = 461
(cuda-gdb) step
62 seven = (in_data&0x7F);
(cuda-gdb) p nine
$2 = 0
(cuda-gdb) step
65 nine = (unsigned int)(S9[nine] ^ seven);
(cuda-gdb) p seven
$3 = 0
(cuda-gdb) p 461 >> 7
$4 = 3
(cuda-gdb) cuda thread
thread (1,0,0)
(cuda-gdb) p 561 & 0x7f
$5 = 49
(cuda-gdb) p 461 & 0x7f
$6 = 77
So, in_data is a value. I will try a trivial example and see if I can reproduce the same.
With the limited information provided (no code) I might take a guess:
the CUDA-GDB documentation states:
The GDB print command has been extended to decipher the location of any program variable and can be used to display the contents of any CUDA program variable including:
* data allocated via cudaMalloc()
* data that resides in various GPU memory regions, such as shared, local, and global memory
* special CUDA runtime variables, such as threadIdx
if in_data refers to a particular memory area then it might be that you're dealing with memory pointers instead of real data.
Just my two cents though.
I was reading Joel's book where he was suggesting as interview question:
Write a program to reverse the "ON" bits in a given byte.
I only can think of a solution using C.
Asking here so you can show me how to do in a Non C way (if possible)
I claim trick question. :) Reversing all bits means a flip-flop, but only the bits that are on clearly means:
return 0;
What specifically does that question mean?
Good question. If reversing the "ON" bits means reversing only the bits that are "ON", then you will always get 0, no matter what the input is. If it means reversing all the bits, i.e. changing all 1s to 0s and all 0s to 1s, which is how I initially read it, then that's just a bitwise NOT, or complement. C-based languages have a complement operator, ~, that does this. For example:
unsigned char b = 102; /* 0x66, 01100110 */
unsigned char reverse = ~b; /* 0x99, 10011001 */
What specifically does that question mean?
Does reverse mean setting 1's to 0's and vice versa?
Or does it mean 00001100 --> 00110000 where you reverse their order in the byte? Or perhaps just reversing the part that is from the first 1 to the last 1? ie. 00110101 --> 00101011?
Assuming it means reversing the bit order in the whole byte, here's an x86 assembler version:
; al is input register
; bl is output register
xor bl, bl ; clear output
; first bit
rcl al, 1 ; rotate al through carry
rcr bl, 1 ; rotate carry into bl
; duplicate above 2-line statements 7 more times for the other bits
not the most optimal solution, a table lookup is faster.
Reversing the order of bits in C#:
byte ReverseByte(byte b)
{
byte r = 0;
for(int i=0; i<8; i++)
{
int mask = 1 << i;
int bit = (b & mask) >> i;
int reversedMask = bit << (7 - i);
r |= (byte)reversedMask;
}
return r;
}
I'm sure there are more clever ways of doing it but in that precise case, the interview question is meant to determine if you know bitwise operations so I guess this solution would work.
In an interview, the interviewer usually wants to know how you find a solution, what are you problem solving skills, if it's clean or if it's a hack. So don't come up with too much of a clever solution because that will probably mean you found it somewhere on the Internet beforehand. Don't try to fake that you don't know it neither and that you just come up with the answer because you are a genius, this is will be even worst if she figures out since you are basically lying.
If you're talking about switching 1's to 0's and 0's to 1's, using Ruby:
n = 0b11001100
~n
If you mean reverse the order:
n = 0b11001100
eval("0b" + n.to_s(2).reverse)
If you mean counting the on bits, as mentioned by another user:
n = 123
count = 0
0.upto(8) { |i| count = count + n[i] }
♥ Ruby
I'm probably misremembering, but I
thought that Joel's question was about
counting the "on" bits rather than
reversing them.
Here you go:
#include <stdio.h>
int countBits(unsigned char byte);
int main(){
FILE* out = fopen( "bitcount.c" ,"w");
int i;
fprintf(out, "#include <stdio.h>\n#include <stdlib.h>\n#include <time.h>\n\n");
fprintf(out, "int bitcount[256] = {");
for(i=0;i<256;i++){
fprintf(out, "%i", countBits((unsigned char)i));
if( i < 255 ) fprintf(out, ", ");
}
fprintf(out, "};\n\n");
fprintf(out, "int main(){\n");
fprintf(out, "srand ( time(NULL) );\n");
fprintf(out, "\tint num = rand() %% 256;\n");
fprintf(out, "\tprintf(\"The byte %%i has %%i bits set to ON.\\n\", num, bitcount[num]);\n");
fprintf(out, "\treturn 0;\n");
fprintf(out, "}\n");
fclose(out);
return 0;
}
int countBits(unsigned char byte){
unsigned char mask = 1;
int count = 0;
while(mask){
if( mask&byte ) count++;
mask <<= 1;
}
return count;
}
The classic Bit Hacks page has several (really very clever) ways to do this, but it's all in C. Any language derived from C syntax (notably Java) will likely have similar methods. I'm sure we'll get some Haskell versions in this thread ;)
byte ReverseByte(byte b)
{
return b ^ 0xff;
}
That works if ^ is XOR in your language, but not if it's AND, which it often is.
And here's a version directly cut and pasted from OpenJDK, which is interesting because it involves no loop. On the other hand, unlike the Scheme version I posted, this version only works for 32-bit and 64-bit numbers. :-)
32-bit version:
public static int reverse(int i) {
// HD, Figure 7-1
i = (i & 0x55555555) << 1 | (i >>> 1) & 0x55555555;
i = (i & 0x33333333) << 2 | (i >>> 2) & 0x33333333;
i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f;
i = (i << 24) | ((i & 0xff00) << 8) |
((i >>> 8) & 0xff00) | (i >>> 24);
return i;
}
64-bit version:
public static long reverse(long i) {
// HD, Figure 7-1
i = (i & 0x5555555555555555L) << 1 | (i >>> 1) & 0x5555555555555555L;
i = (i & 0x3333333333333333L) << 2 | (i >>> 2) & 0x3333333333333333L;
i = (i & 0x0f0f0f0f0f0f0f0fL) << 4 | (i >>> 4) & 0x0f0f0f0f0f0f0f0fL;
i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL;
i = (i << 48) | ((i & 0xffff0000L) << 16) |
((i >>> 16) & 0xffff0000L) | (i >>> 48);
return i;
}
pseudo code..
while (Read())
Write(0);
I'm probably misremembering, but I thought that Joel's question was about counting the "on" bits rather than reversing them.
Here's the obligatory Haskell soln for complementing the bits, it uses the library function, complement:
import Data.Bits
import Data.Int
i = 123::Int
i32 = 123::Int32
i64 = 123::Int64
var2 = 123::Integer
test1 = sho i
test2 = sho i32
test3 = sho i64
test4 = sho var2 -- Exception
sho i = putStrLn $ showBits i ++ "\n" ++ (showBits $complement i)
showBits v = concatMap f (showBits2 v) where
f False = "0"
f True = "1"
showBits2 v = map (testBit v) [0..(bitSize v - 1)]
If the question means to flip all the bits, and you aren't allowed to use C-like operators such as XOR and NOT, then this will work:
bFlipped = -1 - bInput;
I'd modify palmsey's second example, eliminating a bug and eliminating the eval:
n = 0b11001100
n.to_s(2).rjust(8, '0').reverse.to_i(2)
The rjust is important if the number to be bitwise-reversed is a fixed-length bit field -- without it, the reverse of 0b00101010 would be 0b10101 rather than the correct 0b01010100. (Obviously, the 8 should be replaced with the length in question.) I just got tripped up by this one.
Asking here so you can show me how to do in a Non C way (if possible)
Say you have the number 10101010. To change 1s to 0s (and vice versa) you just use XOR:
10101010
^11111111
--------
01010101
Doing it by hand is about as "Non C" as you'll get.
However from the wording of the question it really sounds like it's only turning off "ON" bits... In which case the answer is zero (as has already been mentioned) (unless of course the question is actually asking to swap the order of the bits).
Since the question asked for a non-C way, here's a Scheme implementation, cheerfully plagiarised from SLIB:
(define (bit-reverse k n)
(do ((m (if (negative? n) (lognot n) n) (arithmetic-shift m -1))
(k (+ -1 k) (+ -1 k))
(rvs 0 (logior (arithmetic-shift rvs 1) (logand 1 m))))
((negative? k) (if (negative? n) (lognot rvs) rvs))))
(define (reverse-bit-field n start end)
(define width (- end start))
(let ((mask (lognot (ash -1 width))))
(define zn (logand mask (arithmetic-shift n (- start))))
(logior (arithmetic-shift (bit-reverse width zn) start)
(logand (lognot (ash mask start)) n))))
Rewritten as C (for people unfamiliar with Scheme), it'd look something like this (with the understanding that in Scheme, numbers can be arbitrarily big):
int
bit_reverse(int k, int n)
{
int m = n < 0 ? ~n : n;
int rvs = 0;
while (--k >= 0) {
rvs = (rvs << 1) | (m & 1);
m >>= 1;
}
return n < 0 ? ~rvs : rvs;
}
int
reverse_bit_field(int n, int start, int end)
{
int width = end - start;
int mask = ~(-1 << width);
int zn = mask & (n >> start);
return (bit_reverse(width, zn) << start) | (~(mask << start) & n);
}
Reversing the bits.
For example we have a number represented by 01101011 . Now if we reverse the bits then this number will become 11010110. Now to achieve this you should first know how to do swap two bits in a number.
Swapping two bits in a number:-
XOR both the bits with one and see if results are different. If they are not then both the bits are same otherwise XOR both the bits with XOR and save it in its original number;
Now for reversing the number
FOR I less than Numberofbits/2
swap(Number,I,NumberOfBits-1-I);