How does QEMU risc-v put FDT address in register a1? - qemu

From RISC-V OpenSBI's source code and documents, in OpenSBI firmware a1 preserves FDT address from the prior booting stage, which I guess is QEMU if the following command is used to boot OpenSBI firmware:
qemu-system-riscv64 -M virt -m 256M -nographic -bios build/platform/generic/firmware/fw_payload.bin
and OpenSBI firmware is built with
make PLATFORM=generic CROSS_COMPILE=riscv64-linux-gnu-
fw_base.S in OpenSBI's source code will use the value of a1 to invoke fw_platform_init, which assumes a1 contains the FDT address.
My question is when and how a1 is set before fw_base.S?

a1 is set by function riscv_setup_rom_reset_vec(...) which in qemu/hw/riscv/boot.c inserts some instructions to set FDT address:
/* reset vector */
uint32_t reset_vec[10] = {
0x00000297, /* 1: auipc t0, %pcrel_hi(fw_dyn) */
0x02828613, /* addi a2, t0, %pcrel_lo(1b) */
0xf1402573, /* csrr a0, mhartid */
0,
0,
0x00028067, /* jr t0 */
start_addr, /* start: .dword */
start_addr_hi32,
fdt_load_addr, /* fdt_laddr: .dword */
0x00000000,
/* fw_dyn: */
};
if (riscv_is_32bit(harts)) {
reset_vec[3] = 0x0202a583; /* lw a1, 32(t0) */
reset_vec[4] = 0x0182a283; /* lw t0, 24(t0) */
} else {
reset_vec[3] = 0x0202b583; /* ld a1, 32(t0) */
reset_vec[4] = 0x0182b283; /* ld t0, 24(t0) */
}
When qemu runs, it will first jump to 0x1000 to execute these instructions and then jump to _start.

Related

Is there a way to access value of constant memory bank in CUDA

I have been trying to debug cuda programs that use inline PTX assembly. Specifically, I am debugging at the instruction level, and am trying to determine the values of arguments to the instructions. Occasionally, the disassembly includes a reference to constant memory. I am trying to have gdb print the value of this constant memory, but have not found any documentation that shows how to do this.
For instance, a disassembly includes
IADD R0, R0, c[0x0] [0x148]
I want to determine how to have gdb print the value of c[0x0] [0x148]. I have tried using print * (#constant) ... but this does not seem to work (I pass 0x148 here and it prints out nothing). Is this possible to do in cuda-gdb?
I have tried to avoid this by passing the compiler option --disable-optimizer-constants during compilation, but this does not work.
The way to do this is to
print *(void * #parameter *) addr
where addr is the address inside the constant bank 0 that should be printed.
Example
Suppose we have a simple kernel in a file called foo.cu:
#include <cuda.h>
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void myKernel(int a, int b, int *d)
{
*d = a + b;
}
int main(int argc, char *argv[]) {
if (argc < 3) {
printf("Requires inputs a and b to be specified\n");
return 0;
}
int * dev_d;
int d;
cudaMalloc(&dev_d, sizeof(*dev_d));
myKernel<<<1, 1>>>(atoi(argv[1]), atoi(argv[2]), dev_d);
cudaMemcpy(&d, dev_d, sizeof(d), cudaMemcpyDeviceToHost);
cudaFree(dev_d);
printf("D is: %d\n", d);
return 0;
}
which is compiled via
$ nvcc foo.cu -o foo.out
Next, suppose we are interested in disassembling this program, so we execute cuda-gdb with a command-line for our program:
$ cuda-gdb --args ./foo.out 10 15
Inside cuda-gdb, we get to the kernel by typing
(cuda-gdb) set cuda break_on_launch application
(cuda-gdb) start
Temporary breakpoint 1, 0x000055555555b12a in main ()
(cuda-gdb) cont
Inside the kernel, we view the disassembly we are interested in debugging:
(cuda-gdb) x/15i $pc
=> 0x555555b790a8 <_Z8myKerneliiPi+8>: MOV R1, c[0x0][0x20]
0x555555b790b0 <_Z8myKerneliiPi+16>: MOV R0, c[0x0][0x144]
0x555555b790b8 <_Z8myKerneliiPi+24>: MOV R2, c[0x0][0x148]
0x555555b790c0 <_Z8myKerneliiPi+32>:
0x555555b790c8 <_Z8myKerneliiPi+40>: MOV R3, c[0x0][0x14c]
0x555555b790d0 <_Z8myKerneliiPi+48>: IADD R0, R0, c[0x0][0x140]
0x555555b790d8 <_Z8myKerneliiPi+56>: STG.E [R2], R0
0x555555b790e0 <_Z8myKerneliiPi+64>:
0x555555b790e8 <_Z8myKerneliiPi+72>: NOP
0x555555b790f0 <_Z8myKerneliiPi+80>: NOP
0x555555b790f8 <_Z8myKerneliiPi+88>: NOP
0x555555b79100 <_Z8myKerneliiPi+96>:
0x555555b79108 <_Z8myKerneliiPi+104>: EXIT
0x555555b79110 <_Z8myKerneliiPi+112>: BRA 0x70
0x555555b79118 <_Z8myKerneliiPi+120>: NOP
The second argument being passed to the IADD instruction is in one of the constant memory banks. Let's find out what its value actually is. We advance go to the IADD instruction:
(cuda-gdb) stepi 4
0x0000555555b790d0 in myKernel(int, int, int*)<<<(1,1,1),(1,1,1)>>> ()
(cuda-gdb) x/i $pc
=> 0x555555b790d0 <_Z8myKerneliiPi+48>: IADD R0, R0, c[0x0][0x140]
We can now obtain the contents of c[0x0][0x140] as follows:
(cuda-gdb) print (int) *(void * #parameter *) 0x140
$1 = 10
Here, we knew the argument should have 32 bits, so we cast it as an (32-bit) int. If we hadn't done this, we would get too many bits, e.g.:
(cuda-gdb) print *(void * #parameter *) 0x140
$2 = 0xf0000000a
Note the hexadecimal format can be retained by adding /x after the print command:
(cuda-gdb) print/x (int) *(void * #parameter *)0x140
$3 = 0xa

MIPS instructions - addiu $sp, $sp, -4 [duplicate]

Currently I'm studying for my computer organization midterm, and I'm trying to fully understand the stack pointer and the stack. I know these following facts that surround the concept:
It follows the first in last out principle
And adding something to the stack takes a two step process:
addi $sp, $sp, -4
sw $s0, 0($sp)
What I think is stopping me from fully understanding is that I can't come up with a relevant, self apparent situation where I would need and/or want to keep track of data with a stack pointer.
Could someone elaborate on the concept as a whole and give me some useful code examples?
An important use of stack is nesting subroutine calls.
Each subroutine may have a set of variables local to that subroutine. These variables can be conveniently stored on a stack in a stack frame. Some calling conventions pass arguments on the stack as well.
Using subroutines also means you have to keep track of the caller, that is the return address.
Some architectures have a dedicated stack for this purpose, while others implicitly use the "normal" stack. MIPS by default only uses a register, but in non-leaf functions (ie. functions that call other functions) that return address is overwritten. Hence you have to save the original value, typically on the stack among your local variables. The calling conventions may also declare that some register values must be preserved across function calls, you can similarly save and restore them using the stack.
Suppose you have this C fragment:
extern void foo();
extern int bar();
int baz()
{
int x = bar();
foo();
return x;
}
MIPS assembly may then look like:
addiu $sp, $sp, -8 # allocate 2 words on the stack
sw $ra, 4($sp) # save $ra in the upper one
jal bar # this overwrites $ra
sw $v0, ($sp) # save returned value (x)
jal foo # this overwrites $ra and possibly $v0
lw $v0, ($sp) # reload x so we can return it
lw $ra, 4($sp) # reload $ra so we can return to caller
addiu $sp, $sp, 8 # restore $sp, freeing the allocated space
jr $ra # return
The MIPS calling convention requires first four function parameters to be in registers a0 through a3 and the rest, if there are more, on the stack. What's more, it also requires the function caller to allocate four slots on the stack for the first four parameters, despite those being passed in the registers.
So, if you want to access parameter five (and further parameters), you need to use sp. If the function in turn calls other functions and uses its parameters after the calls, it will need to store a0 through a3 in those four slots on the stack to avoid them being lost/overwritten. Again, you use sp to write these registers to the stack.
If your function has local variables and can't keep all of them in registers (like when it can't keep a0 through a3 when it calls other functions), it will have to use the on-stack space for those local variables, which again necessitates the use of sp.
For example, if you had this:
int tst5(int x1, int x2, int x3, int x4, int x5)
{
return x1 + x2 + x3 + x4 + x5;
}
its disassembly would be something like:
tst5:
lw $2,16($sp) # r2 = x5; 4 slots are skipped
addu $4,$4,$5 # x1 += x2
addu $4,$4,$6 # x1 += x3
addu $4,$4,$7 # x1 += x4
j $31 # return
addu $2,$4,$2 # r2 += x1
See, sp is used to access x5.
And then if you have code something like this:
int binary(int a, int b)
{
return a + b;
}
void stk(void)
{
binary(binary(binary(1, 2), binary(3, 4)), binary(binary(5, 6), binary(7, 8)));
}
this is what it looks in disassembly after compilation:
binary:
j $31 # return
addu $2,$4,$5 # r2 = a + b
stk:
subu $sp,$sp,32 # allocate space for local vars & 4 slots
li $4,0x00000001 # 1
li $5,0x00000002 # 2
sw $31,24($sp) # store return address on stack
sw $17,20($sp) # preserve r17 on stack
jal binary # call binary(1,2)
sw $16,16($sp) # preserve r16 on stack
li $4,0x00000003 # 3
li $5,0x00000004 # 4
jal binary # call binary(3,4)
move $16,$2 # r16 = binary(1,2)
move $4,$16 # r4 = binary(1,2)
jal binary # call binary(binary(1,2), binary(3,4))
move $5,$2 # r5 = binary(3,4)
li $4,0x00000005 # 5
li $5,0x00000006 # 6
jal binary # call binary(5,6)
move $17,$2 # r17 = binary(binary(1,2), binary(3,4))
li $4,0x00000007 # 7
li $5,0x00000008 # 8
jal binary # call binary(7,8)
move $16,$2 # r16 = binary(5,6)
move $4,$16 # r4 = binary(5,6)
jal binary # call binary(binary(5,6), binary(7,8))
move $5,$2 # r5 = binary(7,8)
move $4,$17 # r4 = binary(binary(1,2), binary(3,4))
jal binary # call binary(binary(binary(1,2), binary(3,4)), binary(binary(5,6), binary(7,8)))
move $5,$2 # r5 = binary(binary(5,6), binary(7,8))
lw $31,24($sp) # restore return address from stack
lw $17,20($sp) # restore r17 from stack
lw $16,16($sp) # restore r16 from stack
addu $sp,$sp,32 # remove local vars and 4 slots
j $31 # return
nop
I hope I've annotated the code without making mistakes.
So, note that the compiler chooses to use r16 and r17 in the function but preserves them on the stack. Since the function calls another one, it also needs to preserve its return address on the stack instead of simply keeping it in r31.
PS Remember that all branch/jump instructions on MIPS effectively execute the immediately following instruction before actually transferring control to a new location. This might be confusing.

Is there a way to accelerate the resize image performance of GraphicsMagick on mips CPU?

I'm using a netgear WNDR4300 to run GraphicsMaigick to resize JPEG images.
This is the cpuinfo,
cat /proc/cpuinfo
system type : Atheros AR9344 rev 2
machine : NETGEAR WNDR4300
processor : 0
cpu model : MIPS 74Kc V4.12
BogoMIPS : 278.93
wait instruction : yes
microsecond timers : yes
tlb_entries : 32
extra interrupt vector : yes
hardware watchpoint : yes, count: 4, address/irw mask: [0x0ffc, 0x0ffc, 0x0ffb, 0x0ffb]
isa : mips1 mips2 mips32r1 mips32r2
ASEs implemented : mips16 dsp dsp2
shadow register sets : 1
kscratch registers : 0
package : 0
core : 0
VCED exceptions : not available
VCEI exceptions : not available
It's very slow, and took over 4 minuntes for every photo.
root#OpenWrt:/mnt/sda1/media/# ll
drwxrwxrwx 2 root root 131072 Jun 10 14:07 ./
drwxrwxrwx 7 root root 131072 Jun 8 10:18 ../
-rwxrwxrwx 1 root root 4130026 Feb 10 2018 IMG_20180210_115807.jpg*
root#OpenWrt:/mnt/sda1/media/kidds# time gm mogrify -output-directory > /mnt/sda1/web -resize 64x64 -quality 75 IMG_20170628_100052.jpg
real 4m 19.72s
user 3m 21.86s
sys 0m 9.29s
UPDATE:
root#OpenWrt:~# gm identify -version
GraphicsMagick 1.3.31 2018-11-17 Q8 http://www.GraphicsMagick.org/
Copyright (C) 2002-2018 GraphicsMagick Group.
Additional copyrights and licenses apply to this software.
See http://www.GraphicsMagick.org/www/Copyright.html for details.
Feature Support:
Native Thread Safe yes
Large Files (> 32 bit) yes
Large Memory (> 32 bit) no
BZIP no
DPS no
FlashPix no
FreeType yes
Ghostscript (Library) no
JBIG no
JPEG-2000 no
JPEG yes
Little CMS no
Loadable Modules yes
OpenMP no
PNG yes
TIFF yes
TRIO no
UMEM no
WebP no
WMF no
X11 no
XML no
ZLIB yes
Host type: mips-openwrt-linux-gnu
Configured using the command:
./configure '--target=mips-openwrt-linux' '--host=mips-openwrt-linux' '--build=x86_64-pc-linux-gnu' '--program-prefix=' '--program-suffix=' '--prefix=/usr' '--exec-prefix=/usr' '--bindir=/usr/bin' '--sbindir=/usr/sbin' '--libexecdir=/usr/lib' '--sysconfdir=/etc' '--datadir=/usr/share' '--localstatedir=/var' '--mandir=/usr/man' '--infodir=/usr/info' '--disable-nls' '--enable-shared' '--disable-static' '--enable-dependency-tracking' '--with-modules' '--with-threads' '--without-magick-plus-plus' '--without-perl' '--without-bzlib' '--without-dps' '--without-fpx' '--without-jbig' '--without-webp' '--with-jpeg' '--without-jp2' '--without-lcms2' '--without-lzma' '--with-png' '--with-tiff' '--without-trio' '--with-ttf' '--without-umem' '--without-wmf' '--without-xml' '--with-zlib' '--without-zstd' '--without-x' 'build_alias=x86_64-pc-linux-gnu' 'host_alias=mips-openwrt-linux' 'target_alias=mips-openwrt-linux' 'CC=mips-openwrt-linux-musl-gcc' 'CFLAGS=-Os -pipe -mno-branch-likely -mips32r2 -mtune=24kc -fno-caller-save
Final Build Parameters:
CC = mips-openwrt-linux-musl-gcc
CFLAGS = -Os -pipe -mno-branch-likely -mips32r2 -mtune=24kc -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -msoft-float -mips16 -minterlink-mips16 -iremap/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/build_dir/target-mips_24kc_musl/GraphicsMagick-1.3.31:GraphicsMagick-1.3.31 -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -flto -Wall
CPPFLAGS = -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/usr/include -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/include -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/toolchain-mips_24kc_gcc-7.4.0_musl/usr/include -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/toolchain-mips_24kc_gcc-7.4.0_musl/include/fortify -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/toolchain-mips_24kc_gcc-7.4.0_musl/include -I/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/usr/include/freetype2
CXX = mips-openwrt-linux-musl-g++
CXXFLAGS = -Os -pipe -mno-branch-likely -mips32r2 -mtune=24kc -fno-caller-saves -fno-plt -fhonour-copts -Wno-error=unused-but-set-variable -Wno-error=unused-result -msoft-float -mips16 -minterlink-mips16 -iremap/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/build_dir/target-mips_24kc_musl/GraphicsMagick-1.3.31:GraphicsMagick-1.3.31 -Wformat -Werror=format-security -fstack-protector -D_FORTIFY_SOURCE=1 -Wl,-z,now -Wl,-z,relro -flto
LDFLAGS = -L/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/usr/lib -L/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/lib -L/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/toolchain-mips_24kc_gcc-7.4.0_musl/usr/lib -L/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/toolchain-mips_24kc_gcc-7.4.0_musl/lib -znow -zrelro -L/opt/buildbot/slaves/lede-slave-tah/mips_24kc/build/sdk/staging_dir/target-mips_24kc_musl/usr/lib
LIBS = -lfreetype -lz -lltdl -lm -lpthread
UPDATE2: sample mode
root#OpenWrt:/mnt/sda1/media/kidds# time gm mogrify -output-directory /mnt/sda1/web -sample 64x64 -quality 75 IMG_20170628_100052.jpg
real 0m 20.76s
user 0m 8.70s
sys 0m 3.29s
Is there any way to improve the performance? I just want resize some photoes, if there is some library provided by MIPS, I can write a tool to call it using C language.
I grabbed the file example.c from the v9c source on the IJG website and quickly hacked it about as below to only write every 8th row and every 8th column from your 4640x3480 input file. It creates the following 580x435 image which GraphicsMagick can then presumably manage to resize down to 64x64 in reasonable time since the file is 64x smaller.
You would use:
gm convert result.ppm -resize 64x64 thumb.jpg
It only uses 1.1MB of RAM because it only reads 1 line at a time - I tested using:
/usr/bin/time -l ./example photo.jpg
You could experiment with the decimation factor, I used 8, but you could try 4, or 10 and see the tradeoff between time and quality. Just change the 8 on the second-to-last line of the file.
The code looks like this - I have added "HACK" in the comments so you can see where I was busy making ugly dirty code:
#include <stdio.h>
#include <stdlib.h>
#include "jpeglib.h"
#include <setjmp.h>
struct my_error_mgr {
struct jpeg_error_mgr pub; /* "public" fields */
jmp_buf setjmp_buffer; /* for return to caller */
};
typedef struct my_error_mgr * my_error_ptr;
METHODDEF(void)
my_error_exit (j_common_ptr cinfo)
{
/* cinfo->err really points to a my_error_mgr struct, so coerce pointer */
my_error_ptr myerr = (my_error_ptr) cinfo->err;
/* Always display the message. */
/* We could postpone this until after returning, if we chose. */
(*cinfo->err->output_message) (cinfo);
/* Return control to the setjmp point */
longjmp(myerr->setjmp_buffer, 1);
}
GLOBAL(int)
read_JPEG_file (char * filename,int factor)
{
/* This struct contains the JPEG decompression parameters and pointers to
* working space (which is allocated as needed by the JPEG library).
*/
struct jpeg_decompress_struct cinfo;
/* We use our private extension JPEG error handler.
* Note that this struct must live as long as the main JPEG parameter
* struct, to avoid dangling-pointer problems.
*/
struct my_error_mgr jerr;
/* More stuff */
FILE * infile; /* source file */
FILE * outfile; /* result file */ // HACK
JSAMPARRAY buffer; /* Output row buffer */
int row_stride; /* physical row width in output buffer */
/* In this example we want to open the input file before doing anything else,
* so that the setjmp() error recovery below can assume the file is open.
* VERY IMPORTANT: use "b" option to fopen() if you are on a machine that
* requires it in order to read binary files.
*/
if ((infile = fopen(filename, "rb")) == NULL) {
fprintf(stderr, "can't open %s\n", filename);
return 0;
}
/* Step 1: allocate and initialize JPEG decompression object */
/* We set up the normal JPEG error routines, then override error_exit. */
cinfo.err = jpeg_std_error(&jerr.pub);
jerr.pub.error_exit = my_error_exit;
/* Establish the setjmp return context for my_error_exit to use. */
if (setjmp(jerr.setjmp_buffer)) {
/* If we get here, the JPEG code has signaled an error.
* We need to clean up the JPEG object, close the input file, and return.
*/
jpeg_destroy_decompress(&cinfo);
fclose(infile);
return 0;
}
/* Now we can initialize the JPEG decompression object. */
jpeg_create_decompress(&cinfo);
/* Step 2: specify data source (eg, a file) */
jpeg_stdio_src(&cinfo, infile);
/* Step 3: read file parameters with jpeg_read_header() */
(void) jpeg_read_header(&cinfo, TRUE);
/* We can ignore the return value from jpeg_read_header since
* (a) suspension is not possible with the stdio data source, and
* (b) we passed TRUE to reject a tables-only JPEG file as an error.
* See libjpeg.txt for more info.
*/
/* Step 4: set parameters for decompression */
/* In this example, we don't need to change any of the defaults set by
* jpeg_read_header(), so we do nothing here.
*/
/* Step 5: Start decompressor */
(void) jpeg_start_decompress(&cinfo);
/* We can ignore the return value since suspension is not possible
* with the stdio data source.
*/
/* We may need to do some setup of our own at this point before reading
* the data. After jpeg_start_decompress() we have the correct scaled
* output image dimensions available, as well as the output colormap
* if we asked for color quantization.
* In this example, we need to make an output work buffer of the right size.
*/
/* JSAMPLEs per row in output buffer */
row_stride = cinfo.output_width * cinfo.output_components;
/* Make a one-row-high sample array that will go away when done with image */
buffer = (*cinfo.mem->alloc_sarray)
((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, 1);
/* Step 6: while (scan lines remain to be read) */
/* jpeg_read_scanlines(...); */
// Write a simple PPM file called "result.ppm" that can be thumbnailed with GraphicsMagick
// gm convert result.ppm -resize 64x64 thumbnail.jpg
if ((outfile = fopen("result.ppm", "wb")) == NULL) { // HACK
fprintf(stderr, "can't open result.ppm\n"); // HACK
return 0; // HACK
}
int outwidth = cinfo.output_width/factor; // HACK
int outheight = cinfo.output_height/factor; // HACK
fprintf(outfile,"P6\n%d %d\n255\n",outwidth,outheight); // HACK
unsigned char * outbuf; // HACK
outbuf = (unsigned char*)malloc(outwidth*3); // HACK
int line=0; // HACK
while (cinfo.output_scanline < cinfo.output_height) {
/* jpeg_read_scanlines expects an array of pointers to scanlines.
* Here the array is only one element long, but you could ask for
* more than one scanline at a time if that's more convenient.
*/
(void) jpeg_read_scanlines(&cinfo, buffer, 1);
// Do every Nth row // HACK
if(line%factor==0){ // HACK
int pixel,op;
op=0;
for(pixel=0;pixel<row_stride/3;pixel+=factor){
outbuf[op++] = buffer[0][(pixel*3) +0];
outbuf[op++] = buffer[0][(pixel*3) +1];
outbuf[op++] = buffer[0][(pixel*3) +2];
}
fwrite(outbuf,1,op,outfile); // HACK
} // HACK
line++; // HACK
}
/* Step 7: Finish decompression */
(void) jpeg_finish_decompress(&cinfo);
/* Step 8: Release JPEG decompression object */
/* This is an important step since it will release a good deal of memory. */
jpeg_destroy_decompress(&cinfo);
fclose(infile);
return 1;
}
int main(int argc, char*argv[]){
// Read "test.jpg" and only output every 8th row and every 8th column for a 64x size reduction
read_JPEG_file ("test.jpg",8); // HACK
}
I compiled it with:
clang -std=c11 $(pkg-config --cflags --libs libjpeg) example.c -o example

__ctype_b table and its usage

I have this piece of asm code that uses __ctype_b array, i'm trying to understand what it is, anybody know what is it doing?
ps: i know what __ctype_b_loc() is, this is different.
mov eax, [rbp+counter]
cdqe
add rax, [rbp+server_received_buffer]
movzx eax, byte ptr [rax]
movsx rax, al
add rax, rax
mov rdx, rax
mov rax, cs:__ctype_b
lea rax, [rdx+rax]
movzx eax, word ptr [rax]
movzx eax, ax
and eax, 20h
test eax, eax
I found the answer.
__ctype_b is an array that be used in "__isctype_l" macro:
# define __isctype_l(c, type, locale) \
((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
and it is used here:
# define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
# define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
# define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
# define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
# define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
# define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
# define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
# define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
# define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
# define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
# define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
# define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
then we will reach to this enum
enum
{
_ISupper = _ISbit (0), /* UPPERCASE. */
_ISlower = _ISbit (1), /* lowercase. */
_ISalpha = _ISbit (2), /* Alphabetic. */
_ISdigit = _ISbit (3), /* Numeric. */
_ISxdigit = _ISbit (4), /* Hexadecimal numeric. */
_ISspace = _ISbit (5), /* Whitespace. */
_ISprint = _ISbit (6), /* Printing. */
_ISgraph = _ISbit (7), /* Graphical. */
_ISblank = _ISbit (8), /* Blank (usually SPC and TAB). */
_IScntrl = _ISbit (9), /* Control character. */
_ISpunct = _ISbit (10), /* Punctuation. */
_ISalnum = _ISbit (11) /* Alphanumeric. */
};
now we need to understand what 0x20 is stand for from _ISbit definition
# if __BYTE_ORDER == __BIG_ENDIAN
# define _ISbit(bit) (1 << (bit))
# else /* __BYTE_ORDER == __LITTLE_ENDIAN */
# define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
# endif
0x20 : 0010 0000
from the asm code i know it is big endian. so the shift number should be 5 and it is Whitespace.
Conclusion: that asm code finds Whitespace from the server_received_buffer

Cuda Mutex, why deadlock?

I am trying to implement a atomic based mutex.
I succeed it but I have one question about warps / deadlock.
This code works well.
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
But this one doesn't...
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).
Thanks !
There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.
Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile. The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().
This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.
Here's my code for analysis:
$ cat t830.cu
#include <stdio.h>
__device__ int mLock = 0;
__device__ void doCriticJob(){
}
__global__ void kernel1(){
int index = 0;
int mSize = 1;
while(true) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
break;
}
}
}
__global__ void kernel2(){
int index = 0;
int mSize = 1;
bool blocked = true;
while(blocked) {
if(0 == atomicCAS(&mLock, 0, 1)) {
index = mSize++;
doCriticJob();
atomicExch(&mLock, 0);
blocked = false;
}
}
}
int main(){
kernel2<<<4,128>>>();
cudaDeviceSynchronize();
}
kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.
I use cuobjdump -sass to dump the machine code:
$ cuobjdump -sass ./t830
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
code for sm_20
Fatbin elf code:
================
arch = sm_20
code version = [1,7]
producer = cuda
host = linux
compile_size = 64bit
code for sm_20
Function : _Z7kernel1v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R4, 0x1; /* 0x1800000004011de2 */
/*0010*/ SSY 0x48; /* 0x60000000c0000007 */
/*0018*/ MOV R2, c[0xe][0x0]; /* 0x2800780000009de4 */
/*0020*/ MOV R3, c[0xe][0x4]; /* 0x280078001000dde4 */
/*0028*/ ATOM.E.CAS R0, [R2], RZ, R4; /* 0x54080000002fdd25 */
/*0030*/ ISETP.NE.AND P0, PT, R0, RZ, PT; /* 0x1a8e0000fc01dc23 */
/*0038*/ #P0 BRA 0x18; /* 0x4003ffff600001e7 */
/*0040*/ NOP.S; /* 0x4000000000001df4 */
/*0048*/ ATOM.E.EXCH RZ, [R2], RZ; /* 0x547ff800002fdd05 */
/*0050*/ EXIT; /* 0x8000000000001de7 */
............................
Function : _Z7kernel2v
.headerflags #"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
/*0000*/ MOV R1, c[0x1][0x100]; /* 0x2800440400005de4 */
/*0008*/ MOV32I R0, 0x1; /* 0x1800000004001de2 */
/*0010*/ MOV32I R3, 0x1; /* 0x180000000400dde2 */
/*0018*/ MOV R4, c[0xe][0x0]; /* 0x2800780000011de4 */
/*0020*/ MOV R5, c[0xe][0x4]; /* 0x2800780010015de4 */
/*0028*/ ATOM.E.CAS R2, [R4], RZ, R3; /* 0x54061000004fdd25 */
/*0030*/ ISETP.NE.AND P1, PT, R2, RZ, PT; /* 0x1a8e0000fc23dc23 */
/*0038*/ #!P1 MOV R0, RZ; /* 0x28000000fc0025e4 */
/*0040*/ #!P1 ATOM.E.EXCH RZ, [R4], RZ; /* 0x547ff800004fe505 */
/*0048*/ LOP.AND R2, R0, 0xff; /* 0x6800c003fc009c03 */
/*0050*/ I2I.S32.S16 R2, R2; /* 0x1c00000008a09e84 */
/*0058*/ ISETP.NE.AND P0, PT, R2, RZ, PT; /* 0x1a8e0000fc21dc23 */
/*0060*/ #P0 BRA 0x18; /* 0x4003fffec00001e7 */
/*0068*/ EXIT; /* 0x8000000000001de7 */
............................
Fatbin ptx code:
================
arch = sm_20
code version = [4,2]
producer = cuda
host = linux
compile_size = 64bit
compressed
$
Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).
The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.
Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (#P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.
If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.
(#P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)
NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.