run valgrind for a single file (ignore library) - warnings

I have a main.c that's using library of 1000's of files
is there a method to ask valgrnd to look for memory leaks only in main.c rather than digging through the library ?
also if valgrind reports an error like this within the library
vex x86->IR: unhandled instruction bytes: 0xC5 0xF8 0x10 0x83
==1796== valgrind: Unrecognised instruction at address 0x812b234.
is the ability to use valgrind toast ?

You can use a suppression file to tell valgrind to ignore errors, including memory leaks.
The instruction 0xC5 0xF8 0x10 0x83 is probably the start of a VEX-encoded instruction such as vmovups. You will need a newer version of valgrind with AVX support.

Related

Loading ROM bootloader in qemu

I have a custom bootloader, I have the entry point of the bootloader, How do I specify this address to qemu?
I also have this warning when I try to load the image with this line qemu-system-mips -pflash img_:
WARNING: Image format was not specified for 'img_' and probing guessed raw.
Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted.
Specify the 'raw' format explicitly to remove the restrictions.
I tried -pflash=img_,format=raw but it didn't work.
Thanks.
You should perform some dig into qemu sources to play with custom bootloader images at qemu. QEMU loads booloader at board initilizaton function, which is board specific.
Run following to view all available MIPS board models:
qemu-system-mips64 -machine ?
Supported machines are:
magnum MIPS Magnum
malta MIPS Malta Core LV (default)
mips mips r4k platform
mipssim MIPS MIPSsim platform
none empty machine
pica61 Acer Pica 61
This models are implemented in qemu/hw/mips/ files. (look for * _init funtctions)
For example, for malta board (which is default) it is hw/mips/mips_malta.c. Look at mips_malta_init function: it constructs devices, buses and cpu, register memory regions and place bootloader to memory. Seems that FLASH_ADDRESS macro is that you looking for.
Note that this init functions is common thing to all boards that QEMU implements.
Also every board have some reference/datasheet document, and QEMU model should complement with them, as a programer view.

Determining shared memory usage in CUDA Fortran

I've been writing some basic CUDA Fortran code. I would like to be able to determine the amount of shared memory my program uses per thread block (for occupancy calculation). I have been compiling with -Mcuda=ptxinfo in the hope of finding this information. The compilation output ends with
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1128 bytes spill stores, 604 bytes spill loads
ptxas info : Used 63 registers, 96 bytes smem, 320 bytes cmem[0]
which is the only place in the output that smem is mentioned. There is one array in the global subroutine main_kernel with the shared attribute. If I remove the shared attribute then I get
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1124 bytes spill stores, 532 bytes spill loads
ptxas info : Used 63 registers, 320 bytes cmem[0]
The smem has disappeared. It seems that only shared memory in main_kernel is being counted: device subroutines in my code use variables with the shared attribute but these don't appear to be mentioned in the output e.g the device subroutine evalfuncs includes shared variable declarations but the relevant output is
ptxas info : Function properties for device_procedures_evalfuncs_
504 bytes stack frame, 1140 bytes spill stores, 508 bytes spill loads
Do all variables with the shared attribute need to be declared in a global subroutine?
Do all variables with the shared attribute need to be declared in a global subroutine?
No.
You haven't shown an example code, your compile command, nor have you identified the version of the PGI compiler tools you are using. However, the most likely explanation I can think of for what you are seeing is that as of PGI 14.x, the default CUDA compile option is to generate relocatable device code. This is documented in section 2.2.3 of the current PGI release notes:
2.2.3. Relocatable Device Code
An rdc option is available for the –ta=tesla and –Mcuda flags that specifies to generate
relocatable device code. Starting in PGI 14.1 on Linux and in PGI 14.2 on Windows, the default
code generation and linking mode for Tesla-target OpenACC and CUDA Fortran is rdc,
relocatable device code.
You can disable the default and enable the old behavior and non-relocatable code by specifying
any of the following: –ta=tesla:nordc, –Mcuda=nordc, or by specifying any 1.x compute
capability or any Radeon target.
So a specific option to (disable)enable this is:
–Mcuda=(no)rdc
(note that -Mcuda=rdc is the default, if you don't specify this option)
CUDA Fortran separates Fortran host code from device code. For the device code, the CUDA Fortran compiler does a CUDA Fortran->CUDA C conversion, and passes the auto-generated CUDA C code to the CUDA C compiler. Therefore, the behavior and expectations of switches like rdc and ptxinfo are derived from the behavior of the underlying equivalent CUDA compiler options (-rdc=true and -Xptxas -v, respectively).
When CUDA device code is compiled without the rdc option, the compiler will normally try to inline device (sub)routines that are called from a kernel, into the main kernel code. Therefore, when the compiler is generating the ptxinfo, it can determine all resource requirements (e.g. shared memory, registers, etc.) when it is compiling (ptx assembly) the kernel code.
When the rdc option is specified, however, the compiler may (depending on some other switches and function attributes) leave the device subroutines as separately callable routines with their own entry point (i.e. not inlined). In that scenario, when the device compiler is compiling the kernel code, the call to the device subroutine just looks like a call instruction, and the compiler (at that point) has no visibility into the resource usage requirements of the device subroutine. This does not mean that there is an underlying flaw in the compile sequence. It simply means that the ptxinfo mechanism cannot accurately roll up the resource requirements of the kernel and all of it's called subroutines, at that point in time.
The ptxinfo output also does not declare the total amount of shared memory used by a device subroutine, when it is compiling that subroutine, in rdc mode.
If you turn off the rdc mode:
–Mcuda=nordc
I believe you will see an accurate accounting of the shared memory used by a kernel plus all of its called subroutines, given a few caveats, one of which is that the compiler is able to successfully inline your called subroutines (pretty likely, and the accounting should still work even if it can't) another of which is that you are working with a kernel plus all of its called subroutines in the same file (i.e. translation unit). If you have kernels that are calling device subroutines in different translation units, then the rdc option is the only way to make it work.
Shared memory will still be appropriately allocated for your code at runtime, regardless (assuming you have not violated the total amount of shared memory available). You can also get an accurate reading of the shared memory used by a kernel by profiling your code, using a profiler such as nvvp or nvprof.
If this explanation doesn't describe what you are seeing, I would suggest providing a complete sample code, as well as the exact compile command you are using, plus the version of PGI tools you are using. (I think it's a good suggestion for future questions as well.)

Self-modifying program: Why does it raise an exception?

Just for the purposes of experimenting and playing around, I wrote the following short x64 assembly program:
.code
AsmFun proc
mov rax, MyLabel
mov byte ptr [rax], 0C3h ; C3 is x64 machine code for "ret"
MyLabel:
mov rax, 239847 ; This isn't "ret"
AsmFun endp
end
(I then called the code from C.)
It compiles/assembles/links just fine, but when I walk through the program, Visual Studio complains that an un-handled exception has been raised: "Access writing violation as [MyLabel].", where of course it doesn't actually say "[MyLabel]", but rather the address that happens to be at in memory.
Why is this happening? Is it a Windows thing that was put in place to avoid security exploits?
I live in Linux world, but perhaps you can adapt what I've found out.
Memory pages are generally read-only if they have execute permission. How I got around this was with mmap() and mprotect()... I'm sure there's something similar in Windows. It's a good bet the Mono source code would shed some light.
I used mmap() to allocate a new page with write access (but not read or execute). I populated it, then called mprotect() to change it to read-only and executable.
Don't forget... there are registers you want to avoid trashing. See the ABI documentation for further details.

Compiling CUDA program for a GeForce 310 (compute capability 1.2) with unmatched options "-arch=compute_20 -code=sm_20"

I'm compiling a CUDA program using nvcc with options -arch=20 -code=20 for a GeForce 310 GPU having compute capability 1.2. The program seems to run normally as follows.
wangli#wangli-desktop:~/wangliC2050/1D-EncodeV6.1$ make
nvcc -O --ptxas-options=-v 1D-EncodeV6.1.cu -o 1D-EncodeV6.1 -I../../NVIDIA_GPU_Computing_SDK/C/common/inc -I../../NVIDIA_GPU_Computing_SDK/shared/inc -arch=compute_20 -code=sm_20
ptxas info : Compiling entry function '_Z6EncodePhPjS0_S_S_' for 'sm_20'
ptxas info : Function properties for _Z6EncodePhPjS0_S_S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 52 bytes cmem[0]
wangli#wangli-desktop:~/wangliC2050/1D-EncodeV6.1$ ./1D-EncodeV6.1
########################### Encoding start (loopCount=10)#######################
#p n size averageTime(s) averageThroughput(MB/s) errorRate(0~1)
#================= Encode on GPU v6.1 ===============
4 4 4 0.000294 0.051837 100.000000
#################### Encoding stop #########################
So, I wonder:
Why this program could run on a GeForce 310 with nvcc options -arch=compute_20 -code=sm_20 which do not match the compute capability 1.2 of the card?
What will happen if the value of the -arch option will differ from that of the -code option?
Thanks.
A CUDA executable typically contains 2 types of program data: SASS code which is basically GPU machine code, and PTX which is an intermediate code (although it's pretty close to machine code). As long as PTX code is present in the executable, then if the driver decides that a proper SASS binary is not available for the GPU that the code will actually run on, it will do a "JIT-compile" step at application launch, to create the necessary binary code appropriate for the device in question, using the PTX code in the application package.
This is what is happening in your case.
If arch != code, then you're creating device code that architecturally conforms to the arch type, but is compiled to use machine level instructions that are associated with the code type. For example, if I compile for arch = 1.2 and code = 2.0, I cannot use double types (they will be demoted to float, because double is not supported in a 1.2 architecture) but the SASS machine code generated will be ready to execute on a cc 2.0 device, and will not require a JIT-compile step for that kind of device.
The NVCC manual has more information particularly the section on steering code generation.

nvcc -Xptxas –v compiler flag has no effect

I have a CUDA project. It consists of several .cpp files that contain my application logic and one .cu file that contains multiple kernels plus a __host__ function that invokes them.
Now I would like to determine the number of registers used by my kernel(s). My normal compiler call looks like this:
nvcc -arch compute_20 -link src/kernel.cu obj/..obj obj/..obj .. -o bin/..exe -l glew32 ...
Adding the "-Xptxas –v" compiler flag to this call unfortunately has no effect. The compiler still produces the same textual output as before. The compiled .exe also works the same way as before with one exception: My framerate jumps to 1800fps, up from 80fps.
I had the same problem, here is my solution:
Compile *cu files into device only *ptx file, this will discard host code
nvcc -ptx *.cu
Compile *ptx file:
ptxas -v *.ptx
The second step will show you number of used registers by kernel and amount of used shared memory.
Convert the compute_20 to sm_20 in your compiler call. That should fix it.
When using "-Xptxas -v", "-arch" together, we can not get verbose information(register num, etc.). If we want to see the verbose without losing the chance of assigning GPU architecture(-arch, -code) ahead, we can do the following steps: nvcc -arch compute_XX *.cu -keep then ptxas -v *.ptx. But we will obtain many processing files. Certainly, kogut's answer is to the point.
when you compile
nvcc --ptxas-options=-v
You may want to ctrl your compiler verbose option defaults.
For example is VStudio goto :
Tools->Options->ProjectsAndSolutions->BuildAndRun
then set the verbosity output to Normal.
Not exactly what you were looking for, but you can use the CUDA visual profiler shipped with the nvidia gpu computing sdk. Besides many other useful informations, it shows the number of registers used by each kernel in you application.