QEMU AARCH64 "virt" Machine SMP CPUs Starting in "running" vs. "halted" State - qemu

I'm working on bare-metal. No Linux, libraries, etc. I'm writing processor boot code in ASM and jumping to my compiled C code.
My command line is:
% qemu-system-aarch64 \
-s -S \
-machine virt,secure=on,virtualization=on \
-cpu cortex-a53 \
-d int \
-m 512M \
-smp 4 \
-display none \
-nographic \
-semihosting \
-serial mon:stdio \
-kernel my_file.elf \
-device loader,addr=0x40004000,cpu-num=0 \
-device loader,addr=0x40004000,cpu-num=1 \
-device loader,addr=0x40004000,cpu-num=2 \
-device loader,addr=0x40004000,cpu-num=3 \
;
When I connect gcc at the beginning, I can see:
(gdb) info threads
Id Target Id Frame
* 1 Thread 1.1 (CPU#0 [running]) _start () at .../start.S:20
2 Thread 1.2 (CPU#1 [halted ]) _start () at .../start.S:20
3 Thread 1.3 (CPU#2 [halted ]) _start () at .../start.S:20
4 Thread 1.4 (CPU#3 [halted ]) _start () at .../start.S:20
I want those other three processors to start in the "running" state, not "halted". How?
Note that my DTS contains this section:
psci {
migrate = < 0xc4000005 >;
cpu_on = < 0xc4000003 >;
cpu_off = < 0x84000002 >;
cpu_suspend = < 0xc4000001 >;
method = "smc";
compatible = "arm,psci-0.2\0arm,psci";
};
However, I'm not sure what to do with that. Adding many different lines of this form, don't seem to help:
-device loader,addr=0xc4000003,data=0x80000000,data-len=4
I'm not sure if I'm on the right track with this ARM PSCI thing? ARM's specification seems to define the "interface", not the system "implementation". However, I don't see the PSCI as "real" registers mentioned in the "virt" documentation/source. There is no "SMC" device mentioned in the DTS.
How does QEMU decide whether an SMP processor is "running" or "halted" on start and how can I influence that?
Based on #Peter-Maydell's answer below, I need to do one of two things...
Switch "-kernel" to "-bios". I do this, but my code doesn't load as I expect. My *.elf file has several sections; some in FLASH and some in DDR (above 0x40000000). Maybe that's the problem?
Change my boot code to setup and issue the SMC instruction to make the ARM PSCI "CPU_ON" call that QEMU will recognize and powerup the other processors. Code like this runs but doesn't seem to "do" anything...
ldr w0, =0xc4000003 // CPU_ON code from the DTS file
mov x1, 1 // CPU #1 in cluster zero (format of MPIDR register?)
ldr x2, _boot // Jump address 0x40006000 (FYI)
mov x3, 1 // context ID (meaningful only to caller)
smc #0 // GO!
// result is in x0 -> PSCI_RET_INVALID_PARAMS

Using the response provided by Peter Maydell, I am providing here a Minimal, Reproducible Example for people who may be interested.
Downloading/installing aarch64-elf toolchain:
wget "https://developer.arm.com/-/media/Files/downloads/gnu-a/8.3-2019.03/binrel/gcc-arm-8.3-2019.03-x86_64-aarch64-elf.tar.xz?revision=d678fd94-0ac4-485a-8054-1fbc60622a89&la=en"
mkdir -p /opt/arm
tar Jxf gcc-arm-8.3-2019.03-x86_64-aarch64-elf.tar.xz -C /opt/arm
Example files:
loop.s:
.title "loop.s"
.arch armv8-a
.text
.global Reset_Handler
Reset_Handler: mrs x0, mpidr_el1
and x0,x0, 0b11
cmp x0, #0
b.eq Core0
cmp x0, #1
b.eq Core1
cmp x0, #2
b.eq Core2
cmp x0, #3
b.eq Core3
Error: b .
Core0: b .
Core1: b .
Core2: b .
Core3: b .
.end
build.sh:
#!/bin/bash
set -e
CROSS_COMPILE=/opt/arm/gcc-arm-8.3-2019.03-x86_64-aarch64-elf/bin/aarch64-elf-
AS=${CROSS_COMPILE}as
LD=${CROSS_COMPILE}ld
OBJCOPY=${CROSS_COMPILE}objcopy
OBJDUMP=${CROSS_COMPILE}objdump
${AS} -g -o loop.o loop.s
${LD} -g -gc-sections -g -e Reset_Handler -Ttext-segment=0x40004000 -Map=loop.map -o loop.elf loop.o
${OBJDUMP} -d loop.elf
qemu.sh:
#!/bin/bash
set -e
QEMU_SYSTEM_AARCH64=qemu-system-aarch64
${QEMU_SYSTEM_AARCH64} \
-s -S \
-machine virt,secure=on,virtualization=on \
-cpu cortex-a53 \
-d int \
-m 512M \
-smp 4 \
-display none \
-nographic \
-semihosting \
-serial mon:stdio \
-bios loop.elf \
-device loader,addr=0x40004000,cpu-num=0 \
-device loader,addr=0x40004000,cpu-num=1 \
-device loader,addr=0x40004000,cpu-num=2 \
-device loader,addr=0x40004000,cpu-num=3 \
;
loop.gdb:
target remote localhost:1234
file loop.elf
load loop.elf
disassemble Reset_Handler
info threads
continue
debug.sh:
#!/bin/bash
CROSS_COMPILE=/opt/arm/gcc-arm-8.3-2019.03-x86_64-aarch64-elf/bin/aarch64-elf-
GDB=${CROSS_COMPILE}gdb
${GDB} --command=loop.gdb
Executing the program - two consoles will be needed.
First console:
./build.sh
Output should look like:
/opt/arm/gcc-arm-8.3-2019.03-x86_64-aarch64-elf/bin/aarch64-elf-ld: warning: address of `text-segment' isn't multiple of maximum page size
loop.elf: file format elf64-littleaarch64
Disassembly of section .text:
0000000040004000 <Reset_Handler>:
40004000: d53800a0 mrs x0, mpidr_el1
40004004: 92400400 and x0, x0, #0x3
40004008: f100001f cmp x0, #0x0
4000400c: 54000100 b.eq 4000402c <Core0> // b.none
40004010: f100041f cmp x0, #0x1
40004014: 540000e0 b.eq 40004030 <Core1> // b.none
40004018: f100081f cmp x0, #0x2
4000401c: 540000c0 b.eq 40004034 <Core2> // b.none
40004020: f1000c1f cmp x0, #0x3
40004024: 540000a0 b.eq 40004038 <Core3> // b.none
0000000040004028 <Error>:
40004028: 14000000 b 40004028 <Error>
000000004000402c <Core0>:
4000402c: 14000000 b 4000402c <Core0>
0000000040004030 <Core1>:
40004030: 14000000 b 40004030 <Core1>
0000000040004034 <Core2>:
40004034: 14000000 b 40004034 <Core2>
0000000040004038 <Core3>:
40004038: 14000000 b 40004038 <Core3>
Then:
./qemu.sh
Second console:
./debug.sh
Output should look like:
GNU gdb (GNU Toolchain for the A-profile Architecture 8.3-2019.03 (arm-rel-8.36)) 8.2.1.20190227-git
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "--host=x86_64-pc-linux-gnu --target=aarch64-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.linaro.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
warning: No executable has been specified and target does not support
determining executable automatically. Try using the "file" command.
0x0000000040004000 in ?? ()
Loading section .text, size 0x3c lma 0x40004000
Start address 0x40004000, load size 60
Transfer rate: 480 bits in <1 sec, 60 bytes/write.
Dump of assembler code for function Reset_Handler:
=> 0x0000000040004000 <+0>: mrs x0, mpidr_el1
0x0000000040004004 <+4>: and x0, x0, #0x3
0x0000000040004008 <+8>: cmp x0, #0x0
0x000000004000400c <+12>: b.eq 0x4000402c <Core0> // b.none
0x0000000040004010 <+16>: cmp x0, #0x1
0x0000000040004014 <+20>: b.eq 0x40004030 <Core1> // b.none
0x0000000040004018 <+24>: cmp x0, #0x2
0x000000004000401c <+28>: b.eq 0x40004034 <Core2> // b.none
0x0000000040004020 <+32>: cmp x0, #0x3
0x0000000040004024 <+36>: b.eq 0x40004038 <Core3> // b.none
End of assembler dump.
Id Target Id Frame
* 1 Thread 1.1 (CPU#0 [running]) Reset_Handler () at loop.s:5
2 Thread 1.2 (CPU#1 [running]) Reset_Handler () at loop.s:5
3 Thread 1.3 (CPU#2 [running]) Reset_Handler () at loop.s:5
4 Thread 1.4 (CPU#3 [running]) Reset_Handler () at loop.s:5
All four cores are stopped at address 0x40004000/Reset_Handler, and were started by the continue command in loop.gdb.
Press CTRL+C in the second console:
^C
Thread 1 received signal SIGINT, Interrupt.
Core0 () at loop.s:16
16 Core0: b .
(gdb)
Core #0 was executing code at Core0 label.
Enter the following command (still in the second console):
(gdb) info threads
Id Target Id Frame
* 1 Thread 1.1 (CPU#0 [running]) Core0 () at loop.s:16
2 Thread 1.2 (CPU#1 [running]) Core1 () at loop.s:17
3 Thread 1.3 (CPU#2 [running]) Core2 () at loop.s:18
4 Thread 1.4 (CPU#3 [running]) Core3 () at loop.s:19
(gdb)
Cores #1,#2 and #3 were executing the code at the respective Core1, Core2, Core3 labels prior to be stopped by the CTRL+C.
Description of the MPIDR_EL1 register is available here: the two lasts bits of MPIDR_EL1.Aff0 were used by all four cores for determining their respective core numbers.

This depends on the board model -- generally we follow what the hardware does, and some boards start all CPUs from power-on, and some don't. For the 'virt' board (which is specific to QEMU) what we generally do is use PSCI, which is the Arm standard firmware interface for powering SMP CPUs up and down (among other things; you can also use it for 'power down entire machine', for instance). On startup only the primary CPU is running, and it's the job of guest code to use the PSCI API to start the secondaries. That's what that psci node in the DTS is telling the guest -- it tells the guest what specific form of the PSCI ABI QEMU implements, and in particular whether the guest should use the 'hvc' or 'smc' instruction to call PSCI functions. What QEMU is doing here is emulating a "hardware + firmware" combination -- the guest executes an 'smc' instruction and QEMU performs the actions that on real hardware would be performed by a bit of firmware code running at EL3.
The virt board does also have another mode of operation which is intended for when you want to run a guest which is itself EL3 firmware (for instance if you want to run OVMF/UEFI at EL3). If you start QEMU with -machine secure=true to enable EL3 emulation and you also provide a guest firmware blob via either -bios or -drive if=pflash,..., then QEMU will assume your firmware wants to run at EL3 and provide PSCI services itself, so it will start with all CPUs powered on and let the firmware deal with sorting them out.
A simple example of making a PSCI call to turn on another CPU (in this case cpu #4 of 8):
.equ PSCI_0_2_FN64_CPU_ON, 0xc4000003
ldr x0, =PSCI_0_2_FN64_CPU_ON
ldr x1, =4 /* target CPU's MPIDR affinity */
ldr x2, =0x10000 /* entry point */
ldr x3, =0 /* context ID: put into target CPU's x0 */
smc 0

Related

Cuda gdb print constant

I am in cuda-gdb, I can use ((#global float *)array)[0]
but how to use constant memory in gdb ?
I try ((#parameter float *)const_array)
I declared const_array like this :
__constant__ float const_array[1 << 14]
I tried with 1 << 5, and it's the same problem.
I don't seem to have any trouble with it. In order to print device memory, you must be stopped at a breakpoint in device code.
Example:
$ cat t1973.cu
const int cs = 1 << 14;
__constant__ int cdata[cs];
__global__ void k(int *gdata){
gdata[0] = cdata[0];
}
int main(){
int *hdata = new int[cs];
for (int i = 0; i < cs; i++) hdata[i] = i+1;
cudaMemcpyToSymbol(cdata, hdata, cs*sizeof(cdata[0]));
int *gdata;
cudaMalloc(&gdata, sizeof(gdata[0]));
cudaMemset(gdata, 0, sizeof(gdata[0]));
k<<<1,1>>>(gdata);
cudaDeviceSynchronize();
}
$ nvcc -o t1973 t1973.cu -g -G -arch=sm_70
$ cuda-gdb ./t1973
sh: python3: command not found
Unable to determine python3 interpreter version. Python integration disabled.
NVIDIA (R) CUDA Debugger
11.4 release
Portions Copyright (C) 2007-2021 NVIDIA Corporation
GNU gdb (GDB) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./t1973...
(cuda-gdb) b 5
Breakpoint 1 at 0x403b0c: file t1973.cu, line 6.
(cuda-gdb) run
Starting program: /home/user2/misc/t1973
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[Detaching after fork from child process 22872]
[New Thread 0x7fffef475700 (LWP 22879)]
[New Thread 0x7fffeec74700 (LWP 22880)]
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
Thread 1 "t1973" hit Breakpoint 1, k<<<(1,1,1),(1,1,1)>>> (
gdata=0x7fffcdc00000) at t1973.cu:5
5 gdata[0] = cdata[0];
(cuda-gdb) print gdata[0]
$1 = 0
(cuda-gdb) print cdata[0]
$2 = 1
(cuda-gdb) s
6 }
(cuda-gdb) print gdata[0]
$3 = 1
(cuda-gdb) print cdata[0]
$4 = 1
(cuda-gdb) print cdata[1]
$5 = 2
(cuda-gdb)
Try putting you __constant__ into .cuh, then use as a classic C global variable.

Microblaze on QEMU not producing serial output

I'm trying to emulate baremetal Microblaze code using QEMU but don't get any output from the "print" function. The microblaze is produced from a xilinx project, this produces a .dts file which is used to make a .dtb for use with QEMU. I'm using Xilinx's fork of QEMU
I run QEMU with the following command
~/.local/bin/qemu-system-microblazeel -M microblaze-fdt -dtb system-top.dtb -m 256 -serial mon:stdio -display none -kernel ./workspace/app_0/Debug/app_0.elf -s -S -nographic
I can connect with gdb, and step through the code, it clearly writes to address 0x40600004 which is the UART TX data FIFO, but still nothing is seen at the QEMU terminal. I even added some debug inside the QEMU xilinx UART model, it was registered but never called when the code ran.
#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
int main()
{
init_platform();
print("Hello World\n\r");
cleanup_platform();
return 0;
}
This is the UART node from the .dts file
top_axi_uartlite_0: serial#40600000 {
clock-frequency = <294999169>;
clocks = <&clk_bus_0>;
compatible = "xlnx,axi-uartlite-2.0", "xlnx,xps-uartlite-1.00.a";
current-speed = <115200>;
device_type = "serial";
interrupt-names = "interrupt";
interrupt-parent = <&top_axi_intc_0>;
interrupts = <1 0>;
port-number = <0>;
reg = <0x40600000 0x10000>;
xlnx,baudrate = <0x1c200>;
xlnx,data-bits = <0x8>;
xlnx,odd-parity = <0x0>;
xlnx,s-axi-aclk-freq-hz-d = "294.999169";
xlnx,use-parity = <0x0>;
};
QEMU monitor shows the following memory space
address-space: memory
0000000000000000-ffffffffffffffff (prio 0, i/o): system
0000000000000000-000000000fffffff (prio 0, ram): memory#0
address-space: I/O
0000000000000000-000000000000ffff (prio 0, i/o): io
address-space: cpu-memory-0
0000000000000000-000000000fffffff (prio 0, ram): memory#0

Decompiling 8051 binary, read from EEPROM

I'm trying to decompile the firmware of a Logitech Freedom 2.4 Cordless Joystick. I've managed to get something of the EEPROM. (here)
The EEPROM that is used is the Microchip 25AA320, which is a 32Kbit SPI-EEPROM. The MCU is a nRF24E1G , that contains a 8051 MCU.
The ROM should be 4096 bytes, so I think that my reading program looped over it self 4 times.
I managed to extract a 4kB ROM (here), but the start of the file doesn't look clean.
I loaded both files into IDA Pro and Ghidra and selected the 8051 processor. They don't generate anything useful.
Could anyone help me decompiling this ROM?
I used this Arduino Sketch to dump the rom.
Together with this python script
## Author: Arpan Das
## Date: Fri Jan 11 12:16:59 2019 +0530
## URL: https://github.com/Cyberster/SPI-Based-EEPROM-Reader-Writer
## It listens to serial port and writes contents into a file
## requires pySerial to be installed
import sys
import serial
import time
start = time.time()
MEMORY_SIZE = 4096 # In bytes
serial_port = 'COM5'
baud_rate = 115200 # In arduino, Serial.begin(baud_rate)
write_to_file_path = "dump.rom"
output_file = open(write_to_file_path, "wb")
ser = serial.Serial(serial_port, baud_rate)
print("Press d for dump ROM else CTRL+C to exit.")
ch = sys.stdin.read(1)
if ch == 'd':
ser.write('d')
for i in range(MEMORY_SIZE/32): # i.e. MEMORY_SIZE / 32
# wait until arduino response with 'W' i.e. 1 byte of data write request
while (ser.read() != 'W'): continue
ser.write('G') # sends back write request granted signal
for j in range(32):
byte = ser.read(1);
output_file.write(byte);
print(str(MEMORY_SIZE - (i * 32)) + " bytes remaining.")
print '\nIt took', time.time()-start, ' seconds.'
This is what I did, the next part left is for you. My machine is a Win10 notebook, however I used unix tools because they are so capable.
First of all, I divided the 16KB dump into four 4KB parts. The first one was different from the other three. And the provided 4KB dump is different to all of these parts. I did not investigate this further, and simply took one of the other three parts that are all equal.
$ split -b 4K LogitechFreedom2.4CordlessJoystick.rom part
$ cmp partaa partab
partaa partab differ: byte 1, line 1
$ cmp partab partac
$ cmp partac partad
$ cmp dump.rom partaa
dump.rom partaa differ: byte 9, line 1
$ cmp dump.rom partab
dump.rom partab differ: byte 1, line 1
From the microcontroller's data sheet I learned that the EEPROM contents has a header of at least 3 bytes (chapter 10.2 at page 61).
These bytes are:
0b Version = 00, Reserved = 00, SPEED = 0.5MHz, XO_FREQ = 16MHz
03 Offset to start of user program = 3
0f Number of 256 bytes block = 15
The last entry seems to be off by one, because there seems to be code in the 16th block, too.
Anyway, these bytes look decent, so I cut the first 3 bytes.
$ dd if=partad of=rom.bin bs=1 skip=3
4093+0 records in
4093+0 records out
4093 bytes (4,1 kB, 4,0 KiB) copied, 0,0270132 s, 152 kB/s
$ dd if=partad of=head.bin bs=1 count=3
3+0 records in
3+0 records out
3 bytes copied, 0,0043809 s, 0,7 kB/s
$ od -Ax -t x1 rom.bin > rom.hex
$ od -Ax -t x1 head.bin > head.hex
The hex files are nice for loading them into an editor and look around.
I loaded the remaining 4093 bytes into a disassembler I once wrote and peeked around a bit. It looks promising, so I think you can go on without me now:
C0000: ljmp C0F54
C0003: setb 021H.2
reti
C000B: xch a,r5
inc r6
xrl a,r6
mov a,#0B2H
movc a,#a+pc
movx #r1,a
mov r7,a
setb 021H.2
reti
C0F54: mov psw,#000H
mov sp,#07BH
mov r0,#0FFH
mov #r0,#000H
djnz r0,C0F5C
ljmp C0C09

Determining which gencode (compute_, arch_) values I need for nvcc - within CMake

I'm using CMake as a build system for my code, which involves CUDA. I was thinking of automating the task of deciding which compute_XX and arch_XX I need to to pass to my nvcc in order to compile for the GPU(s) on my current machine.
Is there a way to do this:
With the NVIDIA GPU deployment kit?
Without the NVIDIA GPU deployment kit?
Does CMake's FindCUDA help you in determining the values for these switches?
My strategy has been to compile and run a bash script that probes the card and returns the gencode for cmake. Inspiration came from University of Chicago's SLURM. To handle errors or multiple gpus or other circumstances, modify as necessary.
In your project folder create a file cudaComputeVersion.bash and ensure it is executable from the shell. Into this file put:
#!/bin/bash
# create a 'here document' that is code we compile and use to probe the card
cat << EOF > /tmp/cudaComputeVersion.cu
#include <stdio.h>
int main()
{
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop,0);
int v = prop.major * 10 + prop.minor;
printf("-gencode arch=compute_%d,code=sm_%d\n",v,v);
}
EOF
# probe the card and cleanup
/usr/local/cuda/bin/nvcc /tmp/cudaComputeVersion.cu -o /tmp/cudaComputeVersion
/tmp/cudaComputeVersion
rm /tmp/cudaComputeVersion.cu
rm /tmp/cudaComputeVersion
And in your CMakeLists.txt put:
# at cmake-build-time, probe the card and set a cmake variable
execute_process(COMMAND ${CMAKE_CURRENT_SOURCE_DIR}/cudaComputeVersion.bash OUTPUT_VARIABLE GENCODE)
# at project-compile-time, include the gencode into the compile options
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; "${GENCODE}")
# this makes CMake all chatty and allows you to see that GENCODE was set correctly
set(CMAKE_VERBOSE_MAKEFILE TRUE)
cheers
You can use the cuda_select_nvcc_arch_flags() macro in the FindCUDA module for this without any additional scripts when using CMake 3.7 or newer.
include(FindCUDA)
set(CUDA_ARCH_LIST Auto CACHE STRING
"List of CUDA architectures (e.g. Pascal, Volta, etc) or \
compute capability versions (6.1, 7.0, etc) to generate code for. \
Set to Auto for automatic detection (default)."
)
cuda_select_nvcc_arch_flags(CUDA_ARCH_FLAGS ${CUDA_ARCH_LIST})
list(APPEND CUDA_NVCC_FLAGS ${CUDA_ARCH_FLAGS})
The above sets CUDA_ARCH_FLAGS to -gencode arch=compute_61,code=sm_61 on my machine, for example.
The CUDA_ARCH_LIST cache variable can be configured by the user to generate code for specific compute capabilites instead of automatic detection.
Note: the FindCUDA module has been deprecated since CMake 3.10. However, no equivalent alternative to the cuda_select_nvcc_arch_flags() macro appears to be provided yet in the latest CMake release (v3.14). See this relevant issue at the CMake issue tracker for further details.
A slight improvement over #orthopteroid's answer, which pretty much ensures a unique temporary file is generated, and only requires one instead of two temporary files.
The following goes into scripts/get_cuda_sm.sh:
#!/bin/bash
#
# Prints the compute capability of the first CUDA device installed
# on the system, or alternatively the device whose index is the
# first command-line argument
device_index=${1:-0}
timestamp=$(date +%s.%N)
gcc_binary=$(which g++)
gcc_binary=${gcc_binary:-g++}
cuda_root=${CUDA_DIR:-/usr/local/cuda}
CUDA_INCLUDE_DIRS=${CUDA_INCLUDE_DIRS:-${cuda_root}/include}
CUDA_CUDART_LIBRARY=${CUDA_CUDART_LIBRARY:-${cuda_root}/lib64/libcudart.so}
generated_binary="/tmp/cuda-compute-version-helper-$$-$timestamp"
# create a 'here document' that is code we compile and use to probe the card
source_code="$(cat << EOF
#include <stdio.h>
#include <cuda_runtime_api.h>
int main()
{
cudaDeviceProp prop;
cudaError_t status;
int device_count;
status = cudaGetDeviceCount(&device_count);
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceCount() failed: %s\n", cudaGetErrorString(status));
return -1;
}
if (${device_index} >= device_count) {
fprintf(stderr, "Specified device index %d exceeds the maximum (the device count on this system is %d)\n", ${device_index}, device_count);
return -1;
}
status = cudaGetDeviceProperties(&prop, ${device_index});
if (status != cudaSuccess) {
fprintf(stderr,"cudaGetDeviceProperties() for device ${device_index} failed: %s\n", cudaGetErrorString(status));
return -1;
}
int v = prop.major * 10 + prop.minor;
printf("%d\\n", v);
}
EOF
)"
echo "$source_code" | $gcc_binary -x c++ -I"$CUDA_INCLUDE_DIRS" -o "$generated_binary" - -x none "$CUDA_CUDART_LIBRARY"
# probe the card and cleanup
$generated_binary
rm $generated_binary
and the following goes into CMakeLists.txt or a CMake module:
if (NOT CUDA_TARGET_COMPUTE_CAPABILITY)
if("$ENV{CUDA_SM}" STREQUAL "")
set(ENV{CUDA_INCLUDE_DIRS} "${CUDA_INCLUDE_DIRS}")
set(ENV{CUDA_CUDART_LIBRARY} "${CUDA_CUDART_LIBRARY}")
set(ENV{CMAKE_CXX_COMPILER} "${CMAKE_CXX_COMPILER}")
execute_process(COMMAND
bash -c "${CMAKE_CURRENT_SOURCE_DIR}/scripts/get_cuda_sm.sh"
OUTPUT_VARIABLE CUDA_TARGET_COMPUTE_CAPABILITY_)
else()
set(CUDA_TARGET_COMPUTE_CAPABILITY_ $ENV{CUDA_SM})
endif()
set(CUDA_TARGET_COMPUTE_CAPABILITY "${CUDA_TARGET_COMPUTE_CAPABILITY_}"
CACHE STRING "CUDA compute capability of the (first) CUDA device on \
the system, in XY format (like the X.Y format but no dot); see table \
of features and capabilities by capability X.Y value at \
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications")
execute_process(COMMAND
bash -c "echo -n $(echo ${CUDA_TARGET_COMPUTE_CAPABILITY})"
OUTPUT_VARIABLE CUDA_TARGET_COMPUTE_CAPABILITY)
execute_process(COMMAND
bash -c "echo ${CUDA_TARGET_COMPUTE_CAPABILITY} | sed 's/^\\([0-9]\\)\\([0-9]\\)/\\1.\\2/;' | xargs echo -n"
OUTPUT_VARIABLE FORMATTED_COMPUTE_CAPABILITY)
message(STATUS
"CUDA device-side code will assume compute capability \
${FORMATTED_COMPUTE_CAPABILITY}")
endif()
set(CUDA_GENCODE
"arch=compute_${CUDA_TARGET_COMPUTE_CAPABILITY}, code=compute_${CUDA_TARGET_COMPUTE_CAPABILITY}")
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode ${CUDA_GENCODE} )

divide by zero exception handling in Linux

I am curious to understand the divide by zero exception handling in linux. When divide by zero operation is performed, a trap is generated i.e. INT0 is sent to the processor and ultimately SIGFPE signal is sent to the process that performed the operation.
As I see, the divide by zero exception is registered in trap_init() function as
set_trap_gate(0, &divide_error);
I want to know in detail, what all happens in between the INT0 being generated and before the SIGFPE being sent to the process?
Trap handler is registered in the trap_init function from arch/x86/kernel/traps.c
void __init trap_init(void)
..
set_intr_gate(X86_TRAP_DE, &divide_error);
set_intr_gate writes the address of the handler function into idt_table x86/include/asm/desc.h.
How is the divide_error function defined? As a macro in traps.c
DO_ERROR_INFO(X86_TRAP_DE, SIGFPE, "divide error", divide_error, FPE_INTDIV,
regs->ip)
And the macro DO_ERROR_INFO is defined a bit above in the same traps.c:
193 #define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr) \
194 dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
195 { \
196 siginfo_t info; \
197 enum ctx_state prev_state; \
198 \
199 info.si_signo = signr; \
200 info.si_errno = 0; \
201 info.si_code = sicode; \
202 info.si_addr = (void __user *)siaddr; \
203 prev_state = exception_enter(); \
204 if (notify_die(DIE_TRAP, str, regs, error_code, \
205 trapnr, signr) == NOTIFY_STOP) { \
206 exception_exit(prev_state); \
207 return; \
208 } \
209 conditional_sti(regs); \
210 do_trap(trapnr, signr, str, regs, error_code, &info); \
211 exception_exit(prev_state); \
212 }
(Actually it defines the do_divide_error function which is called by the small asm-coded stub "entry point" with prepared arguments. The macro is defined in entry_32.S as ENTRY(divide_error) and entry_64.S as macro zeroentry: 1303 zeroentry divide_error do_divide_error)
So, when a user divides by zero (and this operation reaches the retirement buffer in OoO), hardware generates a trap, sets %eip to divide_error stub, it sets up the frame and calls the C function do_divide_error. The function do_divide_error will create the siginfo_t struct describing the error (signo=SIGFPE, addr= address of failed instruction,etc), then it will try to inform all notifiers, registered with register_die_notifier (actually it is a hook, sometimes used by the in-kernel debugger "kgdb"; kprobe's kprobe_exceptions_notify - only for int3 or gpf; uprobe's arch_uprobe_exception_notify - again only int3, etc).
Because DIE_TRAP is usually not blocked by the notifier, the do_trap function will be called. It has a short code of do_trap:
139 static void __kprobes
140 do_trap(int trapnr, int signr, char *str, struct pt_regs *regs,
141 long error_code, siginfo_t *info)
142 {
143 struct task_struct *tsk = current;
...
157 tsk->thread.error_code = error_code;
158 tsk->thread.trap_nr = trapnr;
170
171 if (info)
172 force_sig_info(signr, info, tsk);
...
175 }
do_trap will send a signal to the current process with force_sig_info, which will "Force a signal that the process can't ignore".. If there is an active debugger for the process (our current process is ptrace-ed by gdb or strace), then send_signal will translate the signal SIGFPE to the current process from do_trap into SIGTRAP to debugger. If no debugger - the signal SIGFPE should kill our process while saving the core file, because that is the default action for SIGFPE (check man 7 signal in the section "Standard signals", search for SIGFPE in the table).
The process can't set SIGFPE to ignore it (I'm not sure here: 1), but it can define its own signal handler to handle the signal (example of handing SIGFPE another). This handler may just print %eip from siginfo, run backtrace() and die; or it even may try to recover the situation and return to the failed instruction. This may be useful for example in some JITs like qemu, java, or valgrind; or in high-level languages like java or ghc, which can turn SIGFPE into a language exception and programs in these languages can handle the exception (for example, spaghetti from openjdk is in hotspot/src/os/linux/vm/os_linux.cpp).
There is a list of SIGFPE handlers in debian via codesearch for siagaction SIGFPE or for signal SIGFPE