I'm studying reverse engineering, I wrote the following code to see how it is to pass struct variables in the disassembly.
struct s {
int a;
int b;
int c;
};
struct s get_some_values(int a)
{
struct s rt;
rt.a = a + 1;
rt.b = a + 2;
rt.c = a + 3;
return rt;
}
int main()
{
get_some_values(4);
}
The env is:
win 7 x64
windbg x86
vs2017
complied to debug, without aslr and dep
command for the windbg after loaded the executable:
0:000> bl
0:000> x *!main
00411770 returnvalue!main ()
0:000> bp 0x00411770
Since then, at the left corner of windbg, its "BUSY", and not responding to any commands like 'g'.
Here's the screenshot:
Any idea what I'm facing now?
Thanks in advance!
Related
I'm having a weird error executing an opencl kernel, When I'm trying to build the opencl kernel using the clBuildProgram() execution
err = clBuildProgram(program, 1, &ocl->device, "", NULL, NULL);
My process starts using more and more memory, until it reaches 13GB (Normally it uses about 400MB), then yields:
"0xC0000005: Access violation executing location"
The weird part is this happens only if I use the integrated card, which is an Intel HD 4000. If choose other device like the GTX 960 or the CPU it works fine.
Another strange thing is that if there is any syntax error the clBuildProgram function ends fine, giving the compilation error, its only when there isn't any mistakes. Also, if I comment part of my code it goes.
This is my function:
__kernel void update(__global struct PhysicsComponent_ocl_t* vecPhy, __constant struct BoxCollider_ocl_t* vecBx, __constant ulong* vecIdx, __constant float* deltaTime) {
unsigned int i = get_global_id(0);
unsigned int j = get_global_id(1);
if (j > i) { //From size_t j = i + 1; i < vec.size()...
//Copy data to local memory to avoid race conditions
struct AuxPhy_ocl_t phy1;
copyPhyGL(&vecPhy[vecIdx[i]], &phy1);
struct AuxPhy_ocl_t phy2;
copyPhyGL(&vecPhy[vecIdx[j]], &phy2);
if (collide(&phy1, &phy2, &vecBx[i], &vecBx[j])) {
////Check speed correction for obj 1
struct mivec3_t speed1 = phy1.speed;
struct mivec3_t speed2 = phy2.speed;
modifySpeedAndVelocityOnCollision(&phy1, &phy2, &vecBx[i], &vecBx[j], *deltaTime); //Comprobar los dos objetos, por eso se le da la vuelta a los parametros
modifySpeedAndVelocityOnCollision(&phy2, &phy1, &vecBx[j], &vecBx[i], *deltaTime);
//Make the objects not move
struct mivec3_t auxSub;
multiplyVectorByScalarLL(&speed1, *deltaTime, &auxSub);
substractVectorsLL(&phy1.position, &auxSub, &phy1.position);
multiplyVectorByScalarLL(&speed2, *deltaTime, &auxSub);
substractVectorsLL(&phy2.position, &auxSub, &phy2.position);
//Copy data back to global
copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
}
}
}
For example. If I comment the last two functions, builds the program.
//Copy data back to global
//copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
//copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
But they are not the cause for this, because if I put this functions, but comment part of the body it also works.
__kernel void update(__global struct PhysicsComponent_ocl_t* vecPhy, __constant struct BoxCollider_ocl_t* vecBx, __constant ulong* vecIdx, __constant float* deltaTime) {
unsigned int i = get_global_id(0);
unsigned int j = get_global_id(1);
if (j > i) { //From size_t j = i + 1; i < vec.size()...
//Copy data to local memory to avoid race conditions
struct AuxPhy_ocl_t phy1;
copyPhyGL(&vecPhy[vecIdx[i]], &phy1);
struct AuxPhy_ocl_t phy2;
copyPhyGL(&vecPhy[vecIdx[j]], &phy2);
//Removed code was here
copyPhyLG(&phy1, &vecPhy[vecIdx[i]]);
copyPhyLG(&phy2, &vecPhy[vecIdx[j]]);
}
}
I'm mind blown by this, the only thing it comes to my mind it's like the code takes too much space.
Here is the complete kernel code.
I ran into a similar problem, and in my case it was an infinite loop in one of my kernels. I guess the compiler tried to unroll it or optimize it in some way without checking for bounds.
To validate my hypothesis I built my ocl program with optimizations turned off
int err = program.build("-cl-opt-disable");
and the build succeeded as I expected.
When you introduce a syntax error the compilation process stops early on and won't reach the optimization part where the compiler bug reside.
The compilers for the other devices don't have this bug and they will give you back an executable that you can run but probably wont terminate (correctly).
#include <stdio.h>
int main()
{
int x=5; //x = interest rate(5%)
int y=10000; //y = principal
int n = 0; //n = after years
while (1)
{
n++;
y += y*(x/100);
if(y == 20000)
break;
}
printf("%d years later, double.",n);
return 0;
}
When I run it, nothing happens.
Description Resource Path Location Type
cannot open output file mm.exe: Permission denied mm C/C++ Problem
I would appreciate it if you let me know what went wrong.
Since you have X as an integer and it's value is 5, at
y+= y*(x/100)
Is equivalent to
y+= 0
as (5/100) with integer division yields 0. This results in while(1) looping infinitely, and thus will never allow the program to terminate.
Additionally, the permission denied error looks like it can be fixed by changing your save file location. Here is my source and some extra info
Hope this helps!
Two possible mechanisms come to mind:
IPC like the existing QMP and QAPI
QEMU loads a shared library plugin that contains the model
Required capabilities (of course all possible through the C API, but not necessarily IPC APIs):
inject interrupts
register callbacks for register access
modify main memory
Why I want this:
use QEMU as a submodule and leave its source untouched
additional advantages only present for IPC methods:
write the models in any language I want
use a non-GPL license for my device
I'm aware of in-tree devices as explained at: How to add a new device in QEMU source code? which are the traditional way of doing things.
What I've found so far:
interrupts: could only find NMI generation with the nmi monitor command
IO ports: IO possible with i and o monitor commands, so I'm fine there
main memory:
the ideal solution would be to map memory to host directly, but that seems hard:
http://kvm.vger.kernel.narkive.com/rto1dDqn/sharing-variables-memory-between-host-and-guest
https://www.linux-kvm.org/images/e/e8/0.11.Nahanni-CamMacdonell.pdf
http://www.fp7-save.eu/papers/SCALCOM2016.pdf
memory read is possible through the x and xp monitor commands
could not find how to write to memory with monitor commands. But I think the GDB API supports, so it should not be too hard to implement.
The closest working piece of code I could find was: https://github.com/texane/vpcie , which serializes PCI on both sides, and sends it through QEMU's TCP API. But this is more inefficient and intrusive, as it requires extra setup on both guest and host.
This create out of tree PCI device , it just display device in lspci..
It will ease faster PCI driver implementation as it will act as module,
can we extend this to to have similar functionality as edu-pci of QEMU.?
https://github.com/alokprasad/pci-hacking/blob/master/ksrc/virtual_pcinet/virtual_pci.c
/*
*/
#include <linux/init.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/fs.h>
#include <linux/kobject.h>
#include <linux/device.h>
#include <linux/proc_fs.h>
#include <linux/types.h>
#include <linux/pci.h>
#include <linux/version.h>
#include<linux/kernel.h>
#define PCI_VENDOR_ID_XTREME 0x15b3
#define PCI_DEVICE_ID_XTREME_VNIC 0x1450
static struct pci_bus *vbus;
static struct pci_sysdata *sysdata;
static DEFINE_PCI_DEVICE_TABLE( vpci_dev_table) = {
{PCI_DEVICE(PCI_VENDOR_ID_XTREME, PCI_DEVICE_ID_XTREME_VNIC)},
{0}
};
MODULE_DEVICE_TABLE(pci, vpci_dev_table);
int vpci_read(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 *val)
{
switch (where) {
case PCI_VENDOR_ID:
*val = PCI_VENDOR_ID_XTREME | PCI_DEVICE_ID_XTREME_VNIC << 16;
/* our id */
break;
case PCI_COMMAND:
*val = 0;
break;
case PCI_HEADER_TYPE:
*val = PCI_HEADER_TYPE_NORMAL;
break;
case PCI_STATUS:
*val = 0;
break;
case PCI_CLASS_REVISION:
*val = (4 << 24) | (0 << 16) | 1;
/* network class, ethernet controller, revision 1 */ /*2 or 4*/
break;
case PCI_INTERRUPT_PIN:
*val = 0;
break;
case PCI_SUBSYSTEM_VENDOR_ID:
*val = 0;
break;
case PCI_SUBSYSTEM_ID:
*val = 0;
break;
default:
*val = 0;
/* sensible default */
}
return 0;
}
int vpci_write(struct pci_bus *bus, unsigned int devfn, int where,
int size, u32 val)
{
switch (where) {
case PCI_BASE_ADDRESS_0:
case PCI_BASE_ADDRESS_1:
case PCI_BASE_ADDRESS_2:
case PCI_BASE_ADDRESS_3:
case PCI_BASE_ADDRESS_4:
case PCI_BASE_ADDRESS_5:
break;
}
return 0;
}
struct pci_ops vpci_ops = {
.read = vpci_read,
.write = vpci_write
};
void vpci_remove_vnic()
{
struct pci_dev *pcidev = NULL;
if (vbus == NULL)
return;
pci_remove_bus_device(pcidev);
pci_dev_put(pcidev);
}
EXPORT_SYMBOL( vpci_remove_vnic);
void vpci_vdev_remove(struct pci_dev *dev)
{
}
static struct pci_driver vpci_vdev_driver = {
.name = "Xtreme-Virtual-NIC1",
.id_table = vpci_dev_table,
.remove = vpci_vdev_remove
};
int vpci_bus_init(void)
{
struct pci_dev *pcidev = NULL;
sysdata = kzalloc(sizeof(void *), GFP_KERNEL);
vbus = pci_scan_bus_parented(NULL, 2, & vpci_ops, sysdata);
//vbus = pci_create_root_bus(NULL,i,& vpci_ops, sysdata,NULL);
//if (vbus != NULL)
//break;
memset(sysdata, 0, sizeof(void *));
if (vbus == NULL) {
kfree(sysdata);
return -EINVAL;
}
if (pci_register_driver(& vpci_vdev_driver) < 0) {
pci_remove_bus(vbus);
vbus = NULL;
return -EINVAL;
}
pcidev = pci_scan_single_device(vbus, 0);
if (pcidev == NULL)
return 0;
else
pci_dev_get(pcidev);
pci_bus_add_devices(vbus);
return 0;
}
void vpci_bus_remove(void)
{
if (vbus) {
pci_unregister_driver(&vpci_vdev_driver);
device_unregister(vbus->bridge);
pci_remove_bus(vbus);
kfree(sysdata);
vbus = NULL;
}
}
static int __init pci_init(void)
{
printk( "module loaded");
vpci_bus_init();
return 0;
}
static void __exit pci_exit(void)
{
printk(KERN_ALERT "unregister PCI Device\n");
pci_unregister_driver(&vpci_vdev_driver);
}
module_init(pci_init);
module_exit(pci_exit);
MODULE_LICENSE("GPL");
There is at least one fork of QEMU I'm aware of that offers shared library plugins for QEMU... but it's a fork of QEMU 4.0.
https://github.com/cromulencellc/qemu-shoggoth
It is possible to build out of tree plugins with this fork, though it's not documented.
On Nov 11 2019 Peter Maydell, a major QEMU contributor, commented on another Stack Overflow question that:
Device plugins are specifically off the menu, because upstream does not want to provide a nice easy mechanism for people to use to have out-of-tree non-GPL/closed-source devices.
So it seems that QEMU devs oppose this idea at that point in time. It is worth learning about the QEMU plugin system though which might come handy for related applications in any case: How to count the number of guest instructions QEMU executed from the beginning to the end of a run?
This is a shame. Imagine if the Linux kernel didn't have a kernel module interface! I suggest QEMU expose this interface, but just don't make it stable, so that it won't impose a developer burden, and which gives the upside that those who merge won't have as painful rebases.
I'm trying to execute some sample code from the Thrust Quick Start Guide. It's pasted below. What is killing me is that when I'm running it, I'm getting an exception thrown "R6010 -abort() has been called) whenever I hit the find_if.
I've tried this using both the 4.1 and 4.2 runtimes. I'm building this in Visual Studio 2010 Ultimate using the latest NSight release candidate (downloaded May 4th, 2012). My graphics card is an NVidia NVS 3100m.
I can run the vector addition sample generated in a new VS project (that doesn't use Thrust) and it works okay. Adding Thrust however gives me this weirdness.
Any suggestions are appreciated.
mj
thrust::device_vector<int> input(4);
input[0] = 0;
input[1] = 5;
input[2] = 3;
input[3] = 7;
thrust::device_vector<int>::iterator iter;
iter = thrust::find_if(input.begin(), input.end(), greater_than_four());
iter = thrust::find_if(input.begin(), input.end(), greater_than_ten());
EDIT1
Another tidbit of information. I'm digging in deeper into this and seeing that an error is caught during cudaThreadSynchronize(). The message is "launch_closure_by_value".
I figured it out. The __host__ and __device__ tags were missing.
struct greater_than_four
{
__host__ __device__
bool operator()(int x)
{
return x > 4;
}
};
I have recently started learning CUDA and I've integrated my CUDA into MS Visual Studio 2010 with Nsight. I have also acquired the book "CUDA by Example" and I'm going through all the examples and compiling them. I have come across an error however, which I do not understand.
The program comes from chapter 4 and it's the julia_gpu example. Original code:
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y ) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char *ptr ) {
// map from blockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
// globals needed by the update routine
struct DataBlock {
unsigned char *dev_bitmap;
};
int main( void ) {
DataBlock data;
CPUBitmap bitmap( DIM, DIM, &data );
unsigned char *dev_bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap, bitmap.image_size() ) );
data.dev_bitmap = dev_bitmap;
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_bitmap ) );
bitmap.display_and_exit();
}
My Visual Studio however forces me to embelish the cuComplex constructor to device, otherwise it won't compile (it tells me I cannot use it later in the julia function), which I guess is fair enough. So I have:
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
But when I do run the example (having added the necessary includes for it to run through VS, which is cuda_runtime.h and device_launch_parameters.h, as well as copying the glut32.dll into the same folder as the exe) it quickly fails, killing my device driver and saying it's due to an unknown error in line 94, which is the cudaMemcpy call in main. To be exact, it's the actual line containing the call "cudaDeviceToHost". To be frank however, I have tried creating some breakpoints line after line and the driver dies at the kernel call.
Could someone please tell me what might be wrong? I am a noob with CUDA and have no real idea why a trivial example would kill itself like that. What could I be doing wrong? Because frankly, I don't really even know what to investigate.
I have the CUDA 4.1 toolkit, NSight 2.1 and a GeForce GT445M with computational ability rated at 2.1 and the 295 version of the drivers.
I haven't had time to test this yet, but I think it may be your GFX "timing out" as far as windows is concerned.
Windows has a default behaviour from Vista to tell the gfx driver to recover after 2 seconds. If your job takes longer then you get booted. You can increase or remove this feature through the registry. I assume you need a reboot for this because I just made the changes and it's not working yet.
See this link for detail:
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx
...
Timeout Detection and Recovery : Windows Vista attempts to detect these
problematic hang situations and recover a responsive desktop
dynamically. In this process, the Windows Display Driver Model (WDDM)
driver is reinitialized and the GPU is reset. No reboot is necessary,
which greatly enhances the user experience. The only visible artifact
from the hang detection to the recovery is a screen flicker, which
results from resetting some portions of the graphics stack, causing a
screen redraw. Some older Microsoft DirectX applications may render to
a black screen at the end of this recovery. The end user would have to
restart these applications. The following is a brief overview of the
TDR process: ....
Clearly this is why its a weird bug because it will give you that mem copy error at different scales for different people depending on how fast their gfx is.
This is a known issue in CUDA.
You can try changing this:
const float scale = 1.5;
to something larger like 3.5, 4.5, 5.5.
example:
const float scale = 5.5;