How to define a global memory array at device code and pass the value to host after execution? - cuda

I try to create a device global memory array in the kernel code and after the exection is finished, pass the array content to host memory. Is it possible to create a global memory array at device code scope dynamically, or do I need to define the array out side if the device code score as global array.
__global__ void kernel_code(...,int array_size){
__device__ int array_data[size];
// fill the array_data
...
}
int main(){
//pass data from array_data to host array
}
is it possible to do that, if it is not what is the most likely practice?

The allocation of the array must be able to be performed statically by the compiler. So you cannot declare the size of it to be a parameter that you pass to a kernel.
Furthermore, a __device__ variable declaration is not allowed inside a function body. So it has to be at global scope in your module, not at function scope.
Apart from that, you can pass data between a statically declared device array and a host array. The __device__ variable has the following characteristics:
Resides in global memory space,
Has the lifetime of an application,
Is accessible from all the threads within the grid and from the host
through the runtime library (cudaGetSymbolAddress() /
cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).
So in your host code, you would use cudaMemcpyToSymbol to transfer data to from your host array to the device array, and cudaMemcpyFromSymbol to transfer data from the device array to the host array.
For dynamically sized device arrays, the most common practice would be to allocate them using ordinary host runtime API functions like cudaMalloc and transfer data from a host array to a device array or vice-versa using cudaMemcpy

Normal practice is to manipulate device memory only in kernels (it's much faster). Simply use cudaMemcpy(dst, src, cudaMemcpyDeviceToHost) to copy the data into host memory (in main()).

Related

What's the replacement for cuModuleGetSurfRef and cuModuleGetTexRef?

CUDA 12 indicates that these two functions:
CUresult cuModuleGetSurfRef (CUsurfref* pSurfRef, CUmodule hmod, const char* name);
CUresult cuModuleGetTexRef (CUtexref* pTexRef, CUmodule hmod, const char* name);
which obtain a reference to surface or a texture, respectively, from a loaded module - are deprecated.
What are they deprecated in favor of? Are surfaces and textures in modules to be accessed differently? Will they be entirely out of modules? If it's the latter, how would one work with them using the CUDA driver API?
So, based on #talonmies' comment, it seems the "replacement" are "texture objects" and "surface objects". The main difference - as far as is evident in the API - is that the new "objects" have less API calls, which take richer descriptors. Thus, the user sets fields themselves, and does not need the large number of cuTexRefGetXXXX and cuTexRefSetXXXX calls. There are also "tensor map objects", appearing with Compute Capability 9.0 and later.

Can't write into iomem region in qemu using gdb

I'm trying to add an new device in the qemu.
In the respective cpu file, used sysbus_mmio_map to set the base address.
sysbus_mmio_map(SYS_BUS_DEVICE(&s->brif), 0, BASE_ADDRESS);
In the newly created device file,
memory_region_init_io(&s->iomem, obj, &ops, s, "brif", SIZE);
sysbus_init_mmio((SYS_BUS_DEVICE(obj), &s->iomem);
The ops has the corresponding read and write handlers.
My read handler is getting called when I access the IO memory region using gdb, but my write handler is not getting called when I write to the IO memory region using gdb.
What am I missing?
Update: I do get the write handlers if I write to the IO memory region from the code running inside the guest, the problem is only when I try to access from the gdb.
I belive it's just a bug. Se this bugreport (with a patch included).

Android kernel : How to create /dev/video0 before ueventd daemon gets started?

I want to access /dev/video0 from a kernel module after camera is initialized.
For that I want to create /dev/video0 node before the ueventd daemon gets started.
Looking more deeply the kernel handling of /dev/video0 node whenever an
application tries to open this file it gets a FILE *fp pointer ,
the linux kernel virtual file system checks whether this is a regular file or
device file, and if it is a device file it checks it's major number to track
the driver which registered it and saves the minor number in i_rdev field of
struct inode *inode which is again embedded in struct file *fp and
passed to that driver.
So for every FILE *fp opened by application there is struct file *fp in
registered driver i.e v4l2 driver in our case. This file pointer is passed on
to kernel ioctl API v4l2_ioctl.
Now internally v4l2 driver maintains an array of pointers to all the
registered video devices
as seen below :
static struct video_device *video_device[VIDEO_NUM_DEVICES];
Now if we see the implementation of main ioctl call.
static long v4l2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
struct video_device *vdev = video_devdata(filp);
...
}
This video device structure is extracted from file pointer which is the key by which we can control the video device i.e. our camera from within the kernel as it contains function pointers to all the registered v4l2 ioctls. So our target is to access the video device structure from within the kernel.
Now again looking at how kernel accesses the video device when it gets a request from application.
struct video_device *video_devdata(struct file *file)
{
return video_device[iminor(file->f_path.dentry->d_inode)];
}
EXPORT_SYMBOL(video_devdata);
static inline unsigned iminor(const struct inode *inode)
{
return MINOR(inode->i_rdev);
}
As seen above it uses i_rdev field for getting the minor number passed from struct file *fp through VFS.
To summarise if we want to access ioctl from within the kernel we need to fill
a dummy file *fp pointer containing minor number in
file->f_path.dentry->d_inode.i_rdev field, v4l2 subsystem will get
video_device structure using this field and will be able to drive further the
ioctl operations from video_device structure by using video_device->ioctl_ops
field as seen below.
struct video_device
{
#if defined(CONFIG_MEDIA_CONTROLLER)
struct media_entity entity;
#endif
/* device ops */
const struct v4l2_file_operations *fops;
const struct v4l2_ioctl_ops *ioctl_ops;
...
}
To set file->f_path.dentry->d_inode.i_rdev we need to add references to inode and dentry structure inside file structure as per below pseudocode:
static int enumerate_camera()
{
inode.i_rdev = cam_minor_number ;// Saved when camera device registered;
dentry.d_inode = inode;
file.f_path.dentry = dentry;
file.f_dentry->d_inode = inode;
....
}

Is it possible to (deep) copy back all of the dynamic allocated memory in device (in array of pointers manner)? [duplicate]

This question already has an answer here:
How to copy the memory allocated in device function back to main memory
(1 answer)
Closed 7 years ago.
I need to use polymorphism in my kernels. The only way of doing this is to create those objects on the device (to make a virtual mehod table available at the device). Here's the object being created
class Production {
Vertex * boundVertex;
}
class Vertex {
Vertex * leftChild;
Vertex * rightChild;
}
Then on the host I do:
Production* dProd;
cudaMalloc(&dProd, sizeof(Production *));
createProduction<<<1,1>>>(dProd);
where
__global__ void createProduction(Production * prod) {
prod = new Production();
prod->leftChild = new Vertex();
prod->rightChild = new Vertex();
}
The question is how do I get both left and right vertices of the production created on the device back on the host? I know using pointers in classes makes them very hard to handle but... no other way of creating such tree structure.
You can't do that.
The host runtime and driver memory management APIs can't be used to access allocations made on the runtime heap using new or malloc. There is no way for the host to copy those Vertexinstances from the device.

Do STL containers support ARC when storing Obj-C objects in Objective-C++?

For example, would this leak?
static std::tuple<CGSize, NSURL *> getThumbnailURL() {
return std::make_tuple(CGSizeMake(100, 100), [NSURL URLWithString:#"http://examples.com/image.jpg"]);
}
No, it wouldn't leak. That NSURL object would be managed by ARC properly.
http://clang.llvm.org/docs/AutomaticReferenceCounting.html#template-arguments
If a template argument for a template type parameter is an retainable object owner type that does not have an explicit ownership qualifier, it is adjusted to have __strong qualification.
std::tuple<CGSize, NSURL *> is the same as std::tuple<CGSize, NSURL __strong *>. Thus NSURL object will be released when the std::tuple instance is destructed.
Yes, they work. STL containers are templated (STL = Standard Template Library), so whenever you use one it's as if you re-compiled their source code with the template arguments substituted in (template instantiation). And if you re-compiled their source code with the template arguments substituted in, then ARC would perform all the appropriate memory management necessary for the managed pointer types in that code.
Another way to think about it is that ARC managed pointer types are actually C++ smart pointer types -- they have a constructor that initializes it to nil, an assignment operator that releases the existing value and retains (or for block types, copies) the new value, and a destructor that releases the value. So as much as STL containers work with similar C++ smart pointer types, they work with ARC managed pointer types.