GPUDirect RDMA transfer from GPU to remote host - cuda

Scenario:
I have two machines, a client and a server, connected with Infiniband. The server machine has an NVIDIA Fermi GPU, but the client machine has no GPU. I have an application running on the GPU machine that uses the GPU for some calculations. The result data on the GPU is never used by the server machine, but is instead sent directly to the client machine without any processing. Right now I'm doing a cudaMemcpy to get the data from the GPU to the server's system memory, then sending it off to the client over a socket. I'm using SDP to enable RDMA for this communication.
Question:
Is it possible for me to take advantage of NVIDIA's GPUDirect technology to get rid of the cudaMemcpy call in this situation? I believe I have the GPUDirect drivers correctly installed, but I don't know how to initiate the data transfer without first copying it to the host.
My guess is that it isn't possible to use SDP in conjunction with GPUDirect, but is there some other way to initiate an RDMA data transfer from the server machine's GPU to the client machine?
Bonus: If somone has a simple way to test if I have the GPUDirect dependencies correctly installed that would be helpful as well!

Yes, it is possible with supporting networking hardware. See the GPUDirect RDMA documentation.

Related

Can I use DPDK as a packet capture module for a network monitoring application?

My passive network monitoring application needs packets to be captured from network interface (at higher packet rates). The packet capture module should be able to call a monitoring function upon capture of each packet (and also write the packet in to pcap file).
I thought of using DPDK as the packet capture module in my monitoring application (as we use pcap_loop and pfring_loop in libpcap and pfring respectively), but I am not sure whether this is one of the use cases of DPDK, or, is DPDK meant to be used like this?.
So my questions are..
Can I use DPDK to fulfill my requirements?, If yes how to start?.
OS: Linux.
Karnal version: 4.
DPDK version: Latest stable.
Capture on physical device.
The capturing application has root privileges and will be used by the network administrator (as part of passive asset discovery).
I want to use DPDK because it supports capture at line rate upto 10 Gbps
Thank you.
Based on the updates and clarification in comment the request is Can one replace an existing application which PF_RING API calls with DPDK API which is written in C?. Simple answer to it is yes it can be done.
Here is how one should start
identify the Platform (preferably Linux/BSD, windows 21.02 is still work in progress)
identify the processor list of supported CPU
Identify a NIC to use from LIST of DPDK NIC
Set up the Linux environment with Linux Enviroment
Explore basic example/skeleton for basicfwd usage
get the start of ethernet header for packet using DPDK API rte_pktmbuf_mtod. There are many samples in DPDK/examples folder which does the same.
Invoke the packet processing function logic between rx_burst and tx_burst of example/skeleton.
Newer versions of libpcap can themselves use DPDK, at least on Linux. The libpcap on your system might, or might not, be configured to use it. (There are also versions of libpcap modified to use PF_RING.)

Resizing cloud VM disk without taking instance down (Google cloud)

So I saw there is an option in google compute (I assume the same option exists in other cloud VM suppliers so the question isnt specifically on Google compute, but on the underlying technology) to resize the disk without having to restart the machine, and I ask, how is this possible?
Even if it uses some sort of abstraction to the disk and they dont actually assign a physical disk to the VM, but just part of the disk (or part of a number of disks), once the disk is created in the guest VM is has a certain size, how can it change without needing a restart? Does it utilize NFS somehow?
This is built directly into disk protocols these days. This capability has existed for a while, since disks have been virtualized since the late 1990s (either through network protocols like iSCSI / FibreChannel, or through a software-emulated version of hardware like VMware).
Like the VMware model, GCE doesn't require any additional network hops or protocols to do this; the hypervisor just exposes the virtual disk as if it is a physical device, and the guest knows that its size can change and handles that. GCE uses a virtualization-specific driver type for its disks called VirtIO SCSI, but this feature is implemented in many other driver types (across many OSes) as well.
Since a disk can be resized at any time, disk protocols need a way to tell the guest that an update has occurred. In general terms, this works as follows in most protocols:
Administrator resizes disk from hypervisor UI (or whatever storage virtualization UI they're using).
Nothing happens inside the guest until it issues an IO to the disk.
Guest OS issues an IO command to the disk, via the device driver in the guest OS.
Hypervisor emulates that IO command, notices that the disk has been resized and the guest hasn't been alerted yet, and returns a response to the guest telling it to update its view of the device.
The guest OS recognizes this response and re-queries the device size and other details via some other command.
I'm not 100% sure, but I believe the reason it's structured like this is that traditionally disks cannot send updates to the OS unless the OS requests them first. This is probably because the disk has no way to know what memory is free to write to, and even if it did, no way to synchronize access to that memory with the OS. However, those constraints are becoming less true to enable ultra-high-throughput / ultra-low-latency SSDs and NVRAM, so new disk protocols such as NVMe may do this slightly differently (I don't know).

Does cudamalloc incur any kernel calls?

I'm reading the HSA spec and it says the user mode application can submit their jobs into GPU queues directly without any OS interaction. I think this must because the application can talk with the GPU driver directly, therefore doesn't need to incur any OS kernel calls.
So my questions is, for a very simple example, in CUDA application, when we make a cudaMalloc(), does it incur any OS kernel calls?
The entire premise of this question is flawed. "Submitting a job" and allocating memory are not the same thing. Even a user space process running on the host CPU which calls malloc will (most of the time) result in a kernel call as the standard library gathers or releases physical memory to its memory heap, normally via sbrk or mmap.
So yes,cudaMalloc results in an OS kernel call - if you run strace you will see the GPU driver invoking ioctl to issue commands to the GPU MMU/TLB. But so does running malloc in host code, and so, undoubtedly does running malloc on a theoretical HSA platform as well.

Networked CUDA GPU

Is it possible for multiple low-end computers to each make CUDA calls to a GPU located on a central server in a request/response over the cloud scenario? To make it as if these low-end computers possess a "virtual" GPU.
I had a similiar problem to solve.
The database was living in the low end machine and I had a cluster of GPUs in my disposal on the local network.
I made a small client (on the low end machine) to parse the database, serialize the data with google protocol buffers and send them to the server with zmq sockets. For data distribution you can have asynchrouns publisher/subscriber sockets.
On the server side you deserialize the data and you have the CUDA program to run the calculations (it can also be a daemonized application so you dont have to fire it up yourself every time).
Once the data is ready on the server you can issue a synchronous message (request/reply socket) from the client and when the server receives the message it calls a function wrapper to the CUDA kernel.
If you need to process the results back on the client you can follow the reverse route to send the data back to the client.
If the data is already in the server, its even easier. You only need the request/reply socket to send a message and call the function.
Check the zmq manual, they have a lot of examples in many programming languages.

How to prevent two CUDA programs from interfering

I've noticed that if two users try to run CUDA programs at the same time, it tends to lock up either the card or the driver (or both?). We need to either reset the card or reboot the machine to restore normal behavior.
Is there a way to get a lock on the GPU so other programs can't interfere while it's running?
Edit
OS is Ubuntu 11.10 running on a server. While there is no X Windows running, the card is used to display the text system console. There are multiple users.
If you are running on either Linux or Windows with the TCC driver, you can put the GPU into compute exclusive mode using the nvidia-smi utility.
Compute exclusive mode makes the driver refuse a context establishment request if another process already holds a context on that GPU. Any process trying to run on a busy compute exclusive GPU will receive a no device available error and fail.
You can use something like Task Spooler to queue the programs and run one at the time.
We use TORQUE Resource Manager but it's harder to configure than ts. With TORQUE you can have multiple queues (ie one for cuda jobs, two for cpu jobs) and assign a different job to each gpu.