Google compute - Machine types supporting V100 GPU - google-compute-engine

I am trying to find a machine type which allows me to select a V100 GPU in the Singapore region. I can't find this in the documentation and I can't seem to select it in the machine types in the pricing calculator. I can only select a T4 and P4 with the N1 machine type, or the A100 using the A2 machine type. Besides clicking on every instance of every machine and checking if GPU is even available (largely not so far) is the machine -> GPU availability listed in the doco somewhere?
Many thanks

Currently, NVIDIA Tesla V100 GPU type is not supporting some of the regions and in that Asia-Southeast1 (Singapore) is included. If you want to choose the Singapore region, you can select A100, P4, T4 GPU machine types. To select the V100 GPU, you need to select different zones that are included in the documentation.
Try to create your instance at another zone where GPU is available (request an increase in quota if needed).
Refer to the image which shows that you cannot select V100 for the Singapore region.

Related

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable to a higher value if you need more.

How to enable/disable a specific graphic card?

I'm working on a "fujitsu" machine. It has 2 GPUs installed: Quadro 2000 and Tesla C2075. The Quadro GPU has 1 GB RAM and Tesla GPU has 5GB of it. (I checked using the output of nvidia-smi -q). When I run nvidia-smi, the output shows 2 GPUs, but the Tesla ones display is shown as off.
I'm running a memory intensive program and would like to use 5 GB of RAM available, but whenever I run a program, it seems to be using the Quadro GPU.
Is there some way to use a particular GPU out of the 2 in a program? Does the Tesla GPU being "disabled" means it's drivers are not installed?
You can control access to CUDA GPUs either using the environment or programmatically.
You can use the environment variable CUDA_VISIBLE_DEVICES to specify a list of 1 or more GPUs that will be visible to any application, as well as their order of visibility. For example if nvidia-smi reports your Tesla GPU as GPU 1 (and your Quadro as GPU 0), then you can set CUDA_VISIBLE_DEVICES=1 to enable only the Tesla to be used by CUDA code.
See my blog post on the subject.
To control what GPU your application uses programmatically, you should use the device management API of CUDA. Query the number of devices using cudaGetDeviceCount, then you can cudaSetDevice to each device, query its properties using cudaGetDeviceProperties, and then select the device that fits your application criteria. You can also use cudaChooseDevice to select the device that most closely matches the device properties you specify.

nvidia-smi -ac equivalent in NVML

I learnt than nvidia-smi -ac can be used to change the clock
rate of GPU cores and memory. Is nvidia-smi built upon the NVML library?
What is its equivalent in NVML since I checked the document
http://cyber.sibsutis.ru:82/GPGPU/sdk/CUDA_TOOLKIT/nvml.pdf
but could only see the API's which are used to get the values of clock rates rather than
setting them?
Thanks
Yes, nvidia-smi is built on the NVML library.
According to the latest nvml api documentation available here (which is linked from the site I previously suggested to you here) The "Set Application Clocks" command is supported on Tesla K10 and K20 GPUs (page 6). I believe it is also supported on "Kepler" members of the Quadro family, such as Quadro K5000.
If you have a Tesla K10, K20, or K20X GPU, the Set Application Clocks command is described on p68, which I am also reproducing here for convenience:
7.12.2.2 nvmlReturn_t DECLDIR nvmlDeviceSetApplicationsClocks (nvmlDevice_t device, unsigned int
memClockMHz, unsigned int graphicsClockMHz)
Set clocks that applications will lock to.
Sets the clocks that compute and graphics applications will be running at. e.g. CUDA driver requests these clocks
during context creation which means this property defines clocks at which CUDA applications will be running unless
some overspec event occurs (e.g. over power, over thermal or external HW brake).
Can be used as a setting to request constant performance.
For Tesla ™products, and Quadro ®products from the Kepler family. Requires root/admin permissions.
See nvmlDeviceGetSupportedMemoryClocks and nvmlDeviceGetSupportedGraphicsClocks for details on how to list
available clocks combinations.
After system reboot or driver reload applications clocks go back to their default value.
Parameters:
device The identifier of the target device
memClockMHz Requested memory clock in MHz
graphicsClockMHz Requested graphics clock in MHz
Returns:
• NVML_SUCCESS if new settings were successfully set
• NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
• NVML_ERROR_INVALID_ARGUMENT if device is invalid or memClockMHz and graphicsClockMHz is
not a valid clock combination
• NVML_ERROR_NO_PERMISSION if the user doesn’t have permission to perform this operation
• NVML_ERROR_NOT_SUPPORTED if the device doesn’t support this feature
• NVML_ERROR_UNKNOWN on any unexpected error

Kernel time increases for same number of particles

I am trying to run my code on NVIDIA's K10 GPU. I am using 5.0 CUDA Driver and 4.2 CUDA runtime. The problem is that the time taken by the kernel increases with iterations, where each iteration uses the same number of sources and targets (or particles). Because of this, the kernel eventually takes very large times, and the code crashes with runtime error, which says something like "GPU fallen off the bus".
The plot showing the behavior of increasing kernel run time with number of iterations can be seen here:
https://docs.google.com/open?id=0B5QLL4ig3LVqODdmVjNBTlp5UFU
I tried to run the NVIDIA "nbody" example to understand if the same thing happens here too, and yes it does. For the number of particles/bodies (Np) = 1e5 and 10 iterations, code runs fine. For Np=1e5 and iterations= 100, OR Np=1e6 and iterations = 10, code goes into a mode where it hangs the entire system.
When I run my own kernel as well as NVIDIA's nbody example on a different machine with Tesla C2050 NVIDIA card (CUDA Driver version: 3.2, and runtime version: 3.2), there is no problem, and kernel takes the same amount of time for every iteration.
I am trying to understand whats going on in the machine with the K10 GPU. I have tried different combinations of CUDA driver and runtime versions on this machine, and here is what I get:
For 5.0 CUDA Driver, 4.2 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
For 4.2 CUDA Driver, 4.2 Runtime, the codes (nbody as well as my code) crash with error: "CUDA Runtime API error 39: uncorrectable ECC error encountered."
For 5.0 CUDA Driver, 5.0 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
This is a 64-bit linux machine, which we have recently assembled with NVIDIA K10 GPU card. I am using gfortran44 and gcc44.
Please let me know if any other info. is required to track the problem.
Thanks in advance for the help!
M
I'm mostly just creating an answer so we can call this question closed, but I'll try to add a few details.
Tesla GPUs come in 2 distinct categories: those with a fan, and those without. Those with a fan carry (at this time) the "C" designation, although the K20 product family naming will be slightly different:
These are not exhaustive lists:
Tesla GPUs with a Fan: C870, C1060, C2050, C2070, C2075, K20c ("C Class")
Tesla GPUs without a Fan: M1060, M2050, M2070, M2075, M2090, K10, K20, K20X ("M class")
(note that there is currently no K10 type product with a fan or "C" designation)
Tesla GPUs with a fan are designed to be plugged into a wide variety of PC boxes and chassis, including various workstation and server variants. Since they have their own fan, they require a supply of inlet air that is below a certain temperature level, but given that, they will keep themselves cool. As the workload increases, and the generated heat increases, they will spin up their own fan to keep themselves cool. The main ways you can screw up this process are by either restricting the inlet air flow or by putting it in an ambient air environment that is hotter than its max inlet spec.
Tesla GPUs without a fan have something called a passive heatsink and they cannot keep themselves cool independently and take a passive role in the cooling process. They still have a temperature sensor, but it becomes the responsibility of the server BMC (baseboard management controller) to monitor this temperature sensor (this is done directly at the hardware/firmware level, independent of any OS or any activity being directed at the GPU), and to direct a level of airflow over the card that is sufficient to keep the card cool based on it's indicated temperature. The BMC does this by ramping up whatever fans are designed into the server chassis that control airflow over the GPU. Normally there will be shrouding/ducting within the chassis to aid in this process. Server manufacturers integrating these cards have a variety of responsibilities and must follow various technical specifications from NVIDIA in order to make this work.
If you happen to get your hands on a Tesla GPU without a fan and just slap it in some random chassis, you're pretty much guaranteed to have the behavior as described in this question. For this reason, Tesla "M" series and "K" series GPUs are normally only sold to OEMs who have undergone the qualification process.
Since the average sysadmin/system assembler is not likely to devise a suitable closed loop fan control system and normally does not have easy access to the necessary specifications defining the temperature sensor and access method, the only klugey workaround if you have one of these that you simply must play with, is to direct a high level of continuous airflow over the card, in whatever setting you put it. Be advised, that this will most likely be noisy. If you don't have a noisy level of airflow, you probably do not have enough airflow to keep a card cool that is in a high workload situation. In addition, you should probably keep an eye on GPU temps. Note that the nvidia-smi method for monitoring GPU temps does not work for all M class GPUs (i.e. GPUs without a fan). Unfortunately, the method of temperature sensor access in Fermi and prior for the M class GPUs (different than the C class GPUs) was such that it could not be readily monitored in-system via the nvidia-smi command, so in these cases you will get no temperature reading from nvidia-smi, making this approach even harder to manage. Things changed with the Kepler generation, so now the temperature can be monitored both by the nvidia-smi method and by the server BMC at the hardware/firmware level.
C class products with a fan have a temperature that can be monitored with nvidia-smi, regardless of generation. But this is normally not necessary since the card has it's own control system to keep itself cool.
As mentioned in the comments, all GPUs also have a variety of protection mechanisms, none of which are guaranteed to prevent damage. (If you throw the card in a fire, there's nothing to be done about that.) But the first typical mechanism is thermal throttling. At some predefined high temperature near the maximum safe operating range of the GPU, the GPU firmware will independently reduce its clocks to attempt to prevent further temperature rise. (If the card is clocked slower, then generally it's ability to generate heat is also somewhat reduced.) This is a crude mechanism, and when this thermal throttling occurs, something in the cooling arena is already wrong. The card is designed to not enter thermal throttling ever, under normal operating conditions. If temperatures continue to rise (and there is not much headroom at this point), the card will enter it's final protection mode which is to halt itself. At this point the GPU has become unresponsive to the system, and at the OS level, messages like "gpu has fallen of the bus" are typical. This means cooling has failed and protection mechanisms have failed.

How to choose device when running a CUDA executable?

I'm connecting to a GPU cluster from the outside and I have no idea how to select the device on which to run my CUDA programs.
I know there are two Tesla GPU in the cluster, and I'd like to choose one of them.
Any ideas how? How do you choose the device you want to use when there are many connected to your computer?
The canonical way to select a device in the runtime API is using cudaSetDevice. That will configure the runtime to perform lazy context establishment on the nominated device. Prior to CUDA 4.0, this call didn't actually establish a context, it just told the runtime which GPU to try and use. Since CUDA 4.0, this call will establish a context on the nominated GPU at the time of calling. There is also cudaChooseDevice, which will select amongst available devices to find one which matches criteria supplied by the caller.
You can enumerate the available GPUs on a system with cudaGetDeviceCount, and retrieve their particulars using cudaGetDeviceProperties. The SDK deviceQuery example shows full details of how to do this.
You may need to be careful, however, on how you select GPUs in a multi-GPU system, depending on the host and driver configuration. In both the Linux and the Windows TCC driver, there exists the option for GPUs to be marked "compute exculsive", meaning that the driver will limit each GPU to one active context at a time, or compute prohibited, meaning that no CUDA program can establish a context on that device. If your code attempts to establish a context on a compute prohibited device, or on a compute exclusive device which is in use, the result will be an invalid device error. In a multiple GPU system where the policy is to use compute exclusivity, the correct approach is not to try and select a particular GPU, but simply to allow lazy context establishment to happen implicitly. The driver will automagically select a free GPU for your code to run. The compute mode status of any device can be checked by reading the cudaDeviceProp.computeMode field using the cudaGetDeviceProperties call. Note that you are free to check unavailable or prohibited GPUs and query their properties, but any operation which would require context establishment will fail.
See the runtime API documentation on all of these calls
You can set the environment variable CUDA_VISIBLE_DEVICES to a comma-separated list of device IDs to make only those devices visible to an application. Use this either to mask out devices or to change the visibility order of devices so that the CUDA runtime enumerates them in a specific order.