Theano: Explicit transfers of data between GPU dosen't work

Theano: Explicit transfers of data between GPU dosen't work - cuda

I'm trying to implement a simple sample to show how to calculate two theano.tensor.dot in two different GPUs. Where the two dot shares the same A and different B.
theano.tensor.dot(A,B0); theano.tensor.dot(A,B1)
I'm willing to store the B0 and B1 in different GPUs. And A was originally stored in one GPU, and then I made a copy to another GPU with explicit transfer function. Finally, I dot separately in the two GPUs.
My implementation is as below:
import numpy
import theano
va0 = theano.shared(numpy.random.random((1024, 1024)).astype('float32'),
target='dev0')
va1 = va0.transfer('dev1')
vb0 = theano.shared(numpy.random.random((1024, 512)).astype('float32'),
target='dev0')
vb1 = theano.shared(numpy.random.random((1024, 2048)).astype('float32'),
target='dev1')
vc0 = theano.tensor.dot(va0,vb0)
vc1 = theano.tensor.dot(va1,vb1)
f = theano.function([], [vc1,vc0])
print f()
While I was looking into the nvprof result, I found that the two dot still run in the same GPU. And the va0.tranfer('dev1') doesn't work. Actually it copied vb1 into dev0 instead, and both the dots computed in dev0.
I tried sever combinations of Theano Flags but doesn't work. Any one can help?
nvprof: two dot in same gpu

The Theano Flag below solves the issue.
export
THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1,optimizer_verbose=True,optimizer_excluding=local_cut_gpua_host_gpua"
optimizer_verbose provides the optimization done by theano function. I notice one line like below:
local_cut_gpu_transfers HostFromGpu(gpuarray).0
HostFromGpu(gpuarray).0
where local_cut_gpu_transfers is the reason
HostFromGpu(gpuarray).0 is original node the last segment is what the original node be replaced to.
Then, I searched the keyword "local_cut_gpu_transfer" in source code of Theano, until I found:
optdb['canonicalize'].register('local_cut_gpua_host_gpua',
local_cut_gpu_transfers,
'fast_compile', 'fast_run', 'gpuarray')
Then I add 'local_cut_gpua_host_gpua'to optimizer_excluding in Theano Flag.
Hopes that Theano will provide a detailed map of the reason and the optimizer Theano Flag?

Related

Replacing the njit decorator with the cuda.jit decorator

I have an Nvidia GPU, downloaded CUDA, and am trying to make use of it.
Say I have this code:
##cuda.jit (Attempted fix #1)
##cuda.jit(device = True) (Attempted fix #2)
##cuda.jit(int32(int32,int32)) (Attempted fix #3)
#njit
def product(rho, theta):
x = rho * (theta)
return(x)
a = product(1,2)
print(a)
How do I make it work with the cuda.jit decorator instead of njit?
Things I've tried:
When I switch the decorator from #njit to #cuda.jit, I get: TypingError: No conversion from int64 to none for '$0.5', defined at None.
When I switch the decorator #cuda.jit(device = True), I get: TypeError: 'DeviceFunctionTemplate' object is not callable.
And when I specify the types for my inputs and outputs, and use the decorator #cuda.jit(int32(int32,int32)), I get: TypeError: CUDA kernel must have void return type.

numba cuda kernels don't return anything. You must return results via parameters/arguments to the function. The starting point for this is usually some kind of numpy array. Here is an example:
$ cat t44.py
from numba import cuda
import numpy as np
#cuda.jit
def product(rho, theta, x):
x[0] = rho * (theta)
x = np.ones(1,dtype=np.float32)
product(1,2,x)
print(x)
$ python t44.py
[ 2.]
$
There's potentially many other things that could be said; you may wish to avail yourself of the documentation linked above, or e.g. this tutorial. Usually you will want to handle problems much bigger than multiplying two scalars, before GPU computation will be interesting.
Also, numba provides other methods to access GPU computation, that don't depend on use of the #cuda.jit decorator. These methods, such as #vectorize are documented.
I'm also omitting any kernel launch configuration syntax on the call to product. This is legal in numba cuda, but it results in launching a kernel of 1 block that contains 1 thread. This works for this particular example, but this is basically a nonsensical way to use CUDA GPUs.

Understanding when to use python list in Pytorch

Basically as this thread discusses here, you cannot use python list to wrap your sub-modules (for example your layers); otherwise, Pytorch is not going to update the parameters of the sub-modules inside the list. Instead you should use nn.ModuleList to wrap your sub-modules to make sure their parameters are going to be updated. Now I have also seen codes like following where the author uses python list to calculate the loss and then do loss.backward() to do the update (in reinforce algorithm of RL). Here is the code:
policy_loss = []
for log_prob in self.controller.log_probability_slected_action_list:
policy_loss.append(- log_prob * (average_reward - b))
self.optimizer.zero_grad()
final_policy_loss = (torch.cat(policy_loss).sum()) * gamma
final_policy_loss.backward()
self.optimizer.step()
Why using the list in this format works for updating the parameters of modules but the first case does not work? I am very confused now. If I change in the previous code policy_loss = nn.ModuleList([]), it throws an exception saying that tensor float is not sub-module.

You are misunderstanding what Modules are. A Module stores parameters and defines an implementation of the forward pass.
You're allowed to perform arbitrary computation with tensors and parameters resulting in other new tensors. Modules need not be aware of those tensors. You're also allowed to store lists of tensors in Python lists. When calling backward it needs to be on a scalar tensor thus the sum of the concatenation. These tensors are losses and not parameters so they should not be attributes of a Module nor wrapped in a ModuleList.

CUDA tridiagonal solver function (cusparse)

In my CUDA code I am using cusparse<t>gtsv() function (more precisely, cusparseZgtsv and cusparseZgtsvStridedBatch ones).
In the documentaion it is said, that this function solves the equation A*x=alpha *B. My question is - what is alpha? I didn't find it as an input parameter. I have no idea how to specify it. Is it always equals to 1?

I performed some testing (solved some random systems of equations where tridiagonal matrices were always diagonally dominant and checked my solution using direct matrix by vector multiplication).
It looks like in the current version alpha = 1 always, so one can just ignore it. I suspect that it will be added as an input parameter in future releases.

How to plot a transfer function from a Cauer network

The picture below shows a Cauer network, which is a continued fraction network.
I have built the 3rd olrder transfer function 3rd Octave like this:
function uebertragung=G(R1,Tau1,R2,Tau2,R3,Tau3)
s= tf("s");
C1= Tau1/R1;
C2= Tau2/R2;
C3= Tau3/R3;
# --- Uebertragungsfunktion 3.Ordnung --- #
uebertragung= 1/((s*R1*C1)^3+5*(s*R2*C2)^2+6*s*R3*C3+1);
endfunction
R1,R2,R3,C1,C2,C3 are the 6 parameters my characteristic curve depends on.
I need to put this parameters into the tranfser function, get a result and plot the characteristic curve from the data.
The characteristic curve shows thermal impedance vs time. Like these 2 curves from an igbt data sheet.
My problem is I don't know how to handle transfer functions properly. I need data to plot the characteristic curve but I don't know how to generate them out of the transfer function.
Any tips are welcome. Do I have to make Laplace transformation?
If you need further Information ask me and I try to provide them all.

From the data sheet, the equation they are using for their transient thermal impedance graph is the Foster chain step function response:
Z(t) = sum (R_i * (1-exp(-t/tau_i))) = sum (R_i * (1-exp(-t/(R_i*C_i))))
I verified that the stage R's and C's in the table by the graph will produce the plot you shared with that function.
The method for producing a step function response of an s-domain (Laplace domain) impedance function (Z) is to take the inverse Laplace transform of the product of the transfer function and 1/s (the Laplace domain form of a constant value step function). With the Foster model impedance function:
Z(s) = sum (R_i/(1+R_i*C_i*s))
that will produce the equation above.
Using the transfer function in Octave, you can use the Control package function step to calculate the transient response for you rather than performing the inverse Laplace transform yourself. So once you have Z(s), step(Z) will produce or plot the transient response. See help step for details. You can then adjust the plot (switch to log scale, set axes limits, etc) to look like one of the spec sheet plots.
Now, you want to do the same thing with a Cauer network model. It is important to realize that the R's and C's will not be the same for the two models. The Foster network is a decoupled model that has each primary complex pole isolated by layout, but the R's and C's are actually convolutions of the physical thermal resistances and capacitances in the real package. On the contrary, the Cauer model has R's and C's that match the physical package layers, and the poles in the s-domain transfer function will be complex products of the multiple layers.
So, however you are obtaining your R's and C's for the Cauer model, you can't just use the same values they have in their Foster model parameter table. They can be calculated from physical layer and material properties, however, assuming you have that information. Once you do have useful values, the procedure for going from Z(s) to the transient impedance function is the same for either network, and they should produce the same result.
As an example, the following procedure should work in both Octave and Matlab to plot the Thermal impedance curve from the spec sheet data using the Foster Z(s) model as a starting point. For the Cauer model, just use a different Z(s) function.
(Note that Octave has some issues in the step function that insert t = 0 entries into the time series output, even when they aren't specified, which can cause some errors when trying to plot on a log scale. so this example puts in a t=0 node then ignores it. wanted to explain so that line didn't seem confusing).
s = tf('s')
R1 = 8.5e-3; R2 = 2e-3;
tau1 = 151e-3; tau2 = 5.84e-3;
C1 = tau1/R1; C2 = tau2/R2;
input_imped = R1/(1+R1*C1*s)+R2/(1+R2*C2*s)
times = linspace(0, 10, 100000);
[Zvals,output_times] = step(input_imped, times);
loglog(output_times(2:end), Zvals(2:end));
xlim([.001 10]); ylim([0.0001, .1]);
grid;
xlabel('t [s]');
ylabel('Z_t_h_(_j_-_c_) [K/W] IGBT');
text(1,0.013 ,'Z_t_h_(_j_-_c_) IGBT');

Generate Matrix from Another Matrix

Started learning octave recently. How do I generate a matrix from another matrix by applying a function to each element?
eg:
Apply 2x+1 or 2x/(x^2+1) or 1/x+3 to a 3x5 matrix A.
The result should be a 3x5 matrix with the values now 2x+1
if A(1,1)=1 then after the operation with output matrix B then
B(1,1) = 2.1+1 = 3
My main concern is a function that uses the value of x like that of finding the inverse or something as indicated above.
regards.

You can try
B = A.*2 + 1
The operator . means application of the following operation * to each element of the matrix.
You will find a lot of documentation for Octave in the distribution package and on the Web. Even better, you can usually also use the extensive documentation on Matlab.
ADDED. For more complex operations you can use arrayfun(), e.g.
B = arrayfun(#(x) 2*x/(x^2+1), A)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Theano: Explicit transfers of data between GPU dosen't work - cuda

Related

Replacing the njit decorator with the cuda.jit decorator

Understanding when to use python list in Pytorch

CUDA tridiagonal solver function (cusparse)

How to plot a transfer function from a Cauer network

Generate Matrix from Another Matrix

Categories

Resources