max of octave sparse matrix failed - octave

I got a problem when using octave sparse matrix.
max(speye(65536)(:))
will result in a 0x0 variable.
However, speye(65535) and speye(65537) works. How that happens? My octave version is 3.2.4 in Fedora 14.
max(max(speye(65536))) gets the same result.

That's a quick response: http://savannah.gnu.org/bugs/index.php?40287
A fix is available, you could either wait for an update or compile a patched version of octave yourself.

Related

How to check if Caffe is using my GPU?

Same as this question but for caffe. I want a command I can put in my python script to check if gpu is utilized.
I have checked nvidia-smi while my model is running and I see that python is recognized as a process but Usage is N/A.
I also tried running the caffe.set_mode_cpu() command thinking that the times would be very different but the times with the command and without where the same.
I would like to suggest the use of GPUSTAT. You can query the GPU and check if your process is in the list returned by the command.
It is simple, not too elegant but it works.

Theano: cublasSgemm failed (14) an internal operation failed

Sometimes, after a while of running fine, I get such an error with Theano / CUDA:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[512 2048], a.dim=[512 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(512, 493), (493, 2048)]
Inputs strides: [(493, 1), (2048, 1)]
Inputs values: ['not shown', 'not shown']
As my code runs fine for a while (I do Neural Network training, and it runs most of the time through, and even when this error occurred, it already ran fine for >2000 mini-batches), I wonder about the cause of this. Maybe some hardware fault?
This is with CUDA 6.0 and a very recent Theano (yesterday from Git), Ubuntu 12.04, GTX 580.
I also got the error with CUDA 6.5 on a K20:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2899 2000], a.dim=[2899 493], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuReshape{2}.0, GpuReshape{2}.0)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2899, 493), (493, 2000)]
Inputs strides: [(493, 1), (2000, 1)]
Inputs values: ['not shown', 'not shown']
(Another error I sometimes got in the past is this now instead. Not sure if this is related.)
Via Markus, who got the same error:
RuntimeError: cublasSgemm failed (14) an internal operation failed
unit=0 N=0, c.dims=[2 100], a.dim=[2 9919], alpha=%f, beta=%f, a=%p, b=%p, c=%p sa_0=%d, sa_1=%d, sb_0=%d, sb_1=%d, sc_0=%d, sc_1=%d
Apply node that caused the error: GpuDot22(GpuFlatten{2}.0, weight_hidden_)
Inputs types: [CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(2, 9919), (9919, 100)]
Inputs strides: [(9919, 1), (100, 1)]
Inputs values: ['not shown', 'not shown']
With CUDA 6.5, Windows 8.1, Python 2.7, GTX 970M.
The error only occurs in my own network, if I run the LeNet example from Theano, it runs fine. Though the network is compiling and running fine on the CPU (and also on the GPU for some colleagues using Linux). Does anyone have an idea what the problem could be?
Just for reference in case anyone stumbles upon this:
This doesn't occur anymore for me. I'm not exactly sure what fixed it, but I think the main difference is that I avoid any multithreading and forks (without exec). This caused many similar problems, e.g. Theano CUDA error: an illegal memory access was encountered (StackOverflow), and Theano CUDA error: an illegal memory access was encountered (Google Groups discussion). Esp. that discussion on Google Groups is very helpful.
Theano functions are not multithreading safe. However, that is not a
problem for me because I'm only using it in one thread. However, I
still think that other threads might cause these problems. Maybe it is
related to the GC of Python which frees some Cuda_Ndarray in some
other thread while the theano.function is running.
I looked a bit at the relevant Theano code and not sure if it covers
all such cases.
Note that you might even not be aware that you have some background
threads. Some Python stdlib code can spawn such background threads.
E.g. multiprocessing.Queue will do that.
I cannot avoid having multiple
threads, and until this is fixed in Theano, I create a new subprocess
with a single thread where I do all the Theano work. This also has
several advantages such as: More clear separation of the code, being
faster in some cases because it all really runs in parallel, and being
able to use multiple GPUs.
Note that just using the multiprocessing module did not work for me
that well because there are a few libs (Numpy and others, and maybe
Theano itself) which might behave bad in a forked process (depending
on the versions, the OS and race conditions). Thus, I needed a real
subprocess (fork + exec, not just fork).
My code is here, in case anyone is interested in this.
There is ExecingProcess which is modeled after multiprocessing.Process
but does a fork+exec. (Btw, on Windows, the multiprocessing module
will anyway do this, because there is no fork on Windows.)
And there is AsyncTask which adds up a duplex pipe to this which works
with both ExecingProcess and the standard multiprocessing.Process.
See also: Theano Wiki: Using multiple GPUs
Ran into a similar issue, and fwiw, in my case it was solved by eliminating the import of another library that used pycuda. It appears theano really does not like to share.

How can i tell PyCUDA which GPU to use?

I have two NVidia cards in my machine, and both are CUDA capable. When I run the example script to get started with PyCUDA seen here: http://documen.tician.de/pycuda/ i get the error
nvcc fatal : Value 'sm_30' is not defined for option 'gpu-architecture'
My computing GPU is compute capability 3.0, so sm_30 should be the right option for the nvcc compiler. My graphics GPU is only CC 1.2, so i thought maybe that's the problem. I've installed the CUDA 5.0 release for linux with no errors, and all the compiler components and python components.
Is there a way to tell PyCUDA explicitly which GPU to use?
nvcc isn't going to complain based on the specific GPUs you have installed. It will compile for whatever GPU type you tell it to compile for. The problem is you are specifying sm_30 which is not a valid option for --gpu-architecture when a --gpu-code option is also specified.
You should be passing compute_30 for --gpu-architecture and sm_30 for --gpu-code
Also be sure you have the correct nvcc in use and are not inadvertently using some old version of the CUDA toolkit.
Once you have the compile problem sorted out, there is an environment variable CUDA_DEVICE that pycuda will observe to select a particular installed GPU.
From here:
CUDA_DEVICE=2 python my-script.py
By the way someone else had your problem.
Are you sure you don't have an old version of the CUDA toolkit laying around that PyCUDA is using?
I don't know about Python wrapper( or about Python in general), but in C++ you have WGL_NV_gpu_affinity NVidia extension which allows you to target a specific GPU. Probably you can write a wrapper for it in Python.
EDIT:
Now that I see you are actually running Linux, the solution is simpler (C++).You just need to enumerate XDisplay before context init.
So basically the default GPU is usually targeted with Display string "0.0"
To open display with second GPU you can do something like this:
const char* gpuNum = "0:1";
if (!(_display = XOpenDisplay(gpuNum ))) {
printf("error: %s\n", "failed to open display");
} else {
printf("message: %s\n", "display created");
}
////here comes the rest of context setup....
At lest currently, it seems possible to just say
import pycuda.driver as drv
drv.Device(6).make_context()
and this sets Device 6 as current context.

CURAND_STATUS_DOUBLE_PRECISION_REQUIRED is undefined

While building my CUDA project I get the following error:
cutil_inline_runtime.h(328): error: identifier "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" is undefined
So I started googling. Since I couldn't find the solution (nor did I find the actual problem) I downloaded from nVIDIA CURAND guide pdf and started reading. In that pdf it says:
Enumerator:
...
**CURAND_STATUS_DOUBLE_PRECISION_REQUIRED** GPU does not have double precision required by MRG32k3a
...
This means I can't perform operations with double data type... right? Well that seems wrong because, I assure you, couple a days ago I made a project using double data type and it worked just fine. Does anyone have a suggestion? Thank you!
EDIT Since I was unable to find the "real" solution for the problem, eventually I commented out the lines with "CURAND_STATUS_DOUBLE_PRECISION_REQUIRED" in cutil_inline_runtime.h and now it works. I know it is "ugly" but it was the only thing that helped...
I also had this problem. The issue was that I was using different versions of the cuda SDK and the cuda toolkit. I was using a newer version of the SDK which referenced the definition of CURAND_STATUS_DOUBLE_PRECISION_REQUIRED, but that definition was undefined in curand.h in the older version of the cuda toolkit that I had installed.
Try installing the version of the toolkit that matches your cuda SDK version.
Try Googling "Compute Capability", it's how nvidia defines the various CUDA capabilities. In the CUDA C Programming guide, it's mentioned a couple of times that devices with compute capability 1.2 or lower do not support double precision FP math: CUDA C Programming, pages 140, and also check table F-1, namely the row about dpfp support.
Gefore 8800 GT is a G80 architecture chip, which is compute capability 1.0. You cannot do double computations on this card. Keep in mind much example code makes use of emulation modes, and the like, which can fool you into thinking it works.

Reading mxArray in CUSP or in cuSPARSE

I am trying to read mxArray from matlab into my custom made .cu file.
I have two sparse matrices to operate on.
How do I read them inside cusp sparse matrices say A and B ( or in cuSPARSE matrices), so that I can perform operations and return them back to matlab.
One idea that I could come up with is to write mxArrays in .mtx file and then read
from it. But again, are there any alternatives?
Further, I am trying understand the various CUSP mechanisms using the examples posted on its website.But every I try to compile and run the examples, I am getting the following error.
terminate called after throwing an instance of
'thrust::system::detail::bad_alloc'
what(): N6thrust6system6detail9bad_allocE: CUDA driver version is
insufficient for CUDA runtime version
Abort
Here are the stuff that is installed on the machine that I am using.
CUDA v4.2
Thrust v1.6
Cusp v0.3
I am using GTX 480 with Linux x86_64 on my machine.
Strangely enough, code for device query is also returning this output.
CUDA Device Query...
There are 0 CUDA devices.
Press any key to exit...
I updated my drivers and SDK few days.
Not sure whats wrong.
I know, I am asking a lot in one questions but I am facing this problem from quite a while and upgrading and downgrading the drivers doesn't seem to solve.
Cheers
This error is most revealing, "CUDA driver version is insufficient for CUDA runtime version". You definitely need to update your driver.
I use CUSPARSE/CUSP through Jacket's Sparse Linear Algebra library. It's been good, but I wish there were more sparse features available in CUSPARSE/CUSP. I hear Jacket is going to get CULA Sparse into it soon, so that'll be nice.