What the `ipa` pipeline is about in CUDA architecture? - cuda

When looking into ncu --query-metrics it turns out that several counters are about this ipa pipeline that isn't even cited in NSight docs, smsp__inst_executed_pipe_ipa for example. While for all of the other pipelines a proper explanation is provided, for ipa I wasn't able to find any reference at all.

IPA is the Interpolate Attribute pipeline used in pixel/fragment shaders to interpolate a varying attribute over a quad (4 threads). This pipeline is not accessible to a compute shader.

Related

Does Caffe Scale up to multiple CPU Cores?

I wish to run Caffe on a 32 core machine.
Does caffe scale up to available number of cores to utilize them the best?
Although there are 32 cores, can I make caffe use only a selected number of cores?
Generally caffe doesn't support multiple CPUs/cores in its source code, but it uses BLAS routines.
Thus answers to your questions are the following:
Yes, but only through BLAS configuration, i. e. your BLAS version should be compiled with multithreading support (see related discussions: here or here - at the second link you can also find some modifications for caffe itself).
Also through BLAS (if it was compiled with openmp support, you can define OMP_NUM_THREADS to desired value).
caffe does not, but you can you Intel caffe which is optimized for CPU and supports multi node
https://github.com/intel/caffe/wiki/Multinode-guide

How does CUDA raytracing match raycasts against vertices stored in the graphics pipeline?

So, I think I understand the basic functionality of cuda, and also how the graphics pipeline works. But what I don't understand is how CUDA raytracing engines combine those two. Since the vertices of a scene are stored in the graphics pipeline via directx or opengl, I can't see how this information can be accessed via the cuda pipeline.
I can't see how this information can be accessed via the cuda pipeline.
First of all, you can write a ray-tracer that doesn't use the graphics pipeline at all, except for final display of the pixels (e.g. glDrawPixels). An example is here.
But in general, the sharing of data between OGL/DX and CUDA is facilitated by the interop APIs. These APIs allow for various kinds of data to be shared between CUDA and OpenGL, including final rendered pixels, geometry data, and textures. There are plenty of CUDA sample codes which demonstrate all of these types of data sharing, in both directions, including display of results.

Does CUDA applications' compute capability automatically upgrade?

If I compile a CUDA program with a lower Compute Capability, e.g 1.3 (nvcc flag sm_13), and run it on a device with Compute Capability 2.1, will it exploit the features of Compute 2.1 or not?
In that situation, Will the compute 2.1 device behave like a compute 1.3 device?
No, it won't exploit any features you need to explicitly program for.
Only those features that are transparent to the user (like cache or larger register files) will be used.
Additionally, you need to make sure your object file contains a version of the code compiled to the PTX intermediate language, that can be dynamically compiled to the target architecture, or you program will not even run.
Compile to a virtual architecture (nvcc -arch compute_13) to ensure that, or create a fat binary with code for multiple architectures using the -gencode option to nvcc.
With a fat binary, you can program for features available only on higher compute capability if you wrap the code inside #if __CUDA_ARCH__ >= xyz preprocessor conditionals.

CUDA: 1-dimensional cubic spline interpolation in CUDA

I'm making a medical imaging equipment. I want to use CUDA for making faster equipment
I receive 1024 size 1d data from CCD 512 times.
before I perform IFFT
I have to apply high performance interpolation algorithm (like cubic spline interpolation)
to the 1024 size data each (then 1d interpolation 512 times).
Is there any CUDA library to perform cubic spline interpolation?
(I found that there is one library, but it is for 2 or 3 dimensional image.
Since I need to perform other complicated filtering functions, I need the data on the global memory, not on the texture memory.)
Is there any NUFFT (non uniform fast Fourier transform) library (doesn't need to be written for CUDA)?
I'm thinking that if I have NUFFT function, I don't have to do interpolation and IFFT separately which is possible for making even faster equipment.
Since more people have asked this, I have extended my CUDA cubic interpolation code with 1D cubic interpolation as well. You can find the updated code here: http://www.dannyruijters.nl/cubicinterpolation/
A working CUDA example that also contains 1D cubic interpolation can be found in the cudaAccuracyTest sample in the examples subdirectory in CI.zip.
For those of you who are more interested in a SSE approach, I have some working SSE optimized multi-threaded cubic interpolation code (albeit in 3D, not 1D) in the referenceCubicTexture3D sample in the examples subdirectory.
edit: The cubic interpolation code is now available on github. The 1D cubic interpolation code is here.
Regarding #1
Ruijters' bi/tricubic spline interpolation, which is I think what you refer to http://dannyruijters.nl/cubicinterpolation, (edited!) now works with 1D data, thank you! See Danny Ruijters' answer on this page.
Regarding #2
Here're a few NUFFT implementations that I'm aware of, and brief thoughts on them.
The first library mentioned by #ardiyu07, Greengard, et alia's implementation of fast Gaussian gridding, is in Fortran, which I don't know and so I didn't look at this for long (though this does offer type-III nonuniform-to-nonuniform transforms).
The second one is Ferrara's implementation of Greengard's algorithm in Matlab/MEX, and I couldn't get it to give me the correct solution (see my comment to that effect on Mathworks FileExchange, which I just posted).
Potts, et al., http://www-user.tu-chemnitz.de/~potts/nfft/ I couldn't get this to compile in Windows so I gave up on it. It also has type-III NUFFTs.
Fessler, et al., http://web.eecs.umich.edu/~fessler/code/ written in Matlab/MEX and pre-compiled binaries provided for Linux and Windows at least. Definitely written by non-professional programmers, but it's the only one of the 4 that I've gotten to work correctly. I even got it to work in GNU Octave after changing their Matlab source code in a handful of places (basically by seeing where Octave errors were raised), since Octave can use pre-compiled MEX binaries. This also uses a different algorithm than Greengard's or Potts', based on min-max criteria (its solutions are guaranteed to minimize the maximum DFT error), but lacks type-III NUFFTs (only types-I and II: one of the domains has to be uniform).
I believe a fifth NUFFT/"gridding" implementation is by Hargreaves, et al.: http://www-mrsrl.stanford.edu/~brian/gridding/ (paper at http://dx.doi.org/10.1109/TMI.2005.848376). It is in Matlab/MEX. As is, it is not as general-purpose as the previous four on this list, as it's very much embedded in its MRI context.
And here' a sixth implementation, in Cython (fast Python), with type-III nonuniform-to-nonuniform transforms and some other nice features, alas under GPL: https://github.com/mrbell/gfft
I'm working, at a glacial pace, on porting Fessler's algorithm to Python/Cython, and maybe CUDA ("maybe" because just zero-padding the standard (CU)FFT and linear interpolation seems to work well enough). Best of luck.
I don't know about that algorithm, but if what you've found you think fast enough for your equipment, then why dont you change the implementation from using texture memory to just a simple array, and maybe you can do more speedup using shared memory?
I've found some written in matlab and fortran 77:
http://www.cims.nyu.edu/cmcl/nufft/nufft.html
http://www.mathworks.com/matlabcentral/fileexchange/25135-nufft-nufft-usffft
To be honest, your parallelism seems to be a bit low for the GPU. A 6core with SSE optimizations might outperform a GPU here.

Mathematica and CUDA

Is it possible that built in functions in Mathematica (like Minimize[expr,{x1,x2,...}]) will start to work via CUDA after installation of CUDA module for Mathematica?
I don't believe so, no. Mathematica's CUDALink module currently provides only a handful of GPU accelerated functions - some basic image processing operations, BLAS style linear algebra calls, Fourier Transforms and simple parallel reductions (argmin, argmax, and summation). There is also tools for integrating user written CUDA code, and for generating CUDA code symbolically. Outside of that, the rest of Mathematica's core functionality remains CPU only.
You can see full details of current CUDA and OpenCL support here.