Hey there,
is there a way to compile (or better to say: 'translate') a matlab m-function into a C-function so that I can use it in the CUDA kernel of my mex file?
thanks a lot!
MATLAB Coder will generate C code for mex files. I do not yet have a copy to evaluate, so I can't speak with any authority about the quality and nature of the generated code.
However, if I had to guess, I'd say the generated code would likely require a lot of massaging to get it working on your GPU. You may have better luck with a product like Jacket, depending on what you're doing.
You can call a matlab (m or mex) function from C / fortran using this function call. You could then interface that with along with your CUDA kernel.
However it may not be the most efficient way to do things. You could write your own C code for the m file that you have or look it up on matlab central if any one else has done it.
The C function will call eventually set-up device variables and call a CUDA kernel?
I originally wanted to try this for a project because I thought this method would be easier than converting all of my MATLAB code to C first, but I ended up doing that anyway.
There are some user created MATLAB scripts to help provide this functionality, but since they aren't from the Mathworks you'll have to use them at your own risk. I tried them and never found anything malicious, but you never know. I couldn't get them to work with my project due to its specific complications but it should work for simpler tasks.
1) NvMEX: This is directly from Nvidia. http://developer.stage.nvidia.com/matlab-cuda
http://www.mathworks.com/discovery/matlab-gpu.html
2) CUDA MEX: This is from a user. http://www.mathworks.com/matlabcentral/fileexchange/25314-cuda-mex
This is not really a direct answer to your question, but if your goal is simply to have your MATLAB code run on the GPU, then you may find that if you have access to Parallel Computing Toolbox, you can use GPUArrays with arrayfun. For example, if the function you wish to evaluate across many points looks like this:
function y = myFcn( x )
y = 1;
for ii = 1:10
y = sin(x * y);
end
Then you could call this on the GPU like so:
gx = gpuArray( rand(1000) );
gy = arrayfun( #myFcn, gx );
Related
I am trying to integrate a vector valued (49 components) function f[] which as an example may look like:
f[0] = 1
f[1] = cos(x)
f[2] = cos(2x)
f[3] = cos(3x)
... and so on.
I was wondering if there was a way to integrate such a vector function in GSL using a single command. I can currently do this only by having n=49 different cquad integration handles/procedures which seems inefficient, as I wish to use the same integration "mesh " for all the function components.
Thank you for your attention.
As far as I know, at the moment such things via GSL cannot be done.
But your task can be solved with the help of openmp (with #pragma omp parallel for), but there may be a very large overhead when using gsl dll by several threads at the same time.
It is possible (this is not accurate) that for that you will have to rebuild the GSL library itself with the openmp-compiler-flags. But GSL is quite thread-safe on its own.
So, I'm wondering how to implement a STFT in Julia, possibly using a Hamming window. I can't find anything on the internet.
What is the best way to do it? I'd prefer not to use Python libraries, but pure native Julia if possible. Maybe it's a feature still being developed in Juila...?
Thanks!
I'm not aware of any native STFT implementation in Julia. As David stated in his comment, you will have to implement this yourself. This is fairly straightforward:
1) Break up your signal into short time sections
2) Apply your window of choice (in your case, Hamming)
3) Take the FFT using Julia's fft function.
All of the above are fairly standard mathematical operations and you will find a lot of references online. The below code generates a Hamming window if you don't have one already (watch out for Julia's 1-indexing when using online references as a lot of the signal processing reference material likes to use 0-indexing when describing window generation).
Wb = Array(Float64, N)
## Hamming Window Generation
for n in 1:N
Wb[n] = 0.54-0.46cos(2pi*(n-1)/N)
end
You could also use the Hamming window function from DSP.jl.
P.S If you are running Julia on OS X, check out the Julia interface to Apple's Accelerate framework. This provides a very fast Hamming window implementation, as well as convolution and elementwise multiplication functions that might be helpful.
I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !
You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.
As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.
So, I think I have a very weird question.
So, let say that I already have a program put on my GPU and in that program I call a function X. But that function X is not declared yet.
I want to be able, dynamically, to modify that function X, by completely changing the code and put it in the program without recompiling the rest or losing any pointers whatsoever.
To compare it with something that most of us know, I want to be able to do like the shaders in OpenGL. In the middle of the execution, I can change the code of one shader, only recompile that shader, activate the program and now I used this one.
So, is it possible. Or do I need to recompile the whole thing all the time. And if I have to recompile, do I lose the various arrays that I created in global memory ?
Thanks
W
If you compile with the -cuda flag using nvcc, you can get the intermediate C++ source that streams PTX to the processor. In theory, you could post-process this intermediate output to dynamically generate PTX on the fly and send it over. You might even be able to have PTX be self modifying, but that's way out of my league.