CUFFT_INVALID_VALUE in cufftGetSize1d - cuda

What is the proper way to use cufftGetSize1d (or any of the cufftGetSize*) functions?
I tried with:
cufftHandle plan;
size_t workSize;
cufftResult result;
cufftCreate(&plan);
result = cufftGetSize1d(plan, 1000, CUFFT_C2C, 1, &workSize);
However, the result of last call is always CUFFT_INVALID_VALUE, regardless of size, type, or batch i use. The same is with 2d and 3d variants. cufftEstimate1d works correctly.

This appears to be a bug which was introduced during the CUDA 6 release cycle and subsequently fixed in CUDA 7. The following code:
#include <iostream>
#include <cufft.h>
int main()
{
cufftHandle plan;
size_t workSize;
cufftResult result;
cufftCreate(&plan);
result = cufftGetSize1d(plan, 1000, CUFFT_C2C, 1, &workSize);
std::cout << "result = " << result << std::endl;
return 0;
}
fails with CUFFT_INVALID_VALUE when compiled and run with the CUFFT shipped in CUDA 6.5, but succeeds when built and run against the CUFFT version in CUDA 7.0. As noted in comments, cufftGetSize appears to work correctly in CUDA 6.5. So the workaround is to use cufftGetSize or upgrade to a newer than CUDA 6.5 version of CUFFT.
[This community wiki entry was added mostly from comments to get this question off the unanswered question list]

Related

What does nvprof output: "No kernels were profiled" mean, and how to fix it

I have recently installed Cuda on my arch-Linux machine through the system's package manager, and I have been trying to test whether or not it is working by running a simple vector addition program.
I simply copy-paste the code from this tutorial (Both the one using one and more kernels) into a file titled cuda_test.cu and run
> nvcc cuda_test.cu -o cuda_test
In either case, the program can run, and I get no errors (both as in the program doesn't crash and the output is that there were no errors). But when I try to run the Cuda profiler on the program:
> sudo nvprof ./cuda_test
I get result:
==3201== NVPROF is profiling process 3201, command: ./cuda_test
Max error: 0
==3201== Profiling application: ./cuda_test
==3201== Profiling result:
No kernels were profiled.
No API activities were profiled.
==3201== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
The latter warning is not my main problem or the topic of my question, my problem is the message saying that No Kernels were profiled and no API activities were profiled.
Does this mean that the program was run entirely on my CPU? or is it an error in nvprof?
I have found a discussion about the same error here, but there the answer was that the wrong version of Cuda was installed, and in my case, the version installed is the latest version installed through the systems package manager (Version 10.1.243-1)
Is there any way I can get either nvprof to display the expected output?
Edit
Trying to adhere to the warning at the end does not solve the problem:
Adding call to cudaProfilerStop() (or cuProfilerStop()), and also adding cudaDeviceReset(); at end as suggested and linking the appropriate library (cuda_profiler_api.h or cudaProfiler.h) and compiling with
> nvcc cuda_test.cu -o cuda_test -lcuda
Yields a program which can still run, but which, when uppon which nvprof is run, returns:
==12558== NVPROF is profiling process 12558, command: ./cuda_test
Max error: 0
==12558== Profiling application: ./cuda_test
==12558== Profiling result:
No kernels were profiled.
No API activities were profiled.
==12558== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.
======== Error: Application received signal 139
This has not solved the original problem, and has in fact created a new error; the same happens when cudaProfilerStop() is used on its own or alongside cuProfilerStop() and cudaDeviceReset();
The code
The code is, as mentioned copied from a tutorial to test if Cuda is working, though I also have included calls to cudaProfilerStop() and cudaDeviceReset(); for clarity, it is here included:
#include <iostream>
#include <math.h>
#include <cuda_profiler_api.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = threadIdx.x;
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void)
{
int N = 1<<20;
float *x, *y;
cudaProfilerStart();
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Run kernel on 1M elements on the GPU
add<<<1, 1>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
cudaDeviceReset();
cudaProfilerStop();
return 0;
}
This problem was apparently somewhat well known, after some searching I found this thread about the error-code in the edited version; the solution as discussed there is to call nvprof with the flag --unified-memory-profiling off:
> sudo nvprof --unified-memory-profiling off ./cuda_test
This makes nvprof work as expected-- even without the call to cudaProfileStop.
You can solve the problem by using
sudo nvprof --unified-memory-profiling per-process-device <your program>

Error while Compiling Eigen library v3.3.4 with VS2017 + nvcc (CUDA 9.0)

I tried to compile the following code where I use Eigen and cuda at the same time and I get an error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/IterativeLinearSolvers>
__global__ void printWithCUDA()
{
if (threadIdx.x == 0)
{
printf(" Printed with thread %d \n", threadIdx.x);
}
}
int main()
{
// Eigen Operation
Eigen::Matrix3d eigenA;
eigenA << 1, 2, 3,
4, 5, 6,
7, 8, 9;
Eigen::Matrix3d eigenB;
eigenB << -1, -2, -3,
-4, -5, -6,
-7, -8, -9;
Eigen::MatrixXd eigenC = eigenA * eigenB;
std::cout << " \n Eigen Matrix " << std::endl;
std::cout << eigenC;
// CUDA Operation
printWithCUDA <<< 1, 32 >>>();
if (cudaPeekAtLastError() != cudaSuccess)
{
fprintf(stderr, "addWithCuda failed!");
return 1;
}
return 0;
}
With VS 2017, Eigen v3.3.4 and CUDA 9.0, I get the following error
eigen\src/Core/util/Macros.h(402): fatal error C1017: invalid integer constant expression
Macros.h(402): fatal error C1017
In my original project, the Eigen code is separated in a .h file from cuda code but the error is the same.
PS: it works well
if I comment the eigen part, or
I use Eigen in a fully cpp project with VS 2017 without nvcc
Is this specific to VS2017 + CUDA 9.0 + Eigen v3.3.4 ? Because according to Compiling Eigen library with nvcc (CUDA)
: update2
it worked for other verions.
Thanks
Update1:
Thanks Avi Ginsburg, I have downloaded the latest version of dev branch. With that version, I don't get this error anymore.
However, I have other errors that I don't understand: I have just replaced the latest stable release version with the one here The unstable source code from the development branch
The full error is available in the image here Error_Compil but it looks like this
1>kernel.cu
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(614): error C2244: 'Eigen::JacobiSVD::allocate': unable to match function definition to an existing declaration
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(613): note: see declaration of 'Eigen::JacobiSVD::allocate'
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(614): note: definition
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(614): note: 'void Eigen::JacobiSVD::allocate(::Eigen::SVDBase>::Index,Eigen::SVDBase::Index,unsigned int)'
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(614): note: existing declarations
1>g:\librray_quant\issues_lib\eigen_nvcc\eigen_nvcc\3rdparties\dev_branch\eigen\src/SVD/JacobiSVD.h(614): note: 'void Eigen::JacobiSVD::allocate(Eigen::SVDBase::Index,Eigen::SVDBase::Index,unsigned int)'
Apparently, __CUDACC_VER__ is no longer supported in CUDA 9.0, and therefore __CUDACC_VER__ >= 80000 is no longer a valid comparison. I'm not sure what it is defined to be (I assume #define __CUDACC_VER__ "" to cause this error), as I do not have CUDA installed on this computer. Try the dev branch of Eigen, they might have a fix for it. If not, the check should be for __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__ instead. You can submit a proposed fix if you get it working.
Update
The Eigen devs already fixed it in the dev branch (not sure when). They bypassed the issue with:
// Starting with CUDA 9 the composite __CUDACC_VER__ is not available.
#if defined(__CUDACC_VER_MAJOR__) && (__CUDACC_VER_MAJOR__ >= 9)
#define EIGEN_CUDACC_VER ((__CUDACC_VER_MAJOR__ * 10000) + (__CUDACC_VER_MINOR__ * 100))
#elif defined(__CUDACC_VER__)
#define EIGEN_CUDACC_VER __CUDACC_VER__
#else
#define EIGEN_CUDACC_VER 0
#endif
in Eigen/Core and replaced the Macros.h line with (EIGEN_CUDACC_VER >= 80000).

grid_group not found in CUDA 9

I tried using Cooperative Groups in CUDA 9, but I get an error in compiling.
Does anyone know the solution?
The development environment is as follows:
CUDA 9
Kepler K80
Compute Capability: 3.7
#include <cstdint>
#include <iostream>
#include <vector>
#include <cooperative_groups.h>
__global__
void kernel(uint32_t values[])
{
using namespace cooperative_groups;
grid_group g = this_grid();
}
int main(void)
{
constexpr uint32_t kNum = 1 << 24;
std::vector<uint32_t> h_values(kNum);
uint32_t *d_values;
cudaMalloc(&d_values, sizeof(uint32_t) * kNum);
cudaMemcpy(d_values, h_values.data(), sizeof(uint32_t) * kNum, cudaMemcpyHostToDevice);
const uint32_t thread_num = 256;
const dim3 block(thread_num);
const dim3 grid((kNum + block.x - 1) / block.x);
void *params[] = {&d_values};
cudaLaunchCooperativeKernel((void *)kernel, grid, block, params);
cudaMemcpy(h_values.data(), d_values, sizeof(uint32_t) * kNum, cudaMemcpyDeviceToHost);
cudaFree(d_values);
return 0;
}
$ nvcc -arch=sm_37 test.cu --std=c++11 -o test
test.cu(12): error: identifier "grid_group" is undefined
test.cu(12): error: identifier "this_grid" is undefined
The grid_group features are only supported in the Pascal architecture and later.
You can try by compiling for, e.g., sm_60 (of course the executable won't run on your GPU). Additionally you need to enable relocatable device code (-rdc=true).
Unfortunately, the Programming Guide is not very clear about that. I couldn't find this information there. However it is mentioned in some posts on devblog.nvidia.com:
From https://devblogs.nvidia.com/cuda-9-features-revealed/
While Cooperative Groups works on all GPU architectures, certain functionality is inevitably architecture-dependent as GPU capabilities have evolved. Basic functionality, such as synchronizing groups smaller than a thread block down to warp granularity, is supported on all architectures, while Pascal and Volta GPUs enable new grid-wide and multi-GPU synchronizing groups.
Or at the very end of https://devblogs.nvidia.com/cooperative-groups/
New features in Pascal and Volta GPUs help Cooperative Groups go farther, by enabling creation and synchronization of thread groups that span an entire kernel launch running on one or even multiple GPUs.

Getting error 255 when compiling with nvcc [duplicate]

It seems that printf doesn't work inside the Kernel of a cuda code
#include "Common.h"
#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
int idx = threadIdx.x ;
int idy = threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
printf("idx=%d, idy=%d, size=%d\n", idx, idy, size);
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) { // NO Thread divergence here
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
b_d[idy*(size+1) + idx] = temp[idy][idx];
}
when compiling, it says:
error: calling a host function("printf") from a __device__/__global__ function("Kernel") is not allowed
The cuda version is 4
Quoting the CUDA Programming Guide "Formatted output is only supported by devices of compute capability 2.x and higher". See the programming guide for additional information.
Devices of compute capability < 2.x can use cuPrintf.
If you are on a 2.x and above device and you are trying to use printf make sure you have specified arch=sm_20 (or higher). The default is sm_10 which does not have sufficient features to support printf.
NVIDIA offers three source level debuggers for CUDA. You may find these more useful than printf for inspecting variables.
- Nsight Visual Studio Edition CUDA Debugger
- Nsight Eclipse Edition CUDA Debugger
- cuda-gdb
You need to use cuPrintf, as in this example. Note that printf is a pretty limited way of debugging, the Nsight or Nsight eclipse edition IDEs are much nicer.

not able to use printf in cuda kernel function

It seems that printf doesn't work inside the Kernel of a cuda code
#include "Common.h"
#include<cuda.h>
#include <stdio.h>
__device__ __global__ void Kernel(float *a_d , float *b_d ,int size)
{
int idx = threadIdx.x ;
int idy = threadIdx.y ;
//Allocating memory in the share memory of the device
__shared__ float temp[16][16];
//Copying the data to the shared memory
temp[idy][idx] = a_d[(idy * (size+1)) + idx] ;
printf("idx=%d, idy=%d, size=%d\n", idx, idy, size);
for(int i =1 ; i<size ;i++) {
if((idy + i) < size) { // NO Thread divergence here
float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]);
temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx]));
}
__syncthreads(); //Synchronizing all threads before Next iterat ion
}
b_d[idy*(size+1) + idx] = temp[idy][idx];
}
when compiling, it says:
error: calling a host function("printf") from a __device__/__global__ function("Kernel") is not allowed
The cuda version is 4
Quoting the CUDA Programming Guide "Formatted output is only supported by devices of compute capability 2.x and higher". See the programming guide for additional information.
Devices of compute capability < 2.x can use cuPrintf.
If you are on a 2.x and above device and you are trying to use printf make sure you have specified arch=sm_20 (or higher). The default is sm_10 which does not have sufficient features to support printf.
NVIDIA offers three source level debuggers for CUDA. You may find these more useful than printf for inspecting variables.
- Nsight Visual Studio Edition CUDA Debugger
- Nsight Eclipse Edition CUDA Debugger
- cuda-gdb
You need to use cuPrintf, as in this example. Note that printf is a pretty limited way of debugging, the Nsight or Nsight eclipse edition IDEs are much nicer.