cudaGetDeviceCount returns error: 1 invalid argument - cuda

Running this code:
int device = 0;
cudaGetDevice(&device);
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
const int kb = 1024;
const int mb = kb * kb;
cout << "Module Start:" << endl;
cout << props.name << ": " << props.major << "." << props.minor << endl;
cout << " Global memory: " << props.totalGlobalMem / mb << "mb" << endl;
cout << " Shared memory: " << props.sharedMemPerBlock / kb << "kb" << endl;
cout << " Constant memory: " << props.totalConstMem / kb << "kb" << endl;
cout << " Block registers: " << props.regsPerBlock << endl;
cout << " Warp size: " << props.warpSize << endl;
cout << " Threads per block: " << props.maxThreadsPerBlock << endl;
cout << " Max block dimensions: [ " << props.maxThreadsDim[0] << ", " << props.maxThreadsDim[1] << ", " << props.maxThreadsDim[2] << " ]" << endl;
cout << " Max grid dimensions: [ " << props.maxGridSize[0] << ", " << props.maxGridSize[1] << ", " << props.maxGridSize[2] << " ]" << endl;
cout << endl;
Results in the following garbage output and crash on RTX 2060, x64 Windows 10 on friend's computer:
EDIT:
I added some error checks:
int devicesCount;
cudaError_t error_id = cudaGetDeviceCount(&devicesCount);
if (error_id != cudaSuccess) {
printf("cudaGetDeviceCount returned %d\n%s\n", (int)error_id, cudaGetErrorString(error_id));
return 1;
} else {
printf("Found %d GPUs\n", devicesCount);
}
and this is the error:
cudaGetDeviceCount returns error: "return 1, invalid argument"
It also appears he is using Windows insider edition and Driver Version 465.21, which is newer than the current stable release.
Works on the 1070, x64 Windows 10:
I tried using this post to get and set the active device, but that gave the same garbage output.
I am compiling to a .DLL and calling the functions through Python. It is possible that my Visual Studio project settings got messed up somehow, because it was working on friend's 2060 before.
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin\nvcc.exe" -gencode=arch=compute_35,code=\"sm_35,compute_35\" -gencode=arch=compute_37,code=\"sm_37,compute_37\" -gencode=arch=compute_50,code=\"sm_50,compute_50\" -gencode=arch=compute_52,code=\"sm_52,compute_52\" -gencode=arch=compute_60,code=\"sm_60,compute_60\" -gencode=arch=compute_61,code=\"sm_61,compute_61\" -gencode=arch=compute_70,code=\"sm_70,compute_70\" -gencode=arch=compute_75,code=\"sm_75,compute_75\" -gencode=arch=compute_80,code=\"sm_80,compute_80\" -gencode=arch=compute_86,code=\"sm_86,compute_86\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29333\bin\HostX64\x64" -x cu -I./ -I../../common/inc -I./ -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\/include" -I../../common/inc -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart shared --threads 0 -DWIN32 -D_MBCS -D_WINDLL -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64/Release/vc142.pdb /FS /MD " -o x64/Release/gpu_compute.cu.obj "D:\Tests\MyDLL\gpu_compute.cu"

The error was caused by not yet released Driver Version 465.21 shipped with the Windows Insider edition. Rolling back to current 461.40 solved this.

Related

How to combined html with c++

I am pretty new to both html and c++ with a´lot of knowledge in c#, I have just writen my first 100% own c++ program a little banking system and need a litte help on how to integrate html.
For example how would i dispaly the funds in html?
IDE Clion.
int press;
int newFunds;
cout << "Available funds ";
cout << funds;
cout << "$ \n";
cout << "Press 1 for Withdraw \n";
cout << "Press 2 for Deposit \n";
cin >> press;
if(press == 1) {
cout << "Amount";
cin >> withdraw;
funds -= withdraw;
newFunds += withdraw;
if(funds < 0) {
funds += newFunds;
cout << "Not enough funds!";
}
else {
cout << "Remaining funds ", cout << funds, cout << "$ \n";
}
}
else if(press == 2) {
cout << "Amount\n";
cin >> deposit;
funds += deposit;
cout << "Funds ";
cout << funds;
cout << "$";
}
if(press > 2) {
cout << "Invalid";
}
return 0;
}
I welcome any suggestion!
I have tried to look up som tut but they wherent so good!
C++ and HTML can't connect just by creating a localhost web server. To make a "connection" between them, you'll need a C++ API to interact with the server and your C++ program. FastCGI protocol is simple and easy for beginners, but maybe not be the best option for more professional projects. Check information about FastCGI++ that's an API for C++. Here is a useful GitHub link that I recommend to start: https://github.com/eddic/fastcgipp.

Parameter Passing in C-like

int i, a[] = {0, 1, 2};
void foo(int x) {
i++;
x++;
cout << a[0] << " " << a[1] << " " << a[2];
}
void main() {
i = 0;
foo(a[i]);
}
So, the printing output will be:
By value-result: 0 - 1 - 2
By reference: 1 - 1 - 2
By name: 0 - 2 - 2
By constant reference: 0 - 1 - 2
Right ?
Beware, the cout stream and << operators are pure C++ primitives, you are NOT in C !
You also have to understand that side-effect inside a sequence of cout << foo << bar << fuzz; will produce a totally impredictable output depending on the choices made by the compiler (and NOT triggered by the language specification because the variables are supposed to stay constant along the evaluation of the expression).
To illustrate what I am saying, try to compile the following small program (example.cpp):
#include <iostream>
using namespace std;
int main ()
{
int i = 0;
cout << "i++: " << i++ << i++ << i++ << i++ << endl;
i = 0;
cout << "++i: " << ++i << ++i << ++i << ++i << endl;
cout << "The address of i is: " << &i << endl;
return 0;
}
When compiled with g++ -o example example.cpp, you should get something like:
i++: 3210
++i: 4321
The address of i is: 0xbfb1c44c
Then, try to compile it with g++ -O2 -o example example.cpp, you should get something like:
i++: 3210
++i: 4444
The address of i is: 0xbf9af0cc
In fact, the difference of these two execution comes from the fact that once you trigger the optimization in g++, the compiler assumes that you conform to the specification of C++ and that there will be no side-effect inside the cout << ... << endl; expression. So, it will use the last value of i all the time.

CUDA Thrust - Run Length Encoding with run index

I am trying to build a "run length encoder" which produces a report of occurrences of runs within a file using CUDA Thrust. I will use this "report" to perform the run length encoding step later.
e.g.
Input sequence:
inputSequence = [a, a, b, c, a, a, a];
Output sequences:
runChar = [a, a];
runCount = [2, 3];
runPosition = [0, 4];
The output desribes a run of 2 a's starting at position 0 and a run of 3 a's starting at the position 4.
The Thrust run length encoder example described below outputs two arrays - one for the output char and one for its length.
I would like to modify this so runs of less than 2 are excluded and it also outputs the position each run occurs.
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
return 0;
One possible approach, building on what you have shown already:
Take your output lengths, and do an exclusive_scan on them. This creates a corresponding vector of the starting indexes of each run.
Use stream compaction (remove_if) to remove elements from all arrays (output, lengths, and indexes) whose corresponding length is 1. We do this in two steps, the first remove_if operation to clean up output and indexes, using lengths as the stencil, and the second operating directly on lengths. This can probably be significantly improved by operating on all 3 at once, which will make the output length calculation a bit more complicated. How you handle this exactly will depend on which sets of data you intend to retain.
Here is a fully worked example, extending your code:
$ cat t601.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/zip_iterator.h>
struct is_not_one{
template <typename T>
__host__ __device__
bool operator()(T data){
return data != 1;
}
};
int main(){
// input data on the host
const char data[] = "aaabbbbbcddeeeeeeeeeff";
const size_t N = (sizeof(data) / sizeof(char)) - 1;
// copy input data to the device
thrust::device_vector<char> input(data, data + N);
// allocate storage for output data and run lengths
thrust::device_vector<char> output(N);
thrust::device_vector<int> lengths(N);
// print the initial data
std::cout << "input data:" << std::endl;
thrust::copy(input.begin(), input.end(), std::ostream_iterator<char>(std::cout, ""));
std::cout << std::endl << std::endl;
// compute run lengths
size_t num_runs = thrust::reduce_by_key
(input.begin(), input.end(), // input key sequence
thrust::constant_iterator<int>(1), // input value sequence
output.begin(), // output key sequence
lengths.begin() // output value sequence
).first - output.begin(); // compute the output size
// print the output
std::cout << "run-length encoded output:" << std::endl;
for(size_t i = 0; i < num_runs; i++)
std::cout << "(" << output[i] << "," << lengths[i] << ")";
std::cout << std::endl;
thrust::device_vector<int> indexes(num_runs);
thrust::exclusive_scan(lengths.begin(), lengths.begin()+num_runs, indexes.begin());
thrust::device_vector<char> foutput(num_runs);
thrust::device_vector<int> findexes(num_runs);
thrust::device_vector<int> flengths(num_runs);
thrust::copy_if(thrust::make_zip_iterator(thrust::make_tuple(output.begin(), indexes.begin())), thrust::make_zip_iterator(thrust::make_tuple(output.begin()+num_runs, indexes.begin()+num_runs)), lengths.begin(), thrust::make_zip_iterator(thrust::make_tuple(foutput.begin(), findexes.begin())), is_not_one());
size_t fnum_runs = thrust::copy_if(lengths.begin(), lengths.begin()+num_runs, flengths.begin(), is_not_one()) - flengths.begin();
std::cout << "output: " << std::endl;
thrust::copy_n(foutput.begin(), fnum_runs, std::ostream_iterator<char>(std::cout, ","));
std::cout << std::endl << "lengths: " << std::endl;
thrust::copy_n(flengths.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "indexes: " << std::endl;
thrust::copy_n(findexes.begin(), fnum_runs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -arch=sm_20 -o t601 t601.cu
$ ./t601
input data:
aaabbbbbcddeeeeeeeeeff
run-length encoded output:
(a,3)(b,5)(c,1)(d,2)(e,9)(f,2)
output:
a,b,d,e,f,
lengths:
3,5,2,9,2,
indexes:
0,3,9,11,20,
$
I'm certain that this code can be improved upon, but my purpose is to show you one possible general approach.
In my opinion, for future reference, it's not very helpful for you to strip off the include headers from your sample code. I think it's better to provide a complete, compilable code. Not a big deal in this case.
Also note that there are thrust example codes for run length encoding and decoding.

thrust 1.7 tabulate on CUDA device fails

The new thrust::tabulate function works for me on the host but not on the device. The device is a K20x with compute capability 3.5. The host is an Ubuntu machine with 128GB of memory. Help?
I think that the unified addressing is not the problem since I can sort a unifiedly addressed array on the device.
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/tabulate.h>
#include <thrust/version.h>
using namespace std;
// Print an expression's name then its value, possible followed by a
// comma or endl. Ex: cout << PRINTC(x) << PRINTN(y);
#define PRINT(arg) #arg "=" << (arg)
#define PRINTC(arg) #arg "=" << (arg) << ", "
#define PRINTN(arg) #arg "=" << (arg) << endl
// Execute an expression and check for CUDA errors.
#define CE(exp) { \
cudaError_t e; e = (exp); \
if (e != cudaSuccess) { \
cerr << #exp << " failed at line " << __LINE__ << " with error " << cudaGetErrorString(e) << endl; \
exit(1); \
} \
}
const int N(10);
int main(void) {
int major = THRUST_MAJOR_VERSION;
int minor = THRUST_MINOR_VERSION;
cout << "Thrust v" << major << "." << minor
<< ", CUDA_VERSION: " << CUDA_VERSION << ", CUDA_ARCH: " << __CUDA_ARCH__
<< endl;
cout << PRINTN(N);
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
if (!prop.unifiedAddressing) {
cerr << "Unified addressing not available." << endl;
exit(1);
}
cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory) {
cerr << "Can't map host memory." << endl;
exit(1);
}
cudaSetDeviceFlags(cudaDeviceMapHost);
int *p, *q;
CE(cudaHostAlloc(&p, N*sizeof(int), cudaHostAllocMapped));
CE(cudaHostAlloc(&q, N*sizeof(int), cudaHostAllocMapped));
thrust::tabulate(thrust::host, p, p+N, thrust::negate<int>());
thrust::tabulate(thrust::device, q, q+N, thrust::negate<int>());
for (int i=0; i<N; i++)
cout << PRINTC(i) << PRINTC(p[i]) << PRINTN(q[i]);
}
Output:
Thrust v1.7, CUDA_VERSION: 6000, CUDA_ARCH: 0
N=10
i=0, p[i]=0, q[i]=0
i=1, p[i]=-1, q[i]=0
i=2, p[i]=-2, q[i]=0
i=3, p[i]=-3, q[i]=0
i=4, p[i]=-4, q[i]=0
i=5, p[i]=-5, q[i]=0
i=6, p[i]=-6, q[i]=0
i=7, p[i]=-7, q[i]=0
i=8, p[i]=-8, q[i]=0
i=9, p[i]=-9, q[i]=0
The following does not add any info content to my post but is required before stackoverflow will accept it: Much of the program is error checking and version checking.
The problem appears to be fixed in the thrust master branch at the moment. This master branch currently identifies itself as Thrust v1.8.
I ran your code with CUDA 6RC (appears to be what you are using) and I was able to duplicate your observation.
I then updated to the master branch, and removed the __CUDA_ARCH__ macro from your code, and I got the expected results (host and device tabulations match).
Note that according to the programming guide, the __CUDA_ARCH__ macro is only defined when it's used in code that is being compiled by the device code compiler. It is officially undefined in host code. Therefore it's acceptable to use it as follows in host code:
#ifdef __CUDA_ARCH__
but not as you are using it. Yes, I understand the behavior is different between thrust v1.7 and thrust master in this regard, but that appears to (also) be a thrust issue, that has been fixed in the master branch.
Both of these issues I expect would be fixed whenever the next version of thrust gets incorporated into an official CUDA drop. Since we are very close to CUDA 6.0 official release, I'd be surprised if these issues were fixed in CUDA 6.0.
Further notes about the tabulate issue:
One workaround would be to update thrust to master
Issue doesn't appear to be specific to thrust::tabulate in my testing. Many thrust functions that I tested seem to fail in that when used with thrust::device and raw pointers, they fail to write values correctly (seem to write all zeroes), but they do seem to be able to read values correctly (e.g. thrust::reduce seems to work)
Another possible workaround is to wrap your raw pointers with thrust::device_ptr<> using thrust::device_ptr_cast<>(). That seemed to work for me as well.

qserialport does not send a char to arduino

I'm having a trouble in trying to send a char (i.e. "R") from my qt5 application on WIN7 to comport which is connected to an Arduino.
I intend to blink a led on Arduino and my arduino part works OK.
Here is my qt code:
#include <QTextStream>
#include <QCoreApplication>
#include <QtSerialPort/QSerialPortInfo>
#include <QSerialPort>
#include <iostream>
#include <QtCore>
QT_USE_NAMESPACE
using namespace std;
QSerialPort serial;
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
QTextStream out(stdout);
QList<QSerialPortInfo> serialPortInfoList = QSerialPortInfo::availablePorts();
out << QObject::tr("Total number of ports available: ") << serialPortInfoList.count() << endl;
foreach (const QSerialPortInfo &serialPortInfo, serialPortInfoList) {
out << endl
<< QObject::tr("Port: ") << serialPortInfo.portName() << endl
<< QObject::tr("Location: ") << serialPortInfo.systemLocation() << endl
<< QObject::tr("Description: ") << serialPortInfo.description() << endl
<< QObject::tr("Manufacturer: ") << serialPortInfo.manufacturer() << endl
<< QObject::tr("Vendor Identifier: ") << (serialPortInfo.hasVendorIdentifier() ? QByteArray::number(serialPortInfo.vendorIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Product Identifier: ") << (serialPortInfo.hasProductIdentifier() ? QByteArray::number(serialPortInfo.productIdentifier(), 16) : QByteArray()) << endl
<< QObject::tr("Busy: ") << (serialPortInfo.isBusy() ? QObject::tr("Yes") : QObject::tr("No")) << endl;
}
serial.setPortName("COM5");
serial.open(QIODevice::ReadWrite);
serial.setBaudRate(QSerialPort::Baud9600);
serial.setDataBits(QSerialPort::Data8);
serial.setParity(QSerialPort::NoParity);
serial.setStopBits(QSerialPort::OneStop);
serial.setFlowControl(QSerialPort::NoFlowControl);
if(!serial.isOpen())
{
std::cout<<"port is not open"<<endl;
//serial.open(QIODevice::ReadWrite);
}
if(serial.isWritable()==true)
{
std::cout<<"port writable..."<<endl;
}
QByteArray data("R");
serial.write(data);
serial.flush();
std::cout<<"value sent!!! "<<std::endl;
serial.close();
return 0;
}
My source code consists of two parts,
1- serialportinfolist .... which works just fine
2- opening and writing data... I get no issue when running the code and the display shows the result as if nothing has gone wrong!
HOWEVER, the led on the board does not turn on when I run this code.
I test this with Arduino Serial Monitor and it turns on but cant turn on from Qt.
Are you waiting for cr lf (0x0D 0x0A) in your arduino code?
QByteArray ba;
ba.resize(3);
ba[0] = 0x5c; //'R'
ba[1] = 0x0d;
ba[2] = 0x0a;
Or append it to your string with
QByteArray data("R\r\n");
Or
QByteArray data("R\n");
I think I have found a partial solution but it is still incomplete.
When I press debug the first time, qt does not send any signal to Arduino, but when I press debug for the second time it behaves as expected.
So, is'nt it so weird that one has to run it twice to get it working???
Let me know if the problem exists somewhere else,
any help...