I'm using the STL function count_if to count all the positive values
in a vector of doubles. For example my code is something like:
vector<double> Array(1,1.0)
Array.push_back(-1.0);
Array.push_back(1.0);
cout << count_if(Array.begin(), Array.end(), isPositive);
where the function isPositive is defined as
bool isPositive(double x)
{
return (x>0);
}
The following code would return 2. Is there a way of doing the above
without writting my own function isPositive? Is there a built-in
function I could use?
Thanks!
std::count_if(v.begin(), v.end(), std::bind1st(std::less<double>(), 0)) is what you want.
If you're already using namespace std, the clearer version reads
count_if(v.begin(), v.end(), bind1st(less<double>(), 0));
All this stuff belongs to the <functional> header, alongside other standard predicates.
If you are compiling with MSVC++ 2010 or GCC 4.5+ you can use real lambda functions:
std::count_if(Array.begin(), Array.end(), [](double d) { return d > 0; });
I don't think there is a build-in function.
However, you could use boost lambda http://www.boost.org/doc/libs/1_43_0/doc/html/lambda.html
to write it :
cout << count_if(Array.begin(), Array.end(), _1 > 0);
cout<<std::count_if (Array.begin(),Array.end(),std::bind2nd (std::greater<double>(),0)) ;
greater_equal<type>() -> if >= 0
Related
I browsed through the internet, and I've only saw people doing forward declarations on class using the typedef keyword. But, I was wondering how'd I do that with functions/tasks?
I wanted to put the main function above the definitions of other functions/tasks to ease the reader when one is previewing it. In C++, forward declaration for a function looks something like this:
//forward declaration of sub2
int sub2(int A, int B);
int main(){
cout << "Difference: " << sub2(25, 10);
return 0;
}
int sub2(int A, int B) //Defining sub2 here{
return A - B;
}
For SystemVerilog, will it be something like this?
function somefunction();
virtual task body();
somefunction();
endtask: body
function somefunction();
// do something here.
endfunction: somefunction
Should I use typedef for forward declarations with functions/tasks?
Functions and tasks do not need to be declared before use as long as they have a set of trailing parenthesis () which may also include required arguments. They use search rules similar to hierarchical references. See section 23.8.1 Task and function name resolution in the IEEE 1800-2017 SystemVerilog LRM
Function declaration order doesn't matter, like in C. You can call somefunction in body before declaring it.
You don't need to do any kind of declarations.
I am trying to write constraints using ln and exp function, yet I received an error that Cplex can't extract the expression.
forall (t in time)
Gw_C["Mxr"] == 20523 + 17954 * ln(maxl(pbefore[t]));
Ed_c ["RC"]== 0.0422* exp(0.1046* (maxl(pbefore[t])));
Gw_C["RC"] == 3590* pow((maxl(pbefore[t]), 0.6776);
Is there any other possible way to code these constraints on cplex?
Thanks
You may use exp and log if you rely on Constraint Programming within CPLEX:
using CP;
int scale=1000;
dvar int scalex in 1..10000;
dexpr float x=scalex/scale;
maximize x;
subject to
{
exp(x)<=100;
}
execute
{
writeln("x=",x);
}
works fine and gives:
x=4.605
But with Math Programming within CPLEX you cannot use exp like that.
What you can do instead if go through linearization.
I have a const thrust vector of elements from which I would like to extract at most N elements that pass a predicate (in any order), where the thrust vector size and N are known at compile-time. In my specific case, my vector is 500k elements and N is 100k.
My initial thought was to use thrust::copy_if to get all elements that pass the predicate, then to use only the first N elements for my subsequent calculations. However, in that case I would have to allocate two vectors of 500k elements (one for the initial vector, and one for the output of copy_if) and I'd have to process every element.
As this is an operation I have to do many times and across several CUDA streams, I would like to know if there is a way to obtain the N output elements while minimizing the memory footprint required, and ideally, minimizing the number of elements that need to be processed (i.e. breaking the process once N valid elements have been found).
One possible method to perform a stream compaction operation is to perform a predicated prefix-sum followed by a conditional indexed copy. By breaking a "monolithic" operation into these 2 pieces, it becomes fairly easy to insert the desired limiting behavior on output size.
The prefix sum is a fairly involved operation. We will use thrust for that. The conditional indexed copy is fairly trivial, so we will write our own CUDA kernel for that, rather than try to wrestle with a thrust::copy_if operation to get the copy logic just right. This kernel is where we will insert the limiting behavior on the output size.
Here is a worked example:
$ cat t34.cu
#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef int mt;
__global__ void my_copy(mt *d, int *i, mt *r, int limit, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size){
if ((idx == 0) && (*i == 1) && (limit > 0))
*r = *d;
else if ((idx > 0) && (i[idx] > i[idx-1]) && (i[idx] <= limit)){
r[i[idx]-1] = d[idx];}
}
}
int main(){
int rs = 3;
mt d[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
int ds = sizeof(d)/sizeof(d[0]);
thrust::device_vector<mt> data(d, d+ds);
thrust::device_vector<int> idx(ds);
thrust::device_vector<mt> result(rs);
auto my_cmp = thrust::make_transform_iterator(data.begin(), 0+(_1>0));
thrust::inclusive_scan(my_cmp, my_cmp+ds, idx.begin());
my_copy<<<(ds+255)/256, 256>>>(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(idx.data()), thrust::raw_pointer_cast(result.data()), rs, ds);
thrust::host_vector<mt> h_result = result;
thrust::copy_n(h_result.begin(), rs, std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -std=c++14 -o t34 t34.cu -arch=sm_52
$ ./t34
1,2,3,
$
(CUDA 11.0, Fedora 29, GTX 960)
Note that this code is provided for demonstration purposes. You should not assume that it is defect-free or suitable for any particular purpose. Use it at your own risk.
A bit of study with a profiler will show that the thrust::inclusive_scan operation does perform a cudaMalloc and cudaFree operation "under the hood". So even though we have pulled most of the allocations "out into the open" here, thrust apparently still needs to perform a single temporary allocation (of unknown size) to support the scan operation.
Responding to a question in the comments below. To understand this: 0+(_1>0), there are two things to note:
The general syntax is using thrust::placeholders. This capability of thrust allows us to write simple unary or binary functions inline, avoiding the need to use lambdas or write separate functors.
The reason for the 0+ is as follows. If we simply used (_1>0), then thrust would use as its unary function a boolean test of the item returned by dereferencing the iterator, compared to zero. The result of that comparison is a boolean, and if we leave it that way, the prefix sum will ultimately be computed using boolean arithmetic, which we do not want. We want the result of the boolean greater-than test (i.e. true/false) to be converted to an integer, so that the subsequent prefix sum gets performed using integer arithmetic. Prepending the (_1>0) boolean test with 0+ accomplishes that.
I have this piece of code which output is 4. I assumed the answer be 3 because of the pre-increment. Can anyone explain this??
#include<iostream>
#include<cstdio>
#define MAX(A,B) ((A>B)? A : B)
using namespace std;
int main()
{
int i=1,j=2,k;
k= MAX(++i,++j);
cout<<k;
return 0;
}
#define does not work like a function, think of it more like a find and replace so doing the macro expansion manually you get
int main()
{
int i=1,j=2,k;
k= ((++i > ++j) ? ++i : ++j);
cout<<k;
return 0;
}
This means you increment i and j once when you compare then, and increment the larger of the two another time before assigning it to k. I generally avoid including pre and post increment instructions inside of over logic since it is harder to reason about. You are better just incrementing i and j on their own lines before using them in MAX
Macros are not functions, they just perform text substitution pre-compile time. Your line of code becomes
k=((++i>++j)? ++i : ++j);
which clearly increments j twice.
It's called the conditional operator (or ternary operator) which is used in macro substitution
#define MAX(a,b) ((a) > (b) ? (a) : (b))
Means:
if ((a) > (b)){
return a;
} else {
return b;
}
So if you would do:
int test = MAX(5,10);
test would be 10
I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.