How to simplify this loop? - language-agnostic

Considering an array a[i], i=0,1,...,g, where g could be any given number, and a[0]=1.
for a[1]=a[0]+1 to 1 do
for a[2]=a[1]+1 to 3 do
for a[3]=a[2]+1 to 5 do
...
for a[g]=a[g-1]+1 to 2g-1 do
#print a[1],a[2],...a[g]#
The problem is that everytime we change the value of g, we need to modify the code, those loops above. This is not a good code.

Recursion is one way to solve this(although I was love to see an iterative solution).
!!! Warning, untested code below !!!
template<typename A, unsigned int Size>
void recurse(A (&arr)[Size],int level, int g)
{
if (level > g)
{
// I am at the bottom level, do stuff here
return;
}
for (arr[level] = arr[level-1]+1; arr[level] < 2 * level -1; arr[level]++)
{
recurse(copy,level+1,g);
}
}
Then call with recurse(arr,1,g);

Imagine you are representing numbers with an array of digits. For example, 682 would be [6,8,2].
If you wanted to count from 0 to 999 you could write:
for (int n[0] = 0; n[0] <= 9; ++n[0])
for (int n[1] = 0; n[1] <= 9; ++n[1])
for (int n[2] = 0; n[2] <= 9; ++n[2])
// Do something with three digit number n here
But when you want to count to 9999 you need an extra for loop.
Instead, you use the procedure for adding 1 to a number: increment the final digit, if it overflows move to the preceding digit and so on. Your loop is complete when the first digit overflows. This handles numbers with any number of digits.
You need an analogous procedure to "add 1" to your loop variables.
Increment the final "digit", that is a[g]. If it overflows (i.e. exceeds 2g-1) then move on to the next most-significant "digit" (a[g-1]) and repeat. A slight complication compared to doing this with numbers is that having gone back through the array as values overflow, you then need to go forward to reset the overflowed digits to their new base values (which depend on the values to the left).
The following C# code implements both methods and prints the arrays to the console.
static void Print(int[] a, int n, ref int count)
{
++count;
Console.Write("{0} ", count);
for (int i = 0; i <= n; ++i)
{
Console.Write("{0} ", a[i]);
}
Console.WriteLine();
}
private static void InitialiseRight(int[] a, int startIndex, int g)
{
for (int i = startIndex; i <= g; ++i)
a[i] = a[i - 1] + 1;
}
static void Main(string[] args)
{
const int g = 5;
// Old method
int count = 0;
int[] a = new int[g + 1];
a[0] = 1;
for (a[1] = a[0] + 1; a[1] <= 2; ++a[1])
for (a[2] = a[1] + 1; a[2] <= 3; ++a[2])
for (a[3] = a[2] + 1; a[3] <= 5; ++a[3])
for (a[4] = a[3] + 1; a[4] <= 7; ++a[4])
for (a[5] = a[4] + 1; a[5] <= 9; ++a[5])
Print(a, g, ref count);
Console.WriteLine();
count = 0;
// New method
// Initialise array
a[0] = 1;
InitialiseRight(a, 1, g);
int index = g;
// Loop until all "digits" have overflowed
while (index != 0)
{
// Do processing here
Print(a, g, ref count);
// "Add one" to array
index = g;
bool carry = true;
while ((index > 0) && carry)
{
carry = false;
++a[index];
if (a[index] > 2 * index - 1)
{
--index;
carry = true;
}
}
// Re-initialise digits that overflowed.
if (index != g)
InitialiseRight(a, index + 1, g);
}
}

I'd say you don't want nested loops in the first place. Instead, you just want to call a suitable function, taking the current nesting level, the maximum nesting level (i.e. g), the start of the loop, and whatever if needs as context for the computation as arguments:
void process(int level, int g, int start, T& context) {
if (level != g) {
for (int a(start + 1), end(2 * level - 1); a < end; ++a) {
process(level + 1, g, a, context);
}
}
else {
computation goes here
}
}

Related

C# How to count how many anagrams are in a given string

I have to calculate how many anagrams are in a given word.
I have tried using factorial, permutations and using the posibilities for each letter in the word.
This is what I have done.
static int DoAnagrams(string a, int x)
{
int anagrams = 1;
int result = 0;
x = a.Length;
for (int i = 0; i < x; i++)
{ anagrams *= (x - 1); result += anagrams; anagrams = 1; }
return result;
}
Example: for aabv I have to get 12; for aaab I have to get 4
As already stated in a comment there is a formula for calculating the number of different anagrams
#anagrams = n! / (c_1! * c_2! * ... * c_k!)
where n is the length of the word, k is the number of distinct characters and c_i is the count of how often a specific character occurs.
So first of all, you will need to calculate the faculty
int fac(int n) {
int f = 1;
for (int i = 2; i <=n; i++) f*=i;
return f;
}
and you will also need to count the characters in the word
Dictionary<char, int> countChars(string word){
var r = new Dictionary<char, int>();
foreach (char c in word) {
if (!r.ContainsKey(c)) r[c] = 0;
r[c]++;
}
return r;
}
Then the anagram count can be calculated as follows
int anagrams(string word) {
int ac = fac(word.Length);
var cc = countChars(word);
foreach (int ct in cc.Values)
ac /= fac(ct);
return ac;
}
Answer with Code
This is written in c#, so it may not apply to the language you desire, but you didn't specify a language.
This works by getting every possible permutation of the string, adding every copy found in the list to another list, then removing those copies from the original list. After that, the count of the original list is the amount of unique anagrams a string contains.
private static List<string> anagrams = new List<string>();
static void Main(string[] args)
{
string str = "AAAB";
char[] charArry = str.ToCharArray();
Permute(charArry, 0, str.Count() - 1);
List<string> copyList = new List<string>();
for(int i = 0; i < anagrams.Count - 1; i++)
{
List<string> anagramSublist = anagrams.GetRange(i + 1, anagrams.Count - 1 - i);
var perm = anagrams.ElementAt(i);
if (anagramSublist.Contains(perm))
{
copyList.Add(perm);
}
}
foreach(var copy in copyList)
{
anagrams.Remove(copy);
}
Console.WriteLine(anagrams.Count);
Console.ReadKey();
}
static void Permute(char[] arry, int i, int n)
{
int j;
if (i == n)
{
var temp = string.Empty;
foreach(var character in arry)
{
temp += character;
}
anagrams.Add(temp);
}
else
{
for (j = i; j <= n; j++)
{
Swap(ref arry[i], ref arry[j]);
Permute(arry, i + 1, n);
Swap(ref arry[i], ref arry[j]); //backtrack
}
}
}
static void Swap(ref char a, ref char b)
{
char tmp;
tmp = a;
a = b;
b = tmp;
}
Final Notes
I know this isn't the cleanest, nor best solution. This is simply one that carries the best across the 3 object oriented languages I know, that's also not too complex of a solution. Simple to understand, simple to change languages, so it's the answer I've decided to give.
EDIT
Here's the a new answer based on the comments of this answer.
static void Main(string[] args)
{
var str = "abaa";
var strAsArray = new string(str.ToCharArray());
var duplicateCount = 0;
List<char> dupedCharacters = new List<char>();
foreach(var character in strAsArray)
{
if(str.Count(f => (f == character)) > 1 && !dupedCharacters.Contains(character))
{
duplicateCount += str.Count(f => (f == character));
dupedCharacters.Add(character);
}
}
Console.WriteLine("The number of possible anagrams is: " + (factorial(str.Count()) / factorial(duplicateCount)));
Console.ReadLine();
int factorial(int num)
{
if(num <= 1)
return 1;
return num * factorial(num - 1);
}
}

Cuda Implementation of Partitioned Subgroup

is there a more efficient way to implement the "Partitioned Subgroup" functions of Vulkan/OpenGL, which do not have to loop over all elements in the subgroup? My current implementation just uses a loop from 0 to WARP_SIZE.
References:
(slide 37+38) https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9909-nvidia-vulkan-features-update.pdf
https://github.com/KhronosGroup/GLSL/blob/master/extensions/nv/GL_NV_shader_subgroup_partitioned.txt
Simple Implementation:
__device__ uint32_t subgroupPartitionNV(ivec2 p)
{
uint32_t result = 0;
for (int i = 0; i < 32; ++i)
{
int x = __shfl_sync(0xFFFFFFFF, p(0), i);
int y = __shfl_sync(0xFFFFFFFF, p(1), i);
uint32_t b = __ballot_sync(0xFFFFFFFF, p(0) == x && p(1) == y);
if (i == threadIdx.x & 31) result = b;
}
return result;
}
__device__ uint32_t subgroupPartitionedAddNV(float value, uint32_t ballot)
{
float result = 0;
for ( unsigned int i = 0; i < 32; ++i)
{
float other_value = __shfl_sync(0xFFFFFFFF, value, i);
if ((1U << i) & ballot) result += other_value;
}
return result;
}
Thanks to the hint of Abator I came up with a more efficient solution. It's a little ugly because labeled_partition is only implemented for int but works quite well.
template <int GROUP_SIZE = 32>
__device__ cooperative_groups::coalesced_group subgroupPartitionNV(ivec2 p)
{
using namespace cooperative_groups;
thread_block block = this_thread_block();
thread_block_tile<GROUP_SIZE> tile32 = tiled_partition<GROUP_SIZE>(block);
coalesced_group g1 = labeled_partition(tile32, p(0));
coalesced_group g2 = labeled_partition(tile32, p(1));
details::_coalesced_group_data_access acc;
return acc.construct_from_mask<coalesced_group>(acc.get_mask(g1) & acc.get_mask(g2));
}
template <typename T, int GROUP_SIZE = 32>
__device__ T subgroupPartitionedAddNV(T value, cooperative_groups::coalesced_group group)
{
int s = group.size();
int r = group.thread_rank();
for (int offset = GROUP_SIZE / 2; offset > 0; offset /= 2)
{
auto v = group.template shfl_down(value, offset);
if (r + offset < s) value += v;
}
return value;
}

CUDA __syncthreads(); not working; inverse in breakpoint hit order

I have a problem, I think with __syncthreads();, in the following code:
__device__ void prefixSumJoin(const bool *g_idata, int *g_odata, int n)
{
__shared__ int temp[Config::bfr*Config::bfr]; // allocated on invocation
int thid = threadIdx.y*blockDim.x + threadIdx.x;
if(thid<(n>>1))
{
int offset = 1;
temp[2*thid] = (g_idata[2*thid]?1:0); // load input into shared memory
temp[2*thid+1] = (g_idata[2*thid+1]?1:0);
for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
{
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1; // <-- breakpoint B
int bi = offset*(2*thid+2)-1;
temp[bi] += temp[ai];
}
offset *= 2;
}
if (thid == 0) { temp[n - 1] = 0; } // clear the last element
for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
{
offset >>= 1;
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
int t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}
__syncthreads();
g_odata[2*thid] = temp[2*thid]; // write results to device memory
g_odata[2*thid+1] = temp[2*thid+1];
}
}
__global__ void selectKernel3(...)
{
int tidx = threadIdx.x;
int tidy = threadIdx.y;
int bidx = blockIdx.x;
int bidy = blockIdx.y;
int tid = tidy*blockDim.x + tidx;
int bid = bidy*gridDim.x+bidx;
int noOfRows1 = ...;
int noOfRows2 = ...;
__shared__ bool isRecordSelected[Config::bfr*Config::bfr];
__shared__ int selectedRecordsOffset[Config::bfr*Config::bfr];
isRecordSelected[tid] = false;
selectedRecordsOffset[tid] = 0;
__syncthreads();
if(tidx<noOfRows1 && tidy<noOfRows2)
if(... == ...)
isRecordSelected[tid] = true;
__syncthreads();
prefixSumJoin(isRecordSelected,selectedRecordsOffset,Config::bfr*Config::bfr); // <-- breakpoint A
__syncthreads();
if(isRecordSelected[tid]==true){
{
some_instruction;// <-- breakpoint C
...
}
}
}
...
f(){
dim3 dimGrid(13, 5);
dim3 dimBlock(Config::bfr, Config::bfr);
selectKernel3<<<dimGrid, dimBlock>>>(...)
}
//other file
class Config
{
public:
static const int bfr = 16; // blocking factor = number of rows per block
public:
Config(void);
~Config(void);
};
The prefixSum is from GPU Gems 3: Parallel Prefix Sum (Scan) with CUDA, with little change.
Ok, now I set 3 breakpoints: A, B, C. They should be hit in the order A, B, C. The problem is that they are hit in the order: A, B*x, C, B. So at point C, selectedRecordsOffset is not ready and it causes errors. After A the B is hit a few times, but not all and then C is hit and it goes further in the code and then B again for the rest of the loop. x is different depending on the input (for some inputs there isn't any inversion in the breakpoints so C is the last that was hit).
Moreover if I look on thread numbers that cause a hit it is for A and C threadIdx.y = 0 and for B threadIdx.y = 10. How is this possible while it is the same block so why some threads omit sync? There is no conditional sync.
Does anyone have any ideas on where to look for bugs?
If you need some more clarification, just ask.
Thanks in advance for any advice on how to work this out.
Adam
Thou shalt not use __syncthreads() in conditional code if the condition does not evaluate uniformly across all threads of each block.

An error with dec to bin

I have been debugging this function but I don't know why is it throwing 99 when I send 4 to the function.
This is a function to covert from decimal to binary.
Actually, I have tried to cout exp, res and the other variables in each step and then multiply them but I don't know. It doesn't make sense.
int DecToBinary(long num) {
if(num == 0) {
return 0;
}
else if(num == 1) {
return 1;
}
int exp = 0;
int res = 0;
for (; num != 0; exp++){
res = res+num%2*pow(10,exp);
num = num/2;
}
return res;
}
Thank you guys.
if(num == 0) {
return 0;
}
else if(num == 0) {
return 1;
}
You know the second branch will never be executed, right?
Furthermore:
pow(10,exp);
this yields a floating-point number. Be prepared for rounding errors. Even better: don't use pow() at all (you don't need floating-point numbers for working with integers). Simply do the division step by step, accumulating the result in a variable.
int dec2bin(int n)
{
int r = 0, tp = 1;
while (n) {
r += (n % 2) * tp;
n >>= 1;
tp *= 10;
}
return r;
}

Upper Triangular matrix

i want to create upper triangular matrix with cuda
In the upper triangular matrix, the elements located
ed below the diagonal are zeros. This function should assign
the given value to the other elements.
but below code assigns all values as 0 why?
__global__ void initUpperTrinagleGPU(int *devMatrix, int numR, int numC, int value) {
int x = blockDim.x*blockIdx.x + threadIdx.x;
int y = blockDim.y*blockIdx.y + threadIdx.y;
int offset = y * numC + x;
if(numC <= numR) {
devMatrix[offset] = 0;
}
else
devMatrix[offset] = value;
}
This condition is wrong if(numC <= numR), it is true if there are less or equal cols than rows.
This might work, but it's just out of my head, not tested:
if(x >= y) {
devMatrix[offset] = 0;
}
else {
devMatrix[offset] = value;
}
note, that you should wrap this into another condition like:
if(y < numR && x < numC) { ...