CUDA: Max of array, how to prevent write collisions? - cuda

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global maximum", and so on.
But they never seem to address the issue of threads trying to write to the same memory position simultaneously.
Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?
I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )

The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.
For completeness, here is a concrete example of a reduction kernel:
template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
__shared__ volatile T data[BLOCKSIZE];
T maxval = inputvals[threadIdx.x];
for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x)
{
maxfunc(maxval, inputvals[i]);
}
data[threadIdx.x] = maxval;
__syncthreads();
// Here maxfunc(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
maxfunc(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) {
maxfunc(data[0], data[1]);
outputvals[blockIdx.x] = data[0];
}
}
}
The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as
template<T>
__device__ void maxfunc(T & a, T & b)
{
a = (b > a) ? b : a;
}

Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
int main(void)
{
thrust::host_vector<int> h_vec(10000);
thrust::sequence(h_vec.begin(), h_vec.end());
// show hvec
thrust::copy(h_vec.begin(), h_vec.end(),
std::ostream_iterator<int>(std::cout, "\n"));
// transfer to device
thrust::device_vector<int> d_vec = h_vec;
int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end());
std::cout << "max value: " << max_dvec_value << "\n";
return 0;
}
And watch out that thrust::max_element returns a pointer.

Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.

Related

Are atomic operations in CUDA guaranteed to be scheduled per warp?

Suppose I have 8 blocks of 32 threads each running on a GTX 970. Each blcok either writes all 1's or all 0's to an array of length 32 in global memory, where thread 0 in a block writes to position 0 in the array.
Now to write the actual values atomicExch is used, exchanging the current value in the array with the value that the block attempts to write. Because of SIMD, atomic operation and the fact that a warp executes in lockstep I would expect the array to, at any point in time, only contain 1's or 0's. But never a mix of the two.
However, while running code like this there are several cases where at some point in time the array contains of a mix of 0's and 1's. Which appears to point to the fact that atomic operations are not executed per warp, and instead scheduled using some other scheme.
From other sources I have not really found a conclusive write-up detailing the scheduling of atomic operations across different warps (please correct me if I'm wrong), so I was wondering if there is any information on this topic. Since I need to write many small vectors consisting of several 32 bit integers atomically to global memory, and an atomic operation that is guaranteed to write a single vector atomically is obviously very important.
For those wondering, the code I wrote was executed on a GTX 970, compiled on compute capability 5.2, using CUDA 8.0.
The atomic instructions, like all instructions, are scheduled per warp. However there is an unspecified pipeline associated with atomics, and the scheduled instruction flow through the pipeline is not guaranteed to be executed in lockstep, for every thread, for every stage through the pipeline. This gives rise to the possibility for your observations.
I believe a simple thought experiment will demonstrate that this must be true: what if 2 threads in the same warp targeted the same location? Clearly every aspect of the processing could not proceed in lockstep. We could extend this thought experiment to the case where we have multiple issue per clock within an SM and even across SMs, to as additional examples.
If the vector length were short enough (16 bytes or less) then it should be possible to accomplish this ("atomic update") simply by having a thread in a warp write an appropriate vector-type quantity, e.g. int4. As long as all threads (regardless of where they are in the grid) are attempting to update a naturally aligned location, the write should not be corrupted by other writes.
However, after discussion in the comments, it seems that OP's goal is to be able to have a warp or threadblock update a vector of some length, without interference from other warps or threadblocks. It seems to me that really what is desired is access control (so that only one warp or threadblock is updating a particular vector at a time) and OP had some code that wasn't working as desired.
This access control can be enforced using an ordinary atomic operation (atomicCAS in the example below) to permit only one "producer" to update a vector at a time.
What follows is an example producer-consumer code, where there are multiple threadblocks that are updating a range of vectors. Each vector "slot" has a "slot control" variable, which is atomically updated to indicate:
vector is empty
vector is being filled
vector is filled, ready for "consumption"
with this 3-level scheme, we can allow for ordinary access to the vector by both consumer and multiple producer workers, with a single ordinary atomic variable access mechanism. Here is an example code:
#include <assert.h>
#include <iostream>
#include <stdio.h>
const int num_slots = 256;
const int slot_length = 32;
const int max_act = 65536;
const int slot_full = 2;
const int slot_filling = 1;
const int slot_empty = 0;
const int max_sm = 64; // needs to be greater than the maximum number of SMs for any GPU that it will be run on
__device__ int slot_control[num_slots] = {0};
__device__ int slots[num_slots*slot_length];
__device__ int observations[max_sm] = {0}; // reported by consumer
__device__ int actives[max_sm] = {0}; // reported by producers
__device__ int correct = 0;
__device__ int block_id = 0;
__device__ volatile int restricted_sm = -1;
__device__ int num_act = 0;
static __device__ __inline__ int __mysmid(){
int smid;
asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
return smid;}
// this code won't work on a GPU with a single SM!
__global__ void kernel(){
__shared__ volatile int done, update, next_slot;
int my_block_id = atomicAdd(&block_id, 1);
int my_sm = __mysmid();
if (my_block_id == 0){
if (!threadIdx.x){
restricted_sm = my_sm;
__threadfence();
// I am "block 0" and process the vectors, checking for coherency
// "consumer"
next_slot = 0;
volatile int *vslot_control = slot_control;
volatile int *vslots = slots;
int scount = 0;
while(scount < max_act){
if (vslot_control[next_slot] == slot_full){
scount++;
int slot_val = vslots[next_slot*slot_length];
for (int i = 1; i < slot_length; i++) if (slot_val != vslots[next_slot*slot_length+i]) { assert(0); /* badness - incoherence */}
observations[slot_val]++;
vslot_control[next_slot] = slot_empty;
correct++;
__threadfence();
}
next_slot++;
if (next_slot >= num_slots) next_slot = 0;
}
}}
else {
// "producer"
while (restricted_sm < 0); // wait for signaling
if (my_sm == restricted_sm) return;
next_slot = 0;
done = 0;
__syncthreads();
while (!done) {
if (!threadIdx.x){
while (atomicCAS(slot_control+next_slot, slot_empty, slot_filling) > slot_empty) {
next_slot++;
if (next_slot >= num_slots) next_slot = 0;}
// we grabbed an empty slot, fill it with my_sm
if (atomicAdd(&num_act, 1) < max_act) update = 1;
else {done = 1; update = 0;}
}
__syncthreads();
if (update) slots[next_slot*slot_length+threadIdx.x] = my_sm;
__threadfence(); //enforce ordering
if ((update) && (!threadIdx.x)){
slot_control[next_slot] = 2; // mark slot full
atomicAdd(actives+my_sm, 1);}
__syncthreads();
}
}
}
int main(){
kernel<<<256, slot_length>>>();
cudaDeviceSynchronize();
cudaError_t res= cudaGetLastError();
if (res != cudaSuccess) printf("kernel failure: %d\n", (int)res);
int *h_obs = new int[max_sm];
int *h_act = new int[max_sm];
int h_correct;
cudaMemcpyFromSymbol(h_obs, observations, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(h_act, actives, sizeof(int)*max_sm);
cudaMemcpyFromSymbol(&h_correct, correct, sizeof(int));
int h_total_act = 0;
int h_total_obs = 0;
for (int i = 0; i < max_sm; i++){
std::cout << h_act[i] << "," << h_obs[i] << " ";
h_total_act += h_act[i];
h_total_obs += h_obs[i];}
std::cout << std::endl << h_total_act << "," << h_total_obs << "," << h_correct << std::endl;
}
I don't claim this code to be defect free for any use case. It is advanced to demonstrate the workability of a concept, not as production-ready code. It seems to work for me on linux, on a couple different systems I tested it on. It should not be run on GPUs that have only a single SM, as one SM is reserved for the consumer, and the remaining SMs are used by the producers.

CUDA cub::DeviceScan and the temp_storage_bytes parameter

I'm using cub::DeviceScan functiona and the sample code snippet has a parameter temp_storage_bytes, which it uses to allocate memory (which, incidentally, the code snippet never frees).
The code snippet calls cub::DeviceScan functions with a pointer to NULL memory which triggers it to calculate the required amount of temporary device memory needed for the function, and then returns. The necessary temporary memory is allocated with cudaMalloc, and the function call is repeated pointing to this memory. The temporary memory is then freed with cudaFree (or probably should be).
I'm doing many repetitions of the device scan on different float arrays, but each float array is identical length.
My question is, can I assume that temp_storage_bytes will always be the same value? If so, I can then do a single cudaMalloc and a single cudaFree for many function calls.
The example is unclear on how the required memory is determined and whether it can change for a given array of a given length.
You can assume you need only one call to cub::DeviceScan::InclusiveScan to determine the amount of temporary temp_storage_bytes bytes required if you have repeated calls to cub::DeviceScan::InclusiveScan over different arrays of same length. In the example below, I'm calling several times cub::DeviceScan::InclusiveScan over different arrays of same length and using only one call to cub::DeviceScan::InclusiveScan to determine the amount of temporary size-
// Ensure printing of CUDA runtime errors to console
#define CUB_STDERR
#include <stdio.h>
#include <algorithm> // std::generate
#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>
#include <thrust\device_vector.h>
#include <thrust\host_vector.h>
void main(void)
{
// Declare, allocate, and initialize device pointers for input and output
int num_items = 7;
thrust::device_vector<int> d_in(num_items);
thrust::device_vector<int> d_out(num_items);
// Determine temporary device storage requirements for inclusive prefix sum
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in.data(), d_out.data(), num_items);
// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);
for (int k=0; k<10; k++) {
thrust::host_vector<int> h_in(num_items);
thrust::host_vector<int> h_out(num_items,0);
std::generate(h_in.begin(), h_in.end(), rand);
d_in = h_in;
// Run inclusive prefix sum
cub::DeviceScan::InclusiveSum(d_temp_storage, temp_storage_bytes, d_in.data(), d_out.data(), num_items);
int difference = 0;
int prev = 0;
for (int i=0; i<num_items; i++) {
h_out[i] = prev + h_in[i];
prev = h_out[i];
int val = d_out[i];
printf("%i %i %i %i\n",i,difference,h_out[i],d_out[i]);
difference = difference + abs(h_out[i] - d_out[i]);
}
if (difference == 0) printf("Test passed!\n");
else printf("A problem occurred!\n");
h_in.shrink_to_fit();
h_out.shrink_to_fit();
}
getchar();
}

What does #pragma unroll do exactly? Does it affect the number of threads?

I'm new to CUDA, and I can't understand loop unrolling. I've written a piece of code to understand the technique
__global__ void kernel(float *b, int size)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
#pragma unroll
for(int i=0;i<size;i++)
b[i]=i;
}
Above is my kernel function. In main I call it like below
int main()
{
float * a; //host array
float * b; //device array
int size=100;
a=(float*)malloc(size*sizeof(float));
cudaMalloc((float**)&b,size);
cudaMemcpy(b, a, size, cudaMemcpyHostToDevice);
kernel<<<1,size>>>(b,size); //size=100
cudaMemcpy(a, b, size, cudaMemcpyDeviceToHost);
for(int i=0;i<size;i++)
cout<<a[i]<<"\t";
_getch();
return 0;
}
Does it mean I have size*size=10000 threads running to execute the program? Are 100 of them created when loop is unrolled?
No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.
#pragma unroll is a compiler optimization that can, for example, replace a piece of code like
for ( int i = 0; i < 5; i++ )
b[i] = i;
with
b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;
by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.
Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.
If you are sure that the size is 100 for all executions, you can unroll your loop like below:
#pragma unroll
for(int i=0;i<SIZE;i++) //or simply for(int i=0;i<100;i++)
b[i]=i;
in which SIZE is known in compile time with #define SIZE 100.
I also suggest you to have proper CUDA error checking in your code (explained here).

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".