cache behaviour on redundant writes - language-agnostic

Edit - I guess the question I asked was too long so I'm making it very specific.
Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?
For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?
Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.
Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )

I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.
I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.
Anyway, in my reply I described some of the implementation issues.
By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.
I also write about them on my wiki, http://comp-arch.net

Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:
The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
The old and new values are compared.
If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.
Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.
Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.
So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.
The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.
http://blogs.amd.com/developer/tag/sse4a/
Read
Cache hit: Data is read from the cache line to the target register
Cache miss: Data is moved from memory to the cache, and read into the target register
Write
Cache hit: Data is moved from the register to the cache line
Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line

This is not an answer to your original question on computer-architecture, but might be relevant to your goal.
In this discussion, all array index starts with zero.
Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:
An array of size
An array of n, and a counter, used to emulate a set container. Duplicate values allowed.
Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.
When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.
A similar technique, known as a two-level histogram or radix histogram, can also be used.
Two pieces of information are stored:
An array of size
An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.
Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.
Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.
When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.

Related

What's the best way to read in an entire LOB using ODBC?

Reading in an entire LOB whose size you don't know beforehand (without a max allocation + copy) should be a fairly common problem, but finding good documentation and/or examples on the "right" way to do this has proved utterly maddening for me.
I wrestled with SQLBindCol but couldn't see any good way to make it work. SQLDescribeCol and SQLColAttribute return column metadata that seemed to be a default or an upper bound on the column size and not the current LOB's actual size. In the end, I settled on using the following:
1) Put any / all LOB columns as the highest numbered columns in your SELECT statement
2) SQLPrepare the statement
3) SQLBindCol any earlier non-LOB columns that you want
4) SQLExecute the statement
5) SQLFetch a result row
6) SQLGetData on your LOB column with a buffer of size 0 just to query its actual size
7) Allocate a buffer just big enough to hold your LOB
8) SQLGetData again on your LOB column with your correctly sized allocated buffer this time
9) Repeat Steps 6-8 for each later LOB column
10) Repeat Steps 5-9 for any more rows in your result set
11) SQLCloseCursor when you are done with your result set
This seems to work for me, but also seems rather involved.
Are the calls to SQLGetData going back to the server or just processing the results already sent to the client?
Are there any gotchas where the server and/or client will refuse to process very large objects this way (e.g. - some size threshold is exceeded so they generate an error instead)?
Most importantly, is there a better way to do this?
Thanks!
I see several improvements to be done.
If you need to allocate a buffer then you should do it once for all the records and columns. So, you could use the technique suggested by #RickJames, improved with a MAX like this:
SELECT MAX(LENGTH(blob1)) AS max1, MAX(LENGTH(blob2)) AS max2, ...
You could use max1 and max2 to upfront allocate the buffers, or maybe only the largest one for all columns.
The length of the buffer returned at 1. might be too large for your application. You could decide at runtime how large the buffer would be. Anyway, SQLGetData is designed to be called multiple times for each column. Just by calling it again, with the same column number, it will fetch the next chunk. The count of available bytes will be saved where StrLen_or_IndPtr (the last argument) points. And this count will decrease after each call with the amount of bytes fetched.
And certainly there will be roundtrips to the server for each call because the purpose of all this is to prevent the driver from fetching more than the application can handle.
The trick with passing NULL as buffer pointer in order to get the length is prohibited in this case, check SQLGetData on Microsoft's Docs.
However, you could allocate a minimal buffer, say 8 bytes, pass it and its length. The function will return the count of bytes written, 7 in our case because the function add a null char, and will put at StrLen_or_IndPtr the count of remaining bytes. But you probably won't need this if you allocate the buffer as explained above.
Note: The LOBs need to be at the end of the select list and must be fetched in that order precisely.
SQLGetData
SQLGetData get the result of already fetched result. For example, if you have SQLFetch the first row of your table, SQLData will send you back the first row. It is used if you don't know if you can SQLBindCol the result.
But the way it is handle depends on your driver and is not describe in the standards. If your database is a SQL database, cursor cannot go backward, so the result may be still in the memory.
Large object query
The server may refuse to process large object according to the server standard and your ODBC Driver standard. It is not described in the ODBC standard.
To avoid a max-allocation, doing an extra copy, and to be efficient:
Getting the size first is not a bad approach -- it takes virtually no extra time to do
SELECT LENGTH(your_blob) FROM ...
Then do the allocation and actually fetch the blob.
If there are multiple BLOB columns, grab all the lengths in a single pass:
SELECT LENGTH(blob1), LENGTH(blob2), ... FROM ...
In MySQL, the length of a BLOB or TEXT is readily available in front of the bytes. But, even if it must read the column to get the length, think of that as merely priming the cache. That is, the overall time is not hurt much in either case.

Data structure/Algorithm to manage non-overlapping ranges of values?

I'm working on a system that has 10's of thousands of flags for a user. The flags are all sequential in number, 0 through X, whatever X ends up being. X is expected to grow over time. And we're expecting to have lots and lots of users as well.
Our primary concerns are:
Being able to quickly test whether the user has set any given flag.
Being able to quickly set a flag.
Being able to optimize the data storage to as small a size as possible.
With 10k flags, we're looking at around 1k of data per user, in memory, if we use a bit vector. Which might be too much. And to make matters worse, this is in Javascript, being stored in a document database serialized as JSON, which means that we have several storage options, none of them which I particularly like.
Store flags as the JSON output of the Uint32Array object. Looks like: "{"0":10,"1":4294967295}". Unfortunately needs an average of 17 bytes per 4 bytes stored as the flags approach their filled state, which is over 4x the memory, and leads to about 5k of memory when serialized. This is not ideal.
Perform our own serialization of the JSON, using base64 to avoid the bloated sizes of the numbers-as-strings approach. Unfortunately that adds an extra processing step to the JSON input/output phases which complicates things because now we have to modify our data during the process and will slow everything down.
So... putting aside the bitvector idea for a bit. I was wondering if there's possibly a better approach. I considered using an "array of ranges", something like:
[{"m":0,"x":100},{"m":102},{"m":108,"x":204}]
We can make a few assumptions about the data in this system, which is what led me to this approach:
Flags are never un-set. Once it's set, it will remain set.
Flags are generally clustered. If flag X is set, there's a huge probability that X-1 and X+1 will be set as well.
Flags will generally be set at increasing index values. If Flag X is being set, then X-1 is more likely to be set than X+1, and X+1 is likely to be set fairly soon afterwards.
So because of these conditions, I think storing an array of range objects might be the optimal solution. That way, over time, the user's flags eventually condense down into one large range entry. The optimal case is of course:
[{"m":0,"x":10000}]
The worst case scenario, of course, is if they somehow manage to find themselves in a state where they set every other flag.
[{"m":0},{"m":2},{"m":4},{"m":6},{"m":8},{"m":10}...{"m":10000}]
That would be bad. Far worse than the bitvector solution, I think. But we're pretty confident that won't happen.
So, as to the ability to quickly decide if a flag is set; that's simply an O(logn) binary search (since the array will be sorted); just find the range object closest to your number, check to see if your number is in that range, and return.
Insertions are more tricky. It'll still be a binary search, but now we're modifying the array.
one adjacent sibling insert: optimal scenario. We find a range where the min or max is one off from the number we're inserting, and simply decrement or increment the value on the current range. O(1)
No adjacent siblings insert: simply insert a new node with the min set. O(n), because we'll be moving everything after it in the array downwards.
Two adjacent siblings insert: Change the max to the max value of the right sibling range, delete the right sibling range from the array and shift everything after it to the left. O(n).
So cases 2+3 have me wondering if I shouldn't try to use some sort of balanced binary search tree for this. A Red-Black tree, for example.
Is that worth the bother? Am I overthinking this?

performance differences between iterative and recursive method in mips

I have 2 version of a program I must analyze. One is recursive and the other is iterative. I must compare cache hit rate for both and examine as performance varies for both methods as well as their instructions count
For both methods regardless of block setting I get roughly 100 less memory access for the iterative method. both trigger 2 misses. I can only drop the cache hit rate to 85% if i set the setting to 1 block of size 256.
for the instructions count the iterative is roughly 1000 instructions less
Can someone explain to me why this happens or provide some literature I can read this in I can't seem to find anything. I would just like a general overview of why this occurs.
Took some info from this question: Recursion or Iteration? and some from in COMP 273 at McGill, which I assume you're also in and that's why you asked.
Each recursive call generally requires the return address of that call to be pushed onto the stack; in MIPS (assembly) this must be done manually otherwise the return address gets overwritten with each jal. As such, usually more cache space is used for a recursive solution and as such the memory access count is higher. In an iterative solution this isn't necessary whatsoever.

How to create Large Bit array in cuda?

I need to keep track of around 10000 elements of an array in my algorithm .So for this I need boolean for each record.If I used char array to keep track of 10000 arrays (as 0/1),it would take up lot of memory.
So Can I create an bit array of 10000 bits in Cuda where each bit represents corresponding array record?
As Roger said, the answer is yes, CUDA provides the same bitwise operations (i.e. >>, << and &) as normal C so you can implement your bit array essentially normally (almost, see thread synchronisation issues below).
However, for your situation it is almost certainly not a good idea.
There are problems with thread syncronisation. Imagine two of the threads on your GPU are inverting two bits of a single entry of your array. Each thread will read the same value out of memory, and apply their operation to it, but the thread that writes its value back to memory last will overwrite the result of the other thread. (Note: if your bit array is not being modified by the GPU code then this isn't a problem.)
And, unless this is explicitly required, you shouldn't be optimising for memory use, an array with 10K elements does not take much memory at all: even if you were storing each boolean in an 64 bit integer it's only 80 KB. And obviously you can store them in a smaller datatype. (You should only start worrying about compressing the array as much as possible when you are getting upwards of tens of millions, or even hundreds of millions of elements.)
Also, the way GPUs work means that you might get best performance by using a reasonably large data type for each boolean (most likely a 32 bit one) so that, for example, memory coalescing works better. (I haven't tested this assertion, you would need to run some benchmarks to check it.)

What's the difference between safe, regular and atomic registers?

Can anyone provide exhaustive explanation, please? I'm diving into concurrent programming and met those registers while trying to understand consensus.
From Lamport's "On interprocess communication": ...a regular register is atomic if two successive reads that overlap the same write cannot obtain the new then the old value....
Assume, that first comes thread0.write(0) - with no overlapping. Basically, one can say using Lamports definition that thread1 can read first 1 and then 0 again, if both reads are consequent and overlap with thread0.write(1). But how is that possible?
Reads and writes to a shared memory location take a finite period of time, so they may either overlap, or be completely distinct.
e.g.
Thread 1: wwwww wwwww
Thread 2: rrrrr rrrrr
Thread 3: rrrrr rrrrr
The first read from thread 2 overlaps with the first write from thread 1, whilst the second read and second write do not overlap. In thread 3, both reads overlap the first write.
A safe register is only safe as far as reads that do not overlap writes. If a read does not overlap any writes then it must read the value written by the most recent write. Otherwise it may return any value that the register may hold. So, in thread 2, the second read must return the value written by the second write, but the first read can return any valid value.
A regular register adds the additional guarantee that if a read overlaps with a write then it will either read the old value or the new one, but multiple reads that overlap the write do not have to agree on which, and the value may appear to "flicker" back and forth. This means that two reads from the same thread (such as in thread 3 above) that both overlap the write may appear "out of order": the earlier read returning the new value, and the later returning the old value.
An atomic register guarantees that the reads and writes appears to happen at a single point in time. Readers that act at a point before that point will all read the old value and readers that act after that point will all read the new value. In particular, if two reads from the same thread overlap a write then the later read cannot return the old value if the earlier read returns the new one. Atomic registers are linearizable.
The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit gives a good description, along with examples and use cases.