I have a cuda loop where a variable cumul store an accumulation in double :
double cumulative_value = (double)0;
loop(...)
{
// ...
double valueY = computeValueY();
// ...
cumulative_value += valueY
}
This code is compiled on different SDK and run on two computers :
M1 : TeslaM2075 CUDA 5.0
M2 : TeslaM2075 CUDA 7.5
At step 10, results are differents. Values for this addition (double precision representation in hexadecimal) are:
0x 41 0d d3 17 34 79 27 4d => cumulative_value
+ 0x 40 b6 60 1d 78 6f 09 b0 => valueY
-------------------------------------------------------
=
0x 41 0e 86 18 20 3c 9f 9b (for M1)
0x 41 0e 86 18 20 3c 9f 9a (for M2)
Rounding mode is not specified as I can see in the ptx cuda file ( == add.f64) but M1 seems to use round to plus Infinity and M1 an other mode.
If I force M2 with one of the 4 rounding modes (__dadd_XX()) for this instruction, cumulative_value is always different than M1 even before step 10.
But if I force M1 and M2 with the same rounding mode, results are the same but not equals to M1 before modification.
My aim is to get M1 (cuda 5.0) results on M2 machine (cuda 7.5) but I don't understand the default rounding mode behavior at runtime. I am wondering if the rouding mode is dynamic at runtime if not specified. Do you have you an idea ?
After another ptx analysis and in my case, valueY is computed from a FMA instruction on cuda 5.0 while cuda 7.5 compiler uses MUL and ADD instructions. Cuda documentation explains there is only one rounding step using single FMA instruction while there are two rounding steps using MUL and ADD. Thank you very much for helping me :)
Related
I'm in the process of writing some N-body simulation code with short-ranged interactions in CUDA targeted toward Volta and Turing series cards. I plan on using shared memory, but it's not quite clear to me how to avoid bank conflicts when doing so. Since my interactions are local, I was planning on sorting my particle data into local groups that I can send to each SM's shared memory (not yet worrying about particles that have a neighbor who is being worked on from another SM. In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
All of the information I see seems to only mention that memory be coalesced for the copy from global memory to shared memory, but don't I see anything about whether threads in a warp (or the whole SM) care about coalesence in shared memory.
In order to get good performance (avoid bank conflicts), is it sufficient only that each thread reads/writes from/to a different address of shared memory, but each thread may access that memory non-sequentially without penalty?
bank conflicts are only possible between threads in a single warp that are performing a shared memory access, and then only possible on a per-instruction (issued) basis. The instructions I am talking about here are SASS (GPU assembly code) instructions, but nevertheless should be directly identifiable from shared memory references in CUDA C++ source code.
There is no such idea as bank conflicts:
between threads in different warps
between shared memory accesses arising from different (issued) instructions
A given thread may access shared memory in any pattern, with no concern or possibility of shared memory back conflicts, due to its own activity. Bank conflicts only arise as a result of 2 or more threads in a single warp, as a result of a particular shared memory instruction or access, issued warp-wide.
Furthermore it is not sufficient that each thread reads/writes from/to a different address. For a given issued instruction (i.e. a given access) roughly speaking, each thread in the warp must read from a different bank, or else it must read from an address that is the same as another address in the warp (broadcast).
Let's assume that we are referring to 32-bit banks, and an arrangement of 32 banks.
Shared memory can readily be imagined as a 2D arrangement:
Addr Bank
v 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
64 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
96 96 97 98 ...
We see that addresses/index/offset/locations 0, 32, 64, 96 etc. are in the same bank. Addresses 1, 33, 65, 97, etc. are in the same bank, and so on, for each of the 32 banks. Banks are like columns of locations when the addresses of shared memory are visualized in this 2D arrangement
The requirement for non-bank-conflicted access for a given instruction (load or store) issued to a warp is:
no 2 threads in the warp may access locations in the same bank/column.
a special case exists if the locations in the same column are actually the same location. This invokes the broadcast rule and does not lead to bank conflicts.
And to repeat some statements above in a slightly different way:
If I have a loop in CUDA code, there is no possibility for bank conflicts to arise between separate iterations of that loop
If I have two separate lines of CUDA C++ code, there is no possibility for bank conflicts to arise between those two separate lines of CUDA C++ code.
as the title says I'm curious to know how the checksum value is calculated, from what I've read it calculated using 2s complement. Below is a 2 lines from the hex file which was loaded onto my Microcontroller, I've added spaces to make it easier to read, S315 appears on every line, the address on line 1 is 080C0000 followed by 16 hex values which represent the bytes, the values AA on line 1 and AB on line 2 are I assume the checksum values.
For line 1 I've tried adding the following 15+08+0C+00+00+4D+53+53+70+6F+74+31+00+66+10+AE+19+7E+63+1F+78 which gives me 555 Hex or 010101010101 in binary. I've entered the binary value into an online 2s complement calculator but it always says "invalid binary"??
S3 15 080C0000 4D 53 53 70 6F 74 31 00 66 10 AE 19 7E 63 1F 78 AA
S3 15 080C0010 00 00 00 00 45 85 63 EB FF FF FF FF 04 00 03 00 AB
You add the byte values, like you've done. From that sum you take only the least significant byte.
Then for Motorola HEX (SREC):
Then you take the one's complement of that byte by inverting its bits (i.e. 1s turns to 0s and vice versa).
Then for Intel HEX:
Then you take the two's complement of that byte by inverting its bits (i.e. 1s turns to 0s and vice versa) and then you add 1.
Going by your example you have the sum 0x555. Then take the least significant byte, which is 0x55.
For Motorola HEX (SREC): Calculate the one's complement of that. You get 0xAA as the checksum.
For Intel HEX: Calculate the two's complement of that. You get 0xAB as the checksum.
Have tried some of the online references as wells as unix time form at etc. but none of these seem to work. See the examples below.
running Mysql 5.5.5 in ubuntu. innodb engine.
nothing is custom. This is using a built in datetime function.
Here are some examples with the 6 byte hex string and the decoded message below. We are looking for the decoding algorithm. i.e.how to turn the 6 byte hex string into the correct date/time. The algorithm must work correctly on the examples below. The right most byte seems to indicate difference in seconds correctly for small small differences in time between records. i.e. we show an example with 14 sec difference.
full records,nicely highlighted and formated word doc here.
https://www.dropbox.com/s/zsqy9o2rw1h0e09/mysql%20datetime%20examples%20.docx?dl=0
link to formatted word document with the examples.
contact frank%simrex.com re. reward.
replace % with #
hex strings and decoded date/time pairs are below.
pulled from healthy file running mysql
12 51 72 78 B9 46 ... 2014-10-22 16:53:18
12 51 72 78 B9 54 ... 2014-10-22 16:53:32
12 51 72 78 BA 13 ... 2014-10-22 16:55:23
12 51 72 78 CC 27 ... 2014-10-22 17:01:51
here you go.
select str_to_date(conv(replace('12 51 72 78 CC 27',' ', ''), 16, 10), '%Y%m%d%H%i%s')
I have a video with an unknown frame rate. I need to calculate the frame rate it was encoded for. I am trying to calculate it using the data in SPS but I cannot decode it.
The bitstream for the NAL is :
67 64 00 1e ac d9 40 a0 2f f9 61 00 00 03 00 7d 00 00 17 6a 0f 16 2d 96
From an online guide (http://www.cardinalpeak.com/blog/the-h-264-sequence-parameter-set/), I could figure out its profile and level fields, but to figure out everything after the "seq_parameter_set_id" field in the table, I need to know the ue(v). Here is where I get confused. According to this page the "ue(v)" should be called with the value v=32? (why?) What exactly should I feed into the exponential-golomb function? Do I read 32 digits from the beginning of the bitstream, or from after the previously read bytes, to regard it as the "seq_parameter_set_id"?
( My ultimate goal is to decode the VUI parameters so that I can recalculate the framerate.)
Thanks!
ue = Unsigned Exponential golomb coding.
(v) = variable number of bits.
http://en.wikipedia.org/wiki/Exponential-Golomb_coding
I'm using the crypto module to create salts and hashes for storage in my database. I'm using SHA-512, if that's relevant.
What I have is a 64 byte salt, presently in the form of a "SlowBuffer", created by crypto.randomBytes(64, . . .). Example:
<SlowBuffer 91 0d e9 23 c0 8e 8c 32 5c d6 0b 1e 2c f0 30 0c 17 95 5c c3 95 41 fd 1e e7 6f 6e f0 19 b6 34 1a d0 51 d3 b2 8d 32 2d b8 cd be c8 92 e3 e5 48 93 f6 a7 81 ...>
I also have a 64-byte hash that is currently a string. Example:
'de4c2ff99fb34242646a324885db79ca9ef82a5f4b36c657b83ecf6931c008de87b6daf99a1c46336f36687d0ab1fc9b91f5bc07e7c3913bec3844993fd2fbad'
In my database, I have two fields, called passhash and passsalt, which are binary(64)s.
I'm using the mysql module (require('mysql')) to add the row. How can I include the binaries for insertion?
First of all, I'm no longer using the mysql module, but the mysql2 module (because it supports prepared statements). This changes roughly nothing about the question, but I'd like to mention that those reading this who are using 'mysql' should probably use 'mysql2'.
Second, both of these modules can take Buffers as parameters for these fields. This works perfectly. To do what I was originally attempting to do, I have it like this:
var hash; //Pretend initialized as a 64-bit buffer
var salt; //" "
connection.query('insert into users set?', {..., passhash: hash, passsalt: salt,..., callback});
Additionally, I didn't realize that the crypto "digest" function had a default behavior with no parameter, which is to return as a Buffer.
This is not the best worded answer, but that's because no one seems to be paying much attention to this question, and it was my question. :) If you would like more clarification, feel free to comment.