Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
As far as I understand, a ready warp is a warp that can be executed in warp scheduling. A waiting warp is waiting for source operands to be fetched or computed so that it can't be executed. Warp scheduler chooses a ready warp to execute for "warp scheduling".
On the other hand, when a warp has a pipeline stall or a long global memory latency, another warp will be brought into execution to hide the latency. This is the basic idea of "warp context switching" in cuda.
My question is: What is the relation between warp scheduling and warp context switching in Cuda. To elaborate my question, below is a example.
E.g. When warp A is stalled, and warp A is a waiting warp for global memory to be fetched, once the element is fetched, warp A will be scheduled or switched into the ready warp pool. Based on this, warp context switching is a part of warp scheduling. Is it correct?
Can anyone also provide any references on the warp context switching and warp scheduling in Cuda? It seems Nvidia does not make these documents publicly available.
Thanks in advance for any reply.
The ready warps are those which can be scheduled on the next cycle. Stalled warps cannot be scheduled.
To answer the question about latency with an extremely simplified example, suppose that the latency to main memory is 8 execution cycles, and let's ignore the fact that the machine is pipelined. Let's assume all instructions can execute in one cycle, if the data is ready.
Now suppose I have C code like this:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int myval = global_data[idx]*global_data[idx];
That is, myval should contain the square of an item in global memory, when the code is complete. This will be decomposed into a sequence of assembly language instructions. Let's suppose they look something like this:
I0: R0 = global_data[idx];
I1: R1 = R0 * R0;
I2: ...
Every thread can execute the first line of code (initially there are no stalls); there is no dependency yet, and a read by itself does not cause a stall. However every thread can then move on to the second line of code, and now the value of R0 must be correct, so a stall occurs, waiting for the read to be retrieved. As mentioned already, suppose the latency is 8 cycles, and using a warp with of 32 and a threadblock size of 512, we have a total of 16 warps. Let's suppose for simplicity we have a Fermi SM with only 32 units of execution. The sequence will look something like this:
cycle: ready warps: executing warp: instruction executed: Latency:
0 1-16 0 I0 -> I1 (stall) --
1 2-16 1 I0 -> I1 (stall) | --
2 3-16 2 I0 -> I1 (stall) | |
3 4-16 3 I0 -> I1 (stall) | |
4 5-16 4 I0 -> I1 (stall) | |
5 6-16 5 I0 -> I1 (stall) | |
6 7-16 6 I0 -> I1 (stall) | |
7 8-16 7 I0 -> I1 (stall) | |
8 0,9-16 8 I0 -> I1 (stall) <- |
9 1,9-16 0 I1 -> I2 <----
What we see is that after the latency is fulfilled by executing instructions from other warps, a previously "stalled" warp will re-enter the ready warp pool, and it's possible for the scheduler to schedule that warp again (i.e. to do the multiply operation contained in I1) on the very next cycle after the stall condition is removed.
There is no contradiction between latency hiding and warp scheduling. They work together, for a code with sufficient work to do, to hide the latency associated with various operations, such as reading from global memory.
The above example is a simplification compared to actual behavior, but it adequately represents the concepts of latency hiding and warp scheduling, to demonstrate how warp scheduling, in the presence of "enough work to do", can hide latency.
Related
Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.
What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)
Is there any relationship between blockIdx and the order in which thread blocks are executed on the GPU device?
My motivation is that I have a kernel in which multiple blocks will read from the same location in global memory, and it would be nice if these blocks would run concurrently (because L2 cache hits are nice). In deciding how to organize these blocks into a grid, would it be safe to say that blockIdx.x=0 is more likely to run concurrently with blockIdx.x=1 than with blockIdx.x=200? And that I should try to assign consecutive indices to blocks that read from the same location in global memory?
To be clear, I'm not asking about inter-block dependencies (as in this question) and the thread blocks are completely independent from the point of view of program correctness. I'm already using shared memory to broadcast data within a block, and I can't make the blocks any larger.
EDIT: Again, I am well aware that
Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series.
and the blocks are fully independent---they can run in any order and produce the same output. I am just asking if the order in which I arrange the blocks into a grid will influence which blocks end up running concurrently, because that does affect performance via L2 cache hit rate.
I found a writeup in which a CS researcher used micro-benchmarking to reverse engineer the block scheduler on a Fermi device:
http://cs.rochester.edu/~sree/fermi-tbs/fermi-tbs.html
I adapted his code to run on my GPU device (GTX 1080, with the Pascal GP104 GPU) and to randomize the runtimes.
Methods
Each block contains only 1 thread, and is launched with enough shared memory that only 2 blocks can be resident per SM. The kernel records its start time (obtained via clock64()) and then runs for a random amount of time (the task, appropriately enough, is generating random numbers using the multiply-with-carry algorithm).
The GTX 1080 is comprised of 4 Graphics Processing Clusters (GPCs) with 5 streaming multiprocessors (SM) each. Each GPC has its own clock, so I used the same method described in the link to determine which SMs belonged to which GPCs and then subtract a fixed offset to convert all of the clock values to the same time zone.
Results
For a 1-D block grid, I found that the blocks were indeed launched in consecutive order:
We have 40 blocks starting immediately (2 blocks per SM * 20 SMs) and the subsequent blocks start when the previous blocks end.
For 2-D grids, I found the same linear-sequential order, with blockIdx.x being the fast dimension and blockIdx.y the slow dimension:
NB: I made a terrible typo when labeling these plots. All instances of "threadIdx" should be replaced with "blockIdx".
And for a 3-d block grid:
Conclusions
For a 1-D grid, these results match what Dr. Pai reported in the linked writeup. For 2-D grids, however, I did not find any evidence for a space-filling curve in block execution order, so this may have changed somewhere between Fermi and Pascal.
And of course, the usual caveats with benchmarking apply, and there's no guarantee that this isn't specific to a particular processor model.
Appendix
For reference, here's a plot showing the results for random vs. fixed runtimes:
The fact that we see this trend with randomized runtimes gives me more confidence that this is a real result and not just a quirk of the benchmarking task.
Yes, there definitely is a correlation (although of course it is not guaranteed).
You are probably best off just trying it out on your device. You can use the %globaltimer and %smid special PTX registers with a bit of inline assembly:
#include <stdio.h>
__managed__ unsigned long long starttime;
__device__ unsigned long long globaltime(void)
{
unsigned long long time;
asm("mov.u64 %0, %%globaltimer;" : "=l"(time));
return time;
}
__device__ unsigned int smid(void)
{
unsigned int sm;
asm("mov.u32 %0, %%smid;" : "=r"(sm));
return sm;
}
__global__ void logkernel(void)
{
unsigned long long t = globaltime();
unsigned long long t0 = atomicCAS(&starttime, 0ull, t);
if (t0==0) t0 = t;
printf("Started block %2u on SM %2u at %llu.\n", blockIdx.x, smid(), t - t0);
}
int main(void)
{
starttime = 0;
logkernel<<<30, 1, 49152>>>();
cudaDeviceSynchronize();
return 0;
}
I've used 48K of shared memory to make the results a bit more interesting - you should substitute your kernel of interest with it's actual launch configuration instead.
If I run this code on my laptop with a GTX 1050, I get output like the following:
Started block 1 on SM 1 at 0.
Started block 6 on SM 1 at 0.
Started block 8 on SM 3 at 0.
Started block 0 on SM 0 at 0.
Started block 3 on SM 3 at 0.
Started block 5 on SM 0 at 0.
Started block 2 on SM 2 at 0.
Started block 7 on SM 2 at 0.
Started block 4 on SM 4 at 0.
Started block 9 on SM 4 at 0.
Started block 10 on SM 3 at 152576.
Started block 11 on SM 3 at 152576.
Started block 18 on SM 1 at 153600.
Started block 16 on SM 1 at 153600.
Started block 17 on SM 0 at 153600.
Started block 14 on SM 0 at 153600.
Started block 13 on SM 2 at 153600.
Started block 12 on SM 2 at 153600.
Started block 19 on SM 4 at 153600.
Started block 15 on SM 4 at 153600.
Started block 20 on SM 0 at 210944.
Started block 21 on SM 3 at 210944.
Started block 22 on SM 0 at 211968.
Started block 23 on SM 3 at 211968.
Started block 24 on SM 1 at 214016.
Started block 26 on SM 1 at 215040.
Started block 25 on SM 2 at 215040.
Started block 27 on SM 2 at 215040.
Started block 28 on SM 4 at 216064.
Started block 29 on SM 4 at 217088.
So you see there is indeed a strong correlation.
What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How are threads organized to be executed by a GPU?
Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512x512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;
Suppose a 9800GT GPU:
it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads
Note:
The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.
I would greatly appreciate if someone test this again for a more accurate piece of information.