Grid size limits in Nvidia CUDA with 2D grid?
Asked bytrubshawHello everyone. Perhaps my question will seem obvious to many at first glance, but I would still urge you not to consider it stupid until you read it to the end.
So, what is the essence of the question. As is known from the CUDA documentation, the grid size of the kernel to be started has limitations, which depend on the specific device. In most modern video cards, the limit is 65535x65535x1. On my g210m and 8800gt video cards, this is exactly how I checked. But in this place I met a rather strange thing  in my program, for some unknown reason, it is impossible to start the kernel, which would have a dimension (along the threads) more than 5808x5808 (this number may be less depending on the block size, I wrote a strict maximum ) or more than 264x264 (if measured in blocks)  and the last number is the same. As soon as the number of launched blocks exceeds 265x265, the kernel starts, works, but always returns zero as a result.
Debugger from Nvidia Nsight is silent, no errors are thrown, the profiler shows the results of the work in which the kernel is launched. The restriction pops up on all video cards on which I ran the program  a total of 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)
So all this makes me think that there are restrictions not only on the dimension of the grid, but also on the total number of threads in the grid (because there is a limit on the number of threads in the block  why not be there). Only here neither the official documentation, nor the textbook of Boreskov / Kharlmov, nor the best practices guide say anything about this bill  they just say that there are limitations already voiced at the very beginning of the question.
Since I have been digging with this for about two hours a day for the past week, and there is no progress, I ask for help  where to dig? Any comments are welcome, if you need to make any clarification  say
Answers
I have a GTX470.
So. Posted kernel:
__global__ void testKernel( int* g_odata)
{
if(threadIdx.x==0)
{
g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)] = blockIdx.y;
g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)+1] = blockIdx.x;
}
}
I launched it on 8192х8192 blocks and 1024 threads (in your vidyahs there is a maximum of 512 threads in the block, on Fermi 1024):
dim3 grid( 8192, 8192, 1);
dim3 threads( 1024, 1, 1);
testKernel<<< grid, threads, 0 >>>( d_odata);
Naturally allocated memory, etc.
And got the last element of the array: 8191x8191.
I didn’t test it on large numbers, because the memory is running out :( You need to implement some logic.
In general, it is not clear where you have these noncircular values of 265, 264?
Thanks for checking!  marysol bishara
__global__ void testKernel (int * g_odata)
{
int indexX = blockIdx.x * blockDim.x + threadIdx.x;
int indexY = blockIdx.y * blockDim.y + threadIdx.y;
if (indexX == 2097088 && indexY == 2097088)
{
g_odata [0] = indexX;
g_odata [1] = indexY;
}
__suncthreads ();
}
dim3 grid (65534, 65534, 1);
dim3 threads (32, 32, 1);
testKernel & lt; & lt; & lt; grid threads, 0 & gt; & gt; (d_odata);  carolyn kriete
Slightly modifying the code got this result:
Checking 1x1
Processing time: 55.926998 (ms)
Last 31x31

Checking 2x2
Processing time: 0.098000 (ms)
Last 63x63

...
 Checking 256x256
Processing time: 3.470000 (ms)
Last 8191x8191
 ...
 Checking 8192x8192
Processing time: 3465.157959 (ms)
Last 262143x262143
 Checking 16384x16384
Processing time: 13827.656250 (ms)
Last 524287x524287
 Checking 32768x32768
template.cu (98): cudaSafeCall () Runtime API error: the launch timed out and was terminated.
 merrily
So, I have a mistake somewhere, to my regret  tosit agarwal
The logic there is simple, the matrix is taken, which is filled with a twodimensional array of threads. So this matrix can never be zero, in the very first step I equate to diagonal all diagonal elements. Nevertheless, both during debugging and ultimately it turns out that the matrix consists only and exclusively of zeros, and no errors are issued.  bhaskar