Grid size limits in Nvidia CUDA with 2D grid?

Asked by trubshaw

Hello everyone. Perhaps my question will seem obvious to many at first glance, but I would still urge you not to consider it stupid until you read it to the end.

So, what is the essence of the question. As is known from the CUDA documentation, the grid size of the kernel to be started has limitations, which depend on the specific device. In most modern video cards, the limit is 65535x65535x1. On my g210m and 8800gt video cards, this is exactly how I checked. But in this place I met a rather strange thing - in my program, for some unknown reason, it is impossible to start the kernel, which would have a dimension (along the threads) more than 5808x5808 (this number may be less depending on the block size, I wrote a strict maximum ) or more than 264x264 (if measured in blocks) - and the last number is the same. As soon as the number of launched blocks exceeds 265x265, the kernel starts, works, but always returns zero as a result.

Debugger from Nvidia Nsight is silent, no errors are thrown, the profiler shows the results of the work in which the kernel is launched. The restriction pops up on all video cards on which I ran the program - a total of 8 different models (8400m g, 8800gt, 9600gso, 8500gt, 9600gt, ION, g210m, gf9300)

So all this makes me think that there are restrictions not only on the dimension of the grid, but also on the total number of threads in the grid (because there is a limit on the number of threads in the block - why not be there). Only here neither the official documentation, nor the textbook of Boreskov / Kharlmov, nor the best practices guide say anything about this bill - they just say that there are limitations already voiced at the very beginning of the question.

Since I have been digging with this for about two hours a day for the past week, and there is no progress, I ask for help - where to dig? Any comments are welcome, if you need to make any clarification - say

Comments:
What does it mean "as a result, it always gives zero"? - joanne parkington
I meant the following:
The logic there is simple, the matrix is ​​taken, which is filled with a two-dimensional array of threads. So this matrix can never be zero, in the very first step I equate to diagonal all diagonal elements. Nevertheless, both during debugging and ultimately it turns out that the matrix consists only and exclusively of zeros, and no errors are issued. - bhaskar

Answers

miguel corte real
Just checked. I was unable to repeat your problem.
I have a GTX470.
So. Posted kernel:
 
__global__ void testKernel( int* g_odata) 
{
  if(threadIdx.x==0)
  {
    g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)] = blockIdx.y;
    g_odata[2*(blockIdx.y*gridDim.x+blockIdx.x)+1] = blockIdx.x;
  }
}

I launched it on 8192х8192 blocks and 1024 threads (in your vidyahs there is a maximum of 512 threads in the block, on Fermi 1024):
 
    dim3  grid( 8192, 8192, 1);
    dim3  threads( 1024, 1, 1);
    testKernel<<< grid, threads, 0 >>>(  d_odata);

Naturally allocated memory, etc.
And got the last element of the array: 8191x8191.
I didn’t test it on large numbers, because the memory is running out :( You need to implement some logic.

In general, it is not clear where you have these non-circular values ​​of 265, 264?
Replies:
I'm in a stupor, in fact. It looks like I somehow made a mistake somewhere, but somehow it’s too strange that I just can’t catch it and that the program works almost everywhere correctly
Thanks for checking! - marysol bishara
And you could not try to start something like this:

__global__ void testKernel (int * g_odata)
{
 int indexX = blockIdx.x * blockDim.x + threadIdx.x;
 int indexY = blockIdx.y * blockDim.y + threadIdx.y;
 if (indexX == 2097088 && indexY == 2097088)
 {
 g_odata [0] = indexX;
 g_odata [1] = indexY;
 }
 __suncthreads ();
}

dim3 grid (65534, 65534, 1);
dim3 threads (32, 32, 1);
testKernel & lt; & lt; & lt; grid threads, 0 & gt; & gt; (d_odata); - carolyn kriete
I tried. Specifically, your code crashes on timeout. Read more about timeouts here: forums.nvidia.com/lofiversion/index.php?t106635.html.

Slightly modifying the code got this result:
Checking 1x1
Processing time: 55.926998 (ms)
Last 31x31
 -
Checking 2x2
Processing time: 0.098000 (ms)
Last 63x63
 -
...
 - Checking 256x256
Processing time: 3.470000 (ms)
Last 8191x8191
 - ...
 - Checking 8192x8192
Processing time: 3465.157959 (ms)
Last 262143x262143
 - Checking 16384x16384
Processing time: 13827.656250 (ms)
Last 524287x524287
 - Checking 32768x32768
template.cu (98): cudaSafeCall () Runtime API error: the launch timed out and was terminated.
- merrily
Thank you for what you need.
So, I have a mistake somewhere, to my regret - tosit agarwal
Question about archiving for backup :: Shift + Delete on Mac OS X :: Amazon EC2 :: How to make a disk partition image for free? :: Displaying the answer rating in q & amp; a when viewing all of my answers?
Leave Repply for Grid size limits in Nvidia CUDA with 2D grid?
Useful Links