Class DriverAPINativeMethods.Launch
Groups all kernel launch API calls
Inheritance
Inherited Members
Namespace: ManagedCuda
Assembly: ManagedCuda.dll
Syntax
public static class Launch
Methods
cuLaunch(CUfunction)
Invokes the kernel f
on a 1 x 1 x 1 grid of blocks. The block contains the number of threads specified by a previous
call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).
Declaration
[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunch(CUfunction f)
Parameters
Type | Name | Description |
---|---|---|
CUfunction | f | Kernel to launch |
Returns
Type | Description |
---|---|
CUResult | CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized,
ErrorInvalidContext, ErrorInvalidValue
ErrorLaunchFailed, ErrorLaunchOutOfResources
ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing.
|
cuLaunchCooperativeKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[])
Launches a CUDA function where thread blocks can cooperate and synchronize as they execute
Invokes the kernel \p f on a \p gridDimX x \p gridDimY x \p gridDimZ grid of blocks.Each block contains \p blockDimX x \p blockDimY x \p blockDimZ threads.
\p sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.
The device on which this kernel is invoked must have a non-zero value for the device attribute::CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH.
The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.
The kernel cannot make use of CUDA dynamic parallelism.
Kernel parameters must be specified via \p kernelParams. If \p f has N parameters, then \p kernelParams needs to be an array of N pointers. Each of \p kernelParams [0] through \p kernelParams [N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.
Calling ::cuLaunchCooperativeKernel() sets persistent function state that is the same as function state set through ::cuLaunchKernel API
When the kernel \p f is launched via ::cuLaunchCooperativeKernel(), the previous block shape, shared size and parameter info associated with \p f is overwritten.
Note that to use ::cuLaunchCooperativeKernel(), the kernel \p f must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernel() will return ::CUDA_ERROR_INVALID_IMAGE.
Declaration
public static CUResult cuLaunchCooperativeKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams)
Parameters
Type | Name | Description |
---|---|---|
CUfunction | f | Kernel to launch |
System.UInt32 | gridDimX | Width of grid in blocks |
System.UInt32 | gridDimY | Height of grid in blocks |
System.UInt32 | gridDimZ | Depth of grid in blocks |
System.UInt32 | blockDimX | X dimension of each thread block |
System.UInt32 | blockDimY | Y dimension of each thread block |
System.UInt32 | blockDimZ | Z dimension of each thread block |
System.UInt32 | sharedMemBytes | Dynamic shared-memory size per thread block in bytes |
CUstream | hStream | Stream identifier |
System.IntPtr[] | kernelParams | Array of pointers to kernel parameters |
Returns
Type | Description |
---|---|
CUResult |
cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[], UInt32, CudaCooperativeLaunchMultiDeviceFlags)
Launches CUDA functions on multiple devices where thread blocks can cooperate and synchronize as they execute
Invokes kernels as specified in the \p launchParamsList array where each element of the array specifies all the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as they execute. The size of the array is specified by \p numDevices.
No two kernels can be launched on the same device. All the devices targeted by this multi-device launch must be identical. All devices must have a non-zero value for the device attribute ::CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH.
All kernels launched must be identical with respect to the compiled code. Note that any __device__, __constant__ or __managed__ variables present in the module that owns the kernel launched on each device, are independently instantiated on every device. It is the application's responsiblity to ensure these variables are initialized and used appropriately.
The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared memory used by each thread block must also match across all launched kernels.
The streams used to launch these kernels must have been created via either ::cuStreamCreate or ::cuStreamCreateWithPriority. The NULL stream or ::CU_STREAM_LEGACY or ::CU_STREAM_PER_THREAD cannot be used.
The total number of blocks launched per kernel cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT. Since the total number of blocks launched per device has to match across all devices, the maximum number of blocks that can be launched per device will be limited by the device with the least number of multiprocessors.
The kernels cannot make use of CUDA dynamic parallelism.
By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams has completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC. When this flag is specified, each kernel will only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.
Similarly, by default, any subsequent work pushed in any of the specified streams will not begin execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC. When this flag is specified, any subsequent work pushed in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that stream to complete before it begins execution.
Calling ::cuLaunchCooperativeKernelMultiDevice() sets persistent function state that is the same as function state set through ::cuLaunchKernel API when called individually for each element in \p launchParamsList.
When kernels are launched via ::cuLaunchCooperativeKernelMultiDevice(), the previous block shape, shared size and parameter info associated with each ::CUDA_LAUNCH_PARAMS::function in \p launchParamsList is overwritten.
Note that to use ::cuLaunchCooperativeKernelMultiDevice(), the kernels must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernelMultiDevice() will return ::CUDA_ERROR_INVALID_IMAGE.
Declaration
public static CUResult cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[] launchParamsList, uint numDevices, CudaCooperativeLaunchMultiDeviceFlags flags)
Parameters
Type | Name | Description |
---|---|---|
CudaLaunchParams[] | launchParamsList | List of launch parameters, one per device |
System.UInt32 | numDevices | Size of the \p launchParamsList array |
CudaCooperativeLaunchMultiDeviceFlags | flags | Flags to control launch behavior |
Returns
Type | Description |
---|---|
CUResult |
cuLaunchGrid(CUfunction, Int32, Int32)
Invokes the kernel f
on a grid_width
x grid_height
grid of blocks. Each block contains the number of threads
specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).
Declaration
[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunchGrid(CUfunction f, int grid_width, int grid_height)
Parameters
Type | Name | Description |
---|---|---|
CUfunction | f | Kernel to launch |
System.Int32 | grid_width | Width of grid in blocks |
System.Int32 | grid_height | Height of grid in blocks |
Returns
Type | Description |
---|---|
CUResult | CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized,
ErrorInvalidContext, ErrorInvalidValue
ErrorLaunchFailed, ErrorLaunchOutOfResources
ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing.
|
cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream)
Invokes the kernel f
on a grid_width
x grid_height
grid of blocks. Each block contains the number of threads
specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).
cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream) can optionally be associated to a stream by passing a non-zero hStream
argument.
Declaration
[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunchGridAsync(CUfunction f, int grid_width, int grid_height, CUstream hStream)
Parameters
Type | Name | Description |
---|---|---|
CUfunction | f | Kernel to launch |
System.Int32 | grid_width | Width of grid in blocks |
System.Int32 | grid_height | Height of grid in blocks |
CUstream | hStream | Stream identifier |
Returns
Type | Description |
---|---|
CUResult | CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized,
ErrorInvalidContext, ErrorInvalidValue
ErrorLaunchFailed, ErrorLaunchOutOfResources
ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing.
|
cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[])
Invokes the kernel f
on a gridDimX
x gridDimY
x gridDimZ
grid of blocks. Each block contains blockDimX
x blockDimY
x
blockDimZ threads.
sharedMemBytes
sets the amount of dynamic shared memory that will be
available to each thread block.
cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) can optionally be associated to a stream by passing a
non-zero hStream
argument.
Kernel parameters to f
can be specified in one of two ways:
1) Kernel parameters can be specified via kernelParams
. If f
has N parameters, then kernelParams
needs to be an array of N
pointers. Each of kernelParams[0]
through kernelParams[N-1]
must point to a region of memory from which the actual kernel
parameter will be copied. The number of kernel parameters and their
offsets and sizes do not need to be specified as that information is
retrieved directly from the kernel's image.
2) Kernel parameters can also be packaged by the application into
a single buffer that is passed in via the extra
parameter.
This places the burden on the application of knowing each kernel
parameter's size and alignment/padding within the buffer.
The extra
parameter exists to allow cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) to take
additional less commonly used arguments. extra
specifies a list of
names of extra settings and their corresponding values. Each extra
setting name is immediately followed by the corresponding value. The
list must be terminated with either NULL or ::CU_LAUNCH_PARAM_END.
- ::CU_LAUNCH_PARAM_END, which indicates the end of the extra
array;
- ::CU_LAUNCH_PARAM_BUFFER_POINTER, which specifies that the next
value in extra
will be a pointer to a buffer containing all
the kernel parameters for launching kernel f
;
- ::CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next
value in extra
will be a pointer to a size_t containing the
size of the buffer specified with ::CU_LAUNCH_PARAM_BUFFER_POINTER;
The error ::CUDA_ERROR_INVALID_VALUE will be returned if kernel
parameters are specified with both kernelParams
and extra
(i.e. both kernelParams
and extra
are non-NULL).
Calling cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) sets persistent function state that is the same as function state set through the following deprecated APIs:
::cuFuncSetBlockShape() ::cuFuncSetSharedSize() ::cuParamSetSize() ::cuParamSeti() ::cuParamSetf() ::cuParamSetv()
When the kernel f
is launched via cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the previous
block shape, shared size and parameter info associated with f
is overwritten.
Note that to use cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the kernel f
must either have
been compiled with toolchain version 3.2 or later so that it will
contain kernel parameter information, or have no kernel parameters.
If either of these conditions is not met, then cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) will
return ErrorInvalidImage.
Declaration
public static CUResult cuLaunchKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams, IntPtr[] extra)
Parameters
Type | Name | Description |
---|---|---|
CUfunction | f | Kernel to launch |
System.UInt32 | gridDimX | Width of grid in blocks |
System.UInt32 | gridDimY | Height of grid in blocks |
System.UInt32 | gridDimZ | Depth of grid in blocks |
System.UInt32 | blockDimX | X dimension of each thread block |
System.UInt32 | blockDimY | Y dimension of each thread block |
System.UInt32 | blockDimZ | Z dimension of each thread block |
System.UInt32 | sharedMemBytes | Dynamic shared-memory size per thread block in bytes |
CUstream | hStream | Stream identifier |
System.IntPtr[] | kernelParams | Array of pointers to kernel parameters |
System.IntPtr[] | extra | Extra options |
Returns
Type | Description |
---|---|
CUResult | CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized,
ErrorInvalidContext, ErrorInvalidValue, ErrorInvalidHandle,
ErrorInvalidImage, ErrorInvalidValue
ErrorLaunchFailed, ErrorLaunchOutOfResources
ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing, ErrorSharedObjectInitFailed.
|