Class DriverAPINativeMethods.Launch

Groups all kernel launch API calls

Inheritance

System.Object

DriverAPINativeMethods.Launch

Inherited Members

System.Object.Equals(System.Object)

System.Object.Equals(System.Object, System.Object)

System.Object.GetHashCode()

System.Object.GetType()

System.Object.MemberwiseClone()

System.Object.ReferenceEquals(System.Object, System.Object)

System.Object.ToString()

Namespace: ManagedCuda

Assembly: ManagedCuda.dll

Syntax

public static class Launch

Methods

cuLaunch(CUfunction)

Invokes the kernel f on a 1 x 1 x 1 grid of blocks. The block contains the number of threads specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).

Declaration

[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunch(CUfunction f)

Parameters

Type	Name	Description
CUfunction	f	Kernel to launch

Returns

Type	Description
CUResult	CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

cuLaunchCooperativeKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[])

Launches a CUDA function where thread blocks can cooperate and synchronize as they execute

Invokes the kernel \p f on a \p gridDimX x \p gridDimY x \p gridDimZ grid of blocks.Each block contains \p blockDimX x \p blockDimY x \p blockDimZ threads.

\p sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.

The device on which this kernel is invoked must have a non-zero value for the device attribute::CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH.

The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.

The kernel cannot make use of CUDA dynamic parallelism.

Kernel parameters must be specified via \p kernelParams. If \p f has N parameters, then \p kernelParams needs to be an array of N pointers. Each of \p kernelParams [0] through \p kernelParams [N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

Calling ::cuLaunchCooperativeKernel() sets persistent function state that is the same as function state set through ::cuLaunchKernel API

When the kernel \p f is launched via ::cuLaunchCooperativeKernel(), the previous block shape, shared size and parameter info associated with \p f is overwritten.

Note that to use ::cuLaunchCooperativeKernel(), the kernel \p f must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernel() will return ::CUDA_ERROR_INVALID_IMAGE.

Declaration

public static CUResult cuLaunchCooperativeKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams)

Parameters

Type	Name	Description
CUfunction	f	Kernel to launch
System.UInt32	gridDimX	Width of grid in blocks
System.UInt32	gridDimY	Height of grid in blocks
System.UInt32	gridDimZ	Depth of grid in blocks
System.UInt32	blockDimX	X dimension of each thread block
System.UInt32	blockDimY	Y dimension of each thread block
System.UInt32	blockDimZ	Z dimension of each thread block
System.UInt32	sharedMemBytes	Dynamic shared-memory size per thread block in bytes
CUstream	hStream	Stream identifier
System.IntPtr[]	kernelParams	Array of pointers to kernel parameters

Returns

Type	Description
CUResult

cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[], UInt32, CudaCooperativeLaunchMultiDeviceFlags)

Launches CUDA functions on multiple devices where thread blocks can cooperate and synchronize as they execute

Invokes kernels as specified in the \p launchParamsList array where each element of the array specifies all the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as they execute. The size of the array is specified by \p numDevices.

No two kernels can be launched on the same device. All the devices targeted by this multi-device launch must be identical. All devices must have a non-zero value for the device attribute ::CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH.

All kernels launched must be identical with respect to the compiled code. Note that any __device__, __constant__ or __managed__ variables present in the module that owns the kernel launched on each device, are independently instantiated on every device. It is the application's responsiblity to ensure these variables are initialized and used appropriately.

The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared memory used by each thread block must also match across all launched kernels.

The streams used to launch these kernels must have been created via either ::cuStreamCreate or ::cuStreamCreateWithPriority. The NULL stream or ::CU_STREAM_LEGACY or ::CU_STREAM_PER_THREAD cannot be used.

The total number of blocks launched per kernel cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT. Since the total number of blocks launched per device has to match across all devices, the maximum number of blocks that can be launched per device will be limited by the device with the least number of multiprocessors.

The kernels cannot make use of CUDA dynamic parallelism.

By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams has completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC. When this flag is specified, each kernel will only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.

Similarly, by default, any subsequent work pushed in any of the specified streams will not begin execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC. When this flag is specified, any subsequent work pushed in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that stream to complete before it begins execution.

Calling ::cuLaunchCooperativeKernelMultiDevice() sets persistent function state that is the same as function state set through ::cuLaunchKernel API when called individually for each element in \p launchParamsList.

When kernels are launched via ::cuLaunchCooperativeKernelMultiDevice(), the previous block shape, shared size and parameter info associated with each ::CUDA_LAUNCH_PARAMS::function in \p launchParamsList is overwritten.

Note that to use ::cuLaunchCooperativeKernelMultiDevice(), the kernels must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernelMultiDevice() will return ::CUDA_ERROR_INVALID_IMAGE.

Declaration

public static CUResult cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[] launchParamsList, uint numDevices, CudaCooperativeLaunchMultiDeviceFlags flags)

Parameters

Type	Name	Description
CudaLaunchParams[]	launchParamsList	List of launch parameters, one per device
System.UInt32	numDevices	Size of the \p launchParamsList array
CudaCooperativeLaunchMultiDeviceFlags	flags	Flags to control launch behavior

Returns

Type	Description
CUResult

cuLaunchGrid(CUfunction, Int32, Int32)

Invokes the kernel f on a grid_width x grid_height grid of blocks. Each block contains the number of threads specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).

Declaration

[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunchGrid(CUfunction f, int grid_width, int grid_height)

Parameters

Type	Name	Description
CUfunction	f	Kernel to launch
System.Int32	grid_width	Width of grid in blocks
System.Int32	grid_height	Height of grid in blocks

Returns

Type	Description
CUResult	CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream)

cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream) can optionally be associated to a stream by passing a non-zero hStream argument.

Declaration

[Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
public static CUResult cuLaunchGridAsync(CUfunction f, int grid_width, int grid_height, CUstream hStream)

Parameters

Type	Name	Description
CUfunction	f	Kernel to launch
System.Int32	grid_width	Width of grid in blocks
System.Int32	grid_height	Height of grid in blocks
CUstream	hStream	Stream identifier

Returns

Type	Description
CUResult	CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[])

Invokes the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block contains blockDimX x blockDimY x blockDimZ threads.

sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.

cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) can optionally be associated to a stream by passing a non-zero hStream argument.

Kernel parameters to f can be specified in one of two ways:

1) Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

2) Kernel parameters can also be packaged by the application into a single buffer that is passed in via the extra parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer.

The extra parameter exists to allow cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) to take additional less commonly used arguments. extra specifies a list of names of extra settings and their corresponding values. Each extra setting name is immediately followed by the corresponding value. The list must be terminated with either NULL or ::CU_LAUNCH_PARAM_END.

- ::CU_LAUNCH_PARAM_END, which indicates the end of the extra array; - ::CU_LAUNCH_PARAM_BUFFER_POINTER, which specifies that the next value in extra will be a pointer to a buffer containing all the kernel parameters for launching kernel f; - ::CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a pointer to a size_t containing the size of the buffer specified with ::CU_LAUNCH_PARAM_BUFFER_POINTER;

The error ::CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).

Calling cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) sets persistent function state that is the same as function state set through the following deprecated APIs:

::cuFuncSetBlockShape() ::cuFuncSetSharedSize() ::cuParamSetSize() ::cuParamSeti() ::cuParamSetf() ::cuParamSetv()

When the kernel f is launched via cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the previous block shape, shared size and parameter info associated with f is overwritten.

Note that to use cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the kernel f must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) will return ErrorInvalidImage.

Declaration

public static CUResult cuLaunchKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams, IntPtr[] extra)

Parameters

Type	Name	Description
CUfunction	f	Kernel to launch
System.UInt32	gridDimX	Width of grid in blocks
System.UInt32	gridDimY	Height of grid in blocks
System.UInt32	gridDimZ	Depth of grid in blocks
System.UInt32	blockDimX	X dimension of each thread block
System.UInt32	blockDimY	Y dimension of each thread block
System.UInt32	blockDimZ	Z dimension of each thread block
System.UInt32	sharedMemBytes	Dynamic shared-memory size per thread block in bytes
CUstream	hStream	Stream identifier
System.IntPtr[]	kernelParams	Array of pointers to kernel parameters
System.IntPtr[]	extra	Extra options

Returns

Type	Description
CUResult	CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue, ErrorInvalidHandle, ErrorInvalidImage, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing, ErrorSharedObjectInitFailed. Note that this function may also return error codes from previous, asynchronous launches.