Show / Hide Table of Contents

    Class DriverAPINativeMethods.Launch

    Groups all kernel launch API calls

    Inheritance
    System.Object
    DriverAPINativeMethods.Launch
    Inherited Members
    System.Object.Equals(System.Object)
    System.Object.Equals(System.Object, System.Object)
    System.Object.GetHashCode()
    System.Object.GetType()
    System.Object.MemberwiseClone()
    System.Object.ReferenceEquals(System.Object, System.Object)
    System.Object.ToString()
    Namespace: ManagedCuda
    Assembly: ManagedCuda.dll
    Syntax
    public static class Launch

    Methods

    cuLaunch(CUfunction)

    Invokes the kernel f on a 1 x 1 x 1 grid of blocks. The block contains the number of threads specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).

    Declaration
    [Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
    public static CUResult cuLaunch(CUfunction f)
    Parameters
    Type Name Description
    CUfunction f

    Kernel to launch

    Returns
    Type Description
    CUResult

    CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

    cuLaunchCooperativeKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[])

    Launches a CUDA function where thread blocks can cooperate and synchronize as they execute

    Invokes the kernel \p f on a \p gridDimX x \p gridDimY x \p gridDimZ grid of blocks.Each block contains \p blockDimX x \p blockDimY x \p blockDimZ threads.

    \p sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.

    The device on which this kernel is invoked must have a non-zero value for the device attribute::CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH.

    The total number of blocks launched cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT.

    The kernel cannot make use of CUDA dynamic parallelism.

    Kernel parameters must be specified via \p kernelParams. If \p f has N parameters, then \p kernelParams needs to be an array of N pointers. Each of \p kernelParams [0] through \p kernelParams [N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

    Calling ::cuLaunchCooperativeKernel() sets persistent function state that is the same as function state set through ::cuLaunchKernel API

    When the kernel \p f is launched via ::cuLaunchCooperativeKernel(), the previous block shape, shared size and parameter info associated with \p f is overwritten.

    Note that to use ::cuLaunchCooperativeKernel(), the kernel \p f must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernel() will return ::CUDA_ERROR_INVALID_IMAGE.

    Declaration
    public static CUResult cuLaunchCooperativeKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams)
    Parameters
    Type Name Description
    CUfunction f

    Kernel to launch

    System.UInt32 gridDimX

    Width of grid in blocks

    System.UInt32 gridDimY

    Height of grid in blocks

    System.UInt32 gridDimZ

    Depth of grid in blocks

    System.UInt32 blockDimX

    X dimension of each thread block

    System.UInt32 blockDimY

    Y dimension of each thread block

    System.UInt32 blockDimZ

    Z dimension of each thread block

    System.UInt32 sharedMemBytes

    Dynamic shared-memory size per thread block in bytes

    CUstream hStream

    Stream identifier

    System.IntPtr[] kernelParams

    Array of pointers to kernel parameters

    Returns
    Type Description
    CUResult

    cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[], UInt32, CudaCooperativeLaunchMultiDeviceFlags)

    Launches CUDA functions on multiple devices where thread blocks can cooperate and synchronize as they execute

    Invokes kernels as specified in the \p launchParamsList array where each element of the array specifies all the parameters required to perform a single kernel launch. These kernels can cooperate and synchronize as they execute. The size of the array is specified by \p numDevices.

    No two kernels can be launched on the same device. All the devices targeted by this multi-device launch must be identical. All devices must have a non-zero value for the device attribute ::CU_DEVICE_ATTRIBUTE_COOPERATIVE_MULTI_DEVICE_LAUNCH.

    All kernels launched must be identical with respect to the compiled code. Note that any __device__, __constant__ or __managed__ variables present in the module that owns the kernel launched on each device, are independently instantiated on every device. It is the application's responsiblity to ensure these variables are initialized and used appropriately.

    The size of the grids as specified in blocks, the size of the blocks themselves and the amount of shared memory used by each thread block must also match across all launched kernels.

    The streams used to launch these kernels must have been created via either ::cuStreamCreate or ::cuStreamCreateWithPriority. The NULL stream or ::CU_STREAM_LEGACY or ::CU_STREAM_PER_THREAD cannot be used.

    The total number of blocks launched per kernel cannot exceed the maximum number of blocks per multiprocessor as returned by ::cuOccupancyMaxActiveBlocksPerMultiprocessor (or ::cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags) times the number of multiprocessors as specified by the device attribute ::CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT. Since the total number of blocks launched per device has to match across all devices, the maximum number of blocks that can be launched per device will be limited by the device with the least number of multiprocessors.

    The kernels cannot make use of CUDA dynamic parallelism.

    By default, the kernel won't begin execution on any GPU until all prior work in all the specified streams has completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_PRE_LAUNCH_SYNC. When this flag is specified, each kernel will only wait for prior work in the stream corresponding to that GPU to complete before it begins execution.

    Similarly, by default, any subsequent work pushed in any of the specified streams will not begin execution until the kernels on all GPUs have completed. This behavior can be overridden by specifying the flag ::CUDA_COOPERATIVE_LAUNCH_MULTI_DEVICE_NO_POST_LAUNCH_SYNC. When this flag is specified, any subsequent work pushed in any of the specified streams will only wait for the kernel launched on the GPU corresponding to that stream to complete before it begins execution.

    Calling ::cuLaunchCooperativeKernelMultiDevice() sets persistent function state that is the same as function state set through ::cuLaunchKernel API when called individually for each element in \p launchParamsList.

    When kernels are launched via ::cuLaunchCooperativeKernelMultiDevice(), the previous block shape, shared size and parameter info associated with each ::CUDA_LAUNCH_PARAMS::function in \p launchParamsList is overwritten.

    Note that to use ::cuLaunchCooperativeKernelMultiDevice(), the kernels must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then ::cuLaunchCooperativeKernelMultiDevice() will return ::CUDA_ERROR_INVALID_IMAGE.

    Declaration
    public static CUResult cuLaunchCooperativeKernelMultiDevice(CudaLaunchParams[] launchParamsList, uint numDevices, CudaCooperativeLaunchMultiDeviceFlags flags)
    Parameters
    Type Name Description
    CudaLaunchParams[] launchParamsList

    List of launch parameters, one per device

    System.UInt32 numDevices

    Size of the \p launchParamsList array

    CudaCooperativeLaunchMultiDeviceFlags flags

    Flags to control launch behavior

    Returns
    Type Description
    CUResult

    cuLaunchGrid(CUfunction, Int32, Int32)

    Invokes the kernel f on a grid_width x grid_height grid of blocks. Each block contains the number of threads specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).

    Declaration
    [Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
    public static CUResult cuLaunchGrid(CUfunction f, int grid_width, int grid_height)
    Parameters
    Type Name Description
    CUfunction f

    Kernel to launch

    System.Int32 grid_width

    Width of grid in blocks

    System.Int32 grid_height

    Height of grid in blocks

    Returns
    Type Description
    CUResult

    CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

    cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream)

    Invokes the kernel f on a grid_width x grid_height grid of blocks. Each block contains the number of threads specified by a previous call to cuFuncSetBlockShape(CUfunction, Int32, Int32, Int32).

    cuLaunchGridAsync(CUfunction, Int32, Int32, CUstream) can optionally be associated to a stream by passing a non-zero hStream argument.

    Declaration
    [Obsolete("Don't use this CUDA API call with CUDA version >= 4.0.")]
    public static CUResult cuLaunchGridAsync(CUfunction f, int grid_width, int grid_height, CUstream hStream)
    Parameters
    Type Name Description
    CUfunction f

    Kernel to launch

    System.Int32 grid_width

    Width of grid in blocks

    System.Int32 grid_height

    Height of grid in blocks

    CUstream hStream

    Stream identifier

    Returns
    Type Description
    CUResult

    CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing. Note that this function may also return error codes from previous, asynchronous launches.

    cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[])

    Invokes the kernel f on a gridDimX x gridDimY x gridDimZ grid of blocks. Each block contains blockDimX x blockDimY x blockDimZ threads.

    sharedMemBytes sets the amount of dynamic shared memory that will be available to each thread block.

    cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) can optionally be associated to a stream by passing a non-zero hStream argument.

    Kernel parameters to f can be specified in one of two ways:

    1) Kernel parameters can be specified via kernelParams. If f has N parameters, then kernelParams needs to be an array of N pointers. Each of kernelParams[0] through kernelParams[N-1] must point to a region of memory from which the actual kernel parameter will be copied. The number of kernel parameters and their offsets and sizes do not need to be specified as that information is retrieved directly from the kernel's image.

    2) Kernel parameters can also be packaged by the application into a single buffer that is passed in via the extra parameter. This places the burden on the application of knowing each kernel parameter's size and alignment/padding within the buffer.

    The extra parameter exists to allow cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) to take additional less commonly used arguments. extra specifies a list of names of extra settings and their corresponding values. Each extra setting name is immediately followed by the corresponding value. The list must be terminated with either NULL or ::CU_LAUNCH_PARAM_END.

    - ::CU_LAUNCH_PARAM_END, which indicates the end of the extra array; - ::CU_LAUNCH_PARAM_BUFFER_POINTER, which specifies that the next value in extra will be a pointer to a buffer containing all the kernel parameters for launching kernel f; - ::CU_LAUNCH_PARAM_BUFFER_SIZE, which specifies that the next value in extra will be a pointer to a size_t containing the size of the buffer specified with ::CU_LAUNCH_PARAM_BUFFER_POINTER;

    The error ::CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).

    Calling cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) sets persistent function state that is the same as function state set through the following deprecated APIs:

    ::cuFuncSetBlockShape() ::cuFuncSetSharedSize() ::cuParamSetSize() ::cuParamSeti() ::cuParamSetf() ::cuParamSetv()

    When the kernel f is launched via cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the previous block shape, shared size and parameter info associated with f is overwritten.

    Note that to use cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]), the kernel f must either have been compiled with toolchain version 3.2 or later so that it will contain kernel parameter information, or have no kernel parameters. If either of these conditions is not met, then cuLaunchKernel(CUfunction, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, UInt32, CUstream, IntPtr[], IntPtr[]) will return ErrorInvalidImage.

    Declaration
    public static CUResult cuLaunchKernel(CUfunction f, uint gridDimX, uint gridDimY, uint gridDimZ, uint blockDimX, uint blockDimY, uint blockDimZ, uint sharedMemBytes, CUstream hStream, IntPtr[] kernelParams, IntPtr[] extra)
    Parameters
    Type Name Description
    CUfunction f

    Kernel to launch

    System.UInt32 gridDimX

    Width of grid in blocks

    System.UInt32 gridDimY

    Height of grid in blocks

    System.UInt32 gridDimZ

    Depth of grid in blocks

    System.UInt32 blockDimX

    X dimension of each thread block

    System.UInt32 blockDimY

    Y dimension of each thread block

    System.UInt32 blockDimZ

    Z dimension of each thread block

    System.UInt32 sharedMemBytes

    Dynamic shared-memory size per thread block in bytes

    CUstream hStream

    Stream identifier

    System.IntPtr[] kernelParams

    Array of pointers to kernel parameters

    System.IntPtr[] extra

    Extra options

    Returns
    Type Description
    CUResult

    CUDA Error Codes: Success, ErrorDeinitialized, ErrorNotInitialized, ErrorInvalidContext, ErrorInvalidValue, ErrorInvalidHandle, ErrorInvalidImage, ErrorInvalidValue ErrorLaunchFailed, ErrorLaunchOutOfResources ErrorLaunchTimeout, ErrorLaunchIncompatibleTexturing, ErrorSharedObjectInitFailed. Note that this function may also return error codes from previous, asynchronous launches.

    • Improve this Doc
    • View Source
    Back to top Generated by DocFX