Introduction to Metal Compute: Kernel Encoder

Introduction to Metal Compute: Kernel Encoder

CPU Side: Encoder

Now it’s time write the CPU side of the metal pipeline. First let’s take a quick brief on how GPU work scheduling is organised.

To get the GPU to perform work on your behalf, you need to send commands to it. There are three types of commands: render, compute and blit. Compute command is what we need to schedule our adjustments shader for execution.

The objects that you operate while creating a command for a GPU are:

The hierarchy of creation of the objects is the depicted below:

demo app

Now let’s make an empty swift file Adjustments.swift.

demo app

Adjustments

We are going to create an Adjustments class which will be responsible for encoding the work to GPU and passing all necessary data to it: temperature and tint in our case. Following Metal’s paradigm of precompilation of the instructions once and quick reuse them in runtime, Adjustments will store the pipeline state as its property.

import Metal

final class Adjustments {

}

Create temperature and tint properties. These values will be modified by the UI and then sent to the kernel while encoding.

var temperature: Float = .zero
var tint: Float = .zero

Create the dispatch flag and the pipeline state. These values need to be initialised once and stored to use them while encoding.

private var deviceSupportsNonuniformThreadgroups: Bool
private let pipelineState: MTLComputePipelineState

The constructor of Adjustments class takes a metal library as an argument. The library is used further to initialise a function for the pipeline state. As one library can contain multiple functions, it’s a good practice to initialise library once and then reuse it. So we’re going to create and store the library outside of the class.

init(library: MTLLibrary) throws {

}

From this point we are going to fill the constructor following step by step instructions.

Constructor

Different iPhones have different hardware (including GPU) that supports the different set of features. In order to initialise deviceSupportsNonuniformThreadgroups property correctly we need to look find such feature in the Metal Feature Set Table and find a corresponding feature set that describes the type of hardware that supports it.

demo app

self.deviceSupportsNonuniformThreadgroups = library.device.supportsFeatureSet(.iOS_GPUFamily4_v1)

Initialise function constants object and set the deviceSupportsNonuniformThreadgroups value to it. The index is set to 0, the same as it was declared in the shaders.

let constantValues = MTLFunctionConstantValues()
constantValues.setConstantValue(&self.deviceSupportsNonuniformThreadgroups,
                                type: .bool,
                                index: 0)

Create a function from a library with and previously initialised FC. The name of the function is the same as in the shaders.

let function = try library.makeFunction(name: "adjustments",
                                        constantValues: constantValues)

The final step is the pipeline state creation. At this point the shaders will be compiled into GPU instructions and the passed FC will be used to determine if boundary check will be among them.

self.pipelineState = try library.device.makeComputePipelineState(function: function)

The result should look like this:

import Metal

final class Adjustments {

    var temperature: Float = .zero
    var tint: Float = .zero
    private var deviceSupportsNonuniformThreadgroups: Bool
    private let pipelineState: MTLComputePipelineState
    
    init(library: MTLLibrary) throws {
        self.deviceSupportsNonuniformThreadgroups = library.device.supportsFeatureSet(.iOS_GPUFamily4_v1)
        let constantValues = MTLFunctionConstantValues()
        constantValues.setConstantValue(&self.deviceSupportsNonuniformThreadgroups,
                                        type: .bool,
                                        index: 0)
        let function = try library.makeFunction(name: "adjustments",
                                                constantValues: constantValues)
        self.pipelineState = try library.device.makeComputePipelineState(function: function)
    }
    
}

Encoding Function

Next, we’re a going to write the encoding of the kernel. The main thing what we need here to do is use command buffer’s encoder to encode all necessary resources and instructions to GPU.

demo app

Below the class constructor, add the encoding function.

func encode(source: MTLTexture,
            destination: MTLTexture,
            in commandBuffer: MTLCommandBuffer) {

}

Now let’s fill it. Create a command encoder. This lightweight object is used to encode everything to a command buffer.

guard let encoder = commandBuffer.makeComputeCommandEncoder()
else { return }

Set source and destination textures at the same indices we used in the shaders.

encoder.setTexture(source,
                   index: 0)
encoder.setTexture(destination,
                   index: 1)

Set tint and temperature values. Given that the data that we send to GPU is just two float values, which is not much in size, we use recommended in such cases setBytes function. If the data is large, we’d create an MTLBuffer for it and used setBuffer instead. The indices are the same as in the kernel’s arguments.

encoder.setBytes(&self.temperature,
                 length: MemoryLayout<Float>.stride,
                 index: 0)
encoder.setBytes(&self.tint,
                 length: MemoryLayout<Float>.stride,
                 index: 1)

Calculate the size of grid and the threadgroups. The grid size should be the same as the texture’s so each thread could work on its own pixel. Speaking about the threadgroup size, we need it to be as much as possible in order to maximise the work parallelisation. The calculation of threadgroup size is based on two properties of pipeline state: maxTotalThreadsPerThreadgroup and threadExecutionWidth. The first defines the maximum number of threads that can be in a single threadgroup and the second is equal to width of SIMD group and defines the number of threads to execute in parallel on the GPU.

let gridSize = MTLSize(width: source.width,
                       height: source.height,
                       depth: 1)
let threadGroupWidth = self.pipelineState.threadExecutionWidth
let threadGroupHeight = self.pipelineState.maxTotalThreadsPerThreadgroup / threadGroupWidth
let threadGroupSize = MTLSize(width: threadGroupWidth,
                              height: threadGroupHeight,
                              depth: 1)

Set the pipeline state which contains precompiled instructions of our adjustments kernel.

encoder.setComputePipelineState(self.pipelineState)

If the device supports non-uniform threadgroups, we allow Metal to calculate the number of them and to generate smaller threadgroups along the edges of the grid. If the device doesn’t support this feature, we calculate the number of threadgroups by hand to overlap the size of the texture.

if self.deviceSupportsNonuniformThreadgroups {
    encoder.dispatchThreads(gridSize,
                            threadsPerThreadgroup: threadGroupSize)
} else {
    let threadGroupCount = MTLSize(width: (gridSize.width + threadGroupSize.width - 1) / threadGroupSize.width,
                                   height: (gridSize.height + threadGroupSize.height - 1) / threadGroupSize.height,
                                   depth: 1)
    encoder.dispatchThreadgroups(threadGroupCount,
                                 threadsPerThreadgroup: threadGroupSize)
}

After all encoding is done, we call endEncoding(). Without calling of this function the command buffer won’t know that it is ready to dispatch the commands to GPU.

encoder.endEncoding()

Here’s the final encoding function:

func encode(source: MTLTexture,
            destination: MTLTexture,
            in commandBuffer: MTLCommandBuffer) {
    guard let encoder = commandBuffer.makeComputeCommandEncoder()
    else { return }

    encoder.setTexture(source,
                       index: 0)
    encoder.setTexture(destination,
                       index: 1)

    encoder.setBytes(&self.temperature,
                     length: MemoryLayout<Float>.stride,
                     index: 0)
    encoder.setBytes(&self.tint,
                     length: MemoryLayout<Float>.stride,
                     index: 1)

    let gridSize = MTLSize(width: source.width,
                           height: source.height,
                           depth: 1)
    let threadGroupWidth = self.pipelineState.threadExecutionWidth
    let threadGroupHeight = self.pipelineState.maxTotalThreadsPerThreadgroup / threadGroupWidth
    let threadGroupSize = MTLSize(width: threadGroupWidth,
                                  height: threadGroupHeight,
                                  depth: 1)

    encoder.setComputePipelineState(self.pipelineState)
    
    if self.deviceSupportsNonuniformThreadgroups {
        encoder.dispatchThreads(gridSize,
                                threadsPerThreadgroup: threadGroupSize)
    } else {
        let threadGroupCount = MTLSize(width: (gridSize.width + threadGroupSize.width - 1) / threadGroupSize.width,
                                       height: (gridSize.height + threadGroupSize.height - 1) / threadGroupSize.height,
                                       depth: 1)
        encoder.dispatchThreadgroups(threadGroupCount,
                                     threadsPerThreadgroup: threadGroupSize)
    }
    
    encoder.endEncoding()
}

Excellent! Now we have a compute kernel and the corresponding encoder for it. In the next part we are going to write UIImage to MTLTexture conversion to pass the textures to the encoder, create a command queue and dispatch the kernel 👍.

Introduction to Metal Compute: Kernel Encoder
Older post

Introduction to Metal Compute: Kernel Shader

Newer post

Introduction to Metal Compute: Textures & Dispatching

Introduction to Metal Compute: Kernel Encoder