MLIR Super Vectorization

3 minute read

Published:

Super Vectorize, which provide a single, compact, representation in the vector types, is a very cool feature in MLIR Affine Dialect. It implements a high-level abstracte vectorization strategy on this Function.

Vector granularity

This pass is designed to perform vectorization at a super-vector granularity. A super-vector is loosely defined as a vector type that is a multiple of a “good” vector size so the HW can efficiently implement a set of high-level primitives. Multiple is understood along any dimension; e.g. vector<16xf32> and vector<2x8xf32> are valid super-vectors for a vector<8xf32> HW vector. Note that a “good vector size so the HW can efficiently implement a set of high-level primitives” is not necessarily an integer multiple of actual hardware registers. We leave details of this distinction unspecified for now.

Some may prefer the terminology a “tile of HW vectors”. In this case, one should note that super-vectors implement an “always full tile” abstraction. They guarantee no partial-tile separation is necessary by relying on a high-level copy-reshape abstraction that we call vector.transfer. This copy-reshape operations is also responsible for performing layout transposition if necessary. In the general case this will require a scoped allocation in some notional local memory.

Whatever the mental model one prefers to use for this abstraction, the key point is that we burn into a single, compact, representation in the vector types, information that is expected to reduce the impact of the phase ordering problem. Indeed, a vector type conveys information that:

  • the associated loops have dependency semantics that do not prevent vectorization;
  • the associate loops have been sliced in chunks of static sizes that are compatible with vector sizes (i.e. similar to unroll-and-jam);
  • the inner loops, in the unroll-and-jam analogy of 2, are captured by the vector type and no vectorization hampering transformations can be applied to them anymore;
  • the underlying memrefs are accessed in some notional contiguous way that allows loading into vectors with some amount of spatial locality; In other words, super-vectorization provides a level of separation of concern by way of opacity to subsequent passes. This has the effect of encapsulating and propagating vectorization constraints down the list of passes until we are ready to lower further.

For a particular target, a notion of minimal n-d` vector size will be specified and vectorization targets a multiple of those. In the following paragraph, let “k .” represent “a multiple of”, to be understood as a multiple in the same dimension (e.g. vector<16 x k . 128> summarizes vector<16 x 128>, vector<16 x 256>, vector<16 x 1024>, etc).

Some non-exhaustive notable super-vector sizes of interest include:

  • CPU: vector<k . HW_vector_size>, vector<k’ . core_count x k . HW_vector_size>, vector<socket_count x k’ . core_count x k . HW_vector_size>;
  • GPU: vector<k . warp_size>, vector<k . warp_size x float2>, vector<k . warp_size x float4>, vector<k . warp_size x 4 x 4x 4> (for tensor_core sizes).

Loops and operations are emitted that operate on those super-vector shapes. Subsequent lowering passes will materialize to actual HW vector sizes. These passes are expected to be (gradually) more target-specific.

You can have many headings

Aren’t headings cool?