wavecrafters blog

Our small contribution to the developer community

Exclusive Prefix Sum Scan

| Comments

Today Im going to try to explain a really useful algorithm for any parallel developer, but that took me a while to understand and to program myself. In this example we’ll be doing the exclusive version of the prefix sum scan, as it is a little bit more difficult than the inclusive version. It is possible to do this by computing the inclusive scan and then substracting the original vector from the result, but in this post we are going to do it a little bit different. The algorithm in sequential is pretty simple. It takes a vector and returns another vector, each value of it being the sum of the preceding values, not counting itself. This is what it looks like:

As a base Im going to take Sean Baxter’s Modern GPU version of the algorithm. One of the things that must be explained before getting into the it is what VT, or values per thread, is. As taken from Sean’s page, this has the following consequences:

  • Increasing the number of threads per block does more sequential processing per thread, improving overall work-efficiency, but also uses more per-thread state, reducing occupancy and compromising the device’s ability to hide latency.

  • Decreasing the VT does less sequential processing per thread, decreasing overall work-efficiency, but also uses less per-thread state, increasing occupancy and enabling the device to better hide latency.

This leads to a three phase algorithm, described here:

  1. Upsweep. Each thread in each block takes the values that correspond to it (defined by VT) and adds them. Then it perfoms an exclusive scan on the values of the block, stored in our case in the variable localScan. The last value is stored on another array, that we are going to call interBlockScan, as they will be the carry-on for the next block.

  2. Exclusive Scan the interBlockScan array, to find out the carry-on for each block.

  3. Downsweep. Each thread performs an exclusive scan on its values, and then adds the carry-on. The local carry-on is the sum of the preceding values in that block, and the interBlock carry-on is the sum of the preceding block. Thus adding all of them gives us the correct value for the scan.

Lets explain it graphically. Lets assume we have as an input the vector showed in the example above. In order to explain it, lets also assume that our VT is 3, and that the number of threads per block is also 3. Then the first block would take the first nine values. In the first phase of Upsweep, every thread takes three values, adds them, and stores them in localScan. The tricky thing here is that the correspondence is not linear. What i mean by this is that thread 0 will take the values from -1 to 1, thread 1 from 2 to 4, and so on, instead of them taking the first three values (0-2 for the first thread, 3-5 for the second, etc). The reason for this is that we are computing an exclusive scan, and that each thread must take the preceding values to it, not itself. As there is no value -1 for a vector, thread0 will take a 0 as its first value. The last step of the Upsweep phase is to store the last computed value into the interBlockScan array.

The next phase is to do the exclusive scan of the values calculated by each block.

Now, to the Downsweep. Here each thread takes its 3 values (again, the three preceding itself), and adds them, by steps, to the local carry-on and the interBlock carry on. This is easier seen visually than explained.

And now, to the code:

Upsweep:

__global__ void k_upsweep(int* result, int* interBlockScan, int* vector, int vectorSize, int vt, int realSize){

    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= realSize) return;

    int tIdx = threadIdx.x;
    __shared__ int s_vector[SHARED_SIZE];

    //Load values in shared memory. Each thread will take VT values.
    int partial_sum = 0;

    #pragma unroll
    for (int i= 0; i < VT; i++){

        int global_index = gIdx * VT + i;
        if (global_index < vectorSize && (tIdx + i) != 0)
            partial_sum += vector[global_index-1];
     }

    //add last element (it was jumped) to the shared array
    if (tIdx==blockDim.x -1){
        partial_sum += vector[gIdx*VT + 2];
    }

    s_vector[tIdx] = partial_sum;
    int first = 0;
    __syncthreads();

    //Performs the scan using double buffering in the shared memory
    #pragma unroll
    for (int offset = 1; offset < blockDim.x; offset += offset)
    {
        if (tIdx >= offset){
            partial_sum += s_vector[first + tIdx - offset];
        }
        first = blockDim.x - first;
        s_vector[first + tIdx] = partial_sum;
        __syncthreads();
    }

    //each thread stores in itself the value of the preceding position (because its exclusive scan)
    if (tIdx != 0){
        result[gIdx] = s_vector[tIdx + first - 1];
    }else{
        result[gIdx] = 0;
    }

    //Store in another array the last computed element, as it will be the carry on for the next block.
    if (tIdx == 0)
        interBlockScan[blockIdx.x] = s_vector[blockDim.x + first - 1];

}

Exclusive scan:

__global__ void k_exclusiveScan(int* result, int*vector, int vectorSize, int vt)
{
    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= vectorSize) return;

    int tIdx = threadIdx.x;
    __shared__ int s_vector[SHARED_SIZE];

    //Load values in shared memory
    int partial_sum = 0;

    #pragma unroll
    for (int i= 0; i < vt; i++){

        int global_index = gIdx * vt + i;
        if (global_index < vectorSize)
            partial_sum += vector[global_index];

    }

    s_vector[tIdx] = partial_sum;
    int first = 0;
    __syncthreads();

    #pragma unroll
    for (int offset = 1; offset < blockDim.x; offset += offset)
    {
        if (tIdx >= offset){
            partial_sum += s_vector[first + tIdx - offset];
        }
        first = blockDim.x - first;
        s_vector[first + tIdx] = partial_sum;
        __syncthreads();
    }

    if (tIdx != 0){
        result[gIdx] = s_vector[tIdx + first - 1];
    }else{
        result[gIdx] = 0;
    }
}

There is something in this code that I think is worth explaining, and that is how Sean Baxter does the scan in the shared memory array. Lets take the state of the shared array after we have loaded the VT values. The -1 one values are simply not initialized shared memory array positions.

In the successive iterations, the array will be filled as follows.

Downsweep:

__global__ void k_downsweep(int* result,
                            int* originalArray,
                            int* localScan,
                            int* interBlockScan,
                            int vt,
                            int size)
{

    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= size) return;

    int tIdx = threadIdx.x;
    int carryOn = 0;

    if (blockIdx.x != 0){
        carryOn = interBlockScan[blockIdx.x];
    }

    int localCarryOn = 0;
    if (tIdx != 0)
        localCarryOn = localScan[gIdx];

    int currentSum = 0;

    //Takes each value and adds the preceding values (per VT) plus
    //the local carry-on and the interBlock carry-on
    #pragma unroll
    for (int i=0; i < VT; i++)
    {
        if (tIdx == 0 && i==0)
            currentSum=0;
        else
            currentSum += originalArray[gIdx * VT + i - 1];
       result[gIdx*VT + i] = currentSum + localCarryOn + carryOn;
    }

}

There are of course many things improvable. Testing a vector of 100.000.000 elements against Thrust exclusive scan (in a GTX 750ti), we found out to be around 11ms slower, (they took ~21ms and we ~31ms), which is not that bad at all :)

CUDA, algorithms

« GTC 2014 third day

Comments