wavecrafters blog

Our small contribution to the developer community

Exclusive Prefix Sum Scan

| Comments

Today Im going to try to explain a really useful algorithm for any parallel developer, but that took me a while to understand and to program myself. In this example we’ll be doing the exclusive version of the prefix sum scan, as it is a little bit more difficult than the inclusive version. It is possible to do this by computing the inclusive scan and then substracting the original vector from the result, but in this post we are going to do it a little bit different. The algorithm in sequential is pretty simple. It takes a vector and returns another vector, each value of it being the sum of the preceding values, not counting itself. This is what it looks like:

As a base Im going to take Sean Baxter’s Modern GPU version of the algorithm. One of the things that must be explained before getting into the it is what VT, or values per thread, is. As taken from Sean’s page, this has the following consequences:

  • Increasing the number of threads per block does more sequential processing per thread, improving overall work-efficiency, but also uses more per-thread state, reducing occupancy and compromising the device’s ability to hide latency.

  • Decreasing the VT does less sequential processing per thread, decreasing overall work-efficiency, but also uses less per-thread state, increasing occupancy and enabling the device to better hide latency.

This leads to a three phase algorithm, described here:

  1. Upsweep. Each thread in each block takes the values that correspond to it (defined by VT) and adds them. Then it perfoms an exclusive scan on the values of the block, stored in our case in the variable localScan. The last value is stored on another array, that we are going to call interBlockScan, as they will be the carry-on for the next block.

  2. Exclusive Scan the interBlockScan array, to find out the carry-on for each block.

  3. Downsweep. Each thread performs an exclusive scan on its values, and then adds the carry-on. The local carry-on is the sum of the preceding values in that block, and the interBlock carry-on is the sum of the preceding block. Thus adding all of them gives us the correct value for the scan.

Lets explain it graphically. Lets assume we have as an input the vector showed in the example above. In order to explain it, lets also assume that our VT is 3, and that the number of threads per block is also 3. Then the first block would take the first nine values. In the first phase of Upsweep, every thread takes three values, adds them, and stores them in localScan. The tricky thing here is that the correspondence is not linear. What i mean by this is that thread 0 will take the values from -1 to 1, thread 1 from 2 to 4, and so on, instead of them taking the first three values (0-2 for the first thread, 3-5 for the second, etc). The reason for this is that we are computing an exclusive scan, and that each thread must take the preceding values to it, not itself. As there is no value -1 for a vector, thread0 will take a 0 as its first value. The last step of the Upsweep phase is to store the last computed value into the interBlockScan array.

The next phase is to do the exclusive scan of the values calculated by each block.

Now, to the Downsweep. Here each thread takes its 3 values (again, the three preceding itself), and adds them, by steps, to the local carry-on and the interBlock carry on. This is easier seen visually than explained.

And now, to the code:

Upsweep:

__global__ void k_upsweep(int* result, int* interBlockScan, int* vector, int vectorSize, int vt, int realSize){

    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= realSize) return;

    int tIdx = threadIdx.x;
    __shared__ int s_vector[SHARED_SIZE];

    //Load values in shared memory. Each thread will take VT values.
    int partial_sum = 0;

    #pragma unroll
    for (int i= 0; i < VT; i++){

        int global_index = gIdx * VT + i;
        if (global_index < vectorSize && (tIdx + i) != 0)
            partial_sum += vector[global_index-1];
     }

    //add last element (it was jumped) to the shared array
    if (tIdx==blockDim.x -1){
        partial_sum += vector[gIdx*VT + 2];
    }

    s_vector[tIdx] = partial_sum;
    int first = 0;
    __syncthreads();

    //Performs the scan using double buffering in the shared memory
    #pragma unroll
    for (int offset = 1; offset < blockDim.x; offset += offset)
    {
        if (tIdx >= offset){
            partial_sum += s_vector[first + tIdx - offset];
        }
        first = blockDim.x - first;
        s_vector[first + tIdx] = partial_sum;
        __syncthreads();
    }

    //each thread stores in itself the value of the preceding position (because its exclusive scan)
    if (tIdx != 0){
        result[gIdx] = s_vector[tIdx + first - 1];
    }else{
        result[gIdx] = 0;
    }

    //Store in another array the last computed element, as it will be the carry on for the next block.
    if (tIdx == 0)
        interBlockScan[blockIdx.x] = s_vector[blockDim.x + first - 1];

}

Exclusive scan:

__global__ void k_exclusiveScan(int* result, int*vector, int vectorSize, int vt)
{
    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= vectorSize) return;

    int tIdx = threadIdx.x;
    __shared__ int s_vector[SHARED_SIZE];

    //Load values in shared memory
    int partial_sum = 0;

    #pragma unroll
    for (int i= 0; i < vt; i++){

        int global_index = gIdx * vt + i;
        if (global_index < vectorSize)
            partial_sum += vector[global_index];

    }

    s_vector[tIdx] = partial_sum;
    int first = 0;
    __syncthreads();

    #pragma unroll
    for (int offset = 1; offset < blockDim.x; offset += offset)
    {
        if (tIdx >= offset){
            partial_sum += s_vector[first + tIdx - offset];
        }
        first = blockDim.x - first;
        s_vector[first + tIdx] = partial_sum;
        __syncthreads();
    }

    if (tIdx != 0){
        result[gIdx] = s_vector[tIdx + first - 1];
    }else{
        result[gIdx] = 0;
    }
}

There is something in this code that I think is worth explaining, and that is how Sean Baxter does the scan in the shared memory array. Lets take the state of the shared array after we have loaded the VT values. The -1 one values are simply not initialized shared memory array positions.

In the successive iterations, the array will be filled as follows.

Downsweep:

__global__ void k_downsweep(int* result,
                            int* originalArray,
                            int* localScan,
                            int* interBlockScan,
                            int vt,
                            int size)
{

    int gIdx = (blockIdx.x * blockDim.x) + threadIdx.x;
    if (gIdx >= size) return;

    int tIdx = threadIdx.x;
    int carryOn = 0;

    if (blockIdx.x != 0){
        carryOn = interBlockScan[blockIdx.x];
    }

    int localCarryOn = 0;
    if (tIdx != 0)
        localCarryOn = localScan[gIdx];

    int currentSum = 0;

    //Takes each value and adds the preceding values (per VT) plus
    //the local carry-on and the interBlock carry-on
    #pragma unroll
    for (int i=0; i < VT; i++)
    {
        if (tIdx == 0 && i==0)
            currentSum=0;
        else
            currentSum += originalArray[gIdx * VT + i - 1];
       result[gIdx*VT + i] = currentSum + localCarryOn + carryOn;
    }

}

There are of course many things improvable. Testing a vector of 100.000.000 elements against Thrust exclusive scan (in a GTX 750ti), we found out to be around 11ms slower, (they took ~21ms and we ~31ms), which is not that bad at all :)

GTC 2014 Third Day

| Comments

PIXAR

Todays big keynote was presented by the people from PIXAR. It obviously wasnt full of big news like yesterdays keynote was, but it was anyway very impressive. They talked about how they use NVIDIAs GPUs to animate and light their movies. One of the most impressive examples they showed was their lightning techniques. They have a software that throws millions of rays to thousands of objects, and then they are capable of following those rays trajectories to form the final lighting. What is impressive is that its real time calculation (it actually takes them a couple of seconds until it stabilises a little bit, at the beginning its rather blurry, and you can almost see where the rays hit), and thats allows their lighting artists to completely change the lighting of the scene with just a few steps, a process that before took them a handful of hours.

About Pascal and Volta

Aparently Pascal doesnt substitute Volta, it is the transition from Maxwell to Volta. Given that the 3D stacked memory was a feature of Volta present in Pascal, as well as the hardware support (NVLink) for unified memory was initially planned for Maxwell, maybe we can adventure that Pascal will not feature the 16nm FinFet initially planned for Volta in 2016? In that scenario that would be the improvement Volta brings to the table. We are no expert in chip processes, but we’ve been reading articles about delays in the introduction of any chip going beyond 16nm, it is not even clear how economically viable they could be without a new important leap forward in chip engineering.

TALKS

Besides that, one of the talks that most impressed us was the one by the people of Shadertoy, Iñigo Quilez left everyone with their mouths open. He makes incredible 3D animations with just a fragment shader, using math to form the objects and shapes without having an array of vertices defining them. You can check out their webpage here: [Shadertoy](https://www.shadertoy.com/

COMPUTER VISION.

Another very interesting talk was the one by Ikuro Sato and Hideki Niihara. They use a deep neural network for pedestrian detection, using a tegra K1. They made a live demo of their work at the talk, and it was impressive to see how it could calculate distance, height and direction in real time. in this same line was the talk by NVIDIA representatives, that versed around the NVIDIA visionworks toolkit. This will be a library for computer vision and image processing that will allow programmers to develop new applications quickly. It will be modular and extensible, and will run on the Tegra K1. The possibilities for the automotive and mobile industries are enormous. We cant wait to have a K1 in our hands.

Second GTC Day and Keynote

| Comments

Today was the big day! Today was the day that NVIDIA was supposed make its big announcements. And it definitely didnt disappoint. But we’ll cover that a little later and offer our thoughts.

The setting was REALLY impressive. We’ve been in many conferences, but you can tell the difference when the pros do it. The screens, the epic music, all contributed to an excellent atmosphere. As we are quite exhausted and actually looking forward to spend the remaining energies playing a bit with our new nvidia Shields, we will keep it short and expand content in the next few days. And now, to the announcements.

New technology to increase memory bandwith between CPU and GPU as well as across multiple GPUs, we hope to get more details soon, but looks fantastic, great for single node, multiple GPU configurations which also profit from the Unified memory in CUDA 6.0.

PASCAL

This was one of the announcements that has gotten us really excited. There had already been talks about the 3D stacked memory, but it wasnt clear wether it would actually be feasible to build chips that way. Well, it turns out that they can. And it looks like it will revolutionize the way chips are made, as, in words of Jen-Hsun Huang (NVIDIA’s CEO), it makes possible to maintain the speedup predicted by Moores law, all in a GPU the size of two credit cards. Just the name change from last GTC from Volta to Pascal was kind of surprising. As for the practical effect of Pascal, it increases memory bandwith to 1TB/sec, basically removing all bottlenecks in memory bound algorithms, which in practice are most of them. So performance wise, many applications should find very very nice speedups, larger than the “official” GigaFlops difference between Maxwell and Pascal. Pascal will continue to increase in energy efficiency, with a nice jump in performance per watt. This energy efficiency bodes well for nvidia in HPC market. The energy savings they can achieve are simply amazing. For some projects like the Google Brain they can replicate computations using 0.3% of the energy that normal servers use, with those numbers they could just change their business model and start charging a percentage of the saved energy bill.

GOOGLE BRAIN and MACHINE LEARNING

One of the most amazing examples, 0.3% of the energy cost and 0.25% of the hardware cost for the same computations!!! Actually alomg the same lines as we are experiencing in some of our projects, we have 3.000$ workstations going 6x faster than a 300.000$ cluster.

Yay! big saving!!

PHISICS SOLVER / FLEX

Usual amazing stuff from the physics simulations, Nvidia has a fantastic staff bringing always very impressive demos. We will nalize Flex in more detail in a future post. Also nice real time fight-demo of the Unreal Engine 4.

Tegra K1 and JETSON TK1

The Tegra K1 development board, aka Jetson K1, will be available for developers in April. K1 is looking pretty sweet, the performance slides look really good, very easy to program, plenty of tools to do so. We can hardly wait to get our hands on one of those babies. With 192 cores, 326 GFlops and 4x energy savings as the A15, no wonder most car companies are getting in board with Tegra.

No word over the 64 bit K1, also known as Denver. Lot of speculation already about it in “the tubes”. For one it looks like Nvidia will start to be more carefull with announcements of Tegra, a bit as they do with the Geforce line. When competition is so tough you want to keep your cards well guarded until you want to play your hand. So we dont think the 64 bit tegra is out of the picture, they are probably saving the anouncement for a better time, We’ve seen this tiem and again with the Geforce cards.

Erista

Of course as it usually happens with technology the moment you get your hands in one desired piece of it, the next version is announced and then you only have eyes for what you cannot enjoy just yet. In tegra case that will be Erista (Logan’s stranded son… naming of chips starting to look like a soap Opera). Erista will feature the Maxwell architecture, expecting way lower TDP (1.5 W maybe?)… left for speculation were the number of CUDA cores, as in Maxwell the SMS is 128 cores against the Kepler SMX with 192 we are left to wonder whether Erista would use two of them, for a nice 256 cuda cores. This seems to be the pattern for the Kepler to Maxwell transition, more multiprocessors in exchange for the 192 to 128 reduction.

And finally…..in a Oprah-like move, every GTC attendant got a SHIELD!! :D

PS: We will go through the rest of the day in future posts, our Shields just finished charging..

GTC 2014 First Day

| Comments

Today was our first day at the GTC (GPU Technology Conference) in San Jose. We arrived there early in the morning, ready for the registration and the first set of conferences.

The first couple of hours were quite light. I went to a couple of CUDA tutorial sessions, which were informative but rather easy. More interesting was Peter Messmer’s talk about “How to Visualize Your GPU-Accelerated Simulation Results”, which provided some thoughtful hindsights which we will like to test for future applications.

Also very promising was the talk by Mark Ebersole and Amit Rao called “Mobile GPU compute with Tegra K1”, the K1 is definetely looking really good, although they could not speak much about the K1 itself, probably waiting for the Big Boss to bring all the fireworks in the keynote, but they discussed extensively about the tools to work with the K1. And it is indeed looking pretty sweet, full OpenCL support, full CUDA (all libraries included too), OpenGL 4.3, OpenGL ES 3.1, etc. All the same features and tools that you can enjoy in a desktop GPU. Really looking forward to the big round of anouncements in the keynote, specially about when we will be able to enjoy the K1.

After that, it was time for a quick lunch, which definitely wasnt the highlight of the day. The meal consisted of a tortilla wrap filled with meat and rice. Filling but we felt that it could definitely improve.

The rest of the day consisted of a couple more GPU tutorials, and the most exciting part of todays session, the presentation of this year GTC posters. This year we contributed to one of them, the “Markov Chains Interactions on GPU-based Massive Parallel Simu-lated Annealing: A case study with the Faculty Assignment Problem”, which can be downloaded direcly from here GTC poster

So we had a great today time and are really excited about how it will be tomorrow, with the GTC on all cylinders!

k-NN: Normalizing the Data

| Comments

In a previous post we saw the knn classifier but said little about the data itsef, just that we use hash with two keys: {features: [f1, f2, ..., fn], label: lbl} The new point can use the same format, {features:[f1, f2, ..., fn]} but without the label, because is what we want to find.

We can play a little with an invented bunch of data, for example:

data = [{features:[1,2,3], label:27},
        {features:[2,1,1], label:18},
        {features:[3,1,4], label:2},
        {features:[2,4,6], label:9},
        {features:[2,2,0], label:16},
        {features:[4,2,5], label:33},
        {features:[1,0,0], label:1},
        {features:[0,1,0], label:1},
        {features:[1,0,1], label:10},
        {features:[0,0,1], label:1},
        {features:[1,1,0], label:10},
        {features:[1,1,1], label:1}
       ]

If we want to label a new point, point = {features:[1,1,1]} we run the classifier and we get a new label 8 Ups, this is far from the last point in the dataset, {features:[1,1,1], label:1} that’s because we’re using the default k = 5, so it’s getting the five nearest points to estimate the label. Let’s try some other k.

1.upto(10) do |k|
    puts "#{k} => #{Knn.weightedknn(data,point,k)}"
end

and results

1 => 1.0
2 => 9.478750044270722
3 => 9.65221020528869
4 => 9.739048833857241
5 => 7.999965000102083
6 => 6.838162000016251
7 => 6.007117289305309
8 => 7.2468827213222085
9 => 9.40774593640336
10 => 8.703210689823191

Well, with this data, and this point, the better result is with k = 1 this is, choose the label of the same point that you have already in your dataset.

Sometimes each feature has its own scale and can influence the estimation in different ways. If we don’t want this, we can normalize, put in the same interval [0,1], the features, using the minimum and maximum of each serie. The minimum to set the offset, and the difference with the maximum to fix the range, in this way:

def maxymin(data)
    max = Array.new(data[0][:features].size, 0)
    min = Array.new(data[0][:features].size, 0)
    data.each do |vect|
      vect[:features].each_index do |i|
        max[i] = vect[:features][i].to_f if max[i] < vect[:features][i]
        min[i] = vect[:features][i].to_f if min[i] > vect[:features][i]
      end
    end
    return max, min
  end

  def normalize(data, point)
    all_data = data + [point]
    max, min = maxymin(all_data)
    rescaling(data, max, min)
    rescaling([point], max, min)
  end

  def rescaling(data, max, min)
    data.each do |vect|
      vect[:features].each_index{ |i| vect[:features][i] = (vect[:features][i] - min[i]) / (max[i] - min[i]) }
    end
  end

In a real world scenario, is possible that you’d need to rescale each feature differently, so you’d have to try a bunch of factors for each feature until you find the best model to your problem. But, that’s a story for another post, for this example we’ll use a standard normalization.

Now, to normalize our dataset and point:

Knn.normalize(data, point)

and the new dataset and point looks like:

data_normalized = [{:features=>[0.25, 0.5, 0.5], :label=>27},
                  {:features=>[0.5, 0.25, 0.16666666666666666], :label=>18},
                  {:features=>[0.75, 0.25, 0.6666666666666666], :label=>2},
                  {:features=>[0.5, 1.0, 1.0], :label=>9},
                  {:features=>[0.5, 0.5, 0.0], :label=>16},
                  {:features=>[1.0, 0.5, 0.8333333333333334], :label=>33},
                  {:features=>[0.25, 0.0, 0.0], :label=>1},
                  {:features=>[0.0, 0.25, 0.0], :label=>1},
                  {:features=>[0.25, 0.0, 0.16666666666666666], :label=>10},
                  {:features=>[0.0, 0.0, 0.16666666666666666], :label=>1},
                  {:features=>[0.25, 0.25, 0.0], :label=>10},
                  {:features=>[0.25, 0.25, 0.16666666666666666], :label=>1}
                  ]

point_normalized = {:features=>[0.25, 0.25, 0.16666666666666666]}

and the label estimation:

1 => 1.0
2 => 5.499687500000502
3 => 9.66578318278914
4 => 9.749327232834395
5 => 7.999826325111753
6 => 6.8333911434197425
7 => 6.000297548549487
8 => 7.249782996274919
9 => 9.443303452024507
10 => 8.700354686441994

Still the best case is k = 1, the case k = 2 is a little bit better than before, when using non-normalized data, but not enough. It’s time to use real data.

Note: You con download the code from GitHub, https://github.com/dxfl/wc-blog