wavecrafters blog

Our small contribution to the developer community

k-NN: Normalizing the Data

| Comments

In a previous post we saw the knn classifier but said little about the data itsef, just that we use hash with two keys: {features: [f1, f2, ..., fn], label: lbl} The new point can use the same format, {features:[f1, f2, ..., fn]} but without the label, because is what we want to find.

We can play a little with an invented bunch of data, for example:

data = [{features:[1,2,3], label:27},
        {features:[2,1,1], label:18},
        {features:[3,1,4], label:2},
        {features:[2,4,6], label:9},
        {features:[2,2,0], label:16},
        {features:[4,2,5], label:33},
        {features:[1,0,0], label:1},
        {features:[0,1,0], label:1},
        {features:[1,0,1], label:10},
        {features:[0,0,1], label:1},
        {features:[1,1,0], label:10},
        {features:[1,1,1], label:1}
       ]

If we want to label a new point, point = {features:[1,1,1]} we run the classifier and we get a new label 8 Ups, this is far from the last point in the dataset, {features:[1,1,1], label:1} that’s because we’re using the default k = 5, so it’s getting the five nearest points to estimate the label. Let’s try some other k.

1.upto(10) do |k|
    puts "#{k} => #{Knn.weightedknn(data,point,k)}"
end

and results

1 => 1.0
2 => 9.478750044270722
3 => 9.65221020528869
4 => 9.739048833857241
5 => 7.999965000102083
6 => 6.838162000016251
7 => 6.007117289305309
8 => 7.2468827213222085
9 => 9.40774593640336
10 => 8.703210689823191

Well, with this data, and this point, the better result is with k = 1 this is, choose the label of the same point that you have already in your dataset.

Sometimes each feature has its own scale and can influence the estimation in different ways. If we don’t want this, we can normalize, put in the same interval [0,1], the features, using the minimum and maximum of each serie. The minimum to set the offset, and the difference with the maximum to fix the range, in this way:

def maxymin(data)
    max = Array.new(data[0][:features].size, 0)
    min = Array.new(data[0][:features].size, 0)
    data.each do |vect|
      vect[:features].each_index do |i|
        max[i] = vect[:features][i].to_f if max[i] < vect[:features][i]
        min[i] = vect[:features][i].to_f if min[i] > vect[:features][i]
      end
    end
    return max, min
  end

  def normalize(data, point)
    all_data = data + [point]
    max, min = maxymin(all_data)
    rescaling(data, max, min)
    rescaling([point], max, min)
  end

  def rescaling(data, max, min)
    data.each do |vect|
      vect[:features].each_index{ |i| vect[:features][i] = (vect[:features][i] - min[i]) / (max[i] - min[i]) }
    end
  end

In a real world scenario, is possible that you’d need to rescale each feature differently, so you’d have to try a bunch of factors for each feature until you find the best model to your problem. But, that’s a story for another post, for this example we’ll use a standard normalization.

Now, to normalize our dataset and point:

Knn.normalize(data, point)

and the new dataset and point looks like:

data_normalized = [{:features=>[0.25, 0.5, 0.5], :label=>27},
                  {:features=>[0.5, 0.25, 0.16666666666666666], :label=>18},
                  {:features=>[0.75, 0.25, 0.6666666666666666], :label=>2},
                  {:features=>[0.5, 1.0, 1.0], :label=>9},
                  {:features=>[0.5, 0.5, 0.0], :label=>16},
                  {:features=>[1.0, 0.5, 0.8333333333333334], :label=>33},
                  {:features=>[0.25, 0.0, 0.0], :label=>1},
                  {:features=>[0.0, 0.25, 0.0], :label=>1},
                  {:features=>[0.25, 0.0, 0.16666666666666666], :label=>10},
                  {:features=>[0.0, 0.0, 0.16666666666666666], :label=>1},
                  {:features=>[0.25, 0.25, 0.0], :label=>10},
                  {:features=>[0.25, 0.25, 0.16666666666666666], :label=>1}
                  ]

point_normalized = {:features=>[0.25, 0.25, 0.16666666666666666]}

and the label estimation:

1 => 1.0
2 => 5.499687500000502
3 => 9.66578318278914
4 => 9.749327232834395
5 => 7.999826325111753
6 => 6.8333911434197425
7 => 6.000297548549487
8 => 7.249782996274919
9 => 9.443303452024507
10 => 8.700354686441994

Still the best case is k = 1, the case k = 2 is a little bit better than before, when using non-normalized data, but not enough. It’s time to use real data.

Note: You con download the code from GitHub, https://github.com/dxfl/wc-blog

Comments