wavecrafters blog

Our small contribution to the developer community

Playing With Classifiers: *k*-NN

| Comments

In a recent project we needed a classifier, so we’ve been playing with clasiffiers last months. At the end we choosed SVM for the project, but I want to show you one that I like and it’s easy to implement, the k-nearest neighbors (k-NN) classifier.

k-NN is an algorithm that classifies, or labels, a new point comparing its features to the features of the rest of the dataset that is already classified, well, not all the dataset, just the k nearest ones, the neighbors. The label of the new point will be the average of the labels of the neigbors.

You have a dataset, you have a space of points, each one is a list of features and a label.

This is the algorithm:

1. Take your new point and measure the distance to each point of the dataset.  
2. Select the *k* nearest points of the dataset to your new point.  
3. Assign the average label of the selected points to the new one.

Ok, thats it, end of theory, time for fun, lets code!

First of all, credits, I’m going to use code from the Toby Segaran great book Programming Collective Intelligence, but instead of Python I’m using Ruby, I love Ruby.

def knn(data, new_point, k=5)
  dlist = getdistances(data, new_point)
  avg = 0.0
  totalweight = 0.0
  0.upto(k) do |i|
    dist = dlist[i][0]
    point_id = dlist[i][1]
    weight = gaussian(dist)
    avg += weight*data[point_id][:label]
    totalweight += weight
  avg = avg/totalweight

This is the complex version, with weights, because maybe you want that each point contributes differently in a function of its distance to your new point. The closest point contributes to the new label more than the farther.

In this example, the points are a hash with 2 keys, :features that is an array of the features and :label is the label.

First it needs an array with all the distances sorted, then it picks the label of each of the k closest points and calculates the weighted average. There are a bunch of weight functions, you should try a few to see which one fits better to your dataset.

Here is the Gaussian function, the bell curve.

def gaussian(dist, sigma=10.0)

The method to get the distances.

def getdistances(data,vec1)
  distancelist = []
  data.size.times do |i|
    vec2 = data[i][:features]
    distancelist << [euclidean(vec1, vec2), i]

And last, the measure of the distances, the euclidean distance, sum the squares of the difference of each feature, one by one, and take the square root.

def euclidean(p,q)
  d = 0.0
  p.size.times do |i|
    d += (p[i] - q[i])**2

And thats all folks, just 3 methods to build a powerful classifier, cool, isn’t it?

One more thing, As you can see, this algorithm doesn’t need training, it classifies online! but, the downside, it has to calculate the distance to every point in the dataset, so if you have a big dataset it could take a while to get the new label.

And here is where the GPU computing gets into action, speed of light, but that is for another post.