wavecrafters blog

Our small contribution to the developer community

Playing With Classifiers: *k*-NN

| Comments

In a recent project we needed a classifier, so we’ve been playing with clasiffiers last months. At the end we choosed SVM for the project, but I want to show you one that I like and it’s easy to implement, the k-nearest neighbors (k-NN) classifier.

k-NN is an algorithm that classifies, or labels, a new point comparing its features to the features of the rest of the dataset that is already classified, well, not all the dataset, just the k nearest ones, the neighbors. The label of the new point will be the average of the labels of the neigbors.

You have a dataset, you have a space of points, each one is a list of features and a label.

This is the algorithm:

1. Take your new point and measure the distance to each point of the dataset.  
2. Select the *k* nearest points of the dataset to your new point.  
3. Assign the average label of the selected points to the new one.

Ok, thats it, end of theory, time for fun, lets code!

First of all, credits, I’m going to use code from the Toby Segaran great book Programming Collective Intelligence, but instead of Python I’m using Ruby, I love Ruby.

def knn(data, new_point, k=5)
  dlist = getdistances(data, new_point)
  avg = 0.0
  totalweight = 0.0
  0.upto(k) do |i|
    dist = dlist[i][0]
    point_id = dlist[i][1]
    weight = gaussian(dist)
    avg += weight*data[point_id][:label]
    totalweight += weight
  end
  avg = avg/totalweight
end

This is the complex version, with weights, because maybe you want that each point contributes differently in a function of its distance to your new point. The closest point contributes to the new label more than the farther.

In this example, the points are a hash with 2 keys, :features that is an array of the features and :label is the label.

First it needs an array with all the distances sorted, then it picks the label of each of the k closest points and calculates the weighted average. There are a bunch of weight functions, you should try a few to see which one fits better to your dataset.

Here is the Gaussian function, the bell curve.

def gaussian(dist, sigma=10.0)
  Math.exp(-dist**2/(2*sigma**2))
end

The method to get the distances.

def getdistances(data,vec1)
  distancelist = []
  data.size.times do |i|
    vec2 = data[i][:features]
    distancelist << [euclidean(vec1, vec2), i]
  end
  distancelist.sort
end

And last, the measure of the distances, the euclidean distance, sum the squares of the difference of each feature, one by one, and take the square root.

def euclidean(p,q)
  d = 0.0
  p.size.times do |i|
    d += (p[i] - q[i])**2
  end
  d**0.5
end

And thats all folks, just 3 methods to build a powerful classifier, cool, isn’t it?

One more thing, As you can see, this algorithm doesn’t need training, it classifies online! but, the downside, it has to calculate the distance to every point in the dataset, so if you have a big dataset it could take a while to get the new label.

And here is where the GPU computing gets into action, speed of light, but that is for another post.

Related Posts for Octopress Using Categories

| Comments

I was trying to add the related posts feature without relying on the octopress-relatedposts plugin, because I had problems running the lsi in my computer. I also wanted, if possible, to make use of the categories to decide the importance of a given post. Then I came to this wonderful jekyll plugin, which does exacly what I wanted… but was programmed just for jekyll, which apparently uses tags instead of categories. So after a couple of minor changes, voila, it works in octopress

Installation

Copy related_posts.rb to your plugins directory

related_posts.rb
require 'jekyll/post'

module RelatedPosts

  # Used to remove #related_posts so that it can be overridden
  def self.included(klass)
    klass.class_eval do
      remove_method :related_posts
    end
  end

  # Calculate related posts.
  #
  # Returns [<Post>]
  def related_posts(posts)
    return [] unless posts.size > 1
    highest_freq = Jekyll::Post.category_freq(posts).values.max
    related_scores = Hash.new(0)
    posts.each do |post|
      post.categories.each do |category|
        if self.categories.include?(category) && post != self
          cat_freq = Jekyll::Post.category_freq(posts)[category]
          related_scores[post] += (1+highest_freq-cat_freq)
        end
      end
    end

    Jekyll::Post.sort_related_posts(related_scores)
  end

  module ClassMethods
    # Calculate the frequency of each tag.
    #
    # Returns {tag => freq, tag => freq, ...}
    def category_freq(posts)
      return @category_freq if @category_freq
      @category_freq = Hash.new(0)
      posts.each do |post|
        post.categories.each {|category| @category_freq[category] += 1}
      end
      @category_freq
    end

    # Sort the related posts in order of their score and date
    # and return just the posts
    def sort_related_posts(related_scores)
      related_scores.sort do |a,b|
        if a[1] < b[1]
          1
        elsif a[1] > b[1]
          -1
        else
          b[0].date <=> a[0].date
        end
      end.collect {|post,freq| post}
    end
  end

end

module Jekyll
  class Post
    include RelatedPosts
    extend RelatedPosts::ClassMethods
  end
end

Then create a file such as source/_includes/custom/asides/related.html with the following content

related.html
<section>
  <h1>Related Posts</h1>
  <ul class="posts">
    
      <li class="related">
        <a href="/blog/2013/12/16/mis-cositas/">Octopress, accents and rare characters</a>
      </li>
    
      <li class="related">
        <a href="/blog/2014/01/23/knn-normalizing-the-data/">k-NN: normalizing the data</a>
      </li>
    
      <li class="related">
        <a href="/blog/2013/12/27/playing-with-classifiers-knn/">Playing with classifiers: *k*-NN</a>
      </li>
    
  </ul>
</section>

Finall add the file to your default asides list in your _config.yml

_config.yml
default_asides: [custom/asides/related.html, ...]

Octopress, Accents and Rare Characters

| Comments

When I fist tried to configure Octopress, we encountered a problem: Jekyll doesnt do too well with accents and weird characters. When running rake generate I received the following error message:

Starting to watch source with Jekyll and Compass. Starting Rack on port 4000
C:/Ruby193/lib/ruby/1.9.1/psych.rb:154:in `parse': couldn't parse YAML at line 0 column 0 (Psych::SyntaxError)
        from C:/Ruby193/lib/ruby/1.9.1/psych.rb:154:in `parse_stream'
        from C:/Ruby193/lib/ruby/1.9.1/psych.rb:125:in `parse'
        from C:/Ruby193/lib/ruby/1.9.1/psych.rb:112:in `load'
        from C:/Ruby193/lib/ruby/1.9.1/psych.rb:229:in `load_file'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/jekyll-0.12.1/lib/jekyll.rb:130:in `configuration'
        from C:/Ruby193/lib/ruby/gems/1.9.1/gems/jekyll-0.12.1/bin/jekyll:221:in `<top (required)>'
        from C:/Ruby193/bin/jekyll:19:in `load'
        from C:/Ruby193/bin/jekyll:19:in `<main>'
[2013-12-17 10:38:36] INFO  WEBrick 1.3.1
[2013-12-17 10:38:36] INFO  ruby 1.9.3 (2011-10-30) [i386-mingw32]

I played around a bit, trying to make jekyll to accept UTF-8. I have only found a partial solution, and that is inserting the html escaping characters. For example instead of í you could write `&iacute`