• Subscribe

Feature - Neighborly efficiency: Scaling kNN problems using GPUs

Feature - Neighborly efficiency: Scaling kNN problems (nearly) linearly using GPUs

This image depicts an example of a two-dimensional feature space. In this case, the unknown dot would be classified as "green," because three of the five nearest neighbors are green.

Image courtesy of Cyrus Stoller and Libin Sun.

What is k-Nearest Neighbor?

The class of machine learning algorithms known as kNN classifies objects based on the closest training examples in the feature space. What does that mean? Let us illustrate that with an example.

Imagine you have a large set of digital images of friends and family. Many of them have already been tagged according to the person in the image. You would like a way to automate the tagging of the remaining photographs.

First, you would need a program that can analyze images of faces, quantifying features such as eye color, the distance between the eyes, and so on. The program would take in each image - say, 1000 - and return a value for each facial feature the program analyzed. So if the program analyzes 100 features, then the program will return 1000 sets of 100 numbers.

Now imagine that each facial feature is a dimension in a multidimensional space. This is known as the feature space, and in our example, it would have 100 dimensions.

Each image can be mapped as a vector in the feature space. The example images that make up the training set are also each classified; in our example the class of each photograph would be the tags identifying the people.

To determine who an untagged photograph - in general cases referred to as an unclassified object - depicts, the kNN algorithm must calculate the distance within the feature space between the untagged photo and each tagged photo, and then sort the images according to distance, from nearest to farthest. Then, using an integer value for k which you chose, the kNN algorithm looks at the class of the k-nearest neighbors; the most frequently-occurring class is the class of the new image.

For example, if you chose 20 for k, the kNN algorithm will look at how you tagged the top 20 nearest images. If nine of those images are tagged as your mother, four are tagged as you, five are tagged as your aunt, one is tagged as your brother, and one is tagged as your mother's dog, then the algorithm will tag the photo as being of your mother.

A new approach to solving a class of computational problems known as k-Nearest Neighbor could speed up applications ranging from face and fingerprint recognition to music genre classification.

There are a number of approaches to solving kNN problems, explained Libin Sun, a doctoral student studying computer vision at Brown University.

"Current approaches tend to have strong assumptions on the dimensionality or the geometric properties of the data, whereas our approach is highly general and can be applied to all kinds of kNN problems," Sun said.

As an example, Sun points to the parallel approach as exemplified by Piyush Kumar, which focuses on 3-D spatial data. Although Kumar's method is generally considered to be state of the art, it cannot solve problems involving thousands of dimensions. Yet these problems arise frequently in computer vision, where each image can have millions of pixels, and each pixel has three dimensions to account for color (red, green, and blue).

"Some people are using sequential approaches. Some people are using approximations. Some are using one layer of parallelization," explained Cyrus Stoller, a software engineer at the Palo Alto Medical Foundation. "We wanted to calculate the actual k-nearest neighbors."

Stoller and Sun were both undergraduates double majoring in computer science and mathematics, taking a course in parallel and distributed computing under associate professor Tia Newhall at Swarthmore College when they collaborated with Newhall and Andrew Danner to create a new approach to kNN problems. They split the calculations up in order to create two layers of parallelization. The first and most conventional layer used the popular MPI protocol, while the second layer used GPUs to accelerate the calculations.

Next, they tested their approach using the Lincoln cluster at the National Center for Supercomputing Applications, which offers 96 Intel 64-bit Linux nodes, each equipped with two nVIDIA Tesla S1070 accelerator units. The result, according to a poster they presented at the TeraGrid conference this past summer, was impressive, spitting results out at 80 times the speed they would expect if using just the GPU layer.

Of course, this solution isn't right for all problems. Stoller and Sun's approach is best-suited to particularly large problems.

"Our experiments show that our method scales close to linear if the problem is large enough," Stoller said. "As long as the problem is large enough, the time taken for the merge phase will always be dominated by the time needed for each node to calculate/sort distances for their designated chunk of data. This allows us to (almost linearly) scale by throwing in more nodes."

Sun expressed hopes that their approach would prove useful to the larger computer vision field.

"A recent trend in computer vision is the emphasis on data driven approaches, which almost always require performing kNN on datasets containing millions of images," Sun said. "We anticipate that our method can be of use to a greater field of research, not just handling 3D spatial/GIS data, which motivated most of the previous kNN algorithms."

What next?

Stoller, Sun, Newhall, and Danner plan to follow up with additional experiments, with an eye towards eventually publishing a paper.

For more information, download the poster PDF.

-Miriam Boon, iSGTW

Join the conversation

Do you have story ideas or something to contribute? Let us know!

Copyright © 2021 Science Node ™  |  Privacy Notice  |  Sitemap

Disclaimer: While Science Node ™ does its best to provide complete and up-to-date information, it does not warrant that the information is error-free and disclaims all liability with respect to results from the use of the information.


We encourage you to republish this article online and in print, it’s free under our creative commons attribution license, but please follow some simple guidelines:
  1. You have to credit our authors.
  2. You have to credit ScienceNode.org — where possible include our logo with a link back to the original article.
  3. You can simply run the first few lines of the article and then add: “Read the full article on ScienceNode.org” containing a link back to the original article.
  4. The easiest way to get the article on your site is to embed the code below.