# Explain the scale-invariant feature transformation algorithm in detail.

Medium Last updated on May 7, 2022, 1:28 a.m.

The Scale Invariant Feature Transformation (SIFT) algorithm is one of the breakthrough algorithms in Computer Vision. It was developed by Prof. David Lowe of University of British Columbia in 1999 and was first introduced in 2004. The algorithm is developed to mimic the process of image understanding by a human brain.

For example, when we look at an image of a “cat in garden” and the image of the same cat present in a house we can identify the cat as the same object present in both scenarios irrespective of the background. Likewise, the SIFT algorithm focuses on local rather than global features to accurately identify the objects present in an image by their characteristic composition.

The ability to obtain these unique features leverages the fact that these can be used to perform feature matching with objects present in a large database. Another significant advantage of SIFT for image matching, object classification, or scene detection is its’ rotation and scale-invariance.

To understand SIFT algorithm better, let us dive into the exciting mathematical steps behind this algorithm.

Step 1: Scale-Space Construction: The term “scale-space” means creating a set of images with multiple scales from the same image, i.e., belonging to the same space. This concept replicates the real-world scenario of the presence of objects at multiple scales, which will aid in identifying at what scale the object is perfectly visible accurately. The first step is to perform a Gaussian Blur on the input image to start with this idea. Mathematically, gaussian blur is the convolution of the gaussian blur kernel with the input image. Lets’ define scale space of an image as $L(x, y, \sigma)$ as:

$$L(x, y, \sigma) = G(x, y, \sigma) * I(x, y)$$

Here, Gaussian Operator is

$$G(x, y, \sigma) = \frac{1}{2 \pi \sigma^2} e^{-\frac{(x^2+y^2)}{2\sigma^2}}$$

The gaussian filter will remove unnecessary background noise and preserve edges and shape critical information.

To make the algorithm scale-invariant, we create an image set known as octaves. So the first octave comprises the scale (shape) of the original image. For the subsequent octaves, the image sizes are divided by half hence making their scale half of the previous octave. Ideally, there are four octaves in total. Each octave comprises a set of five images which are subsequent blurred versions of each other starting from the original image. Let’s have a look at the visual representation of octaves:

Step 2: Difference of Gaussian (DoG): Next, we perform a subtraction operation on the gaussian blurred images of the octaves. For every octave, we have a set of images that have different blur intensities. So, here we take the difference between the subsequent blurred images and obtain DoGs.

The mathematical equations of this step can be defined as:

$$D(x, y, \sigma) = (G(x, y, k \sigma) - G(x, y, \sigma) * I(x,y)$$

therefore, $D$ can simply be computed by simple image subtraction.

$$D(x, y, \sigma) = L(x, y, k \sigma) - L(x, y, \sigma)$$

This technique enhances the features, and as a result, the irrelevant information gets deleted.

Note: The difference-of-Gaussian function provides a close approximation to the scale-normalized Laplacian of Gaussian.
$$LoG \approx DoG$$

$$G(x, y, k \sigma) - G(x, y, \sigma) \approx (k-1) \sigma^2 \Delta^2 G$$

Step 3: Keypoint Selection and Localization: We have extracted multiple features from the previous step. Now, to choose the important key points, the local maxima and minima of the image should be identified. It is done by calculating the laplacian of gaussian.

Laplacian of gaussian method involves comparing the pixel intensity of a pixel of an image with the neighboring pixels of the same image and with pixels in the previous and the next image in the octave.

The maxima and minima of the difference-of-Gaussian images are detected by comparing a pixel (marked with X, refer to figure above) is compared with 8 neighboring pixels of the same image and with 18 (9 + 9) pixels from other two images which results in 26 total comparisons. If that particular pixel turns out to be the lowest or highest among the comparisons it is chosen as a key point.

After the keypoints are selected, there is again a requirement to eliminate non-significant keypoints such as those which have low contrast or are located towards the edge.

I. The low contrast key points are simply removed by comparing the pixel intensities with a given threshold. First, a second-order Taylor expansion up to the quadratic term is performed to obtain an accurate location of the maxima or minima pixel (keypoint). If the intensity of the keypoint is less than 0.03 then it is neglected.
Mathematically, it can be written as:

• Taylor Expansion: $D(x) = D + \frac{\partial D^T}{\partial x} x+ \frac{1}{2} x^T \frac{\partial^2 D}{\partial x^2} x$

• Location of the extremum, $\hat{x}$ determined by taking the derivative of
above function with respect to x and setting it to zero:
$$\hat{x} = {\frac{\partial^2 D}{\partial x^2}}^{-1} \frac{\partial D}{\partial x}$$

• Rejecting unstable extrema with low contrast. $$D(\hat{x}) = D + \frac{1}{2} \frac{\partial D^T}{\partial x} \hat{x}$$

II. The keypoints which are detected near the edge are because of the high edge response of DoGs. Those keypoints may be vulnerable to noise so they also need to be checked and removed. To perform that, a second-order Hessian matrix is used. For detailed equations refer to page 12 of the Original Paper

Step 4: Orientation Assignment: As we mentioned earlier, SIFT algorithm is both scale and rotation invariant. Therefore, to perform the latter, an orientation is assigned to each of the keypoints. First, the magnitude and orientation are calculated and next a histogram of orientation vs magnitude is created.

• To calculate the magnitude of a particular keypoint present at location (x,y), magnitude along the x-direction would be $M_x$ = Value at (x + 1) - Value at (x - 1) and magnitude along the y-direction would be $M_y$ = Value at (y + 1) - Value at (y - 1). Now, the final magnitude will be the squareroot of sum of squared distances along x and y directions $\sqrt {{M_x}^2 + {M_y}^2}$. Final angle will be $\tan^{-1} \frac{M_y} {M_x}$ .
$$m(x,y) = \sqrt { (L(x+1,y)-L(x-1,y))^2 + (L(x,y+1)-L(x,y-1))^2}$$
$$\theta(x,y) = \tan^{-1} \frac{(L(x,y+1)-L(x,y-1))}{(L(x+1,y)-L(x-1,y))}$$
• The histogram consists of angle values on the x axis and magnitude values on the y axis. The plotting is done for all the pixels around the keypoint. To choose the orientation for the keypoint, the bin which achieves the highest peak will be chosen. Additionally, any other peak above 80% will also be considered. In that case, a new keypoint will be created which will have the same scale and magnitude but a different direction.

Step 5: Keypoint Descriptor: From the above steps, we have found the stable keypoints that prove to be scale and rotation invariant. Further, to describe these keypoints uniquely (creating a fingerprint), the neighboring pixels will be considered for computing the descriptor.

• The standard method is to take a grid of 16x16 around the keypoint and further divide it into sub-blocks of 4x4.
• Compute the 8 bin histograms for each of these 4x4 blocks. This will result in 4x4x8 = 128 bin values.
• These 128 values form a feature vector (unique descriptor) for that particular keypoint.

## Coding Exercise: SIFT Implementation for Feature Matching

We will consider the below two images to perform feature matching.