Scale-Invariant Feature Extraction Technique for Object Detection

Annapurna Vinod
4 min readMar 18, 2022


A Feature Extraction technique to match local features in images.

Scale-invariant feature transform (SIFT) addresses the problem of matching features with changing scale and rotations. It is one of the successful approaches to feature matching and is used for recognizing objects from image databases. This algorithm is patented, so this algorithm is included in the non-free module in OpenCV.

Finding features for all the points in an image is a time-consuming task. To reduce the cost of feature extraction, the complex features are extracted at those points only which pass a certain test. These points are known as points of interest. SIFT algorithm finds these interest points and calculates features for these points only.

SIFT finds the points which are invariant to scale change of image. This can be done by finding features that are stable across all scales. The Gaussian function is convoluted with the image to find scale space. A set of progressively blurred images (using a Gaussian function) is constructed. Then, we perform a search over all scales and image locations, and the difference of Gaussian function is applied to identify potential interest points. For this, scale-space extrema are found in the difference of Gaussian convoluted with the image. These interest points are invariant to scale and orientation. The difference of Gaussian convoluted with the image can be computed as:



where G(x, y, r) is a Gaussian function at x,y locations and standard deviation r: I(x, y) is the pixel value at x, y locations of the image. And, ‘‘k’’ is a multiplicative factor that separates two nearby scales.

We take an image and blur it a little using the Gaussian function. Calculate the second-order derivatives or the Laplacian — which locates edges and corners that are good for detecting key points. The computation of the second-order derivative is also extremely sensitive to noise, and the blurring helps.

1. Key point localization: Location and scale are determined for each candidate location. The selection of key points is based on the measure of their stability. Key points with intensity less than a certain threshold value are rejected. Also, key points on the edges are rejected. Now, only strong key points are left.

To create a scale space, progressively blur out images using Gaussian kernel. Scale space images are obtained for different octaves. This is done by progressively blurring out images using Gaussian kernels, and then repeating the same after down sampling the original image by half to create the next octave.

Stable key point locations are detected using scale-space extrema in the difference-of-Gaussian function convolved with the image D(x, y, σ). The keypoint orientation is also determined from the local image appearance and is covariant to image rotations. Depending on the symmetry of the keypoint appearance, determining the orientation can be ambiguous. Keypoints are further refined by eliminating those that are likely to be unstable, either because they are selected nearby an image edge, rather than an image blob, or are found on image structures with low contrast.

2. Assigning orientation: On the basis of local image gradient directions, each key point is assigned one or more orientations. All operations are then performed on the transformed image data with respect to orientation, scale and location. In this way, images remain invariant to the transformations.

3. Key point descriptor: The measurement of the local image gradient is done at the scale selected in the region around each key point. To compute the key point descriptor, a 16x16 neighbourhood around the key point is taken. It divides into 16 sub-blocks of size 4x4.

A peak in the DoG scale space fixes 2 parameters of the key point: the position and scale. It remains to choose an orientation. In order to do this, SIFT computes an histogram of the gradient orientations in a Gaussian window with a standard deviation which is 1.5 times bigger than the scale of the keypoint.

For each sub-block, histogram for 8 orientation bins is created. Length of key point descriptor is the number of sub-blocks multiplied by the number of bins. So, a total of 16 x 8=128 bin values are computed.

It is represented as a feature vector to form a keypoint descriptor. This feature vector introduces a few complications. We need to get rid of them before finalizing the fingerprint.

• Rotation dependence: The feature vector uses gradient orientations. Clearly, if you rotate the image, everything changes. All gradient orientations also change. To achieve rotation independence, the key point’s rotation is subtracted from each orientation. Thus, each gradient orientation is relative to the key point’s orientation.

• Illumination dependence: If we threshold numbers that are big, we can achieve illumination independence. So, any number (of the 128) greater than 0.2 is changed to 0.2. This resultant feature vector is normalized again. And now you have an illumination-independent feature vector.

4. Keypoint matching: Key points between two images are matched by identifying their nearest neighbours using Euclidian distance.