Image Stitching

What is Image Stitching?

Combining multiple images captured at certain intervals with a certain overlap of the field of view to create a large panoramic view. Generally, this is useful in 3D reconstruction, surveillance, mapping, and virtual reality. The input images are as follows:

We can combine the above images using the following steps:

  1. Feature extraction: we need to first find the highest probability features available amongst the left image w.r.t center and then the right image w.r.t center image. This can be done in various ways and I have made use of the most commonly used and pretty reliable method of SIFT features.
  2. Feature matching: this step can be done by a simple function available in the cv2 library – BruteForce Matcher. It takes the descriptor of one feature in the first image and matches that with the second image based on distance(L2 Norm). Select the top features(min distance), so we don’t end up considering the erroneous matches.

3. Finding Homography (Perspective Transform): any two images of the same planar surface in space are related by homography (assuming a pinhole camera model). We can use this to transform(translation and rotation) one image into another image’s perspective and this will help us to easily mesh the two images together.

Homography Equation

The concept of Homography comes when we are understanding the pinhole camera model. When we project a 3D point into a 2D plane, we have them in a homogenous coordinate system and can be denoted with, where the

2D projection = [Camera Intrinsic][Camera Extrinsic][3D projection coordinate]

The two transformations denote the Camera Intrinsic and Camera Extrinsic matrix. Camera Extrinsic denotes the camera position in the 3D space and the second matrix denotes the camera intrinsic matrix which converts the image plane to pixels and shifts the origin to match pixel coordinates whose top left of the image corner starts at (0,0).

These two matrices can be merged into a matrix known as the camera calibration matrix, which is used to derive the homography.

Camera Calibration Matrix

As we know map this 3D point into a 2D plane, the Z becomes 0, and hence C13, C23, and C33 become 0. This means we can get rid of those and come up with just the first, second, and fourth columns as a matrix which is referred to as a homography matrix.

Homography from Camera Calibration Matrix

In order to compute homography, we need to solve the linear equations for a minimum number of points(8).

Three linear equations for each point

The above equations denote the relation between a single matching pair and we have to calculate 8 of these to be able to predict the homography. Hence, using 4 pairs of these we will get a final matrix, which can then be simplified as homography. The first matrix below denotes the A, matrix, and the h vector denotes the homography.

Final Equation to derive homography

Note: We can solve this equation by choosing any 4 point pairs, however, how do we decide which points are acting as most accurate for the complete set of matches? In order to finalize the most suitable pair, we can use methods like RANSAC to estimate the best homography which gives minimum errors in transformation. RANSAC stands for Random Sampling Consensus and works on the principle of picking any 4 pair of points and calculating the homography for those. For each time homography is calculated, we have to find the number of outliers within a specified threshold. In this problem, I performed 1000 iterations and obtained the following homography:

Homography between the left and center Image
Homography between the center and right image

4. Using the homography, we can wrap the images over each other using the cv2.warpPerspective(src_img, homography, shape_of_final_image) function.

5. Lastly, merge all the obtained warps together and we can obtain the final stitched image:

Meshed Image