Medium Last updated on May 11, 2022, 5:05 p.m.
In the 21st century, the tremendous increase of visual data in the form of images and videos has resulted in the huge proliferation of Computer Vision-based applications. These applications are prevalent across a plethora of industries such as retail, banking, agriculture, healthcare, etc. Here, we will talk about one such important computer vision application heavily used in the surveillance industry - Object Tracking.
Object tracking is essentially a mechanism that continuously estimates the trajectory of an object based on apriori information. Let’s understand this better with an example. Imagine a truck containing explosives passing through a highway, and the state police have come to know about it. All the highways are equipped with CCTV cameras. Given that the number plate or any other distinguishing information about the truck is known, one can employ an object tracking algorithm in the camera to predict the future positions and catch the fraudulent truck. Similarly, object tracking can be useful for concerns related to security, such as theft detection, illegal activity recognition, or vehicle navigation for traffic monitoring.
Object tracking mainly comprises two domains - Single Object Tracking (SOT) and Multiple Object Tracking (MOT). In SOT a single particular object is to be tracked hence the information about the appearance of the object is known beforehand. However, in MOT, multiple objects such as vehicles, traffic lights, and pedestrians could be required to be tracked in a video therefore, prior information about the appearance or number of objects is missing. Hence, MOT is a relatively more competitive problem than SOT.
Before understanding how MOT works, let’s look at some of the common challenges faced by tracking algorithms.
The MOT pipeline comprises the following major steps - object representation, object detection, assigning identities, and tracking.
In the video frames, the object detection algorithm first identifies the objects which belong to the target class and generates bounding boxes around them. These detections undergo feature extraction algorithms to obtain distinctive appearance features or motion-related features. These features are then used to compute the similarity score, which will tell how related the features are, finally, based on the similarity metric. The detections are associated with their target IDs. Check out the below figure for a clear understanding.
The first step to approach the task of object tracking is to decide how to represent an object. Based on the various scenarios under which an object can appear, the following are the common ways of representing it.
Point: Objects which occupy a small part of the image can be represented as points.
Geometric Shapes: Shapes such as squares, rectangles, and ellipses can represent rigid objects.
Silhouette/Contour: To track non-rigid objects, object silhouette or contour can be used to define the object’s boundary.
Articulated Object: Objects which have a defined set of portions, such as a human body that consists of a head, hands, torso, legs, etc., can be exhaustively marked part-wise.
Skeletal: To understand the shape of the object, skeletal representation can be obtained from the silhouette and used for rigid and articulated objects.
Identifying the right set of targets in real-time is extremely important for the success of any object tracking algorithm. Many methods utilize detections from the already available datasets while others develop custom detectors. Faster R-CNN, SSD, and versions of YOLO are some of the commonly used detectors. While Faster R-CNN is more accurate, SSD outperforms Faster R-CNN for detecting large objects. On the other hand, YOLO is a one-shot detector that is speedier than Faster R-CNN and SSD.
Robust feature extraction is an essential pipeline component. Although initially, many traditional vision algorithms such as SIFT, SURF, and ORB were used for feature extraction, currently, deep learning-based CNN models perform much better. The strong representative power of CNNs makes them a suitable choice. One of the first usages of DL methods was of autoencoders to perform feature refinement, which greatly improved the performance. Many algorithms also employ the usage of custom residual networks. Additionally, siamese networks with contrastive loss have also been widely used to identify target pairs belonging to the same class and differentiate well between different target pairs.
Once the features of detections are extracted, the next step is to assign class identities based on the affinity score. Simple metrics can be used to calculate affinity scores such as euclidean distance and cosine similarity. However, DL-based approaches perform much more accurately. Neural networks such as RNNs and LSTMs are extensively used for predicting motion and calculating association vectors. Additionally, networks such as Siamese LSTMs, Bidirectional LSTMs, and Deep Multi-Layer Perceptron also give good accuracy in terms of tracking.
Object tracking algorithms mainly comprise two models: the motion and appearance models. The motion model keeps track of the object’s velocity and location, whereas the appearance model figures out what the object looks like. Usually, some pre-trained classifiers perform the task of target object detection; however, robust classifiers for all kinds of objects are not readily available. Hence, this requires online training or training on the fly.
OpenCV provides multiple object tracking algorithms which can be trained during the runtime by feeding custom examples. Let’s briefly understand some of those algorithms.
BOOSTING Tracker: This algorithm is focused on giving more weightage to weak classifiers which leads to incorrect classifications. The user initially chooses the frame where the object is located and that is classified as positive detection. The remaining background is treated as negative. Further, for the upcoming frames, the classifier scores the objects, and the object with the maximum score is considered positive detection, whereas the surrounding regions with low scores are neglected. Although this algorithm demonstrates good tracking accuracy, it is relatively slow in speed.
MIL (Multiple Instance Learning) Tracker: The approach of MIL is quite similar to that of Boosting tracker. Here, instead of predicting the only target object in the future frame, a bag of positives is selected containing at least one positive object. This method is more robust to noise, however, there is no mechanism to stop tracking even when the real target object is lost.
KCF (Kernelized Correlation Filters) Tracker: This method is a combination of both the Boosting and the MIL trackers. First, the bag of positives is obtained from the MIL method, containing many overlapping areas. Next, correlation filters are used on such regions which accurately detect and predict the motion of the target object with good accuracy.
TLD (Tracking, Learning, Detection) Tracker: The TLD tracker consists of three processes i.e. tracking, learning, and detection. The tracking module is responsible to track the object. Simultaneously, the detector module looks at the signs around and identifies errors or failure cases. The learning module learns about the mistakes and tries to prevent those in the future. This method is robust to detecting objects under overlapping scenarios; however, it is prone to instability in detection and tracking as it sometimes loses track of the object.
MedianFlow Tracker: This approach is developed on the Lucas-Kanade method, which focuses on using directional information rather than color. MedianFlow tracks the trajectories of the object in both forward and backward directions and the error is estimated based on that. It provides high accuracy in situations when the object is clearly visible, however, the track can be lost when the object is moving at high speed.
SORT (Simple Online and Realtime Tracking): This framework performs tracking by means of detection through the Faster R-CNN object detector. It does not utilize the appearance feature after detection but rather uses the size, and coordinates of the bounding box for motion estimation and id association. The target object is traced using a velocity model which is linearly constant. A state vector is maintained which comprises the bounding box centers, the scale, the aspect ratio, and the velocities. The id assignment part is estimated by predicting the bounding box coordinates from the state and then later comparing them with the actual detections.
DeepSORT: Since the SORT algorithm is vulnerable to the problem of id switches under the scenarios of poor illumination or occlusion, DeepSORT has come to the rescue. The id association algorithm in DeepSORT considers both the motion and appearance features. The main idea is to obtain a vector of the image which can represent the image accurately. Hence, a DL-based classifier is trained whose classification layer whose final dense layer is used for generating the feature vector. Additionally, multiple states have been added to the motion tracking module which helps in diminishing the problem of id switches by 45% as compared to SORT.
FairMOT: FairMOT is a pretty advanced method as it inculcates both the tasks of object detection and re-ID in a single framework. This multi-task learning approach allows optimization of both tasks in a shared manner. Some other approaches which consisted of similar networks suffered from a bias to the re-ID task as the accuracy of re-ID was heavily dependent on the accuracy of the primary detection task.
As the name suggests, “Fair” MOT gives equal priority to both tasks. It has a streamlined network architecture that comprises an encoder-decoder block that results in two branches, one for detection and the other for re-ID.
FairMOT uses anchor-free CenterNet as the object detection architecture. The detection branch outputs heatmaps, bounding boxes, and center offsets. The re-ID branch generates distinguishable features for different objects. Additionally, FairMOT uses ResNet-34 as the backbone architecture which helps to achieve good speed and accuracy. This method is also robust to working with varying object scales and poses as it implements deformable convolutions which can easily adjust to different receptive fields.
First, the motion similarity is computed for bounding boxes having scores above the threshold. The location of the tracklets is predicted in the new frames using Kalman Filter. The similarity is computed by taking the Intersection over Union (IoU) of the predicted boxes and the detected boxes. Similarly, the motion similarity is next computed for the low score detection boxes. Since ByteTrack is focused on computing the similarity of the tracklets, it can distinguish between the background and the objects of interest.
1. DEEP LEARNING IN VIDEO MULTI-OBJECT TRACKING: A SURVEY
2. Object Tracking: A Survey
3. Object Tracking in Computer Vision (Complete Guide)
4. Top 5 Object Tracking Methods
5. Zero to Hero: A Quick Guide to Object Tracking: MDNET, GOTURN, ROLO