Line 195: | Line 195: | ||
− | Mixing probability <math display="inline">{u}_{k-1\vert k-1}^{i\vert j}</math>= <math display="inline">{\overline{c}}_{j}^{-1}{p}_{ij}{u}_{k-1}^{i}</math>; Normalization constants <math display="inline">{\overline{c}}_{j}=</math><math>\sum _{i=1}^{r}{p}_{ij}{u}_{k-1}^{i}</math>; | + | Mixing probability <math display="inline">{u}_{k-1\vert k-1}^{i\vert j}</math>= <math display="inline">{\overline{c}}_{j}^{-1}{p}_{ij}{u}_{k-1}^{i}</math>; Normalization constants <math display="inline">{\overline{c}}_{j}=</math><math>\sum _{i=1}^{r}{p}_{ij}{u}_{k-1}^{i}</math>; <math display="inline">{p}_{ij}</math> is the Markov transition probability of filter <math> i </math> to <math> j </math>. |
− | + | ||
− | <math display="inline"> | + | |
(2) Model filtering estimation: in this paper, the standard IMM model is used, and the filtering phase has two main steps: prediction and correction. The prediction phase is responsible for calculating the a priori state estimates for each system in the current state. The correction phase is responsible for combining the actual measurements into each prior estimate to obtain the corresponding a posteriori state estimate. The motion and measurement models of the Kalman Filter are described as follows: | (2) Model filtering estimation: in this paper, the standard IMM model is used, and the filtering phase has two main steps: prediction and correction. The prediction phase is responsible for calculating the a priori state estimates for each system in the current state. The correction phase is responsible for combining the actual measurements into each prior estimate to obtain the corresponding a posteriori state estimate. The motion and measurement models of the Kalman Filter are described as follows: | ||
Line 217: | Line 215: | ||
{| style="text-align: center; margin:auto;width: 100%;" | {| style="text-align: center; margin:auto;width: 100%;" | ||
|- | |- | ||
− | | style="text-align: center;" | <math>{Z}_{k}^{j}=H{\hat{X}}_{k\vert k-1}^{j}+{V}_{k}^{j}</math> | + | | style="text-align: center;" | <math>{Z}_{k}^{j}=H{\hat{X}}_{k\vert k-1}^{j}+{V}_{k}^{j}</math> |
|} | |} | ||
− | | style=" | + | | style="width: 5px;text-align: right;white-space: nowrap;" |(6) |
|} | |} | ||
In multi-object tracking of Intelligent Transportation System, there are objects of different sizes in images or videos, especially pedestrian and traffic lights with low resolution . Meantime, objects are subject to occlusion and loss in object tracking. All of the above mentioned situations may lead to unsatisfactory multi-object tracking results. Attracted by the effect of deep convolution neural networks, the paper proposes a multi-object tracking network, CNN-based Multi-Object Tracking Networks with Position Correction and IMM (CNN_PC_IMM) to solve those problems. Our proposed method consists of object detection module and object tracking module. Compared to other networks, our proposed network has several main contributions that play an essential role in achieving state-of-the-art object tracking performance. In detection phase, feature fusion technique is used. We add a scale branch to the YOLOv3 network to increase the accuracy of small object prediction, and import a residual structure to enhance gradient propagation and avoid gradient disappearance or explosion for the whole network. In addition, we determine the size of the anchor box based on the size of the object in the dataset to better detection and tracking the objects. In tracking phase, IMM is used to calculate the motion state information of the object at a certain moment. Next, the optimization algorithm is proposed to fine-tune object position when the tracking object is occluded due to dense multi-object in traffic scenes or lost due to incomplete object information. Finally, experimental results and analysis are performed on MOT16 benchmark dataset with several popular tracking algorithms used to compare the performance with the proposed algorithm in the paper. It is demonstrated that the proposed algorithm has better performance on MOPA, MOTP, ML.
Keywords: Intelligent transportation system, multi-object tracking, deep convolution neural networks, YOLOv3, IMM
With the further improvement of people's material and cultural levels in modern society, vehicles have gradually developed into a consumer product needed by the public in daily life. However, along with the yearly increase the total number of vehicles come a series of serious traffic safety and traffic congestion problems in cities. For the prospective planning and development of each city, to reduce the occurrence of traffic accidents and traffic congestion, experts and scholars in industry and academia have invested a lot of human and material resources in the exploration of Intelligent Transportation Systems.
Multiple Object Tracking (MOT) also known as Multi-Object Tracking (MTT) is a computer vision task that seeks to optimize the analysis of video to identify and track objects belonging to more than one category [1,2]. Multi-object tracking is an important part of the unmanned system perception algorithm. It not only provides data information of multiple spatio-temporal points and positions for multi-object motion objects, which gets the specific running state trajectory of each object. It also supplies highly informative data for object scene understanding. For example, the object motion contains information such as the acceleration and time of the motion of the corresponding object, the start-stop, and duration information of the acceleration in the trajectory which indicates when the object has entered and left the scene. The implicit information of the trajectory can also indirectly express the behavioral motility and behavioral psychological characteristics of each object, which can offer significant data for high-level computerized motion visual recognition such as behavioral feature analysis and behavioral feature prediction. In addition, rich message can be obtained by analyzing the motion state of each object, such as acquiring the number of objects that exist in a specific area over a while, the relationship between each object, etc. Therefore, multi-object tracking technology has important research and application value in Intelligent Transportation System. Traditional multi-object tracking involves three main elements, appearance model, motion model, and online update mechanism [3]. The implementation stage of traditional multi-object tracking includes constructing the initial appearance model, in which the initial area is obtained by object detection methods, etc.. And then, the corresponding features and the model are obtained by using the motion model to predict and analyze the area where the object is likely to appear. Third, the candidate area is determined and matched. Finally, the appearance model is adjusted based on the information from the previous parts to calculate the object area for the current frame. Traditional multi-object tracking methods usually use texture, shape, color, SIFT, SURF, and other features for object discrimination [4,5]. Most of these features are designed for specific application scenarios, and their robustness is weak, resulting in a limited range of use.
In 2012, a milestone time of deep learning era, a group of researchers from the University of Toronto achieved impressively with results in the ImageNet competition, overcoming all other teams in the first place by a significant margin [6]. Deep learning has subsequently been utilized in a broad scope of applications, speech recognition, automatic machine translation, self-driving cars, face recognition, etc. Multi-object tracking methods based on deep learning have also emerged in recent years. Usually, multi-object tracking algorithms based on convolutional neural networks to detect objects and employ tracking models to accomplish multi-object tracking and achieve many excellent results. Yu et al. [7] developed a revised Faster R-CNN that contained skip-pooling and multi-region characteristics and fine-tuned it on various pedestrian detection datasets. They were able to enhance the performance of the proposed model by more than 30% with this structure, attaining state-of-the-art performance on the MOT16 dataset. Zhang et al. [8] interviewed SSD with Faster R-CNN and R-FCN in the context of their pig tracking and expressed that its performance is better on their data. They used an online tracking method named Discriminative Correlation Filters (DFC) with HOG and Colour Names features to detect the object, and the output of the DCF tracker was applied to refine the bounding boxes in case of tracking failure. Zhao et al. [9] then used SSD to replace pedestrians and vehicles in the scenario, but they made use of a CNN-based correlation filter to enable the SSD to create far more accuracy in the bounding boxes. Wang et al. [10] developed a new RGBT object tracking with short-term historical information in correlation filter tracking to solve the issue of RGBT and RGB tracking in a difficult environment by booting multimodal datasets. CNN and object bounding box were used to obtain object features in the whole framework which achieved results compared to state-of-the-art algorithms on three RGBT datasets. To improve the technique to get better performance of the bounding box once the objects encounter serious deformation, Xie et al. [11] combined the deepest layer feature of the CNN model and affine transformation as a new information model which was based on region CNN. RoI pooling and NMS were applied in the model and the proposed model has gained promising results.
The object scale changes a lot and the actual motion pattern is complicated, which makes it difficult for a single model to describe the motion pattern of the object. In this paper, we propose a multi-object tracking algorithm named CNN_PC_IMM that integrates the improved YOLOv3, named YOLOv3_I, and interactive multi-model, position correction optimization algorithm and can also automatically adjusts the model parameters according to the size of the target in the database. The model finally is tested and analyzed on MOT16 dataset. The main contributions of this paper are: a new model for multi-scale detection of objects based on YOLOv3 is proposed and experimentally proven to be feasible for object detection. The detection results of the previous step are used as input for the subsequent tracking. And the motion state of the object is recorded with the interactive multi-model for object matching. A position correction optimization algorithm is proposed for the multi-object to detect the error rate by fine-tuning the position of the detection and prediction results when a object is lost. The reliability of the proposed algorithm is verified and analyzed on the MOT16 dataset in comparison with other algorithms.
The rest of the research is described as follows. Section 2 introduces the related work of multi-object tracking in Intelligent Transportation System. Then, Section 3 is dedicated to our approach to implement multi-object tracking. Section 4 covers the comparison of the experimental results and analysis. Finally, the conclusion is wrapped up by Section 5.
Intelligent Transportation System (ITS) provides intelligent guidance for relieving traffic jams and reducing the environmental pollution. The development of Intelligent Transportation System has been progressing rapidly. Meanwhile, Intelligent Transportation System has been encouraging much research in various fields such as vehicle detection, congestion detection, vehicle counting, and multi-object tracking in recent years. Detection and tracking of traffic objects are an indispensable part of Intelligent Transportation System. The following gives the development of object tracking in Intelligent Transportation System with deep learning techniques or traditional methods.
Video-based car detection is considered as a component of Intelligent Transportation System, due to its accessibility to non-intrusive and holistic car behavior data acquisition abilities. Inspired by Harris-Stephen corner detector, Chintalacheruvr et al. [12] designed a vehicle detection system that set the number and pace of vehicles on arterial roads and highways. This system has no complex calibration required, is robust against change in contrast, and gets greater performance on low-resolution video. In the field of ITS. Hinz et al. [13] proposed the first multi-object tracking model based on vision sensors for the neural vision system. The system capabilities were tested on real dynamic vision data of a highway bridge scenario. Liang et al. [14] employed the YOLO model and multi-object tracking algorithm to calculate the number of vehicles in the various traffic environment. In the paper, a real vehicle dataset was obtained from highway surveillance cameras. The experiment results expressed the proposed new method was feasible to be applied to real-life scenarios of the vehicle computing. According to discuss the possibility of applying deep learning to low-resolution 3D LiDAR sensors, Pino et al. [15] devised a LIDAR-based system that performed point-to-point vehicle detection with PUCK data by CNN and MH-EKF to guess the real position and speed of the detected vehicle. The results showed that the claimed low-resolution DL algorithm could successfully perform the vehicle detection task with better performance than the geometric baseline approach. Furthermore, it was observed that the system realized tracking performance at the close range to that of the high-end HDL-64 sensor. On the other hand, at long distances, the detection was restrained to half the distance of the high-end sensors. Liu et al. [16] proposed the HSAN model to boot the ReID performance and obtained the robust features of the various objects. To distinguish objects, the attention mechanism method and the posed information were employed. The Market-1501, CUHK03 and DUKE ReID, and MOT were compared on HSAN. Abbas et al. [17] designed a V-ITS system to predict and track vehicles and driver’s activities during highway driving. This modified V-ITS system enabled automated traffic regulation and thus reduced traffic accidents. To develop the V-ITS system, a pre-trained convolutional neural network (CNN) model with 4 layers was used and the illegal behavior was identified with a deep belief network (DBN) model. Vehicle counting has a vital role in Intelligent Transportation Systems as it assists in the creation of autonomous driving systems and better road planning. Dirir et al. [18] suggested an effective object counting system and evaluated its capabilities with 20 different video datasets.
Nizar et al. [19] used HOG to extract object features, SVM to classify objects, KLT tracker to compute the number of objects in order to detect traffic situations by computer vision. The developed system got 95.2% accuracy. Tian et al. [20] proposed a new tracking method to address the objects association involving in the presence of motion noise or extended occlusions by bringing together information from the expanded structural and spatial domains. In this approach, the detections are firstly combined into small traces depending on the meta-measurements of object proximity. The task of correlating small tracking is settled by structural information of the motion patterns based on their interactions. This work [21] is a broad overview of the MDP framework for MOT introduced by Xiang et al. with some additional crucial extensions. Firstly, the authors tracked objects with various cameras and sensor modalities by merging object candidate proposals. Secondly, the objects were tracked directly in the real world, which is different from the other methods. This allowed autonomous available to navigation and related tasks on tracking. Vasic et al. [22] used a collaborative fusion for extending the GM-PHD to track vehicles, which relied on the problems of clutter and occlusion. Emami et al. [23] presented a utility MOT framework that merged trajectories from a new video MOT neural architecture devised for low-power edge devices with trajectories from commercially accessible traffic radars. The proposed architecture implemented efficient spatio-temporal object re-identification by depth-separable convolution for joint predictive object detection and dense grid features at a single scale. Due to the complex interaction and representation of road participants (e.g., vehicles, bicycles, or pedestrians) and road context information (e.g., traffic lights, lanes, and regulatory elements), behavior prediction in multi-intelligent bodies and dynamic environments is essential in smart vehicle environments. Gómez–Huélamo et al. [24] described a novel SmartMOT, a powerful and simple model that introduced semantic information and the mind of tracking-by-detection to predict the next trajectories of the obstacles by supposing CTRV structure. The system pipeline was provided by the monitored lanes around the self-vehicle, which were accounted for by the planning layer, the status of the self-vehicle, including its mileage and speed and the corresponding bird's eye view (BEV) detection.
The scale of traffic objects varies greatly, and the actual motion of the objects in traffic scenes is more complex, so it is difficult to describe the motion pattern of the objects using only one model. Therefore, to address the above problems, the article proposes a multi-object tracking algorithm based on the improved YOLOv3 with interactive multi-model [25,26] and object position correction algorithm. The algorithm uses the current state-of-the-art "detection + tracking" idea (Figure 1). First, the algorithm uses the improved YOLOv3 model (YOLOv3_I) for multi-scale detection of the objects. Feature fusion technique is used and a scale branch is added to the YOLOv3 network to increase the accuracy of small object prediction. In addition, we also import a residual structure to enhance gradient propagation and avoid gradient disappearance or explosion for the whole network. The result of the object detection is taken as the input for the subsequent tracking, and the object detection frame and features are mainly adopted for the later tracker's matching calculation. Image pre-processing includes the usual operations such as data normalization, flipping, and refining the results by removing the objects with confidence less than 0.7. Non-maximum suppression is also used to get more accurate results. Second, in order to accommodate the complexity of moving objects, the interactive multi-model is used to calculate the motion state information of the object at a certain moment. An optimization algorithm corrects the objects to find object detection frame for each predicted object. If it is found, it is judged as tracker; otherwise, it is decided that the object has been lost and need be fine-tuned in processing of the detection and prediction results. Then, the position correction algorithm is used to correct object position. If the object detection box cannot find the corresponding predicted box that means that a new object has appeared. Finally, the objects are matched and the tracker, feature set is updated.
Figure 1. Network architecture of our approach |
Figure 2 gives the flow chart of our model. It mainly contains two phases.
Figure 2. General flow chart of our multi-object tracking model |
Detection phase:
Step 1: Split the original image into S*S cells or grids; each cell produces K bounding boxes according to the number of anchor boxes
Step 2: Employ the convolutional neural network to get features and predict the , and the
Step 3: Compare the maximum confidence of the bounding boxes with the threshold , if , the bounding box has the object. Else, the bounding boxes not contain the object.
Step 4: Select the class with the highest probability as the object category. Adopt NMS to operate a maximum part search for suppressing redundant boxes, output the results of object detection.
Tracking phase:
Step 1: Use IMM to predict tracks or Bbox for current frame. If objects are confirmed detection results and prediction tracks begin correlation and matching. Else, position correction algorithm is used for the unmatched tracks and detections.
Step 2: Update the tracked Bbox after the matching is completed.
Step 3: After updating, the current frame is predicted, the next frame is observed, and updated; then predicted again. The next frame is observed and updated etc.
In traffic scenes, usually people, traffic lights, or cars in the input image has low resolution and is small objects, while the perceptual field of the convolutional layer at the end of the YOLOv3 backbone is very large. So it is a difficult task to detect the accurate objects with YOLOv3. In computer vision tasks, invariance and equivalent transformations are two important properties in image feature representation. Classification aims to learn high-level semantic information and therefore invariant feature representations are required. The goal of object localization is to distinguish between position and scale changes, so it requires equivalence transformations. Object detection consists of two subtasks, object identification, and localization. While learning invariance and equivalence transformations are the keys to detection. The YOLOv3 is composed of a series of convolutional and pooling layers. The deep-level features have stronger invariance, but their equivalence transformation is weak. Although this is beneficial for classification recognition, it suffers from low localization accuracy in object detection. On the contrary, the shallow layer features are not conducive to semantic learning, but it contains more edge and contour information, which helps in object localization. The combination of multi-scale features can increase the global and local feature information in the model to improve the accuracy of object detection and increase the accuracy of subsequent object tracking. Therefore, to boot the detection performance of the model for small objects in traffic scenes, this paper use feature fusion technique to enhances the prediction of YOLOv3 by further integrating deep and shallow features in the model, adding a one-dimensional scale to strengthen small object prediction, and adding a residual structure in the corresponding scale branch to effectively control the gradient propagation and prevent the gradient from disappearing, degrading the network, and not easily converging, which brings about unfavorable network training. The network is easy to converge. This makes the training of the deeper network less difficult. The structure of the improved Yolov3 network is shown in Figure 3. The detail fusion process can be described as follows:
|
(1) |
|
(2) |
Where indicates the r_th layer feature map of , belongs to stands for the action of convolution, BN, ReLu and tensor addition represents the p_th fused feature map. and shows the high-resolution feature map and low-resolution feature map, respectively. is the action of upsampling, is the action of full connection and represents the action of convolution. To describe the model later, the model is abbreviated as YOLOv3_I.
Figure 3. The structure of YOLOv3_I network |
The model is similar to YOLOv3, with only convolution layers. The size of the output feature map is controlled by adjusting the convolution step, so there is no special restriction on the input image size. Also drawing on the pyramid feature map idea, the small size feature map is applied to detect large-size objects, while the large size feature map detects small size objects. Finally, according to analysis to dataset, a total of 4 feature maps are output, the first feature map is down-sampled 32 times, the second feature map 16 times, the third 8 times, and the fourth 4 times. Four detections, each corresponding to a different field of perception, 32 down-sampling has the largest field of perception, suitable for detecting large objects, 16 for objects of average size, 8 for the smallest field of perception, suitable for prediction small objects, and even smaller objects. Considering the fact that objects have different pixels, we use different scales of anchor boxes to match. The anchor size of each cell is shown in Table 1 according to object size of the MOT16. The whole network, which draws the essence of Resnet [27], Densenet [28], and FPN [29], incorporates the current industry effective techniques of object detection to detect the object.
Feature map | 13*13 | 26*26 | 52*52 | 104*104 |
---|---|---|---|---|
Receptive field | Large | Medium | Less | Smallest |
A prior box | (156*198) (156*198) (371*326) |
(30*61) (62*45) (59*119) |
(10*13) (16*30) (33*23) |
(10*11) (19*15) (25*21) |
Due to the complexity of the actual motion of the objects in the traffic scene, it is difficult to describe the motion pattern of the objects using only one model. Therefore, the algorithm in this paper adopts the Interacting Multiple Model (IMM) [30] to estimate the motion patterns of multiple objects (Figure 4). Its main feature is the ability to approximate the state of dynamic systems with multiple patterns of behavior that can be switched from one behavior pattern to another. In particular, the IMM evaluator can be a self-tuning variable bandwidth filter, which allows it to naturally track maneuvering objects. IMM handles multi-object motion models in the Bayesian framework with a model set containing different motion sub-models, each corresponding to a filter. To solve the uncertainty of the object motion, each filter performs parallel operation computation and switches between models according to the updated weights. During object tracking, the fit of each sub-model of the system to the current object depends on the probability of the model. So, the interaction between models can be performed according to this principle, and finally the probabilistic output results of each sub-model are fused.
Figure 4. Flow chart of IMM algorithm |
(1) Model initialization: denotes the j_th filter at the k_th frame, .
Model initialization state ; Covariance matrix :
|
(3) |
|
(4) |
Mixing probability = ; Normalization constants ; is the Markov transition probability of filter to .
(2) Model filtering estimation: in this paper, the standard IMM model is used, and the filtering phase has two main steps: prediction and correction. The prediction phase is responsible for calculating the a priori state estimates for each system in the current state. The correction phase is responsible for combining the actual measurements into each prior estimate to obtain the corresponding a posteriori state estimate. The motion and measurement models of the Kalman Filter are described as follows:
|
(5) |
|
(6) |
where is a priori state estimation. are state transition matrix, measurement matrix respectively. represents the measured value at the kth frame. and are the noise corresponding to the computational quantities obeying Gaussian distribution. Finally, the posterior state estimate is obtained
|
(7) |
is the residual of the motion model and the measurement model. filtering gain is defined as:
|
(8) |
where the a priori covariance matrix is:
|
(9) |
The posterior covariance matrix is updated as:
|
(10) |
(3) Model probability updating. Calculating model probabilities:
|
(11) |
|
(12) |
Likelihood function :
|
(13) |
The filter residual, ; , is the corresponding filtering residual matrix, and is Gaussian distribution. If the residuals are larger, it means that the model has a larger deviation from the object localization. Its weight decreases, and vice versa, the weight increases.
(4) Estimated fusion. Final state estimates and their covariances:
|
(14) |
Published on 29/11/23
Accepted on 04/11/23
Submitted on 27/07/23
Volume 39, Issue 4, 2023
DOI: 10.23967/j.rimni.2023.11.001
Licence: CC BY-NC-SA license
Are you one of the authors of this document?