Researchers Develop DCN-Mobile-YOLO Model for Accurate Multi-Lane Vehicle Counting on Mobile Terminals

In the era of smart city construction, real-time and accurate vehicle flow statistics are the core of intelligent traffic management, providing critical data support for traffic signal optimization, road planning and congestion relief. Traditional object detection methods often struggle to balance detection accuracy and processing speed, and single target detection technologies are prone to problems such as missed detection and repeated counting in multi-lane vehicle counting scenarios, especially when applied to mobile and embedded devices, the large number of parameters of mainstream neural network models has become a major bottleneck restricting real-time detection. To address these industry pain points, a research team led by He Biao from Shenzhen University has proposed a lightweight mobile terminal target tracking and multi-lane vehicle counting model named DCN-Mobile-YOLO, which breaks the trade-off between detection accuracy and real-time performance of mobile vehicle detection algorithms and provides a new technical solution for intelligent traffic data collection.

The research, based on the classic YOLO v4 target detection framework, innovatively optimizes the network structure and integrates multi-target tracking technology, achieving a significant improvement in both detection accuracy and mobile terminal adaptation. The team’s relevant research results were published in the Journal of Shenzhen University Science and Engineering in November 2021, marking an important progress in the application of deep learning in the field of mobile intelligent traffic detection.

In the field of computer vision-based target counting, technical routes are mainly divided into traditional feature extraction methods and convolutional neural network-based methods. Traditional methods, such as those relying on Haar-like features, local binary patterns (LBP) and histogram of oriented gradient (HOG), realize target detection and counting by extracting edge features of images. However, these methods have high requirements for the accuracy of edge detection, and their detection performance degrades sharply in scenes with overlapping targets or dense vehicle flow. In addition, traditional methods are highly sensitive to environmental changes such as light and background, leading to poor universality in actual traffic scenarios. For example, color feature-based extraction methods are easily disturbed by light changes, resulting in chaotic color space distribution information; frame difference-based methods such as optical flow cannot model single-frame images, and background difference methods are highly sensitive to external scenes, making it difficult to adapt to the complex and changeable road traffic environment.

With the rapid development of deep learning, convolutional neural network (CNN) based target detection methods have gradually become the mainstream of vehicle counting technology. These methods are divided into two-stage detectors represented by Faster R-CNN and Mask R-CNN, and one-stage detectors represented by SSD and YOLO series. Two-stage detectors have advantages in detection accuracy and positioning precision, but their complex network structure leads to slow inference speed; one-stage detectors convert target frame positioning into a regression problem for direct processing, with faster algorithm speed and more suitable for real-time detection scenarios. Among them, the YOLO v4 algorithm, as an optimized version of the YOLO series, has achieved a good balance between recognition accuracy and efficiency through improvements in data processing and enhancement, backbone network design, network training strategies, activation functions and loss functions, and has been widely used in traffic sign recognition, pedestrian and vehicle detection, industrial part defect detection and other fields.

However, the mainstream YOLO v4 algorithm uses CSPDarkNet as its backbone network, which has a large number of parameters and requires strong GPU computing power to achieve real-time detection. This makes it difficult to transplant to mobile and embedded devices, which are widely needed in actual traffic detection scenarios such as road side monitoring and mobile patrol. At the same time, a single target detection framework is prone to missed detection and false detection in video continuous detection, leading to inaccurate vehicle counting; in multi-lane counting, the traditional virtual detection line and virtual frame methods have large statistical errors due to the influence of shooting angle, vehicle speed and vehicle size, which cannot meet the demand of accurate traffic flow statistics. These problems have become the key factors restricting the practical application of deep learning-based vehicle counting technology in mobile intelligent traffic systems.

To solve the above problems, the research team from Shenzhen University proposed the DCN-Mobile-YOLO model, which carries out three core optimizations on the basis of the YOLO v4 framework: lightweight backbone network reconstruction, detection accuracy enhancement based on deformable convolution, and multi-target tracking and accurate lane counting based on DeepSORT algorithm. The model not only reduces the number of network parameters and realizes the adaptation of mobile terminals, but also effectively solves the problems of missed detection, repeated counting and inaccurate lane division in multi-lane vehicle counting, realizing the unification of detection accuracy, real-time performance and counting precision.

The core architecture of the DCN-Mobile-YOLO model is divided into three parts: the backbone network DCN-MobileNet, the feature enhancement neck and the detection output head. The biggest innovation of the backbone network is to replace the original CSPDarkNet with the MobileNet v3 framework suitable for mobile terminals, and introduce depthwise separable convolution to decompose the standard convolution into depthwise convolution and pointwise convolution, which greatly reduces the number of network parameters and computation while ensuring the feature extraction ability. To make up for the problem of detection accuracy decline caused by the lightweight of the backbone network, the team uses the deformable convolutional network v2 (DCN v2) convolution kernel to replace the conventional convolution kernel of YOLO v4. By expanding the region of interest (ROI) of the feature layer, the deformable convolution can better adapt to the deformation and different postures of vehicles in actual traffic scenes, effectively improving the accuracy of target positioning and feature extraction. In addition, the model integrates the cross stage partial network (CSPNet) into the backbone network, which integrates the gradient changes into the feature map, further balancing the detection accuracy and computation load; the activation function is replaced with a smoother Mish function instead of the traditional ReLU function, which improves the expressiveness of the network for deep features.

For the feature enhancement neck part, the model combines SPP-Net (Spatial Pyramid Pooling in Deep Convolutional Networks) and PANet (Pyramid Attention Network) to enhance the feature extraction ability of the network. SPP-Net uses three different sizes of convolution to increase the receptive field of the feature map to the upper layer output, and through the differential pooling strategy, it not only avoids the risk of network overfitting, but also realizes the fixed-size image feature output, which is suitable for multi-scale vehicle detection. PANet optimizes on the basis of the feature pyramid network (FPN), fuses feature maps of different scales through downsampling and upsampling at the same time, making the feature information of the output layer more abundant, and effectively improving the network’s expression ability for shallow feature information (such as vehicle edge, shape) and deep semantic information (such as vehicle type, category). The detection output head corresponds to the last three feature layers of the backbone network, and realizes multi-scale target detection through three feature maps of different sizes, which can accurately detect small and large vehicles in different lane positions and distances, and further improve the adaptability of the model to complex traffic scenes.

To solve the problems of missed detection and repeated counting in a single target detection framework, the DCN-Mobile-YOLO model integrates the DeepSORT multi-target tracking algorithm, which realizes the continuous tracking of vehicles in video frames and ensures the uniqueness of vehicle counting. The DeepSORT algorithm is an upgraded version of the SORT algorithm, which combines the traditional Kalman filter (KF) and Hungarian algorithm, and adds a re-identification module to realize feature matching and update. In the DCN-Mobile-YOLO model, the DeepSORT algorithm first reads the position of the target detection frame of the current frame through the DCN-Mobile-YOLO detector, and defines a unique ID for each vehicle according to the deep features of the image block of each detection frame; then sorts the confidence of the detection frames, deletes the detection frames and features that do not meet the threshold, and uses the non-maximum suppression (NMS) method to eliminate redundant detection frames of the same target, so that each vehicle has a unique boundary frame; the Kalman filter is used to predict the position of the vehicle in the current frame according to the state parameters of the detection frame (such as boundary frame area, aspect ratio, center point coordinates), and the prediction result is more accurate than the direct detection result of the detector; the Hungarian algorithm is used to match the predicted tracker with the detector of the current frame through cascade matching and IoU matching, update the parameters and feature set of the Kalman tracker, and judge the disappearance of the old target and the appearance of the new target. This series of operations realizes the continuous tracking of each vehicle in the video, fundamentally solving the problems of missed detection and repeated counting in a single target detection, and laying a foundation for accurate vehicle counting.

In terms of multi-lane accurate counting, the research team abandoned the traditional virtual detection line and fixed virtual frame method, and proposed an adaptive lane detection rule based on the actual road traffic characteristics and video shooting angles. The traditional virtual detection line method has a small amount of computation and high real-time performance, but has a large statistical error when the vehicle speed is slow; the virtual frame method judges the lane to which the vehicle belongs by the overlapping area ratio of the fixed detection area and the detection frame, with higher accuracy, but it is only suitable for the situation where the shooting sensor is directly facing the lane line, and is prone to misjudgment due to the shooting angle. The adaptive lane detection rule proposed by the team takes the vehicle projection center point as the core, and comprehensively considers the vehicle height, the actual detection frame height and the relationship with the lane to set the actual lane detection area. This rule avoids the missed detection caused by the large lane width or high vehicle height in the traditional IoU calculation method, and also solves the misjudgment caused by the vehicle driving to both sides and the shooting angle in the center point judgment method of the detection frame. Through manual calibration to judge the accuracy of lane division, the adaptive rule can realize accurate matching between vehicles and lanes in different shooting angles and road conditions, ensuring the accuracy of multi-lane vehicle counting.

To verify the effectiveness and performance of the DCN-Mobile-YOLO model, the research team carried out a series of comparative experiments and actual traffic scene tests under a strict experimental environment. The experimental hardware configuration uses an Intel Core i9-9900K CPU with a frequency of 3.60 GHz, 64 Gbyte of memory, and two NVIDIA GeForce GTX 2080Ti graphics cards with 11 Gbyte of video memory each; the software environment is Ubuntu 16.04 64 bit, with TensorFlow 13.1 & Keras 2.3.1 as the deep learning framework and CUDA 10 as the parallel computing framework, ensuring the stability and reliability of the experimental results.

The team selected the VOC2007 + 2012 data set as the training and verification set to test the detection performance of the model, and compared it with the YOLO v4 algorithm with MobileNet and CSPDarkNet as the backbone network respectively from the indicators of precision, recall rate, average precision (AP), mean average precision (mAP), number of parameters and training time. Precision is the proportion of the number of positive samples detected by the model to the total number of detected samples, and recall rate is the proportion of the number of positive samples detected to the total number of positive samples; AP is the average precision value of a certain category on the P-R curve, and mAP is the average value of the area under the P-R curve of all categories, which is the core indicator to measure the overall detection accuracy of the model.

The experimental results show that the DCN-Mobile-YOLO model has achieved a significant improvement in detection accuracy: its mAP value reaches 76.41%, which is 13.19% higher than the YOLO v4 algorithm with MobileNet as the backbone network (mAP 63.22%) and 6.63% higher than the YOLO v4 algorithm with CSPDarkNet as the backbone network (mAP 69.78%). In terms of the number of network parameters, the DCN-MobileNet backbone network only has 0.48% more parameters than MobileNet, and its total parameters are only 17% of CSPDarkNet, realizing the lightweight of the model on the premise of improving accuracy; in terms of training time, with a training batch size of 32 and a total training epoch of 200, the DCN-Mobile-YOLO model has the same training time as the MobileNet-based YOLO v4 algorithm, saving half of the training time compared with the CSPDarkNet-based YOLO v4 algorithm, which is conducive to subsequent model iteration and optimization.

In terms of detection efficiency and real-time performance, the DCN-Mobile-YOLO model has an average detection frame rate of 12 frames per second (fps) in actual tests, which meets the real-time detection requirements of mobile terminals. In the actual vehicle counting test, the model has a higher number of current frame detection and total detection count than the MobileNet-YOLO and CSPDarknet-YOLO models, and maintains a faster real-time frame rate under the same detection scale. It is worth noting that the detection frame rate is related to the number of detected targets, and the more targets are detected in the same model, the lower the frame rate, but the DCN-Mobile-YOLO model still maintains a stable real-time performance in the dense vehicle flow scene, which reflects the excellent computing efficiency of its lightweight backbone network.

To further verify the practical application effect of the DCN-Mobile-YOLO model in actual traffic scenarios, the research team used GoPro sports cameras to collect video data of the morning peak period (07:05-09:00) at the Shennan-Beihuan Interchange in Shenzhen, China on January 9, 2020, and carried out multi-lane vehicle flow statistics and analysis. The test divides the morning peak period into 23 time periods with 5 minutes as a cycle, and counts the vehicle flow of 8 lanes including tidal lane (Lane 1), left turn lane (Lane 2), straight lanes (Lane 3-7) and right turn lane (Lane 8). The experimental results show that the average vehicle flow of the tidal lane and left turn lane in a single lane in 23 time periods is 26 vehicles and 17 vehicles respectively, and the average vehicle flow of the straight lanes is 55 vehicles per lane, showing an obvious imbalance in road traffic flow.

By classifying the lanes into three types: tidal + left turn, straight and right turn, and calculating the average flow of each type of lane in each time period, the team found that the average vehicle flow of tidal + left turn, straight and right turn lanes every 5 minutes during the entire morning peak period is 43, 274 and 22 vehicles respectively, with a proportion of approximately 2:10:1. This data fully shows that the straight direction is the bottleneck of the morning peak traffic flow at the interchange, and the current lane setting has a large optimization space. Based on this accurate traffic flow data, the research team put forward a traffic optimization suggestion: changing the left turn lane into a variable lane and the tidal lane into a left turn lane, which can further optimize the lane traffic efficiency and effectively alleviate the traffic pressure in the straight direction. This suggestion fully reflects the practical application value of the DCN-Mobile-YOLO model in intelligent traffic management, and can provide data support and decision-making basis for traffic management departments to carry out lane optimization and traffic signal adjustment.

The research and development of the DCN-Mobile-YOLO model has important theoretical and practical significance for the field of intelligent traffic and computer vision. Theoretically, the model innovatively combines the deformable convolution network with the mobile convolutional network, realizing the balance between the lightweight of the deep learning model and the improvement of detection accuracy, and providing a new research idea for the design of mobile terminal target detection models; the integration of DeepSORT algorithm and adaptive lane detection rules enriches the technical system of multi-target tracking and counting, and solves the key technical problems of missed detection, repeated counting and inaccurate lane division in complex scenes.

In practical application, the DCN-Mobile-YOLO model has the characteristics of light weight, high accuracy and real-time performance, which can be easily transplanted to mobile and embedded devices such as road side intelligent monitoring terminals, mobile patrol equipment and unmanned inspection vehicles, realizing real-time and accurate multi-lane vehicle flow statistics in various traffic scenes. The accurate traffic flow data collected by the model can provide a reliable data basis for the traffic management department to carry out intelligent traffic management such as traffic signal timing optimization, tidal lane adjustment, road congestion early warning and traffic planning, which is of great significance for improving the efficiency of urban road traffic operation, alleviating traffic congestion and promoting the construction of smart cities.

In the future, with the continuous development of deep learning and edge computing technology, the DCN-Mobile-YOLO model can be further optimized in terms of model compression, multi-scene adaptation and multi-target joint detection. For example, through model quantization and pruning technology, the number of model parameters is further reduced to adapt to more low-power mobile devices; by adding scene recognition and adaptive adjustment modules, the model’s adaptability to different road conditions, weather and light environments is improved; by integrating vehicle type, speed and other detection functions, the model’s ability to collect comprehensive traffic data is enhanced. At the same time, the model can be combined with 5G, Internet of Things and big data technology to realize the interconnection and sharing of traffic data, and build a comprehensive intelligent traffic data collection and analysis system, which will inject more technical power into the construction of smart cities and intelligent transportation systems.

Author Information: Wen Nu, Guo Renzhong, He Biao, School of Architecture & Urban Planning, Research Institute for Smart Cities, Shenzhen University, Shenzhen 518061, Guangdong Province, P. R. China; Guangdong-Hong Kong-Macau Joint Laboratory for Smart Cities, Shenzhen 518061, Guangdong Province, P. R. China; Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources, Shenzhen 518034, Guangdong Province, P. R. China. Journal: Journal of Shenzhen University Science and Engineering, Vol. 38 No.6, November 2021 DOI: 10.3724/SP.J.1249.2021.06628