Ⅰ. Introduction
Autonomous vehicles make decisions on their own and control their movement based on the perception system of road environment, in which the perception of environmental information mainly relies on the core object detection technology of computer vision[1]. The task of object detection on the road is to accurately locate the classes and position information of various objects on the exit road in the image sequence, which has become an important research content in the field of self-driving. The objects on the road mainly include static objects and moving objects, among which the static objects mainly include traffic signs, lanes, obstacles, etc. The movement objects mainly include vehicles, pedestrians, non-motor vehicles, animals and so on. Detection of moving objects is more important in vehicle driving. At present, researchers mainly face four difficulties in detecting moving objects:
(1) The overlapping objects in motion[2]; (2) The direction and speed of the objects are uncertain[3]; (3) The detection precision of small objects is still not high[4]; (4) It is difficult to optimize the detection speed and accuracy simultaneously[5].
In recent years, the convolutional neural network has been increasingly applied in the field of computer vision. In the development of object detection algorithms using convolutional neural networks, one is based on candidate regions (Also known as two-stage)[6-8], and the other is based on regression (Also known as one-stage)[9-11]. Compared with two-stage detector, one-stage detector has faster detection speed and close detection accuracy[12]. In order to further improve the detection accuracy of the one-stage detector, this paper discusses the data set division.
Generally, the convolutional neural network model is used to conduct supervised learning on the data set that should be divided into three parts: training set, validation set and test set. How to divide the data set to get the network parameters with the best performance has not been reported clearly. In this paper, the YOLOv4[11] network model was used to study training sets, validation sets and test sets of different sizes on the Udacity dataset.
Specifically, the technical contributions of our paper can be concluded as follows:
(1) In the Udacity data set, the relationship between Avg IoU (Average Intersection over Union) and Iteration with the number of candidate boxes was analyzed by clustering the initial candidate boxes with the k-Means++[13] dimension clustering algorithm, and the number and value of the initial candidate boxes were determined.
(2) The YOLOv4 network model was used to train and test three classes objects, namely Truck, Car and Pedestrian, on the Udacity data set. In the training set, validation set and test set of different numbers of images, the detection performance of the model on the three classes objects was compared and analyzed to determine the optimal proportional relationship among the training set, validation set and test set.
Ⅱ. Related work
In terms of candidate regions research, R-CNN (Regions with CNN Features)[6] proposed a target detection model based on convolutional neural network. The mAP (Mean Average Precision) of R-CNN on VOC2007 (Visual Object Classes 2007)data set reached 62.4%, but the algorithm complexity is high and the training time is long. SPP (Spatial Pyramid Pooling)[7] increased the accuracy to 59.2% on VOC2007 dataset. TridentNet[8] was constructed by using hollow convolution, and its accuracy on COCO (Common Objects in Context) data set reached 48.4%. Based on the regression detector, the detection accuracy of SSD (Single Shot MultiBox Detector)[9] on VOC2007 data set has been improved to 74.3%, while the detection speed has also been improved to 59 FPS. The accuracy of YOLOv3[10] on COCO data set has been improved to 33.0%. YOLOv4[11] combines a series of tuning techniques and trade-off of detection accuracy and speed for the COCO data set at 65 FPS on Tesla V100.
Of course, there are many data sets studied in connection with self-driving. KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) Dataset[14] takes advantage of autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM (Simultaneous Localization And Mapping) and 3D object detection. The Mapillary Vistas Dataset[15] is a novel, large scale street-level image dataset containing 25000 high-resolution images annotated into 66 object categories. The Udacity Self-driving Dataset[16] contains two subsets of 9,423 and 15,000 images from a continuous videos. The videos were shoot by a point grey research cameras running at full resolution of 1920x1200 at 2Hz when driving in Mountain View California and neighboring cities during daylight.
Ⅲ. Dataset partition and candidate boxes
1. Dataset and network model
Udacity data set includes occluding objects, moving direction and speed change objects, and pedestrian objects that are smaller than cars in reality, which meet the requirements of each problem. Therefore, Udacity data set was selected in this paper to conduct research on improving network detection performance. When the image resolution is 416×416, a total of 59.578 BFLOPS (Billion float operations per second) is calculated by YOLOv4. The YOLOv4 simplified network architecture is shown in figure 1.
Fig. 1. Simplified YOLOv4 network architecture
그림 1. 간소화된 YOLOv4 네트워크 구조
2. Dataset partition
We used the dataset including 9,423 images contain over 65,000 2D labels that annotated at Car, Truck and Pedestrian. In fact, there are 9218 images that indicate the object information of the three classes of Car, Truck and Pedestrian, among which 205 images do not include the objects of the selected classes.
First, the label information format of 9218 images was converted to Pascal VOC2007 format. Then, the total number of images in the Udacity data set was divided into three subsets: training set (T1) validation set (V) and test set (T2), and each of the three subsets is a group. Finally, the number of images contained in the three subsets of each group is allocated in proportion to T1:V:T2. It is divided into 7 groups, represented by G1 to G7, and the specific values are shown in table 1.
Table 1. Proportions of 7 groups
표 1. 7개 그룹의 비율
In table 1, T1 decreases from 80% to 20% with a difference of 10%, V and T2 increase from 10% to 40% with a difference of 5%. And figure 2 shows the distribution of number of images in each group.
Fig. 2. Distributions of the number of images in 7 groups
그림 2. 7개 그룹의 이미지 데이터 갯수 분포
In order to ensure the comparability of the analysis of validation set and test set, the number of images of validation set and test set is equal.
3. Candidate boxes
YOLOv4 refers to the idea of the anchor frame of Faster R-CNN. At the beginning of the training, a set of initial candidate boxes are manually selected and set according to the data set. However, such initial candidate boxes will often affect the performance of network learning. For the Udacity dataset, the object contained in the image can be divided into three classes, Car, Truck and Pedestrian. For an object of the same class in the image, its real box size can be divided according to its distance from the shooting point and its status. Objects of different classes also tend to reflect their own characteristics. Compared with Car and Truck, the ratio of the height and width of the real frame of the Pedestrian type is relatively large. This slender frame characteristic makes it difficult to set up the initial candidate boxes.
In order to enable the network to determine the initial candidate box anchor parameters according to the features of the three classes, the k-Means++ algorithm was used to cluster the real box marked objects. Calculate the distance Dis. See formula 1 for IoU definition and formula 2 for Dis calculation.
\(\text { Io } U=\frac{\operatorname{area}\left(B_{p} \cap B_{g t}\right)}{\operatorname{area}\left(B_{p} \cup B_{g t}\right)}\) (1)
\(D i s_{I o U}=1-\operatorname{Io} U\) (2)
IoU is a measure of the magnitude of overlap between two bounding boxes (or, in the more general case, two objects). It calculates the size of the overlap between two bounding boxes, divided by the total area of the two bounding boxes combined. Where Bgt = (xgt, ygt, wgt, hgt) is the ground-truth box, and Bp = (x, y, w, h) is the predicted box. Xgt and ygt are the horizontal and vertical coordinates of the object's real box Bgt center point. Wgt and hgt are the width and height of the object's real box Bgt. X and y are the horizontal and vertical coordinates of the object's predicted box Bp. W and h are the width and height of the object’s predicted box Bp.
The formula (1) shows that the value of IoU is not directly related to the size of anchor boxes, which avoids the large margin of error. In formula (2), the greater the value of IoU, the smaller the distance Dis, indicating that the initial candidate boxes effect of k-means++ clustering are better.
By comparing the IoU with an assigned threshold T, we can classify a detection as being correct or incorrect. If IoU ≥ T then the result is considered as correct. If IoU < T the result is considered as incorrect.
4. Evaluation metrics
The detection speed and accuracy of moving object detection are very important in self-driving. The speed should match the driving speed of the vehicle to meet the real-time response of the vehicle to emergencies on the road. The accuracy includes the accuracy of class and location. We selected the mean average precision (mAP) and the frames per second (FPS) as the evaluation indexes of the model.
Instead, network performance is analyzed using the most common performance measures such as Precision, Recall and F1-score. They are respectively defined as formulas from (3) to (5).
\(\text { Precision }=\frac{T P}{T P+F P}=\frac{T P}{\text { all detections }}\) (3)
\(\text { Recall }=\frac{T P}{T P+F N}=\frac{T P}{\text { all groundtruths }}\) (4)
\(F_{1_{-}} \text {score }=\frac{2(\text { Precison })(\text { Recall })}{\text { Precision }+\text { Recall }}\) (5)
Where TP (true positive) is the total amount of true-positive pixels, FP (false positive) is the total amount of false-positive pixels, and FN (false negative) is the total amount of false-negative pixels.
Take the Car as an example, where TP refers to the number of Car identified as Car by the network model, FP refers to the number of Truck or Pedestrian identified as Car, and FN refers to the number of Car identified as Truck or Pedestrian.
The calculation of AP is formula (6):
\(A P=\int_{0}^{1} p(r) d r\) (6)
Where p stands for Precision, r for Recall, and p is a function taking r as the parameter.
AP1, AP2 and AP3 of Car, Truck and Pedestrian are calculated respectively, and then the mean values of AP of the three classes are calculated to obtain them AP of Udacity data set detected by the YOLOv4 model, as shown in formula (7).
\(m A P=\frac{1}{n} \sum_{i=1}^{n} A P_{i}\) (7)
Where APi being the AP in the ith class and n is the total number of classes being evaluate, here, n=3.
The Udacity data set is composed of images captured from videos, and the FPS is calculated based on the number of images in the test set divided by the time taken for network model detection, as shown in formula (8).
\(F P S=\frac{N_{\fallingdotseq}}{T_{\text {total }}}\) (8)
Here, N≒ is the number of images in the test set, and Ttotal is the time spent testing on the test set.
Ⅳ. Experiment and analysis
1. Experimental environment
The experimental environment configuration of this paper is as follows: Intel Core I5-6500 CPU (3.20 GHz, 4 Core), 16 G memory, Windows 10 Professional 64-bit operating system. NVIDIA GeForce GTX 1060 graphics card, 3 GB graphics memory, compute capability 610, GPU count 1.
2. Experimental and discuss
Theoretically, the initial number of candidate boxes K is determined randomly. In order to study the relationship between the initial number of candidate boxes K and Avg IoU, value K from 2 to 11 was taken. After dimension clustering of the dataset Udacity with k-Means++ algorithm, the relationship diagram of K, Avg IoU and iteration times was obtained, as shown in figure 3.
Fig. 3. Relation for Avg IoU and iteration times
그림 3. Avg IoU와 반복 횟수 의 관계
In figure 3, Avg IoU also increases with the increase of K, and the increment gradually decreases from K = 7. At the beginning, the number of iterations also increases with the increase of K, but when K = 9, the number of iterations decreases and then increase continually. Obviously, increasing K can increase the probability of accurately matching the real object box and increase the accuracy of detection, although the increase of K will lead to slower detection speed of the model. So, we choose k = 9 as a good tradeoff between model complexity and high Avg IoU.
Run YOLOv4 network under Darknet framework according to the 7 groups of different proportion relationships in table 1. When testing, the threshold value was set to 50% and 75% respectively.
After test, the 3 performance parameters of Precision, Recall and F1-score are shown in figure 4.
Fig. 4. TheresultsofthePrecision,RecallandF1-score
그림 4. 정밀도, 비율 및 F1값의 결과
According to the statistical data of performance parameters in figure 4, the 6 parameters obtained from the 7 groups were analyzed with Standard Deviation, S. The formula of S is shown in formula (9).
\(S=\sqrt{\frac{1}{N-1} \sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}\) (9)
In formula (9), Xi denotes performance parameter, \(\bar{X}\) denotes the arithmetic mean of a performance parameters, N denotes the number of performance parameters, here, N = 6. Through calculation, the maximum value S of the 6 performance parameters is Precision50 with 0.0058. So, the 6 performance parameters are argely unaffected by the grouping of the data sets.
Next, the performance parameters of mAP, Avg IoU, and FPS are drawn in figure 5.
Fig. 5. The results of the mAP, Avg IoU and FPS
그림 5. mAP, Avg IoU 및 FPS의 결과
As shown in figure 5, when threshold is 50%, the maximum value of mAP50 is 80.89 in G2, and the maximum value of AvgIoU50 is 68.87, which is also in G2. When threshold is 75%, the maximum value of mAP75 is 47.08 in G2, and the maximum value of AvgIoU75 is 54.72 in G2. The deviation of each parameter is shown in figure 6.
Fig. 6. Deviations of the mAP, Avg IoU and FPS
그림 6. mAP, Avg IoU 및 FPS의 오차
As can be seen from figure 6, each parameter in G2 is positive deviation, and the values of mAP50, mAP75, Avg IoU50 and Avg IoU75 are all the maximum values in the 7 groups. However, the average FPS deviation in G2 is 0.14, lower than the maximum value of 0.20 in G6.
Through the above analysis, it can be concluded that when Car, Truck and Pedestrian are trained, verified and detected in the Udacity dataset with the YOLOv4 network model, the Udacity dataset was allocated the number of images in the training set, validation set and test set according to the ratio of G2, and the optimal detection result can be achieved. That is, the training set accounts for 70%, the validation set accounts for 15%, and the test set accounts for 15%.
Table 2. The test results of the 2 sample images
표 2. 2종류 샘플 데이터에 대한 실험 결과
Two sample images were selected and tested with 4 network models after training with dataset of G1, G2, G4 and G7. The test results with a threshold of 50% are shown in table 2.
As show in table 2, the 2 images in the first row are the original images before test. The number of 3 classes objects in the image is given below each image. The following four rows of images are the images tested by 4 network model parameters from G1, G2, G4 and G7 respectively, and the test results of each image are also given below each row, including the number and average precision of the three types of objects.
Ⅴ. Conclusion
In this paper, Car, Truck and Pedestrian, three types of targets in the automatic driving scene, are detected in the Udacity data set with the YOLOv4 network model with superior performance. The k-means++ algorithm was used to cluster the initial candidate boxes. The 9 initial candidate boxes were ascertained in the network model. The data set was divided into three subsets: training set, validation set and test set. According to the number of images contained in the subsets, the data set was divided into 7 groups. First, the initial candidate boxes were determined by k-means++, and then the training set and validation set were conducted by using the YOLOv4 network model to obtain the optimal function equation of each network model. Finally, the test was carried out on 7 test sets respectively to obtain the performance parameters such as mAP and FPS.
The results show that the changes in the number of images in the training set, the validation set and the test set have a great impact on the training of the optimal function equation of the YOLOv4 network model. Through comparative analysis, when the Udacity data set is divided into training set, validation set and test set, the number of images is in accordance with the scale relation of T1:V:T2 = 7:1.5:1.5, and the network model parameter is the optimal. This network model can ensure high accuracy in detecting moving targets of different sizes under the automatic driving scenario in suburban driving, and mAP50, mAP75 reach 80.89%, 47.08% respectively, with a speed of 10.56 FPS.
In this paper on the training set, validation set and test set number of images is allocated according to the relationship between arithmetic progression, belongs to the discontinuous changes in the relationship. This way obtained by discrete data sets grouping optimal network model may be missing a better network model. In the future we will continue to study on this question.
참고문헌
- Khodabandeh M, Vahdat A, Ranjbar M, et al. "A robust learning approach to domain adaptive object detection", Proceedings of the IEEE/CVF ICCV, pp. 480-490, 2019.
- Hong-Gi Park, Kyoung-Ho Bae. "A Study on Detection and Resolving of Occlusion Area by Street", Journal of the Korea Academia-Industrial cooperation Society. Vol. 21, No. 10, pp. 77-83, 2020. DOI: https://doi.org/10.5762/KAIS.2020.21.10.77.
- Seung-Cheol Lim, Jae-Seung Go. "A Study on Design and Implementation of Driver's Blind Spot Assist System Using CNN Technique", The Journal of The Institute of Internet, Broadcasting and Communication (JIIBC), Vol. 20, No. 2, pp. 149-155, 2020. DOI: https://doi.org/10.7236/JIIBC.2020.20.2.149.
- Bing Han, Yunhao Wang, Zheng Yang, et al. "Small-Scale Pedestrian Detection Based on Deep Neural Network", IEEE Trans. Intell. Transport. Syst., 21 (7), pp. 3046-3055, 2020. DOI: 10.1109/TITS.2019.2923752.
- Hojoon Lee, Jeongsik Yoon, Yonghwan Jeong, et al. "Moving Object Detection and Tracking Based on Interaction of Static Obstacle Map and Geometric Model-Free Approach for Urban Autonomous Driving", IEEE Trans. Intell. Transport. Syst., pp. 1-10, 2020. DOI: 10.1109/TITS.2020.2981938.
- Ross Girshick, Jeff Donahue, Trevor Darrell, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation", IEEE CVPR, pp. 580-587, 2014.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition", IEEE Trans. Pattern Anal. Mach. Intell., 37 (9), pp. 1904-1916, 2015. DOI: 10.1109/TPAMI.2015.2389824.
- Yanghao Li, Yuntao Chen, Naiyan Wang, et al. "Scale-Aware Trident Networks for Object Detection", IEEE/CVF ICCV, pp. 6054-6063, 2019.
- W Liu, D Anguelov, D Erhan, et al. "SSD: Single shot multibox detector" ECCV, Springer, Cham, pp. 21-37, 2016.
- J Redmon, A Farhadi, "Yolov3: An incremental improvement", ArXiv, 1804.02767, 2018.
- A Bochkovskiy, CY Wang, HY Liao, et al. "YOLOv4: Optimal Speed and Accuracy of Object Detection", ArXiv, 2004.10934, 2020.
- S Wu, X Li, X Wang. "IoU-aware single-stage object detector for accurate localization", Image and Vision Computing, 103911, 2020.
- David Arthur, Sergei Vassilvitskii, "k-means++: the advantages of careful seeding", 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027-1035, 2007.
- A Geiger, P Lenz, R Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite", IEEE CVPR, pp. 3354-3361, 2012. DOI: 10.1109/CVPR.2012.6248074.
- G Neuhold, T Ollmann, S R Bulo, et al. "The mapillary vistas dataset for semantic understanding of street scenes", ICCV, pp. 22-29, 2017.
- Udacity. An open source self-driving car, 2017.