DOI QR코드

DOI QR Code

YOLOv4 네트워크를 이용한 자동운전 데이터 분할이 검출성능에 미치는 영향

Influence of Self-driving Data Set Partition on Detection Performance Using YOLOv4 Network

  • 왕욱비 (배재대학교 컴퓨터공학과) ;
  • 진락 (배재대학교 컴퓨터공학과) ;
  • 이추담 (배재대학교 컴퓨터공학과) ;
  • 손진구 (배재대학교 컴퓨터공학과) ;
  • 정석용 (배재대학교 컴퓨터공학과) ;
  • 송정영 (배재대학교 컴퓨터공학과)
  • 투고 : 2020.10.23
  • 심사 : 2020.12.04
  • 발행 : 2020.12.31

초록

뉴-럴 네트워크와 자동운전 데이터 셋을 개발하는 목표중의 하나가 데이터 셋을 분할함에 따라서 움직이는 물체를 검출하는 성능을 개선하는 방법이 있다. 다크넷 (DarkNet) 프레임 워크에 있어서, YOLOv4 네트워크는 Udacity 데이터 셋에서 훈련하는 셋과 검증 셋으로 사용되었다. Udacity 데이터 셋의 7개 비율에 따라서 이 데이터 셋은 훈련 셋, 검증 셋, 테스트 셋을 포함한 3개의 부분 셋으로 나누어진다. K-means++ 알고리즘은 7개 그룹에서 개체 Box 차원 군집화를 수행하기 위해 사용되었다. 훈련을 위한 YOLOv4 네트워크의 슈퍼 파라메타를 조절하여 7개 그룹들에 대하여 최적 모델 파라메타가 각각 구해졌다. 이 모델 파라메타는 각각 7 개 테스트 셋 데이터에 비교하고 검출에 사용되었다. 실험결과에서 YOLOv4 네트워크는 Udacity 데이터 셋에서 트럭, 자동차, 행인으로 표현되는 움직이는 물체에 대하여 대/중/소 물체 검출을 할수 있음을 보여 주었다. 훈련 셋과 검증 셋, 테스트 셋의 비율이 7 ; 1.5 ; 1.5 일 때 최적의 모델 파라메타로서 가장 높은 검출 성능이었다. 그 결과값은, mAP50가 80.89%, mAP75가 47.08%에 달하고, 검출 속도는 10.56 FPS에 달한다.

Aiming at the development of neural network and self-driving data set, it is also an idea to improve the performance of network model to detect moving objects by dividing the data set. In Darknet network framework, the YOLOv4 (You Only Look Once v4) network model was used to train and test Udacity data set. According to 7 proportions of the Udacity data set, it was divided into three subsets including training set, validation set and test set. K-means++ algorithm was used to conduct dimensional clustering of object boxes in 7 groups. By adjusting the super parameters of YOLOv4 network for training, Optimal model parameters for 7 groups were obtained respectively. These model parameters were used to detect and compare 7 test sets respectively. The experimental results showed that YOLOv4 can effectively detect the large, medium and small moving objects represented by Truck, Car and Pedestrian in the Udacity data set. When the ratio of training set, validation set and test set is 7:1.5:1.5, the optimal model parameters of the YOLOv4 have highest detection performance. The values show mAP50 reaching 80.89%, mAP75 reaching 47.08%, and the detection speed reaching 10.56 FPS.

키워드

Ⅰ. Introduction

Autonomous vehicles make decisions on their own and control their movement based on the perception system of road environment, in which the perception of environmental information mainly relies on the core object detection technology of computer vision[1]. The task of object detection on the road is to accurately locate the classes and position information of various objects on the exit road in the image sequence, which has become an important research content in the field of self-driving. The objects on the road mainly include static objects and moving objects, among which the static objects mainly include traffic signs, lanes, obstacles, etc. The movement objects mainly include vehicles, pedestrians, non-motor vehicles, animals and so on. Detection of moving objects is more important in vehicle driving. At present, researchers mainly face four difficulties in detecting moving objects:

(1) The overlapping objects in motion[2]; (2) The direction and speed of the objects are uncertain[3]; (3) The detection precision of small objects is still not high[4]; (4) It is difficult to optimize the detection speed and accuracy simultaneously[5].

In recent years, the convolutional neural network has been increasingly applied in the field of computer vision. In the development of object detection algorithms using convolutional neural networks, one is based on candidate regions (Also known as two-stage)[6-8], and the other is based on regression (Also known as one-stage)[9-11]. Compared with two-stage detector, one-stage detector has faster detection speed and close detection accuracy[12]. In order to further improve the detection accuracy of the one-stage detector, this paper discusses the data set division.

Generally, the convolutional neural network model is used to conduct supervised learning on the data set that should be divided into three parts: training set, validation set and test set. How to divide the data set to get the network parameters with the best performance has not been reported clearly. In this paper, the YOLOv4[11] network model was used to study training sets, validation sets and test sets of different sizes on the Udacity dataset.

Specifically, the technical contributions of our paper can be concluded as follows:

(1) In the Udacity data set, the relationship between Avg IoU (Average Intersection over Union) and Iteration with the number of candidate boxes was analyzed by clustering the initial candidate boxes with the k-Means++[13] dimension clustering algorithm, and the number and value of the initial candidate boxes were determined.

(2) The YOLOv4 network model was used to train and test three classes objects, namely Truck, Car and Pedestrian, on the Udacity data set. In the training set, validation set and test set of different numbers of images, the detection performance of the model on the three classes objects was compared and analyzed to determine the optimal proportional relationship among the training set, validation set and test set.

Ⅱ. Related work

In terms of candidate regions research, R-CNN (Regions with CNN Features)[6] proposed a target detection model based on convolutional neural network. The mAP (Mean Average Precision) of R-CNN on VOC2007 (Visual Object Classes 2007)data set reached 62.4%, but the algorithm complexity is high and the training time is long. SPP (Spatial Pyramid Pooling)[7] increased the accuracy to 59.2% on VOC2007 dataset. TridentNet[8] was constructed by using hollow convolution, and its accuracy on COCO (Common Objects in Context) data set reached 48.4%. Based on the regression detector, the detection accuracy of SSD (Single Shot MultiBox Detector)[9] on VOC2007 data set has been improved to 74.3%, while the detection speed has also been improved to 59 FPS. The accuracy of YOLOv3[10] on COCO data set has been improved to 33.0%. YOLOv4[11] combines a series of tuning techniques and trade-off of detection accuracy and speed for the COCO data set at 65 FPS on Tesla V100.

Of course, there are many data sets studied in connection with self-driving. KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) Dataset[14] takes advantage of autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM (Simultaneous Localization And Mapping) and 3D object detection. The Mapillary Vistas Dataset[15] is a novel, large scale street-level image dataset containing 25000 high-resolution images annotated into 66 object categories. The Udacity Self-driving Dataset[16] contains two subsets of 9,423 and 15,000 images from a continuous videos. The videos were shoot by a point grey research cameras running at full resolution of 1920x1200 at 2Hz when driving in Mountain View California and neighboring cities during daylight.

Ⅲ. Dataset partition and candidate boxes

1. Dataset and network model

Udacity data set includes occluding objects, moving direction and speed change objects, and pedestrian objects that are smaller than cars in reality, which meet the requirements of each problem. Therefore, Udacity data set was selected in this paper to conduct research on improving network detection performance. When the image resolution is 416×416, a total of 59.578 BFLOPS (Billion float operations per second) is calculated by YOLOv4. The YOLOv4 simplified network architecture is shown in figure 1.

OTNBBE_2020_v20n6_157_f0001.png 이미지

Fig. 1. Simplified YOLOv4 network architecture

그림 1. 간소화된 YOLOv4 네트워크 구조

2. Dataset partition

We used the dataset including 9,423 images contain over 65,000 2D labels that annotated at Car, Truck and Pedestrian. In fact, there are 9218 images that indicate the object information of the three classes of Car, Truck and Pedestrian, among which 205 images do not include the objects of the selected classes.

First, the label information format of 9218 images was converted to Pascal VOC2007 format. Then, the total number of images in the Udacity data set was divided into three subsets: training set (T1) validation set (V) and test set (T2), and each of the three subsets is a group. Finally, the number of images contained in the three subsets of each group is allocated in proportion to T1:V:T2. It is divided into 7 groups, represented by G1 to G7, and the specific values are shown in table 1.

Table 1. Proportions of 7 groups

표 1. 7개 그룹의 비율

OTNBBE_2020_v20n6_157_t0001.png 이미지

In table 1, T1 decreases from 80% to 20% with a difference of 10%, V and T2 increase from 10% to 40% with a difference of 5%. And figure 2 shows the distribution of number of images in each group.

OTNBBE_2020_v20n6_157_f0002.png 이미지

Fig. 2. Distributions of the number of images in 7 groups

그림 2. 7개 그룹의 이미지 데이터 갯수 분포

In order to ensure the comparability of the analysis of validation set and test set, the number of images of validation set and test set is equal.

3. Candidate boxes

YOLOv4 refers to the idea of the anchor frame of Faster R-CNN. At the beginning of the training, a set of initial candidate boxes are manually selected and set according to the data set. However, such initial candidate boxes will often affect the performance of network learning. For the Udacity dataset, the object contained in the image can be divided into three classes, Car, Truck and Pedestrian. For an object of the same class in the image, its real box size can be divided according to its distance from the shooting point and its status. Objects of different classes also tend to reflect their own characteristics. Compared with Car and Truck, the ratio of the height and width of the real frame of the Pedestrian type is relatively large. This slender frame characteristic makes it difficult to set up the initial candidate boxes.

In order to enable the network to determine the initial candidate box anchor parameters according to the features of the three classes, the k-Means++ algorithm was used to cluster the real box marked objects. Calculate the distance Dis. See formula 1 for IoU definition and formula 2 for Dis calculation.

\(\text { Io } U=\frac{\operatorname{area}\left(B_{p} \cap B_{g t}\right)}{\operatorname{area}\left(B_{p} \cup B_{g t}\right)}\)       (1)

\(D i s_{I o U}=1-\operatorname{Io} U\)       (2)

IoU is a measure of the magnitude of overlap between two bounding boxes (or, in the more general case, two objects). It calculates the size of the overlap between two bounding boxes, divided by the total area of the two bounding boxes combined. Where Bgt = (xgt, ygt, wgt, hgt) is the ground-truth box, and Bp = (x, y, w, h) is the predicted box. Xgt and ygt are the horizontal and vertical coordinates of the object's real box Bgt center point. Wgt and hgt are the width and height of the object's real box Bgt. X and y are the horizontal and vertical coordinates of the object's predicted box Bp. W and h are the width and height of the object’s predicted box Bp.

The formula (1) shows that the value of IoU is not directly related to the size of anchor boxes, which avoids the large margin of error. In formula (2), the greater the value of IoU, the smaller the distance Dis, indicating that the initial candidate boxes effect of k-means++ clustering are better.

By comparing the IoU with an assigned threshold T, we can classify a detection as being correct or incorrect. If IoU ≥ T then the result is considered as correct. If IoU < T the result is considered as incorrect.

4. Evaluation metrics

The detection speed and accuracy of moving object detection are very important in self-driving. The speed should match the driving speed of the vehicle to meet the real-time response of the vehicle to emergencies on the road. The accuracy includes the accuracy of class and location. We selected the mean average precision (mAP) and the frames per second (FPS) as the evaluation indexes of the model.

Instead, network performance is analyzed using the most common performance measures such as Precision, Recall and F1-score. They are respectively defined as formulas from (3) to (5).

\(\text { Precision }=\frac{T P}{T P+F P}=\frac{T P}{\text { all detections }}\)       (3)

\(\text { Recall }=\frac{T P}{T P+F N}=\frac{T P}{\text { all groundtruths }}\)       (4)

\(F_{1_{-}} \text {score }=\frac{2(\text { Precison })(\text { Recall })}{\text { Precision }+\text { Recall }}\)       (5)

Where TP (true positive) is the total amount of true-positive pixels, FP (false positive) is the total amount of false-positive pixels, and FN (false negative) is the total amount of false-negative pixels.

Take the Car as an example, where TP refers to the number of Car identified as Car by the network model, FP refers to the number of Truck or Pedestrian identified as Car, and FN refers to the number of Car identified as Truck or Pedestrian.

The calculation of AP is formula (6):

\(A P=\int_{0}^{1} p(r) d r\)       (6)

Where p stands for Precision, r for Recall, and p is a function taking r as the parameter.

AP1, AP2 and AP3 of Car, Truck and Pedestrian are calculated respectively, and then the mean values of AP of the three classes are calculated to obtain them AP of Udacity data set detected by the YOLOv4 model, as shown in formula (7).

\(m A P=\frac{1}{n} \sum_{i=1}^{n} A P_{i}\)       (7)

Where APi being the AP in the ith class and n is the total number of classes being evaluate, here, n=3.

The Udacity data set is composed of images captured from videos, and the FPS is calculated based on the number of images in the test set divided by the time taken for network model detection, as shown in formula (8).

\(F P S=\frac{N_{\fallingdotseq}}{T_{\text {total }}}\)       (8)

Here, N is the number of images in the test set, and Ttotal is the time spent testing on the test set.

Ⅳ. Experiment and analysis

1. Experimental environment

The experimental environment configuration of this paper is as follows: Intel Core I5-6500 CPU (3.20 GHz, 4 Core), 16 G memory, Windows 10 Professional 64-bit operating system. NVIDIA GeForce GTX 1060 graphics card, 3 GB graphics memory, compute capability 610, GPU count 1.

2. Experimental and discuss

Theoretically, the initial number of candidate boxes K is determined randomly. In order to study the relationship between the initial number of candidate boxes K and Avg IoU, value K from 2 to 11 was taken. After dimension clustering of the dataset Udacity with k-Means++ algorithm, the relationship diagram of K, Avg IoU and iteration times was obtained, as shown in figure 3.

OTNBBE_2020_v20n6_157_f0003.png 이미지

Fig. 3. Relation for Avg IoU and iteration times

그림 3. Avg IoU와 반복 횟수 의 관계

In figure 3, Avg IoU also increases with the increase of K, and the increment gradually decreases from K = 7. At the beginning, the number of iterations also increases with the increase of K, but when K = 9, the number of iterations decreases and then increase continually. Obviously, increasing K can increase the probability of accurately matching the real object box and increase the accuracy of detection, although the increase of K will lead to slower detection speed of the model. So, we choose k = 9 as a good tradeoff between model complexity and high Avg IoU.

Run YOLOv4 network under Darknet framework according to the 7 groups of different proportion relationships in table 1. When testing, the threshold value was set to 50% and 75% respectively.

After test, the 3 performance parameters of Precision, Recall and F1-score are shown in figure 4.

OTNBBE_2020_v20n6_157_f0004.png 이미지

Fig. 4. TheresultsofthePrecision,RecallandF1-score

그림 4. 정밀도, 비율 및 F1값의 결과​​​​​​​

According to the statistical data of performance parameters in figure 4, the 6 parameters obtained from the 7 groups were analyzed with Standard Deviation, S. The formula of S is shown in formula (9).

\(S=\sqrt{\frac{1}{N-1} \sum_{i=1}^{N}\left(X_{i}-\bar{X}\right)^{2}}\)       (9)

In formula (9), Xi denotes performance parameter, \(\bar{X}\) denotes the arithmetic mean of a performance parameters, N denotes the number of performance parameters, here, N = 6. Through calculation, the maximum value S of the 6 performance parameters is Precision50 with 0.0058. So, the 6 performance parameters are argely unaffected by the grouping of the data sets.

Next, the performance parameters of mAP, Avg IoU, and FPS are drawn in figure 5.

OTNBBE_2020_v20n6_157_f0005.png 이미지

Fig. 5. The results of the mAP, Avg IoU and FPS

그림 5. mAP, Avg IoU 및 FPS의 결과​​​​​​​

As shown in figure 5, when threshold is 50%, the maximum value of mAP50 is 80.89 in G2, and the maximum value of AvgIoU50 is 68.87, which is also in G2. When threshold is 75%, the maximum value of mAP75 is 47.08 in G2, and the maximum value of AvgIoU75 is 54.72 in G2. The deviation of each parameter is shown in figure 6.

OTNBBE_2020_v20n6_157_f0006.png 이미지

Fig. 6. Deviations of the mAP, Avg IoU and FPS

그림 6. mAP, Avg IoU 및 FPS의 오차​​​​​​​

As can be seen from figure 6, each parameter in G2 is positive deviation, and the values of mAP50, mAP75, Avg IoU50 and Avg IoU75 are all the maximum values in the 7 groups. However, the average FPS deviation in G2 is 0.14, lower than the maximum value of 0.20 in G6.

Through the above analysis, it can be concluded that when Car, Truck and Pedestrian are trained, verified and detected in the Udacity dataset with the YOLOv4 network model, the Udacity dataset was allocated the number of images in the training set, validation set and test set according to the ratio of G2, and the optimal detection result can be achieved. That is, the training set accounts for 70%, the validation set accounts for 15%, and the test set accounts for 15%.

Table 2. The test results of the 2 sample images

​​​​​​​표 2. 2종류 샘플 데이터에 대한 실험 결과​​​​​​​

OTNBBE_2020_v20n6_157_t0002.png 이미지

Two sample images were selected and tested with 4 network models after training with dataset of G1, G2, G4 and G7. The test results with a threshold of 50% are shown in table 2.

As show in table 2, the 2 images in the first row are the original images before test. The number of 3 classes objects in the image is given below each image. The following four rows of images are the images tested by 4 network model parameters from G1, G2, G4 and G7 respectively, and the test results of each image are also given below each row, including the number and average precision of the three types of objects.

Ⅴ. Conclusion

In this paper, Car, Truck and Pedestrian, three types of targets in the automatic driving scene, are detected in the Udacity data set with the YOLOv4 network model with superior performance. The k-means++ algorithm was used to cluster the initial candidate boxes. The 9 initial candidate boxes were ascertained in the network model. The data set was divided into three subsets: training set, validation set and test set. According to the number of images contained in the subsets, the data set was divided into 7 groups. First, the initial candidate boxes were determined by k-means++, and then the training set and validation set were conducted by using the YOLOv4 network model to obtain the optimal function equation of each network model. Finally, the test was carried out on 7 test sets respectively to obtain the performance parameters such as mAP and FPS.

The results show that the changes in the number of images in the training set, the validation set and the test set have a great impact on the training of the optimal function equation of the YOLOv4 network model. Through comparative analysis, when the Udacity data set is divided into training set, validation set and test set, the number of images is in accordance with the scale relation of T1:V:T2 = 7:1.5:1.5, and the network model parameter is the optimal. This network model can ensure high accuracy in detecting moving targets of different sizes under the automatic driving scenario in suburban driving, and mAP50, mAP75 reach 80.89%, 47.08% respectively, with a speed of 10.56 FPS.

In this paper on the training set, validation set and test set number of images is allocated according to the relationship between arithmetic progression, belongs to the discontinuous changes in the relationship. This way obtained by discrete data sets grouping optimal network model may be missing a better network model. In the future we will continue to study on this question.​​​​​​​

참고문헌

  1. Khodabandeh M, Vahdat A, Ranjbar M, et al. "A robust learning approach to domain adaptive object detection", Proceedings of the IEEE/CVF ICCV, pp. 480-490, 2019.
  2. Hong-Gi Park, Kyoung-Ho Bae. "A Study on Detection and Resolving of Occlusion Area by Street", Journal of the Korea Academia-Industrial cooperation Society. Vol. 21, No. 10, pp. 77-83, 2020. DOI: https://doi.org/10.5762/KAIS.2020.21.10.77.
  3. Seung-Cheol Lim, Jae-Seung Go. "A Study on Design and Implementation of Driver's Blind Spot Assist System Using CNN Technique", The Journal of The Institute of Internet, Broadcasting and Communication (JIIBC), Vol. 20, No. 2, pp. 149-155, 2020. DOI: https://doi.org/10.7236/JIIBC.2020.20.2.149.
  4. Bing Han, Yunhao Wang, Zheng Yang, et al. "Small-Scale Pedestrian Detection Based on Deep Neural Network", IEEE Trans. Intell. Transport. Syst., 21 (7), pp. 3046-3055, 2020. DOI: 10.1109/TITS.2019.2923752.
  5. Hojoon Lee, Jeongsik Yoon, Yonghwan Jeong, et al. "Moving Object Detection and Tracking Based on Interaction of Static Obstacle Map and Geometric Model-Free Approach for Urban Autonomous Driving", IEEE Trans. Intell. Transport. Syst., pp. 1-10, 2020. DOI: 10.1109/TITS.2020.2981938.
  6. Ross Girshick, Jeff Donahue, Trevor Darrell, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation", IEEE CVPR, pp. 580-587, 2014.
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. "Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition", IEEE Trans. Pattern Anal. Mach. Intell., 37 (9), pp. 1904-1916, 2015. DOI: 10.1109/TPAMI.2015.2389824.
  8. Yanghao Li, Yuntao Chen, Naiyan Wang, et al. "Scale-Aware Trident Networks for Object Detection", IEEE/CVF ICCV, pp. 6054-6063, 2019.
  9. W Liu, D Anguelov, D Erhan, et al. "SSD: Single shot multibox detector" ECCV, Springer, Cham, pp. 21-37, 2016.
  10. J Redmon, A Farhadi, "Yolov3: An incremental improvement", ArXiv, 1804.02767, 2018.
  11. A Bochkovskiy, CY Wang, HY Liao, et al. "YOLOv4: Optimal Speed and Accuracy of Object Detection", ArXiv, 2004.10934, 2020.
  12. S Wu, X Li, X Wang. "IoU-aware single-stage object detector for accurate localization", Image and Vision Computing, 103911, 2020.
  13. David Arthur, Sergei Vassilvitskii, "k-means++: the advantages of careful seeding", 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027-1035, 2007.
  14. A Geiger, P Lenz, R Urtasun, "Are we ready for autonomous driving? the kitti vision benchmark suite", IEEE CVPR, pp. 3354-3361, 2012. DOI: 10.1109/CVPR.2012.6248074.
  15. G Neuhold, T Ollmann, S R Bulo, et al. "The mapillary vistas dataset for semantic understanding of street scenes", ICCV, pp. 22-29, 2017.
  16. Udacity. An open source self-driving car, 2017.