A Deep Learning Based Approach to Recognizing Accompanying Status of Smartphone Users Using Multimodal Data

스마트폰 다종 데이터를 활용한 딥러닝 기반의 사용자 동행 상태 인식

  • Kim, Kilho (SUALAB) ;
  • Choi, Sangwoo (Recommendations, Coupang) ;
  • Chae, Moon-jung (Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University) ;
  • Park, Heewoong (Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University) ;
  • Lee, Jaehong (kakaomobility datalab) ;
  • Park, Jonghun (Department of Industrial Engineering and Institute for Industrial Systems Innovation, Seoul National University)
  • 김길호 ((주)수아랩) ;
  • 최상우 ((주)쿠팡 추천팀) ;
  • 채문정 (서울대학교 산업공학과.서울대학교 산업시스템혁신연구소) ;
  • 박희웅 (서울대학교 산업공학과.서울대학교 산업시스템혁신연구소) ;
  • 이재홍 (카카오모빌리티 데이터랩) ;
  • 박종헌 (서울대학교 산업공학과.서울대학교 산업시스템혁신연구소)
  • Received : 2018.06.25
  • Accepted : 2019.03.11
  • Published : 2019.03.31


As smartphones are getting widely used, human activity recognition (HAR) tasks for recognizing personal activities of smartphone users with multimodal data have been actively studied recently. The research area is expanding from the recognition of the simple body movement of an individual user to the recognition of low-level behavior and high-level behavior. However, HAR tasks for recognizing interaction behavior with other people, such as whether the user is accompanying or communicating with someone else, have gotten less attention so far. And previous research for recognizing interaction behavior has usually depended on audio, Bluetooth, and Wi-Fi sensors, which are vulnerable to privacy issues and require much time to collect enough data. Whereas physical sensors including accelerometer, magnetic field and gyroscope sensors are less vulnerable to privacy issues and can collect a large amount of data within a short time. In this paper, a method for detecting accompanying status based on deep learning model by only using multimodal physical sensor data, such as an accelerometer, magnetic field and gyroscope, was proposed. The accompanying status was defined as a redefinition of a part of the user interaction behavior, including whether the user is accompanying with an acquaintance at a close distance and the user is actively communicating with the acquaintance. A framework based on convolutional neural networks (CNN) and long short-term memory (LSTM) recurrent networks for classifying accompanying and conversation was proposed. First, a data preprocessing method which consists of time synchronization of multimodal data from different physical sensors, data normalization and sequence data generation was introduced. We applied the nearest interpolation to synchronize the time of collected data from different sensors. Normalization was performed for each x, y, z axis value of the sensor data, and the sequence data was generated according to the sliding window method. Then, the sequence data became the input for CNN, where feature maps representing local dependencies of the original sequence are extracted. The CNN consisted of 3 convolutional layers and did not have a pooling layer to maintain the temporal information of the sequence data. Next, LSTM recurrent networks received the feature maps, learned long-term dependencies from them and extracted features. The LSTM recurrent networks consisted of two layers, each with 128 cells. Finally, the extracted features were used for classification by softmax classifier. The loss function of the model was cross entropy function and the weights of the model were randomly initialized on a normal distribution with an average of 0 and a standard deviation of 0.1. The model was trained using adaptive moment estimation (ADAM) optimization algorithm and the mini batch size was set to 128. We applied dropout to input values of the LSTM recurrent networks to prevent overfitting. The initial learning rate was set to 0.001, and it decreased exponentially by 0.99 at the end of each epoch training. An Android smartphone application was developed and released to collect data. We collected smartphone data for a total of 18 subjects. Using the data, the model classified accompanying and conversation by 98.74% and 98.83% accuracy each. Both the F1 score and accuracy of the model were higher than the F1 score and accuracy of the majority vote classifier, support vector machine, and deep recurrent neural network. In the future research, we will focus on more rigorous multimodal sensor data synchronization methods that minimize the time stamp differences. In addition, we will further study transfer learning method that enables transfer of trained models tailored to the training data to the evaluation data that follows a different distribution. It is expected that a model capable of exhibiting robust recognition performance against changes in data that is not considered in the model learning stage will be obtained.

스마트폰이 널리 보급되고 현대인들의 생활 속에 깊이 자리 잡으면서, 스마트폰에서 수집된 다종 데이터를 바탕으로 사용자 개인의 행동을 인식하고자 하는 연구가 활발히 진행되고 있다. 그러나 타인과의 상호작용 행동 인식에 대한 연구는 아직까지 상대적으로 미진하였다. 기존 상호작용 행동 인식 연구에서는 오디오, 블루투스, 와이파이 등의 데이터를 사용하였으나, 이들은 사용자 사생활 침해 가능성이 높으며 단시간 내에 충분한 양의 데이터를 수집하기 어렵다는 한계가 있다. 반면 가속도, 자기장, 자이로스코프 등의 물리 센서의 경우 사생활 침해 가능성이 낮으며 단시간 내에 충분한 양의 데이터를 수집할 수 있다. 본 연구에서는 이러한 점에 주목하여, 스마트폰 상의 다종 물리 센서 데이터만을 활용, 딥러닝 모델에 기반을 둔 사용자의 동행 상태 인식 방법론을 제안한다. 사용자의 동행 여부 및 대화 여부를 분류하는 동행 상태 분류 모델은 컨볼루션 신경망과 장단기 기억 순환 신경망이 혼합된 구조를 지닌다. 먼저 스마트폰의 다종 물리 센서에서 수집한 데이터에 존재하는 타임 스태프의 차이를 상쇄하고, 정규화를 수행하여 시간에 따른 시퀀스 데이터 형태로 변환함으로써 동행 상태분류 모델의 입력 데이터를 생성한다. 이는 컨볼루션 신경망에 입력되며, 데이터의 시간적 국부 의존성이 반영된 요인 지도를 출력한다. 장단기 기억 순환 신경망은 요인 지도를 입력받아 시간에 따른 순차적 연관 관계를 학습하며, 동행 상태 분류를 위한 요인을 추출하고 소프트맥스 분류기에서 이에 기반한 최종적인 분류를 수행한다. 자체 제작한 스마트폰 애플리케이션을 배포하여 실험 데이터를 수집하였으며, 이를 활용하여 제안한 방법론을 평가하였다. 최적의 파라미터를 설정하여 동행 상태 분류 모델을 학습하고 평가한 결과, 동행 여부와 대화 여부를 각각 98.74%, 98.83%의 높은 정확도로 분류하였다.


JJSHBB_2019_v25n1_163_f0001.png 이미지

Example of Converting Physical Sensor Data

JJSHBB_2019_v25n1_163_f0002.png 이미지

CNN Structure of Proposed Model

JJSHBB_2019_v25n1_163_f0003.png 이미지

LSTM Recurrent Network and Last Classifier of Proposed Model

JJSHBB_2019_v25n1_163_f0004.png 이미지

the main UI of SCDC application

Summary of Collected Smartphone Multimodal Data

JJSHBB_2019_v25n1_163_t0001.png 이미지

Error Rate and F1 Score Comparison between Proposed Model and Baseline Model

JJSHBB_2019_v25n1_163_t0002.png 이미지


Supported by : 삼성전자


  1. Alex, K., I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," Proceedings of neural information processing systems, (2012), 1097-1105.
  2. Chen, Y., K. Zhong, J. Zhang, Q. Sun, and X. Zhao, "LSTM Networks for Mobile Human Activity Recognition," Proceedings of International Conference on Artificial Intelligence: Technologies and Applications, (2016), 50-53.
  3. Davide, F, P. C. Diniz, D. R. Ferreira, and J. M. Cardoso, "Preprocessing techniques for context recognition from accelerometer data," Personal and Ubiquitous Computing, Vol. 14, No. 7(2010), 645-662.
  4. Enrique, G., V. Osmani, A. Maxhuni, and O. Mayora, "Detecting Walking in Synchrony Through Smartphone Accelerometer and Wi-Fi Traces," Proceedings of AmI 2014: Ambient Intelligence, (2014), 33-46.
  5. Frank, S., G. Li, X. Chen, and D. Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription," Proceedings of IEEE Workshop Automatic Speech Recognition and Understanding, (2011), 24-29.
  6. Hochreiter, S., and J. Schmidhuber, "Long short-term memory," Neural computation, Vol. 9, No. 8(1997), 1735-1780.
  7. Jiang, W., and Z. Yin, "Human activity recognition using wearable sensors by deep convolutional neural networks," Proceedings of the 23rd ACM international conference on Multimedia, (2015), 1307-1310.
  8. Kingma, D. P., and J. L. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv: 1412.6980(2014).
  9. Lara, O. D., and M. A. Labrador, "A mobile platform for real-time human activity recognition," Proceedings of Consumer Communications and Networking Conference, (2012), 667-671.
  10. LeCun, Y., and Y. Bengio, "Convolutional networks for images, speech, and time series," The handbook of brain theory and neural networks, Vol. 3361, No. 10(1995), 255-258.
  11. Lee, Y., Y. Ju, C. Min, S. Kang, I. Hwang, and J. Song, "Comon: Cooperative ambience monitoring platform with continuity and benefit awareness," Proceedings of the 10th international conference on Mobile systems, applications, and services, (2012), 43-56.
  12. Liu, S., Y. Jiang, and A. Striegel, "Face-to-face proximity estimationusing bluetooth on smartphones," IEEE Transactions on Mobile Computing, Vol. 13, No. 4(2014), 811-823.
  13. Lu, Hong, A. B. Brush, B. Priyantha, A. K. Karlson, and J. Liu, "Speakersense: Energy efficient unobtrusive speaker identification on mobile phones," Proceedings of Pervasive Computing, (2011), 188-205.
  14. Lukowicz, P., H. Junker, M. Stager, T. von Buren, and G. Troster, "WearNET: A distributed multi-sensor system for context aware wearables," Proceedings of UbiComp 2002: Ubiquitous Computing, (2002), 361-370.
  15. Ordonez, F. J., and D. Roggen, "Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition," Sensors, Vol. 16, No. 1(2016), 115.
  16. Ronao, C. A., and S. Cho, "Deep convolutional neural networks for human activity recognition with smartphone sensors," Proceedings of International Conference on Neural Information Processing, (2015), 46-53.
  17. Samek, W., T. Wiegand, and K. Muller, "Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models," arXiv preprint arXiv: 1708.08296(2017).
  18. Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: a simple way to prevent neural networks from overfitting," Journal of machine learning research, Vol. 15, No. 1(2014), 1929-1957.
  19. Stisen, A., H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, "Smart devices are different: Assessing and miti-gatingmobile sensing heterogeneities for activity recognition," Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, (2015), 127-140.
  20. Tarzia, S. P., P. A. Dinda, R. P. Dick, and G. Memik, "Indoor localization without infrastructure using the acoustic background spectrum," Proceedings of the 9th international conference on Mobile systems, applications, and services, (2011), 155-168.
  21. Vinh, L. T., S. Lee, H. X. Le, H. Q. Ngo, H. I. Kim, M. Han, and Y. Lee, "Semi-Markov conditional random fields for accelerometer-based activity recognition," Applied Intelligence, Vol. 35, No. 2(2011), 226-241.
  22. Xu, C., S. Li, G. Liu, Y. Zhang, E. Miluzzo, Y. Chen, J. Li, and B. Firner, "Crowd++: unsupervised speaker count with smartphones," Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, (2013), 43-52.