Three-stream network with context convolution module for human-object interaction detection

Siadari, Thomhert S.;Han, Mikyong;Yoon, Hyunjin;

doi:10.4218/etrij.2019-0230

ETRI Journal

Volume 42 Issue 2
/
Pages.230-238
/
2020
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Three-stream network with context convolution module for human-object interaction detection

Siadari, Thomhert S. (ICT Major of ETRI School, University of Science and Technology) ;
Han, Mikyong (City and Transportation ICT Research Department, Electronics and Telecommunications Research Institute) ;
Yoon, Hyunjin (ICT Major of ETRI School, University of Science and Technology)

Received : 2019.04.29
Accepted : 2019.08.14
Published : 2020.04.03

https://doi.org/10.4218/etrij.2019-0230 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Human-object interaction (HOI) detection is a popular computer vision task that detects interactions between humans and objects. This task can be useful in many applications that require a deeper understanding of semantic scenes. Current HOI detection networks typically consist of a feature extractor followed by detection layers comprising small filters (eg, 1 × 1 or 3 × 3). Although small filters can capture local spatial features with a few parameters, they fail to capture larger context information relevant for recognizing interactions between humans and distant objects owing to their small receptive regions. Hence, we herein propose a three-stream HOI detection network that employs a context convolution module (CCM) in each stream branch. The CCM can capture larger contexts from input feature maps by adopting combinations of large separable convolution layers and residual-based convolution layers without increasing the number of parameters by using fewer large separable filters. We evaluate our HOI detection method using two benchmark datasets, V-COCO and HICO-DET, and demonstrate its state-of-the-art performance.

Keywords

References

A. Gupta, A. Kembhavi, and L. S. Davis, Observing human-object interactions: Using spatial and functional compatibility for recognition, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2009), no. 10, 1775-1789. https://doi.org/10.1109/TPAMI.2009.83
V. Delaitre, I. Laptev, and J. Sivic, Recognizing human actions in still images: a study of bag-of-features and part-based representations, in Proc. BMVC 2010-21st British Mach. Vision Conf., 2010, pp. 97:1-11.
B. Yao and L. Fei-Fei, Modeling mutual context of object and human pose in human-object interaction activities, in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recogn., San Francisco, CA, USA, June 2010, pp.17-24.
C. Y. Chen and K. Grauman, Predicting the location of interactees in novel human-object interactions, Asian conference on computer vision, Springer, Cham, Switzerland, 2014, pp. 351-367.
S. Gupta and J. Malik, Visual semantic role labeling, arXiv preprint arXiv:1505.04474, 2015.
L. Wang and D. Sng, Deep learning algorithms with applications to video analytics for a smart city: a survey, arXiv preprint arXiv:1512.03131, 2015.
J. W. Choi, D. Moon, and J. H. Yoo, Robust multi-person tracking for real-time intelligent video surveillance, ETRI J. 37 (2015), no. 3, 551-561. https://doi.org/10.4218/etrij.15.0114.0629
J. Moon et al., Extensible hierarchical method of detecting interactive actions for video understanding, ETRI J. 39 (2017), no. 4, 502-513. https://doi.org/10.4218/etrij.17.0116.0054
K. Yun et al., Vision-based garbage dumping action detection for real-world surveillance platform, ETRI J. 41 (2019), no. 4, 494-505. https://doi.org/10.4218/etrij.2018-0520
Y. Licheng et al., Visual madlibs: fill in the blank image generation and question answering, arXiv preprint arXiv:1506.00278, 2015.
G. Gkioxari et al., Detecting and recognizing human-object interactions, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Salt Lake City, UT, USA, June 2018, pp. 8359-8367.
Y. W. Chao et al., Learning to detect human-object interactions, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 381-389.
L. Shen et al., Scaling human-object interaction recognition through zero-shot learning, in Proc. IEEE Winter Conf. Applicat. Comput. Vision, Lake Tahoe, NV, USA, Mar. 2018, pp. 1568-1576.
C. Gao, Y. Zou, and J. B. Huang, iCAN: Instance-centric attention network for human-object interaction detection, British Machine Vision Conference, 2018.
C. Peng et al., Large kernel matters-improve semantic segmentation by global convolutional network, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 4353-4361.
M. A. Sadeghi and A. Farhadi, Recognition using visual phrases, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Providence, RI, USA, 2011, pp. 1745-1752.
L. Cewu et al., Visual relationship detection with language priors, European Conference on Computer Vision, Springer, Cham, Switzerland, 2016, pp. 852-869.
M. Yatskar, L. Zettlemoyer, and A. Farhadi, Situation recognition: Visual semantic role labeling for image understanding, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, 2016, pp. 5534-5542.
B. Dai, Y. Zhang, and D. Lin, Detecting visual relationships with deep relational networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 3076-3086.
H. Zhang et al., Visual translation embedding network for visual relation detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 5532-5540.
H. Ronghang et al., Modeling relationships in referential expressions with compositional modular networks, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, 2017, pp. 1115-1124.
J. Peyre et al., Weakly-supervised learning of visual relations, in Proc. IEEE Int. Conf. Comput. Vision, Venice, Italy, 2017, pp. 5179-5188.
A. Kolesnikov, C. H. Lampert, and V. Ferrari. Detecting visual relationships using box attention, arXiv preprint arXiv:1807.02136, 2018.
M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich, Feedforward semantic segmentation with zoom-out features, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Boston, MA, USA, 2015, pp. 3376-3385.
W. Liu, A. Rabinovich, and A. C. Berg, Parsenet: Looking wider to see better, arXiv preprint arXiv:1506.04579, 2015.
F. Yu and V. Koltun, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv:1511.07122, 2015.
K. He et al., Deep residual learning for image recognition, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Las Vegas, NV, USA, June 2016, pp. 770-778.
R. Girshick et al., Detectron, https://github.com/facebookresearch/detectron, 2018.
T. Y. Lin et al., Microsoft COCO: Common objects in context, in Proc. Computer Vision-ECCV, Zurich, Switzerland, Sept. 2014, pp. 740-755.
Y. W. Chao et al., HICO: A benchmark for recognizing human-object interactions in images, in Proc. IEEE Int. Conf. Comput. Vision, Santiago, Chile, 2015, pp. 1017-1025.
T. Y. Lin et al., Feature pyramid networks for object detection, in Proc. IEEE Conf. Comput. Vision Pattern Recogn., Honolulu, HI, USA, July 2017, pp. 2117-2125.
S. Qi et al., Learning human-object interactions by graph parsing neural networks, in Proc. Eur. Conf. Comput. Vision (ECCV), 2018, pp. 401-417.
X. Bingjie et al., Interact as you intend: Intention-driven human- object interaction detection, CoRR abs/1808.09796, 2018.

ETRI Journal

Three-stream network with context convolution module for human-object interaction detection

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)