DOI QR코드

DOI QR Code

Image Understanding for Visual Dialog

  • Cho, Yeongsu (Dept. of Computer Science, Graduate School of Kyonggi University) ;
  • Kim, Incheol (Dept. of Computer Science, Kyonggi University)
  • Received : 2019.06.26
  • Accepted : 2019.09.04
  • Published : 2019.10.31

Abstract

This study proposes a deep neural network model based on an encoder-decoder structure for visual dialogs. Ongoing linguistic understanding of the dialog history and context is important to generate correct answers to questions in visual dialogs followed by questions and answers regarding images. Nevertheless, in many cases, a visual understanding that can identify scenes or object attributes contained in images is beneficial. Hence, in the proposed model, by employing a separate person detector and an attribute recognizer in addition to visual features extracted from the entire input image at the encoding stage using a convolutional neural network, we emphasize attributes, such as gender, age, and dress concept of the people in the corresponding image and use them to generate answers. The results of the experiments conducted using VisDial v0.9, a large benchmark dataset, confirmed that the proposed model performed well.

Keywords

References

  1. A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, and D. Batra, "VQA: visual question answering," International Journal of Computer Vision, vol. 123, no. 1, pp. 4-31, 2017. https://doi.org/10.1007/s11263-016-0966-6
  2. A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra, "Visual dialog," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 326-335.
  3. P. H. Seo, A. Lehrmann, B. Han, and L. Sigal, "Visual reference resolution using attention memory for visual dialog," Advances in Neural Information Processing Systems, vol. 30, pp. 3719-3729, 2017.
  4. J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra, "Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model," Advances in Neural Information Processing Systems, vol. 30, pp. 314-324, 2017.
  5. Q. Wu, P. Wang, C. Shen, I. Reid, and A. van den Hengel, "Are you talking to me? Reasoned visual dialog generation through adversarial learning," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 6106-6115.
  6. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, "Bottom-up and top-down attention for image captioning and visual question answering," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 6077-6086.
  7. L. Peng, Y. Yang, Y. Bin, N. Xie, F. Shen, Y. Ji, and X. Xu, "Word-to-region attention network for visual question answering," Multimedia Tools and Applications, vol. 78, no. 3, pp. 3843-3858, 2019. https://doi.org/10.1007/s11042-018-6389-3
  8. A. Trott, C. Xiong, and R. Socher, "Interpretable counting for visual question answering," in Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada, 2018.
  9. M. T. Desta, L. Chen, and T. Kornuta, "Object-based reasoning in VQA," in Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, 2018, pp. 1814-1823.