• Title/Summary/Keyword: VQA

Search Result 10, Processing Time 0.035 seconds

Korean VQA with Deep learning (딥러닝을 이용한 한국어 VQA)

  • Bae, Jangseong;Lee, Changki
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.364-366
    • /
    • 2018
  • Visual Question Answering(VQA)은 주어진 이미지와 질문에 대해 알맞은 정답을 찾는 기술이다. VQA는 어린이 학습, 인공지능 비서 등 여러 분야에 활용할 수 있는 중요한 기술이다. 그러나 관련된 한국어 데이터를 확보하기 힘든 이유로 한국어를 이용한 연구는 이루어지지 못하고 있다. 본 논문에서는 기존 영어 VQA 데이터를 한글로 번역하여 한국어 VQA 데이터로 사용하며, 이미지 정보와 질문 정보를 적절히 조절할 수 있는 Gate를 한국어 VQA에 적용한다. 실험 결과, 본 논문에서 제안한 모델이 영어 및 한국어 VQA 데이터에서 다른 모델보다 더 좋은 성능을 보였다.

  • PDF

Study of the Application of VQA Deep Learning Technology to the Operation and Management of Urban Parks - Analysis of SNS Images - (도시공원 운영 및 관리를 위한 VQA 딥러닝 기술 활용 연구 - SNS 이미지 분석을 중심으로 -)

  • Lee, Da-Yeon;Park, Seo-Eun;Lee, Jae Ho
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.51 no.5
    • /
    • pp.44-56
    • /
    • 2023
  • This research explores the enhancement of park operation and management by analyzing the changing demands of park users. While traditional methods depended on surveys, there has been a recent shift towards utilizing social media data to understand park usage trends. Notably, most research has focused on text data from social media, overlooking the valuable insights from image data. Addressing this gap, our study introduces a novel method of assessing park usage using social media image data and then applies it to actual city park evaluations. A unique image analysis tool, built on Visual Question Answering (VQA) deep learning technology, was developed. This tool revealed specific city park details such as user demographics, behaviors, and locations. Our findings highlight three main points: (1) The VQA-based image analysis tool's validity was proven by matching its results with traditional text analysis outcomes. (2) VQA deep learning technology offers insights like gender, age, and usage time, which aren't accessible from text analysis alone. (3) Using VQA, we derived operational and management strategies for city parks. In conclusion, our VQA-based method offers significant methodological advancements for future park usage studies.

Hybrid No-Reference Video Quality Assessment Focusing on Codec Effects

  • Liu, Xingang;Chen, Min;Wan, Tang;Yu, Chen
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.5 no.3
    • /
    • pp.592-606
    • /
    • 2011
  • Currently, the development of multimedia communication has progressed so rapidly that the video program service has become a requirement for ordinary customers. The quality of experience (QoE) for the visual signal is of the fundamental importance for numerous image and video processing applications, where the goal of video quality assessment (VQA) is to automatically measure the quality of the visual signal in agreement with the human judgment of the video quality. Considering the codec effect to the video quality, in this paper an efficient non-reference (NR) VQA algorithm is proposed which estimates the video quality (VQ) only by utilizing the distorted video signal at the destination. The VQA feature vectors (FVs) which have high relationships with the subjective quality of the distorted video are investigated, and a hybrid NR VQA (HNRVQA) function is established by considering the multiple FVs. The simulation results, testing on the SDTV programming provided by VCEG Phase I, show that the proposed algorithm can represent the VQ accurately, and it can be used to replace the subjective VQA to measure the quality of the video signal automatically at the destinations.

MMA: Multi-modal Message Aggregation for Korean VQA (MMA: 한국어 시각적 질의응답을 위한 멀티 모달 메시지 통합)

  • Park, Sungjin;Park, Chanjun;Seo, Jaehyung;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.468-472
    • /
    • 2020
  • 시각적 질의응답(Visual Question Answering, VQA)은 주어진 이미지에 연관된 다양한 질문에 대한 올바른 답변을 예측하는 기술이다. 해당 기술은 컴퓨터 비전-자연어 처리 연구분야에서 활발히 연구가 진행되고 있으며, 질문의 의도를 정확히 파악하고, 주어진 이미지에서 관련 단서 정보를 찾는 것이 중요하다. 또한, 서로 이질적인 특성을 지닌 정보(이미지 객체, 객체 위치, 질문)를 통합하는 과정도 중요하다. 본 논문은 질문의 의도에 알맞은 정보를 효율적으로 사용하기 위해 멀티 모달 입력 이미지 객체, 객체 위치, 질문)에 대한 Multi-modal Message Aggregation (MMA) 제안하며 이를 통해 한국어 시각적 질의응답 KVQA에서 다른 모델보다 더 좋은 성능을 확인하였다.

  • PDF

KG_VCR: A Visual Commonsense Reasoning Model Using Knowledge Graph (KG_VCR: 지식 그래프를 이용하는 영상 기반 상식 추론 모델)

  • Lee, JaeYun;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.3
    • /
    • pp.91-100
    • /
    • 2020
  • Unlike the existing Visual Question Answering(VQA) problems, the new Visual Commonsense Reasoning(VCR) problems require deep common sense reasoning for answering questions: recognizing specific relationship between two objects in the image, presenting the rationale of the answer. In this paper, we propose a novel deep neural network model, KG_VCR, for VCR problems. In addition to make use of visual relations and contextual information between objects extracted from input data (images, natural language questions, and response lists), the KG_VCR also utilizes commonsense knowledge embedding extracted from an external knowledge base called ConceptNet. Specifically the proposed model employs a Graph Convolutional Neural Network(GCN) module to obtain commonsense knowledge embedding from the retrieved ConceptNet knowledge graph. By conducting a series of experiments with the VCR benchmark dataset, we show that the proposed KG_VCR model outperforms both the state of the art(SOTA) VQA model and the R2C VCR model.

Using similarity based image caption to aid visual question answering (유사도 기반 이미지 캡션을 이용한 시각질의응답 연구)

  • Kang, Joonseo;Lim, Changwon
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.2
    • /
    • pp.191-204
    • /
    • 2021
  • Visual Question Answering (VQA) and image captioning are tasks that require understanding of the features of images and linguistic features of text. Therefore, co-attention may be the key to both tasks, which can connect image and text. In this paper, we propose a model to achieve high performance for VQA by image caption generated using a pretrained standard transformer model based on MSCOCO dataset. Captions unrelated to the question can rather interfere with answering, so some captions similar to the question were selected to use based on a similarity to the question. In addition, stopwords in the caption could not affect or interfere with answering, so the experiment was conducted after removing stopwords. Experiments were conducted on VQA-v2 data to compare the proposed model with the deep modular co-attention network (MCAN) model, which showed good performance by using co-attention between images and text. As a result, the proposed model outperformed the MCAN model.

A Study on Performance Improvement of GVQA Model Using Transformer (트랜스포머를 이용한 GVQA 모델의 성능 개선에 관한 연구)

  • Park, Sung-Wook;Kim, Jun-Yeong;Park, Jun;Lee, Han-Sung;Jung, Se-Hoon;Sim, Cun-Bo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.749-752
    • /
    • 2021
  • 오늘날 인공지능(Artificial Intelligence, AI) 분야에서 가장 구현하기 어려운 분야 중 하나는 추론이다. 근래 추론 분야에서 영상과 언어가 결합한 다중 모드(Multi-modal) 환경에서 영상 기반의 질의 응답(Visual Question Answering, VQA) 과업에 대한 AI 모델이 발표됐다. 얼마 지나지 않아 VQA 모델의 성능을 개선한 GVQA(Grounded Visual Question Answering) 모델도 발표됐다. 하지만 아직 GVQA 모델도 완벽한 성능을 내진 못한다. 본 논문에서는 GVQA 모델의 성능 개선을 위해 VCC(Visual Concept Classifier) 모델을 ViT-G(Vision Transformer-Giant)/14로 변경하고, ACP(Answer Cluster Predictor) 모델을 GPT(Generative Pretrained Transformer)-3으로 변경한다. 이와 같은 방법들은 성능을 개선하는 데 큰 도움이 될 수 있다고 사료된다.

3D Visual Attention Model and its Application to No-reference Stereoscopic Video Quality Assessment (3차원 시각 주의 모델과 이를 이용한 무참조 스테레오스코픽 비디오 화질 측정 방법)

  • Kim, Donghyun;Sohn, Kwanghoon
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.4
    • /
    • pp.110-122
    • /
    • 2014
  • As multimedia technologies develop, three-dimensional (3D) technologies are attracting increasing attention from researchers. In particular, video quality assessment (VQA) has become a critical issue in stereoscopic image/video processing applications. Furthermore, a human visual system (HVS) could play an important role in the measurement of stereoscopic video quality, yet existing VQA methods have done little to develop a HVS for stereoscopic video. We seek to amend this by proposing a 3D visual attention (3DVA) model which simulates the HVS for stereoscopic video by combining multiple perceptual stimuli such as depth, motion, color, intensity, and orientation contrast. We utilize this 3DVA model for pooling on significant regions of very poor video quality, and we propose no-reference (NR) stereoscopic VQA (SVQA) method. We validated the proposed SVQA method using subjective test scores from our results and those reported by others. Our approach yields high correlation with the measured mean opinion score (MOS) as well as consistent performance in asymmetric coding conditions. Additionally, the 3DVA model is used to extract information for the region-of-interest (ROI). Subjective evaluations of the extracted ROI indicate that the 3DVA-based ROI extraction outperforms the other compared extraction methods using spatial or/and temporal terms.

Visual Commonsense Reasoning with Knowledge Graph (지식 그래프를 이용한 영상 기반 상식 추론)

  • Lee, Jae-Yun;Kim, In-Cheol
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.10a
    • /
    • pp.994-997
    • /
    • 2019
  • 영상 기반 상식 추론(VCR) 문제는 기존의 영상 기반 질문-응답(VQA) 문제들과는 달리, 영상에 포함된 사물들 간의 관계 파악과 답변 근거 제시 등 별도의 상식 추론이 요구되는 새로운 지능 문제이다. 본 논문에서는 입력 데이터(영상, 자연어 질문, 응답 리스트)에서 사물들 간의 관계와 맥락 정보를 추출해내는 모듈들 외에, 별도로 ConceptNet과 같은 외부 지식 베이스로부터 관련 상식들을 직접 가져다 GCN 기반의 지식 그래프 임베딩 과정을 거쳐 추가적으로 활용할 수 있는 모듈들을 포함한 새로운 심층 신경망 모델인 KG_VCR을 제안한다. 제안 모델인 KG_VCR의 세부 설계사항들을 소개하고, VCR 벤치마크 데이터 집합을 이용한 다양한 실험들을 통해 제안 모델의 성능을 입증한다.

Deep Neural Network-Based Scene Graph Generation for 3D Simulated Indoor Environments (3차원 가상 실내 환경을 위한 심층 신경망 기반의 장면 그래프 생성)

  • Shin, Donghyeop;Kim, Incheol
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.8 no.5
    • /
    • pp.205-212
    • /
    • 2019
  • Scene graph is a kind of knowledge graph that represents both objects and their relationships found in a image. This paper proposes a 3D scene graph generation model for three-dimensional indoor environments. An 3D scene graph includes not only object types, their positions and attributes, but also three-dimensional spatial relationships between them, An 3D scene graph can be viewed as a prior knowledge base describing the given environment within that the agent will be deployed later. Therefore, 3D scene graphs can be used in many useful applications, such as visual question answering (VQA) and service robots. This proposed 3D scene graph generation model consists of four sub-networks: object detection network (ObjNet), attribute prediction network (AttNet), transfer network (TransNet), relationship prediction network (RelNet). Conducting several experiments with 3D simulated indoor environments provided by AI2-THOR, we confirmed that the proposed model shows high performance.