• Title/Summary/Keyword: Parallel corpus

Search Result 66, Processing Time 0.028 seconds

A Hybrid Sentence Alignment Method for Building a Korean-English Parallel Corpus (한영 병렬 코퍼스 구축을 위한 하이브리드 기반 문장 자동 정렬 방법)

  • Park, Jung-Yeul;Cha, Jeong-Won
    • MALSORI
    • /
    • v.68
    • /
    • pp.95-114
    • /
    • 2008
  • The recent growing popularity of statistical methods in machine translation requires much more large parallel corpora. A Korean-English parallel corpus, however, is not yet enoughly available, little research on this subject is being conducted. In this paper we present a hybrid method of aligning sentences for Korean-English parallel corpora. We use bilingual news wire web pages, reading comprehension materials for English learners, computer-related technical documents and help files of localized software for building a Korean-English parallel corpus. Our hybrid method combines sentence-length based and word-correspondence based methods. We show the results of experimentation and evaluate them. Alignment results from using a full translation model are very encouraging, especially when we apply alignment results to an SMT system: 0.66% for BLEU score and 9.94% for NIST score improvement compared to the previous method.

  • PDF

A Study on the Performance Improvement of Machine Translation Using Public Korean-English Parallel Corpus (공공 한영 병렬 말뭉치를 이용한 기계번역 성능 향상 연구)

  • Park, Chanjun;Lim, Heuiseok
    • Journal of Digital Convergence
    • /
    • v.18 no.6
    • /
    • pp.271-277
    • /
    • 2020
  • Machine translation refers to software that translates a source language into a target language, and has been actively researching Neural Machine Translation through rule-based and statistical-based machine translation. One of the important factors in the Neural Machine Translation is to extract high quality parallel corpus, which has not been easy to find high quality parallel corpus of Korean language pairs. Recently, the AI HUB of the National Information Society Agency(NIA) unveiled a high-quality 1.6 million sentences Korean-English parallel corpus. This paper attempts to verify the quality of each data through performance comparison with the data published by AI Hub and OpenSubtitles, the most popular Korean-English parallel corpus. As test data, objectivity was secured by using test set published by IWSLT, official test set for Korean-English machine translation. Experimental results show better performance than the existing papers tested with the same test set, and this shows the importance of high quality data.

A Comparative Study on Korean Connective Morpheme '-myenseo' to the Chinese expression - based on Korean-Chinese parallel corpus (한국어 연결어미 '-면서'와 중국어 대응표현의 대조연구 -한·중 병렬 말뭉치를 기반으로)

  • YI, CHAO
    • Cross-Cultural Studies
    • /
    • v.37
    • /
    • pp.309-334
    • /
    • 2014
  • This study is based on the Korean-Chinese parallel corpus, utilizing the Korean connective morpheme '-myenseo' and contrasting with the Chinese expression. Korean learners often struggle with the use of Korean Connective Morpheme especially when there is a lexical gap between their mother language. '-myenseo' is of the most use Korean Connective Morpheme, it usually contrast to the Chinese coordinating conjunction. But according to the corpus, the contrastive Chinese expression to '-myenseo' is more than coordinating conjunction. So through this study, can help the Chinese Korean language learners learn easier while studying '-myenseo', because the variety Chinese expression are found from the parallel corpus that related to '-myenseo'. In this study, firstly discussed the semantic features and syntactic characteristics of '-myenseo'. The significant semantic features of '-myenseo' are 'simultaneous' and 'conflict'. So in this chapter the study use examples of usage to analyse the specific usage of '-myenseo'. And then this study analyse syntactic characteristics of '-myenseo' through the subject constraint, predicate constraints, temporal constraints, mood constraints, negatives constraints. then summarize them into a table. And the most important part of this study is Chapter 4. In this chapter, it contrasted the Korean connective morpheme '-myenseo' to the Chinese expression by analysing the Korean-Chinese parallel corpus. As a result of the analysis, the frequency of the Chinese expression that contrasted to '-myenseo' is summarized into

    . It can see from the table that the most common Chinese expression comparative to '-myenseo' is non-marker patterns. That means the connection of sentence in Korean can use connective morpheme what is a clarifying linguistic marker, but in Chinese it often connect the sentence by their intrinsic logical relationships. So the conclusion of this chapter is that '-myenseo' can be comparative to Chinese conjunction, expression, non-marker patterns and liberal translation patterns, which are more than Chinese conjunction that discovered before. In the last Chapter, as the conclusion part of this study, it summarized and suggest the limitations and the future research direction.

  • Judging Translated Web Document & Constructing Bilingual Corpus (웹 번역문서 판별과 병렬 말뭉치 구축)

    • Jee-hyung, Kim;Yill-byung, Lee
      • Proceedings of the Korean Information Science Society Conference
      • /
      • 2004.10a
      • /
      • pp.787-789
      • /
      • 2004
    • People frequently feel the need of a general searching tool that frees from language barrier when they find information through the internet. Therefore, it is necessary to have a multilingual parallel corpus to search with a word that includes a search keyword and has a corresponding word in another language, Multilingual parallel corpus can be built and reused effectively through the several processes which are judgment of the web documents, sentence alignment and word alignment. To build a multilingual parallel corpus, multi-lingual dictionary should be constructed in each language and HTML should be simplified. And by understanding the meaning and the statistics of document structure, judgment on translated web documents will be made and the searched web pages will be aligned in sentence unit.

    • PDF

    Automatic Extraction of Paraphrases from a Parallel Bible Corpus (정렬된 성경 코퍼스로부터 바꿔쓰기표현(paraphrase)의 자동 추출)

    • Lee, Kong-Joo;Yun, Bo-Hyun
      • Korean Journal of Cognitive Science
      • /
      • v.17 no.4
      • /
      • pp.323-336
      • /
      • 2006
    • In this paper, we present a pilot system that can extract paraphrases from a parallel corpus using to-training method. Paraphrases are useful for the applications that should rreate a varied ind fluent text, such as machine translation, question-answering system, and multidocument summarization system. One of the difficulties in extracting paraphrases is to find a rich source from which we can extract paraphrases. The bible is one of the good sources fur extracting paraphrases as it has several Korean versions in which every sentence can be easily aligned by the chapter and the verse. We ran extract not only the lexical-level paraphrases but also the phrasal-level paraphrases from the parallel corpus which consists of the bibles using co-training method.

    • PDF

    The Study on the Principles of Selecting Korean Particle 'Ka' and 'Nun' Using Korean-English Parallel Corpus (한영 병렬 말뭉치를 이용한 한국어 조사 '가'와 '는'의 선택 원리 연구)

    • Yoo, Hyun-Kyung;An, Ye-Ri;Yang, Su-Hyang
      • Language and Information
      • /
      • v.11 no.1
      • /
      • pp.1-23
      • /
      • 2007
    • This study aims to research into the meaning of Korean particle 'ka' and 'nun' inductively by examining the correspondences of those particles and English articles on the Korean-English parallel corpus. The correspondences were checked in three ways: semantically, syntactically and pragmatically. This study found that when the semantic or syntactic tier is not salient, the pragmatic tier is activated and particles are selected according to the pragmatic elements such as the amount of information or the change of topic. However, if the meaning of the particles is salient or if there is any syntactic motive, particles are selected in accordance with the semantic or syntactic elements. Former studies which focused on one of those three tiers cannot properly explain such correspondences on the Korean-English parallel corpus. This study shows that semantic, syntactic and pragmatic tiers hierarchically affect the selection of a particle and that the selection process is also related to speaker's intention. This dimensional analysis of particles is expected to contribute to theoretical studies and applied studies like Korean language education as well.

    • PDF

    Filter-mBART Based Neural Machine Translation Using Parallel Corpus Filtering (병렬 말뭉치 필터링을 적용한 Filter-mBART기반 기계번역 연구)

    • Moon, Hyeonseok;Park, Chanjun;Eo, Sugyeong;Park, JeongBae;Lim, Heuiseok
      • Journal of the Korea Convergence Society
      • /
      • v.12 no.5
      • /
      • pp.1-7
      • /
      • 2021
    • In the latest trend of machine translation research, the model is pretrained through a large mono lingual corpus and then finetuned with a parallel corpus. Although many studies tend to increase the amount of data used in the pretraining stage, it is hard to say that the amount of data must be increased to improve machine translation performance. In this study, through an experiment based on the mBART model using parallel corpus filtering, we propose that high quality data can yield better machine translation performance, even utilizing smaller amount of data. We propose that it is important to consider the quality of data rather than the amount of data, and it can be used as a guideline for building a training corpus.

    A Corpus Analysis of Temporal Adverbs and Verb Tenses Cooccurrence in Spanish, English, and Chinese

    • Cheng, An Chung;Lu, Hui-Chuan
      • Asia Pacific Journal of Corpus Research
      • /
      • v.3 no.2
      • /
      • pp.1-16
      • /
      • 2022
    • This study investigates the cooccurrence between temporal adverbs and grammatical tenses in Spanish and contrasts temporal specifications across Spanish, English, and Chinese. Based on a monolingual Spanish corpus and a trilingual parallel corpus, the study identified the top ten frequent single-word temporal adverbs collocating with grammatical tenses in Spanish. It also contrasted the cooccurrence of temporal adverbs and verb tenses in three languages. The results show that aun 'still', hoy 'today', and ahora 'now' collocate with the present tense at more than 80%. Ayer 'yesterday' and finalmente 'finally' cooccurring with the simple past tense are at 84% and 69%, respectively. Then, mientras 'meanwhile' collocates with the past imperfect at 55%, the highest of all. Mañana 'tomorrow' cooccurs with the future and present tenses at 34%. Other adverbs, ya 'already', siempre 'always', and nuevamete 'again', do not present a strong cooccurrence tendency with a tense overall. The contrastive analysis of the trilingual parallel corpus shows a comprehensive view of temporal specifications in the three languages. However, no clear one-to-one mapping pattern of the cooccurrence across the three languages can be concluded, which provides helpful insights for second language instruction with natural language data rather than intuition. Future research with larger corpora is needed.

    Mining Parallel Text from the Web based on Sentence Alignment

    • Li, Bo;Liu, Juan;Zhu, Huili
      • Proceedings of the Korean Society for Language and Information Conference
      • /
      • 2007.11a
      • /
      • pp.285-292
      • /
      • 2007
    • The parallel corpus is an important resource in the research field of data-driven natural language processing, but there are only a few parallel corpora publicly available nowadays, mostly due to the high labor force needed to construct this kind of resource. A novel strategy is brought out to automatically fetch parallel text from the web in this paper, which may help to solve the problem of the lack of parallel corpora with high quality. The system we develop first downloads the web pages from certain hosts. Then candidate parallel page pairs are prepared from the page set based on the outer features of the web pages. The candidate page pairs are evaluated in the last step in which the sentences in the candidate web page pairs are extracted and aligned first, and then the similarity of the two web pages is evaluate based on the similarities of the aligned sentences. The experiments towards a multilingual web site show the satisfactory performance of the system.

    • PDF

    Translating English By-Phrase Passives into Korean: A Parallel Corpus Analysis (영한 병렬 코퍼스에 나타난 영어 수동문의 한국어 번역)

    • Lee, Seung-Ah
      • Journal of English Language & Literature
      • /
      • v.56 no.5
      • /
      • pp.871-905
      • /
      • 2010
    • This paper is motivated by Watanabe's (2001) observation that English byphrase passives are sometimes translated into Japanese object topicalization constructions. That is, the original English sentence in the passive may be translated into the active voice with the logical object topicalized. A number of scholars, including Chomsky (1981) and Baker (1992), have remarked that languages have various ways to avoid focusing on the logical subject. The aim of the present study is to examine the translation equivalents of the English by-phrase passives in an English-Korean parallel corpus compiled by the author. A small sample of articles from Newsweek magazine and its published Korean translation reveals that there are indeed many ways to translate English by-phrase passives, including object topicalization (12.5%). Among the 64 translated sentences analyzed and classified, 12 (18.8%) examples were problematic in terms of agent defocusing, which is the primary function of passives. Of these 12 instances, five cases were identified where an alternative translation would be more suitable. The results suggest that the functional characteristics of English by-phrase passives should be highlighted in translator training as well as language teaching.