• Title/Summary/Keyword: Corpus Linguistics

Search Result 77, Processing Time 0.028 seconds

An Algorithm for Predicting the Relationship between Lemmas and Corpus Size

  • Yang, Dan-Hee;Gomez, Pascual Cantos;Song, Man-Suk
    • ETRI Journal
    • /
    • v.22 no.2
    • /
    • pp.20-31
    • /
    • 2000
  • Much research on natural language processing (NLP), computational linguistics and lexicography has relied and depended on linguistic corpora. In recent years, many organizations around the world have been constructing their own large corporal to achieve corpus representativeness and/or linguistic comprehensiveness. However, there is no reliable guideline as to how large machine readable corpus resources should be compiled to develop practical NLP software and/or complete dictionaries for humans and computational use. In order to shed some new light on this issue, we shall reveal the flaws of several previous researches aiming to predict corpus size, especially those using pure regression or curve-fitting methods. To overcome these flaws, we shall contrive a new mathematical tool: a piecewise curve-fitting algorithm, and next, suggest how to determine the tolerance error of the algorithm for good prediction, using a specific corpus. Finally, we shall illustrate experimentally that the algorithm presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, compiling methodology, corpus representativeness and linguistic comprehensiveness.

  • PDF

Characteristics of Intermediate/Advanced Korean Inter-Englishes: A Corpus-Linguistic Analysis. (우리나라 중.상급학습자 영어의 특징 : 말뭉치 언어학적 분석)

  • 안성호;이영미
    • Korean Journal of English Language and Linguistics
    • /
    • v.4 no.1
    • /
    • pp.83-102
    • /
    • 2004
  • The purpose of this paper is to find out some major characteristics of intermediate-advanced Korean learners' English by corpus- linguistically analyzing their essays in comparison with native speakers'. We construct a corpus of CBT TOEFL essays by Korean learners, NNS1 (94076 words in 402 texts), and its sub-corpus, NNS2 (14291 words in 45 texts), and then a corpus of model essays written or meticulously edited by native speakers, NS (14833 words in 35 texts). We compare NNS1 and NNS2 with NS, and with some other corpora, in terms of high-frequency words, and show that Korean learners' writings have more features of informal writing than those of formal writing, which is in accord with the reports in Granger (1998) that EFL writings by European advanced learners are characterized by informality.

  • PDF

Development and Evaluation of a Korean Treebank and its Application to NLP

  • Han, Chung-Hye;Han, Na-Rae;Ko, Eon-Suk;Martha Palmer
    • Language and Information
    • /
    • v.6 no.1
    • /
    • pp.123-138
    • /
    • 2002
  • This paper discusses issues in building a 54-thousand-word Korean Treebank using a phrase structure annotation, along with developing annotation guidelines based on the morpho-syntactic phenomena represented in the corpus. Various methods that were employed for quality control are presented. The evaluation on the quality of the Treebank and some of the NLP applications under development using the Treebank are also pre-sented.

  • PDF

Using Corpora for Studying English Grammar

  • Kwon, Heok-Seung
    • Korean Journal of English Language and Linguistics
    • /
    • v.4 no.1
    • /
    • pp.61-81
    • /
    • 2004
  • This paper will look at some grammatical phenomena which will illustrate some of the questions that can be addressed with a corpus-based approach. We will use this approach to investigate the following subjects in English grammar: number ambiguity, subject-verb concord, concord with measure expressions, and (reflexive) pronoun choice in coordinated noun phrases. We will emphasize the distinctive features of the corpus-based approach, particularly its strengths in investigating language use, as opposed to traditional descriptions or prescriptions of structure in English grammar. This paper will show that a corpus-based approach has made it possible to conduct new kinds of investigations into grammar in use and to expand the scope of earlier investigations. Native speakers rarely have accurate information about frequency of use. A large representative corpus (i.e., The British National Corpus) is one of the most reliable sources of frequency information. It is important to base an analysis of language on real data rather than intuition. Any description of grammar is more complete and accurate if it is based on a body of real data.

  • PDF

A Transformation-Based Learning Method on Generating Korean Standard Pronunciation

  • Kim, Dong-Sung;Roh, Chang-Hwa
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2007.11a
    • /
    • pp.241-248
    • /
    • 2007
  • In this paper, we propose a Transformation-Based Learning (TBL) method on generating the Korean standard pronunciation. Previous studies on the phonological processing have been focused on the phonological rule applications and the finite state automata (Johnson 1984; Kaplan and Kay 1994; Koskenniemi 1983; Bird 1995). In case of Korean computational phonology, some former researches have approached the phonological rule based pronunciation generation system (Lee et al. 2005; Lee 1998). This study suggests a corpus-based and data-oriented rule learning method on generating Korean standard pronunciation. In order to substituting rule-based generation with corpus-based one, an aligned corpus between an input and its pronunciation counterpart has been devised. We conducted an experiment on generating the standard pronunciation with the TBL algorithm, based on this aligned corpus.

  • PDF

Korean Nominal Bank, Using Language Resources of Sejong Project (세종계획 언어자원 기반 한국어 명사은행)

  • Kim, Dong-Sung
    • Language and Information
    • /
    • v.17 no.2
    • /
    • pp.67-91
    • /
    • 2013
  • This paper describes Korean Nominal Bank, a project that provides argument structure for instances of the predicative nouns in the Sejong parsed Corpus. We use the language resources of the Sejong project, so that the same set of data is annotated with more and more levels of annotation, since a new type of a language resource building project could bring new information of separate and isolated processing. We have based on the annotation scheme based on the Sejong electronic dictionary, semantically tagged corpus, and syntactically analyzed corpus. Our work also involves the deep linguistic knowledge of syntaxsemantic interface in general. We consider the semantic theories including the Frame Semantics of Fillmore (1976), argument structure of Grimshaw (1990) and argument alternation of Levin (1993), and Levin and Rappaport Hovav (2005). Various syntactic theories should be needed in explaining various sentence types, including empty categories, raising, left (or right dislocation). We also need an explanation on the idiosyncratic lexical feature, such as collocation and etc.

  • PDF

A Novel Theory of Support in Social Media Discourse

  • Solomon, Bazil Stanley
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.1
    • /
    • pp.95-125
    • /
    • 2020
  • This paper aims to inform people how to support each other on social media. It alludes to an architecture for social media discourse and proposes a novel theory of support in social media discourse. It makes a methodological contribution. It combines predominately artificial intelligence with corpus linguistics analysis. It is on a large-scale dataset of anonymised diabetes-related user's posts from the Facebook platform. Log-likelihood and precision measures help with validation. A multi-method approach with Discourse Analysis helps in understanding any potential patterns. People living with Diabetes are found to employ sophisticated high-frequency patterns of device-enabled categories of purpose and content. It is with, for example, linguistic forms of Advice with stance-taking and targets such as Diabetes amongst other interactional ways. There can be uncertainty and variation of effect displayed when sharing information for support. The implications of the new theory aim at healthcare communicators, corpus linguists and with preliminary work for AI support-bots. These bots may be programmed to utilise the language patterns to support people who need them automatically.

『Asia Pacific Journal of Corpus Research』 (1 권 1 호의 연구 동향과 연구 방법에 관한 고찰)

  • Jung, Chae Kwan
    • Asia Pacific Journal of Corpus Research
    • /
    • v.1 no.1
    • /
    • pp.127-132
    • /
    • 2020
  • The purpose of this review is to provide local readers, more specifically, Korean student readers who are not all that familiar with the English language a general overview of research articles that have been published in Asia Pacific Journal of Corpus Research vol. 1, no. 1. A brief summary of each research article focusing on research methods and then followed by an overall review and some insights on research issues will be presented.

Clustering Keywords to Define Cybersecurity: An Analysis of Malaysian and ASEAN Countries' Cyber Laws

  • Joharry, Siti Aeisha;Turiman, Syamimi;Nor, Nor Fariza Mohd
    • Asia Pacific Journal of Corpus Research
    • /
    • v.3 no.2
    • /
    • pp.17-33
    • /
    • 2022
  • While the term is nothing new, 'cybersecurity' still seems to be defined quite loosely and subjectively depending on context. This is problematic especially to legal writers for prosecuting cybercrimes that do not fit a particular clause/act. In fact, what is more difficult is the non-existent single 'cybersecurity law' in Malaysia, rather than the current implementation of 10-related cyber security acts. In this paper, the 10 acts are compiled into a corpus to analyse the language used in these acts via a corpus linguistics approach. A list of frequent words is firstly investigated to see whether the so-called related laws do talk about cybersecurity followed by close inspection of the concordance lines and habitually associated phrases (clusters) to explore use of these words in context. The 'compare 2 wordlist' feature is used to identify similarities or differences between the 10 Malaysian cybersecurity related laws against a corpus of cyber laws from other ASEAN countries. Findings revealed that ASEAN cyber laws refer mostly to three cybersecurity dominant themes identified in the literature: technological solutions, events, and strategies, processes, and methods, whereas Malaysian cybersecurity-related laws revolved around themes like human engagement, and referent objects (of security). Although these so-called cyber related policies and laws in Malaysia are highlighted in the National Cyber Security Agency (NACSA), their practical applications to combat cybercrimes remain uncertain.

Critical Discourse Analysis of '5.18' in 'Honam' and 'Yeongnam' Local Newspapers by Using Corpus (코퍼스를 이용한 '호남'과 '영남' 지역신문에서의 '5.18'에 대한 비판적 담화분석)

  • Lee, Sukeui;Jin, Duhyeon
    • Korean Linguistics
    • /
    • v.76
    • /
    • pp.83-112
    • /
    • 2017
  • In this paper, newspaper articles were collected through '5.18' keyword search results and the news corpus was constructed from the collected data. In the articles of local newspapers 'Honam' and 'Yeongnam', the ideological differences regarding '5.18' were investigated. The ideological differences of local newspaper discourse through objective figures was analyzed.. The subjects of the newspaper articles, the frequency of nouns and predicates were analyzed. The use and meaning of the intended vocabulary were examined. As a result of analyzing the title of the newspaper article, the discourse written in 'Honam' emphasized the necessity of re - recognition of 5.18. In both regions, the word "Gwangju" is often used. However, 'Gwangju' in 'Honam' newspaper means spiritual space, not physical space. In Honam regional newspapers, there are many vocabularies describing the events such as 'shoot' and 'fire', this calls for recollection and memory of '5.18'. In the analysis of newspaper discourse, the analysis of the contrast between the local newspapers was very insignificant, but, this study was conducted to analyze the discourse among local newspapers.