• 제목/요약/키워드: Corpus Frequency

검색결과 165건 처리시간 0.023초

한국어 코퍼스에서 나타나는 빈도효과 비교 (A Comparison of Frequency Effects among Korean Corpus)

  • 정재범;임희석;남기춘
    • 대한음성학회:학술대회논문집
    • /
    • 대한음성학회 2002년도 11월 학술대회지
    • /
    • pp.93-96
    • /
    • 2002
  • This research studied the correlation of word frequency effect in Korean corpus. Experiment 1 showed that word frequency of each other corpus was significant correlated. Experiment 2 showed significant correlation between word frequency of each corpus and lexical decision time of participants. These results support that 4 corpus in this research should have stability to word frequency effect of participants

  • PDF

A Comparison of Korean EFL Learners' Oral and Written Productions

  • Lee, Eun-Ha
    • 영어어문교육
    • /
    • 제12권2호
    • /
    • pp.61-85
    • /
    • 2006
  • The purpose of the present study is to compare Korean EFL learners' speech corpus (i.e. oral productions) with their composition corpus (i.e. written productions). Four college students participated in the study. The composition corpus was collected through a writing assignment, and the speech corpus was gathered by audio-taping their oral presentations. The results of the data analysis indicate that (i) As for error frequency, young adult low-intermediate Korean EFL learners showed high frequency in determiners (mostly, indefinite articles), vocabulary (mostly, semantic errors), and prepositions. The frequency order did not show much difference between the speech corpus and the composition corpus; and (ii) When comparing the oral productions with the written productions, there were not many differences between them in terms of the contents, a style (i.e., colloquial vs. literary), vocabulary selection, and error types and frequency. Therefore, it is assumed that the proficiency in oral presentation of EFL learners at this learning stage heavily depends on how much/how well they are able to write. In other words, EFL learners' writing and speaking skills are closely co-related. It implies that the teacher does not need to separate teaching how to speak from teaching how to write. The teacher may use the same methods or strategies to help the learners improve their English speaking and writing skills. Furthermore, it will be more effective to teach writing before speaking since they have more opportunities to write than speak in the EFL contexts.

  • PDF

A Study on the Diachronic Evolution of Ancient Chinese Vocabulary Based on a Large-Scale Rough Annotated Corpus

  • Yuan, Yiguo;Li, Bin
    • 아시아태평양코퍼스연구
    • /
    • 제2권2호
    • /
    • pp.31-41
    • /
    • 2021
  • This paper makes a quantitative analysis of the diachronic evolution of ancient Chinese vocabulary by constructing and counting a large-scale rough annotated corpus. The texts from Si Ku Quan Shu (a collection of Chinese ancient books) are automatically segmented to obtain ancient Chinese vocabulary with time information, which is used to the statistics on word frequency, standardized type/token ratio and proportion of monosyllabic words and dissyllabic words. Through data analysis, this study has the following four findings. Firstly, the high-frequency words in ancient Chinese are stable to a certain extent. Secondly, there is no obvious dissyllabic trend in ancient Chinese vocabulary. Moreover, the Northern and Southern Dynasties (420-589 AD) and Yuan Dynasty (1271-1368 AD) are probably the two periods with the most abundant vocabulary in ancient Chinese. Finally, the unique words with high frequency in each dynasty are mainly official titles with real power. These findings break away from qualitative methods used in traditional researches on Chinese language history and instead uses quantitative methods to draw macroscopic conclusions from large-scale corpus.

벅아이 코퍼스에서의 젊은 성인 남성의 모음 포먼트 분석 (An Analysis of the Vowel Formants of the Young Males in the Buckeye Corpus)

  • 윤규철;노혜욱
    • 말소리와 음성과학
    • /
    • 제4권2호
    • /
    • pp.41-49
    • /
    • 2012
  • The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, syllabic stress information, the location in a word, location in utterance, speech rate of three consecutive words, and the word frequency in the corpus. The results indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants. The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The result indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants.

Reduction and Frequency Analyses of Vowels and Consonants in the Buckeye Speech Corpus

  • Yang, Byung-Gon
    • 말소리와 음성과학
    • /
    • 제4권3호
    • /
    • pp.75-83
    • /
    • 2012
  • The aims of this study were three. First, to examine the degree of deviation from dictionary prescribed symbols and actual speech made by American English speakers. Second, to measure the frequency of vowel and consonant production of American English speakers. And third, to investigate gender differences in the segmental sounds in a speech corpus. The Buckeye Speech Corpus was recorded by forty American male and female subjects for one hour per subject. The vowels and consonants in both the phonemic and phonetic transcriptions were extracted from the original files of the corpus and their frequencies were obtained using codes of a free software R. Results were as follows: Firstly, the American English speakers produced a reduced number of vowels and consonants in daily conversation. The reduction rate from the dictionary transcriptions to the actual transcriptions was around 38.2%. Secondly, the American English speakers used more front high and back low vowels while three-fourths of the consonants accounted for stops, fricatives, and nasals. This indicates that the segmental inventory has nonlinear frequency distribution in the speech corpus. Thirdly, the two gender groups produced vowels and consonants similarly even though there were a few noticeable differences in their speech. From these results we propose that English teachers consider pronunciation education reflecting the actual speech sounds and that linguists find a way to establish unmarked segmentals from speech corpora.

주택디자인에서 건축가들의 어휘 사용행태 및 기본어휘에 관한 연구 (A Study on the Lexicon-Use Behaviour of Architects & the Basic Lexicons in House Design)

  • 윤대한
    • 한국주거학회논문집
    • /
    • 제17권5호
    • /
    • pp.27-37
    • /
    • 2006
  • This paper analyzed statistically two corpora that were constructed from the texts about house designs written by Korean architects and PA Awards architects. The main results are as follows; (1) The numbers of words in Korean house-design corpus were 9,352 and those of words in PA Awards house design corpus were 2,379. The former were 18.7% and the latter 4.8% of about 50,000 words regarded as the rest using scale in actual life. (2) When the architects described their house designs, the lexicon-concentration phenomenon was pervasive in both groups. Therefore, we can infer that the high-frequency lexicons are very important in house design. (3) The architects' behaviour patterns of using the house-design lexicons, went by rules according to the word frequency order. The tendency formulas of them had the $R^{2}$ values which were more than 90%. (4) In Korean house design corpus, the high frequency lexicons were '공간', '층', '주택', '집', '대지', '거실', and '실'. In PA awards house design corpus, they were 'house','room','space','living','wall','level' and 'area'. From these results, We can tell that 'space' is the highest frequency word in house design of the two groups, and that '대지 ' and 'wall' are the words that reveal well the differences between the two groups.

Using Corpora for Studying English Grammar

  • Kwon, Heok-Seung
    • 한국영어학회지:영어학
    • /
    • 제4권1호
    • /
    • pp.61-81
    • /
    • 2004
  • This paper will look at some grammatical phenomena which will illustrate some of the questions that can be addressed with a corpus-based approach. We will use this approach to investigate the following subjects in English grammar: number ambiguity, subject-verb concord, concord with measure expressions, and (reflexive) pronoun choice in coordinated noun phrases. We will emphasize the distinctive features of the corpus-based approach, particularly its strengths in investigating language use, as opposed to traditional descriptions or prescriptions of structure in English grammar. This paper will show that a corpus-based approach has made it possible to conduct new kinds of investigations into grammar in use and to expand the scope of earlier investigations. Native speakers rarely have accurate information about frequency of use. A large representative corpus (i.e., The British National Corpus) is one of the most reliable sources of frequency information. It is important to base an analysis of language on real data rather than intuition. Any description of grammar is more complete and accurate if it is based on a body of real data.

  • PDF

벅아이 코퍼스 오류 수정과 코퍼스 활용을 위한 프랏 스크립트 툴 (Error Correction and Praat Script Tools for the Buckeye Corpus of Conversational Speech)

  • 윤규철
    • 말소리와 음성과학
    • /
    • 제4권1호
    • /
    • pp.29-47
    • /
    • 2012
  • The purpose of this paper is to show how to convert the label files of the Buckeye Corpus of Spontaneous Speech [1] into Praat format and to introduce some of the Praat scripts that will enable linguists to study various aspects of spoken American English present in the corpus. During the conversion process, several types of errors were identified and corrected either manually or automatically by the use of scripts. The Praat script tools that have been developed can help extract from the corpus massive amounts of phonetic measures such as the VOT of plosives, the formants of vowels, word frequency information and speech rates that span several consecutive words. The script tools can extract additional information concerning the phonetic environment of the target words or allophones.

An Attempt to Measure the Familiarity of Specialized Japanese in the Nursing Care Field

  • Haihong Huang;Hiroyuki Muto;Toshiyuki Kanamaru
    • 아시아태평양코퍼스연구
    • /
    • 제4권2호
    • /
    • pp.57-74
    • /
    • 2023
  • Having a firm grasp of technical terms is essential for learners of Japanese for Specific Purposes (JSP). This research aims to analyze Japanese nursing care vocabulary based on objective corpus-based frequency and subjectively rated word familiarity. For this purpose, we constructed a text corpus centered on the National Examination for Certified Care Workers to extract nursing care keywords. The Log-Likelihood Ratio (LLR) was used as the statistical criterion for keyword identification, giving a list of 300 keywords as target words for a further word recognition survey. The survey involved 115 participants of whom 51 were certified care workers (CW group) and 64 were individuals from the general public (GP group). These participants rated the familiarity of the target keywords through crowdsourcing. Given the limited sample size, Bayesian linear mixed models were utilized to determine word familiarity rates. Our study conducted a comparative analysis of word familiarity between the CW group and the GP group, revealing key terms that are crucial for professionals but potentially unfamiliar to the general public. By focusing on these terms, instructors can bridge the knowledge gap more efficiently.

코퍼스 빈도 정보 활용을 위한 적정 통계 모형 연구: 코퍼스 규모에 따른 타입/토큰의 함수관계 중심으로 (The Statistical Relationship between Linguistic Items and Corpus Size)

  • 양경숙;박병선
    • 한국언어정보학회지:언어와정보
    • /
    • 제7권2호
    • /
    • pp.103-115
    • /
    • 2003
  • In recent years, many organizations have been constructing their own large corpora to achieve corpus representativeness. However, there is no reliable guideline as to how large corpus resources should be compiled, especially for Korean corpora. In this study, we have contrived a new statistical model, ARIMA (Autoregressive Integrated Moving Average), for predicting the relationship between linguistic items (the number of types) and corpus size (the number of tokens), overcoming the major flaws of several previous researches on this issue. Finally, we shall illustrate that the ARIMA model presented is valid, accurate and very reliable. We are confident that this study can contribute to solving some inherent problems of corpus linguistics, such as corpus predictability, corpus representativeness and linguistic comprehensiveness.

  • PDF