Interplay of Text Mining and Data Mining for Classifying Web Contents

웹 컨텐츠의 분류를 위한 텍스트마이닝과 데이터마이닝의 통합 방법 연구

  • 최윤정 (이화여자대학교 컴퓨터 학과) ;
  • 박승수 (이화여자대학교 컴퓨터 학과)
  • Published : 2002.09.01

Abstract

Recently, unstructured random data such as website logs, texts and tables etc, have been flooding in the internet. Among these unstructured data there are potentially very useful data such as bulletin boards and e-mails that are used for customer services and the output from search engines. Various text mining tools have been introduced to deal with those data. But most of them lack accuracy compared to traditional data mining tools that deal with structured data. Hence, it has been sought to find a way to apply data mining techniques to these text data. In this paper, we propose a text mining system which can incooperate existing data mining methods. We use text mining as a preprocessing tool to generate formatted data to be used as input to the data mining system. The output of the data mining system is used as feedback data to the text mining to guide further categorization. This feedback cycle can enhance the performance of the text mining in terms of accuracy. We apply this method to categorize web sites containing adult contents as well as illegal contents. The result shows improvements in categorization performance for previously ambiguous data.

최근 인터넷에는 기존의 데이터베이스 형태가 아닌 일정한 구조를 가지지 않았지만 상당한 잠재적 가치를 지니고 있는 텍스트 데이터들이 많이 생성되고 있다. 고객창구로서 활용되는 게시판이나 이메일, 검색엔진이 초기 수집한 데이터 둥은 이러한 비구조적 데이터의 좋은 예이다. 이러한 텍스트 문서의 분류를 위하여 각종 텍스트마이닝 도구가 개발되고 있으나, 이들은 대개 단순한 통계적 방법에 기반하고 있기 때문에 정확성이 떨어지고 좀 더 다양한 데이터마이닝 기법을 활용할 수 있는 방법이 요구되고 있다. 그러나, 정형화된 입력 데이터를 요구하는 데이터마이닝 기법을 텍스트에 직접 적용하기에는 많은 어려움이 있다. 본 연구에서는 이러한 문제를 해결하기 위하여 전처리 과정에서 텍스트마이닝을 수행하고 정제된 중간결과를 데이터마이닝으로 처리하여 텍스트마이닝에 피드백 시켜 정확성을 높이는 방법을 제안하고 구현하여 보았다. 그리고, 그 타당성을 검증하기 위하여 유해사이트의 웹 컨텐츠를 분류해내는 작업에 적용하여 보고 그 결과를 분석하여 보았다. 분석 결과, 제안방법은 기존의 텍스트마이닝만을 적용할 때에 비하여 오류율을 현저하게 줄일 수 있었다.

Keywords

References

  1. Proceedings, Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'01) Topic Detection Tracking and Trend Analysis Using Self-Organizing Neural Networks Kanagasa R.;A-H. Tan
  2. Proceedings, Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'01) Predictive Self-Organizing Networks for Text Categorization A-H Tan
  3. Accepted by ICDAR'01 Worshop on Web Document Analysis Web Structure Analysis for Information Mining Lakshmi V.;A-H Tan;C-L Tan
  4. Proceedings of the Sixth ACM SIGKDD International Conference in KDD Workshop on Text Mining Using Information Extraction to Aid the Discovery of Prediction Rules from Text Mooney J.
  5. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining A Feature Weight Adjustment Algorithm for Document Categorization Shankar S.;Karypis G.
  6. Automatic Personalization Based on Web Usage Mining Mobasher B.;Cooley R.;Srivastava J
  7. Frequent Sets, Sequences, and Taxonomies : New, Efficient Algorithmic Proposals, Technical Report LSI-00-78-R Baixeries, J.;G. Casas, J.;L. Balcazar
  8. Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Tect Mining: Finding Nuggets in Mountains of Textual Data Dorre J.;Gerstl P.;Seiffert R.
  9. Trend in Knowledge Discovery from Databases (29th) Text Mining-Knowledge Discovery from Text Lee Hing Yan
  10. Proceedings of the Fifth ACM SIGKDD International Conference on KDD Fast and Effective Text Mining Using Linear-Time Document Clustering Larsen B.;Aone C.
  11. Journal of Konwledge and Information Systems v.1 no.1 Data Preparation for Mining World Wide Web Browsing Patterns Cooley R.;Mobasher B.;Srivastava, J.
  12. Communications of the ACM v.42 Mining Online Text Kevin K.
  13. Learning to Extract Key Phaese from Text, Technical Report ERB-1057 Turney P.
  14. Proceedings of the Third European Conference of Principles and Practice of Knowledge Discovery in Databases TopCat: Data Mining for Topic Identification in a Text Corpus Clifton C.;Cooley R.
  15. Journal of Information Retrieval An Evaluation of Statistical Approaches to Text Categorization Yang Y.
  16. Proceedings of the Seventh International Conference on Information and Knowledge Management Inductive Learning Algorithms and Representations for Text Categorization Platt J.;Heckerman, D.;Sahami M.
  17. Proceeding of Fifteenth National Conference on Artificial Intelligence Adaptive Web Sites: Automatically Synthesizing Web page Perkowitz M.;Etzioni O.
  18. IEEE Bulletin of the Technical Committee on Data Engineering v.1 no.21 Hypergraph Based Clustering in High-Dimensional Data Sets : a Summary of Results E-H Han;Karypis G.;Kumar, V.;Mobasher, B.
  19. Advanced in Knowledge Discovery and Data Mining From Data Mining to Knowledge Discovery Fayyad, U. M.;Piatetsky-Shapiro, G.;Smyth, P.
  20. IBM Text Mining