• Title/Summary/Keyword: Suffix Tree

Search Result 50, Processing Time 0.023 seconds

An Index Data Structure for String Search in External Memory (외부 메모리에서 문자열을 효율적으로 탐색하기 위한 인덱스 자료 구조)

  • Na, Joong-Chae;Park, Kun-Soo
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.32 no.11_12
    • /
    • pp.598-607
    • /
    • 2005
  • We propose a new external-memory index data structure, the Suffix B-tree. The Suffix B-tree is a B-tree in which the key is a string like the String B-tree. While the node in the String B-tree is implemented with a Patricia trio, the node in the Suffix B-tree is implemented with an array. So the Suffix B-tree is simpler and easier to be Implemented than the String B-tree. Nevertheless, the branching algorithm of the Suffix B-tree is as efficient as that of the String B-tree. Consequently, the Suffix B-tree takes the same worst-case disk accesses as the String B-tree to solve the string matching problem, which is fundamental and important in the area of string algorithms.

Comparison Architecture for Large Number of Genomic Sequences

  • Choi, Hae-won;Ryoo, Myung-Chun;Park, Joon-Ho
    • Journal of Information Technology and Architecture
    • /
    • v.9 no.1
    • /
    • pp.11-19
    • /
    • 2012
  • Generally, a suffix tree is an efficient data structure since it reveals the detailed internal structures of given sequences within linear time. However, it is difficult to implement a suffix tree for a large number of sequences because of memory size constraints. Therefore, in order to compare multi-mega base genomic sequence sets using suffix trees, there is a need to re-construct the suffix tree algorithms. We introduce a new method for constructing a suffix tree on secondary storage of a large number of sequences. Our algorithm divides three files, in a designated sequence, into parts, storing references to the locations of edges in hash tables. To execute experiments, we used 1,300,000 sequences around 300Mbyte in EST to generate a suffix tree on disk.

Relation Extraction Using Suffix Tree and Distant Supervision (Suffix Tree와 Distant Supervision을 이용한 관계 추출)

  • Lee, HyunGoo;Choi, Maengsik;Kim, Harksoo
    • Annual Conference on Human and Language Technology
    • /
    • 2014.10a
    • /
    • pp.149-152
    • /
    • 2014
  • 자연어처리 분야에서 관계 추출은 중요한 연구 분야이다. 많은 관계 추출 연구는 지도 학습 방법을 사용하지만 정답을 구축하는 비용이 큰 문제가 있다. 본 논문에서는 distant supervision을 이용하여 데이터를 구축하고, suffix tree를 이용한 규칙기반 관계 추출 모델을 제안한다. Suffix tree를 이용한 관계추출의 Macro F1-measure는 84.05%로 관계 추출에서 사용이 가능함을 보였다.

  • PDF

Finding All-Pairs Suffix-Prefix Matching Using Suffix Array (접미사 배열을 이용한 Suffix-Prefix가 일치하는 모든 쌍 찾기)

  • Han, Seon-Mi;Woo, Jin-Woon
    • The KIPS Transactions:PartA
    • /
    • v.17A no.5
    • /
    • pp.221-228
    • /
    • 2010
  • Since string operations were applied to computational biology, security and search for Internet, various data structures and algorithms for computing efficient string operations have been studied. The all-pairs suffix-prefix matching is to find the longest suffix and prefix among given strings. The matching algorithm is importantly used for fast approximation algorithm to find the shortest superstring, as well as for bio-informatics and data compressions. In this paper, we propose an algorithm to find all-pairs suffix-prefix matching using the suffix array, which takes O($k{\cdot}m$)�� time complexity. The suffix array algorithm is proven to be better than the suffix tree algorithm by showing it takes less time and memory through experiments.

Suffix Tree Constructing Algorithm for Large DNA Sequences Analysis (대용량 DNA서열 처리를 위한 서픽스 트리 생성 알고리즘의 개발)

  • Choi, Hae-Won
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.15 no.1
    • /
    • pp.37-46
    • /
    • 2010
  • A Suffix Tree is an efficient data structure that exposes the internal structure of a string and allows efficient solutions to a wide range of complex string problems, in particular, in the area of computational biology. However, as the biological information explodes, it is impossible to construct the suffix trees in main memory. We should find an efficient technique to construct the trees in a secondary storage. In this paper, we present a method for constructing a suffix tree in a disk for large set of DNA strings using new index scheme. We also show a typical application example with a suffix tree in the disk.

Improving Lookup Time Complexity of Compressed Suffix Arrays using Multi-ary Wavelet Tree

  • Wu, Zheng;Na, Joong-Chae;Kim, Min-Hwan;Kim, Dong-Kyue
    • Journal of Computing Science and Engineering
    • /
    • v.3 no.1
    • /
    • pp.1-4
    • /
    • 2009
  • In a given text T of size n, we need to search for the information that we are interested. In order to support fast searching, an index must be constructed by preprocessing the text. Suffix array is a kind of index data structure. The compressed suffix array (CSA) is one of the compressed indices based on the regularity of the suffix array, and can be compressed to the $k^{th}$ order empirical entropy. In this paper we improve the lookup time complexity of the compressed suffix array by using the multi-ary wavelet tree at the cost of more space. In our implementation, the lookup time complexity of the compressed suffix array is O(${\log}_{\sigma}^{\varepsilon/(1-{\varepsilon})}\;n\;{\log}_r\;\sigma$), and the space of the compressed suffix array is ${\varepsilon}^{-1}\;nH_k(T)+O(n\;{\log}\;{\log}\;n/{\log}^{\varepsilon}_{\sigma}\;n)$ bits, where a is the size of alphabet, $H_k$ is the kth order empirical entropy r is the branching factor of the multi-ary wavelet tree such that $2{\leq}r{\leq}\sqrt{n}$ and $r{\leq}O({\log}^{1-{\varepsilon}}_{\sigma}\;n)$ and 0 < $\varepsilon$ < 1/2 is a constant.

A Suffix Tree Transform Technique for Substring Selectivity Estimation (부분 문자열 선택도 추정을 위한 서픽스트리 변환 기법)

  • Lee, Hong-Rae;Shim, Kyu-Seok;Kim, Hyoung-Joo
    • Journal of KIISE:Databases
    • /
    • v.34 no.2
    • /
    • pp.141-152
    • /
    • 2007
  • Selectivity estimation has been a crucial component in query optimization in relational databases. While extensive researches have been done on this topic for the predicates of numerical data, only little work has been done for substring predicates. We propose novel suffix tree transform algorithms for this problem. Unlike previous approaches where a full suffix tree is pruned and then an estimation algorithm is employed, we transform a suffix tree into a suffix graph systematically. In our approach, nodes with similar counts are merged while structural information in the original suffix tree is preserved in a controlled manner. We present both an error-bound algorithm and a space-bound algorithm. Experimental results with real life data sets show that our algorithms have lower average relative error than that of the previous works as well as good error distribution characteristics.

A Generalization of the Linearized Suffix Tree to Square Matrices

  • Na, Joong-Chae;Lee, Sun-Ho;Kim, Dong-Kyue
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1760-1766
    • /
    • 2010
  • The linearized suffix tree (LST) is an array data structure supporting traversals on suffix trees. We apply this LST to two dimensional (2D) suffix trees and obtain a space-efficient substitution of 2D suffix trees. Given an $n{\times}n$ text matrix and an $m{\times}m$ pattern matrix over an alphabet ${\Sigma}$, our 2D-LST provides pattern matching in $O(m^2log{\mid}{\Sigma}{\mid})$ time and $O(n^2)$ space.

Performance Comparison of Keyword Extraction Methods for Web Document Cluster using Suffix Tree Clustering (Suffix Tree를 이용한 웹 문서 클러스터의 제목 생성 방법 성능 비교)

  • 염기종;권영식
    • Proceedings of the Korea Inteligent Information System Society Conference
    • /
    • 2002.11a
    • /
    • pp.328-335
    • /
    • 2002
  • 최근 들어 인터넷 기술의 발달로 웹 상에 많은 자료들이 산재해 있습니다. 사용자가 원하는 정보를 검색하기 위해서 키워드 검색을 이용하고 있는데 이러한 키워드 검색은 사용자들이 입력한 단편적인 정보에 바탕하여 검색하고 검색된 결과들을 자체적인 기준으로 순위를 매겨 나열식으로 제시하고 있다. 이러한 경우 사용자들의 생각과는 다르게 결과가 제시될 수 있다. 따라서 사용자들의 검색 시간을 줄이고 편리하게 검색하기 위한 환경의 필요성이 높아지고 있다. 본 논문에서는 Suffix Tree 알고리즘을 사용하여 관련있는 문서들을 분류하고 각각의 분류된 클러스터에 제목을 생성하기 위하여 문서 빈도수, 단어 빈도수와 역문서 빈도수, 카이 검정, 공통 정보, 엔트로피 방법을 비교 평가하여 제목을 생성하는데 어떠한 방법이 가장 효과적인지 알아보기 위해 비교 평가해본 결과 문서빈도수가 TF-IDF보다 약 10%정도 성능이 좋은 결과를 보여주었다.

  • PDF

High Performance Pattern Matching algorithm with Suffix Tree Structure for Network Security (네트워크 보안을 위한 서픽스 트리 기반 고속 패턴 매칭 알고리즘)

  • Oh, Doohwan;Ro, Won Woo
    • Journal of the Institute of Electronics and Information Engineers
    • /
    • v.51 no.6
    • /
    • pp.110-116
    • /
    • 2014
  • Pattern matching algorithms are widely used in computer security systems such as computer networks, ubiquitous networks, sensor networks, and so on. However, the advances in information technology causes grow on the amount of data and increase on the computation complexity of pattern matching processes. Therefore, there is a strong demand for a novel high performance pattern matching algorithms. In light of this fact, this paper newly proposes a suffix tree based pattern matching algorithm. The suffix tree is constructed based on the suffix values of all patterns. Then, the shift nodes which informs how many characters can be skipped without matching operations are added to the suffix tree in order to boost matching performance. The proposed algorithm reduces memory usage on the suffix tree and the amount of matching operations by the shift nodes. From the performance evaluation, our algorithm achieved 24% performance gain compared with the traditional algorithm named as Wu-Manber.