본문 바로가기
Bioinformatics/Algorithms

[bwt] suffix array 관련 정리

by 임은천 2015. 3. 27.


1. BWT and SA

Given a string S, a sorted suffix array SA,

BWT[i] = S[SA[i] - 1]


2. Backward search


2-1. Rank Function

rank() sometimes is named with occurrence()

rank() counts the number of an alphabet up to a certain position.

In practice, we uses sampled cumulative alphabet counters.

Given a pattern P, a global alphabet counter C, a low suffix array index l, a high suffix array index h,

for (i = len(P) - 1; h > l && i >= 0; --i) {

   c = P[i]

   l = C[c] + rank(l - 1, c) + 1

   h = C[c] + rank(h, c)

}


l, h usually implemented as a class named with Interval in practical applications

struct Interval {

  int64_t low;

  int64_t high;

}


2-2. Select Function

select() sometimes is named with locate(), or psi()

select() gives the location of a pattern in the original string S.

The value of suffix array is the position of the pattern in S.

Given a suffix array SA, a low suffix array index l, a high suffix array index h,

for (suffix_array_index = l; suffix_array_index <= h; ++suffix_array_index) {

  print(SA[suffix_array_index])



To get SA[suffix_array_index] from a BWT we need an algorithm as follows.

get_SA(suffix_array_index, BWT) {

  suffix_array_index = -1;

  do {

    c = BWT[suffix_array_index];

    suffix_array_index = C[c] + rank(suffix_array_index, c);

    ++index_in_S;

  } while(suffix_array_index > 0);

  return index_in_S;

}


댓글