In computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses dynamic programming.[1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.

Algorithm information

edit

Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If and are strings, where and , the Needleman–Wunsch algorithm finds an optimal alignment in time, using space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes time, but needs only space and is much faster in practice.[2] One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.

The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:[3]

  1. one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
  2. if is the optimal alignment of , and is an arbitrary partition of , there exists a partition of such that .

Algorithm description

edit

denotes the i-th character of , where . denotes a substring of size , ranging from the i-th to the j-th character of . is the reversed version of .

and are sequences to be aligned. Let be a character from , and be a character from . We assume that , and are well defined integer-valued functions. These functions represent the cost of deleting , inserting , and replacing with , respectively.

We define , which returns the last line of the Needleman–Wunsch score matrix :

function NWScore(X, Y)
    Score(0, 0) = 0 // 2 * (length(Y) + 1) array
    for j = 1 to length(Y)
        Score(0, j) = Score(0, j - 1) + Ins(Yj)
    for i = 1 to length(X) // Init array
        Score(1, 0) = Score(0, 0) + Del(Xi)
        for j = 1 to length(Y)
            scoreSub = Score(0, j - 1) + Sub(Xi, Yj)
            scoreDel = Score(0, j) + Del(Xi)
            scoreIns = Score(1, j - 1) + Ins(Yj)
            Score(1, j) = max(scoreSub, scoreDel, scoreIns)
        end
        // Copy Score[1] to Score[0]
        Score(0, :) = Score(1, :)
    end
    for j = 0 to length(Y)
        LastLine(j) = Score(1, j)
    return LastLine

Note that at any point, only requires the two most recent rows of the score matrix. Thus, is implemented in space.

The Hirschberg algorithm follows:

function Hirschberg(X, Y)
    Z = ""
    W = ""
    if length(X) == 0
        for i = 1 to length(Y)
            Z = Z + '-'
            W = W + Yi
        end
    else if length(Y) == 0
        for i = 1 to length(X)
            Z = Z + Xi
            W = W + '-'
        end
    else if length(X) == 1 or length(Y) == 1
        (Z, W) = NeedlemanWunsch(X, Y)
    else
        xlen = length(X)
        xmid = length(X) / 2
        ylen = length(Y)

        ScoreL = NWScore(X1:xmid, Y)
        ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))
        ymid = arg max ScoreL + rev(ScoreR)

        (Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)
    end
    return (Z, W)

In the context of observation (2), assume that is a partition of . Index is computed such that and .

Example

edit

Let

The optimal alignment is given by

 W = AGTACGCA
 Z = --TATGC-

Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:

         T   A   T   G   C
     0  -2  -4  -6  -8 -10
 A  -2  -1   0  -2  -4  -6
 G  -4  -3  -2  -1   0  -2
 T  -6  -2  -4   0  -2  -1
 A  -8  -4   0  -2  -1  -3
 C -10  -6  -2  -1  -3   1
 G -12  -8  -4  -3   1  -1
 C -14 -10  -6  -5  -1   3
 A -16 -12  -8  -7  -3   1

One starts with the top level call to , which splits the first argument in half: . The call to produces the following matrix:

        T   A   T   G   C
    0  -2  -4  -6  -8 -10
 A -2  -1   0  -2  -4  -6
 G -4  -3  -2  -1   0  -2
 T -6  -2  -4   0  -2  -1
 A -8  -4   0  -2  -1  -3

Likewise, generates the following matrix:

       C   G   T   A   T
    0 -2  -4  -6  -8 -10
 A -2 -1  -3  -5  -4  -6
 C -4  0  -2  -4  -6  -5
 G -6 -2   2   0  -2  -4
 C -8 -4   0   1  -1  -3

Their last lines (after reversing the latter) and sum of those are respectively

 ScoreL      = [ -8 -4  0 -2 -1 -3 ]
 rev(ScoreR) = [ -3 -1  1  0 -4 -8 ]
 Sum         = [-11 -5  1 -2 -5 -11]

The maximum (shown in bold) appears at ymid = 2, producing the partition .

The entire Hirschberg recursion (which we omit for brevity) produces the following tree:

               (AGTACGCA,TATGC)
               /               \
        (AGTA,TA)             (CGCA,TGC)
         /     \              /        \
     (AG, )   (TA,TA)      (CG,TG)     (CA,C)
              /   \        /   \       
           (T,T) (A,A)  (C,T) (G,G) 

The leaves of the tree contain the optimal alignment.

See also

edit

References

edit
  1. ^ Hirschberg's algorithm.
  2. ^ "The Algorithm".
  3. ^ Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM. 18 (6): 341–343. CiteSeerX 10.1.1.348.4774. doi:10.1145/360825.360861. MR 0375829. S2CID 207694727.

📚 Artikel Terkait di Wikipedia

Edit distance

operations. A linear-space solution to this problem is offered by Hirschberg's algorithm. A general recursive divide-and-conquer framework for solving such

Longest common subsequence

quadratic-time linear-space algorithm for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to

Hirschberg–Sinclair algorithm

The Hirschberg–Sinclair algorithm is a distributed algorithm designed for leader election problem in a synchronous ring network. It is named after its

Hirschberg

historian Max Hirschberg (1883–1964), German Jewish lawyer Hirschberg test, a medical screening test for an eye condition Hirschberg's algorithm, a dynamic

Needleman–Wunsch algorithm

The Needleman–Wunsch algorithm is an algorithm used in bioinformatics to align protein or nucleotide sequences. It was one of the first applications of

List of algorithms

An algorithm is a fundamental set of rules or defined procedures that are typically designed and used to be a simpler way to solve a specific problem

Dan Hirschberg

synchronous ring. Lynch named this algorithm the HS algorithm, after its authors. Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common

Thompson's construction

computer science, Thompson's construction algorithm, also called the McNaughton–Yamada–Thompson algorithm, is a method of transforming a regular expression