DOC

DNA Sequence Alignment

By Norma Wagner,2014-03-28 08:11
15 views 0
DNA Sequence Alignment

Introduction to

DNA Sequence Alignment

College of Electrical Engineering and Computer Science

National Taiwan University

胡哲銘

Ghe-Ming Hu

指導教授;丁建均 博士

DNA Sequence Alignment

Introduction

Similarity and alignment of DNA sequence can be applied to lots of biological technologies. We compare two sequence to search for the homology of a newly one of the reference sequence so that we can analyze the relation between the two DNA sequences. DNA sequence analysis is a fast-growing field and many similarity measurement of sequence of methods have been proposed and developed. Because the numbers of DNA sequence are always huge, we have to seek for the help of computer. Therefore, there are many algorithms for solving the sequence alignment have been proposed.

Dynamic programming is the currently most popular algorithm for determining the similarity between two sequences. However, its complexities are O(MN), where M

and N represent the lengths of two DNA sequences been compared. Thus, some fast DNA analyzing tool such as FASTA and BLAST will be introduced later in this paper. FASTA and BLAST are the algorithms can approximate the result of dynamic programming and they are also very popular.

Next, we will introduce a new algorithm for comparing the similarity between the two DNA sequences, that is, the unitary discrete correlation (UDCR) algorithm. We use it instead of dynamic programming for similarity measurement. UDCR can reduce the complexities of semi-global and local alignment from O(MN) to O(N?logM) 2

2or O(L), where M and N represent the lengths of the two DNA sequences being compared,

and L is the size of the matched subsequences.

Finally, we combine the advantage of dynamic programming and UDCR and develop a new algorithm, we call it combined unitary discrete correlation (CUDCR) algorithm. This algorithm requires less computing time and more accurate than dynamic programming. For example, in semi-global and local alignments, only a small part of input DNA sequence is actually be used. Thus, we can use UCDR algorithm to measure the better-aligned location and then use dynamic programming (DP) algorithm to find the detailed sequence alignment.

1.DNA Sequence Assembly

1.1 Shotgun Sequencing

Before we introduce the DNA alignment algorithm, we briefly present the structure of DNA sequence assembly. Due to the current technology, we can not read a whole strand of DNA at one time, but a strand of 350 to 1000 nucleotides in length instead. Thus, we have to reconstruct a whole DNA sequence from a set of its subsequences called fragment. The technique presented above is DNA sequence assembly problem or DNA sequencing.

The most popular DNA sequencing is shotgun sequencing. It requires a lot of fragments to reconstruct the original DNA sequence. The fragment may be created by breaking of DNA, copies of the original one and at random intervals of the original one. The central idea is that we can infer the fragments from copies of the original DNA sequence will overlap with each other and they will merge into a new fragment called contig or meta-fragment (See Fig.1.1). To insure the successful reconstruction, its recommended that that the cover ratio of the fragments should be 5x to 10x.

Original DNA

Copies of the original DNA

Fragments of the copies

(Shotgun)

Reconstruct the original DNA

Fig.1.1 The concept of shotgun sequencing.

1.2 Greedy algorithm

Shotgun sequencing is essentially a greedy algorithm. The process of greedy algorithm is presented as below:

Step1. Calculate pair-wise alignments of all fragments. Step2. Choose two fragments with the largest overlap. Step3. Merge the chosen fragments.

Step4. Repeat step 2 and 3 until there is not any fragment which can be merged.

1.3 Issues of shotgun sequencing

There are four major issues in shotgun sequencing as below:

1. The errors in fragments.

2. The unknown orientation of a fragment.

3. Gaps in fragment coverage.

4. Repeats in fragments.

1.4 The shortest superstring problem

Shotgun sequencing seems a method for DNA sequencing; however, the wrong assembled DNA sequence will happen due to the false overlaps between sequences. Shortest superstring means the shortest string which contains all fragments in a given set. Suppose there is no sequencing error and the fragments are randomly generated, the output sequence approaches the original DNA sequence as the number of fragments increases. However, the issues of shotgun sequencing we mentioned above may happen, it seems not a good solution.

Terefore, the sequence by hybridization (SBH) techniques comes in our mind.

kSBH contains very short k-letter fragments called probes. Next, the SBH 4

technique tries to reconstruct the DNA sequence from the k-letter probe composition.

Suppose that there is not any sequencing error, the output string approaches the original DNA sequence as the value of k increases. Now the directed path graph is

used to solve the SBH problem efficiently. The SBH adopts V (vertexes in the graph)

as a set of fragments and (edges in the graph) as the edge between two Evv,;；ij

fragments of V, then the directed path graph G of the DNA sequence can be figured

out. With the aid of the directed path graph G, the sequencing problem is reduced to

finding Hamiltonian path in G. It is also proved that finding the Eulerian path is

equivalent to finding the Hamiltonian path. However, SBH has many unsolved problem now. A small k value makes it hard to recreate the DNA sequence while a large k value makes the computation huge. Unfortunately, the value of k is limited to 8 or 10 currently.

2. Dynamic Programming

We have introduced DP algorithm before and now we will present how DP work in detail. Note that the term programming is similar to the word optimization. DP

is based on the divide-and-conquer method. It divides the original problem into sub-problems, and the sub-problems will be divided into more sub-problems, one after another. DP solves this problem by the recurrence relation instead of wasting time on unnecessary computation.

2.1 The edit distance between two strings

We need a way to score an alignment to find the optimal sequence alignment. There is a common way called edit distance to measure what is the difference

between the two strings. There are four edit operators in the edit distance --- insertion, deletion, replacement (substitution) and match. Insertions and deletions are both called the indels, and an indel is represented by a dash “-” in an alignment. The

insertion operation, denoted by I, indicates inserting an “empty” letter to the first sequence, and the deletion operation, denoted by D, indicates deleting a letter from the first sequence. (Note that an insertion/deletion in one string can be seen as deletion/insertion in another one). The replacement operation, denoted by R, means there is a mismatch between the two strings at the aligned position. In addition, a match, denoted by M, means the aligned characters of the two strings are identical.

Here, we have a example as Fig.2.1 shown.

Sequence 1: a b b d e Sequence 1: a b b d e - Align Sequence 2: a c d e c Sequence 2: a c - d e c

Transcripts: MRDMMI

Fig.2.1 The global alignment of “abbde” and “acdec”.

2.2 String similarity method

String similarity method is an alternative method to edit distance method and both of them are developed at the same time. String similarity method is often preferred than the edit distance method and it was proved that both of them can transform to each other by a formula. Here, we give a example as shown in Fig.2.2 nd Fig.2.3. Let A be the alphabet used by strings and A={a, b, c, -}. We define the a

L

similarity score as , where represents the ith character of sSiSi,Si;；;；;；;；?1211i

string , represents the ith character of string , s(x, y) represents the SSiS;；122

score from scoring matrix of A and L is the length of the string.

a b c -

a 1 -1 -2 -1

b -1 3 -2 0

c -2 -2 0 -2

- -1 0 -2 0

Fig.2.2 The scoring matrix of A.

: a b c - Alignment : S1

S: a c - b 2

Pairwise score : 1 - 2 - 2 0

Similarity score = 1-2-2+0 = -3

Fig. 2.1 The calculation of similarity score.

We can found that the scoring matrix A is a symmetric matrix and this simple example above can demonstrate how to calculate the similarity score of an alignment.

2.3 Dynamic programming calculation of similarity

DP algorithm has three essential components- the recurrence relation, the tabular computation, and the traceback.

; The recurrence relation:

The recurrence relation establishes a recursive relationship between the value of D(i,j), for i and j both positive and D(i,j) means that the distance between S(i) and 1

S(j). The recurrence relation has the base condition below: 2

D(i, 0)=i (2.1)

and

D(0, j)=j (2.2)

The recurrence relation for D(i,j) when D(i,j) when both i and j are strictly positive is

D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+1] (2.3) Where t(i, j) is defined to have value 1 if S(i) ? S(j), and t(i, j) has the value 0 if 12

S(i) = S(j). 12

Here, we give a example as shown in Fig.2.4. Let Svintner and Swriters. 1=2=

First, we construct a table computing the edit distance between vintner and writers.

D(i,j) S w r i t e r s 2

S 0 1 2 3 4 5 6 7 1

v 1

i 2

n 3

t 4

n 5

e 6

r 7

Fig.2.4 The computing table with the base condition

; Tabular computation

With the base condition, it means that we have some initial value in D(i,j) table.

Next, we can compute all the value in this table by the equation: D(i,j)=min[D(i-1, j)+1, D(i, j-1)+1, D(i-1, j-1)+1] as in (2.3). By this rule, we can find

the final table as shown in Fig.2.5.

D(i,j) S w r i t e r s 2

S 0 1 2 3 4 5 6 7 1

v 1 1 2 3 4 5 6 7

i 2 2 2 2 3 4 5 6

n 3 3 3 3 3 4 5 6

t 4 4 4 4 3 4 5 6

n 5 5 5 5 4 4 5 6

e 6 6 6 6 5 4 5 6

r 7 7 6 7 6 5 4 5

Fig.2.5 The computing table with the final result

; The traceback

After the tabular computation is finished, we have to find out the traceback in this computation table. In order to find out the traceback, we use the pointers to point where the direction is. We use the bottom-up computation. We start at the right-bottom position in this table and judge where the pointer point by a specified formula (From this case, we start at the position D(8,8)). The formula is below:

Set a pointer from (i,j) to cell (i,j-1) if D(i,j)= D(i,j-1)+1 (2.4)

Set a pointer from (i,j) to cell (i-1,j) if D(i,j)= D(i-1,j)+1 (2.5)

Set a pointer from (i,j) to cell (i-1,j-1) if D(i,j)= D(i-1,j-1)+t(i.j) (2.6)

From the formula (2.4) (2.5) and (2.6), we can figure that it only need to compute three cases by these formula for one cell, and then we can decide which direction the next cell points. Because there are three routes for every single cell, it will be six permutation combinations. However, the results are not always distinct at all. For this case, we just have three results as shown in Fig.2.6 Fig.2.7 and Fig.2.8.

D(i,j) S w r i t e r s 2

S 0 0 0 0 0 0 0 0 1

v 0 3 1 0 0 0 0 0

i 0 0 0 3 0 0 0 0

n 0 0 0 2 0 0 0 0

t 0 0 0 0 3 0 0 0

n 0 0 0 0 2 0 0 0

e 0 0 0 0 0 3 0 0

r 0 0 0 0 0 0 3 1

Fig 2.6 The first result

Report this document

For any questions or suggestions please email
cust-service@docsford.com