Sequence Alignment

  • View
    13

  • Download
    1

Embed Size (px)

DESCRIPTION

Sequence Alignment. Kun-Mao Chao ( ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao. GenBank 200.0. orzs sequence evolution. the origin? their evolutionary relationships? - PowerPoint PPT Presentation

Transcript

  • Sequence AlignmentKun-Mao Chao ()Department of Computer Science and Information EngineeringNational Taiwan University, Taiwan

    WWW: http://www.csie.ntu.edu.tw/~kmchao

  • GenBank 200.0*

  • *Useful WebsitesMIT Open Coursewaree.g. Biology 7.012 Introduction to BiologyThe International Society for Computational Biology:http://www.iscb.org/National Center for Biotechnology Information (NCBI, NIH):http://www.ncbi.nlm.nih.gov/European Bioinformatics Institute (EBI):http://www.ebi.ac.uk/DNA Data Bank of Japan (DDBJ):http://www.ddbj.nig.ac.jp/index-e.html

  • *orzs sequence evolutionorz (kid)OTZ (adult)Orz (big head)Crz (motorcycle driver)on_ (soldier)or2 (bottom up)o (back high)STO (the other way around)Oroz (me)the origin?their evolutionary relationships?their putative functional relationships?

  • *What?THETR UTHIS MOREI MPORT ANTTH ANTHE FACTS

    The truth is more important than the facts.

  • *Dot Matrix

  • *

  • *C---TTAACT CGGATCA--TPairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

    An alignment of A and B:Sequence ASequence B

  • *C---TTAACT CGGATCA--TPairwise AlignmentSequence A: CTTAACTSequence B: CGGATCAT

    An alignment of A and B:Insertion gapMatchMismatchDeletion gap

  • *Alignment GraphSequence A: CTTAACTSequence B: CGGATCATC G G A T C A TC T T A A C TC---TTAACT CGGATCA--T

  • *A simple scoring schemeMatch: +8 (w(x, y) = 8, if x = y)Mismatch: -5 (w(x, y) = -5, if x y)Each gap symbol: -3 (w(-,x)=w(x,-)=-3)C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12Alignment score

  • *An optimal alignment-- the alignment of maximum score

    Let A=a1a2am and B=b1b2bn .Si,j: the score of an optimal alignment between a1a2ai and b1b2bjWith proper initializations, Si,j can be computed as follows.

  • *Computing Si,j ijw(ai,-)w(-,bj)w(ai,bj)Sm,n

  • *Initializations C G G A T C A TC T T A A C TMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3-6-9-12-15-18-21

  • *S3,5 = C G G A T C A TC T T A A C TMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3852-1-4-7-10-13-6530-3741-2-920-2-5?-12-15-18-21

  • *S3,5 = 5 C G G A T C A TC T T A A C Toptimal scoreMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3852-1-4-7-10-13-6530-3741-2-920-2-552-19-12-1-3-5630107-15-4-6-831-285-18-7-9-110-2963-21-10-12-14-386414

  • *C T T A A C TC G G A T C A T C G G A T C A TC T T A A C T8 5 5 +8 -5 +8 -3 +8 = 14

    0-3-6-9-12-15-18-21-24-3852-1-4-7-10-13-6530-3741-2-920-2-552-19-12-1-3-5630107-15-4-6-831-285-18-7-9-110-2963-21-10-12-14-386414

  • *Now try this example in classSequence A: CAATTGASequence B: GAATCTGC

    Their optimal alignment

  • *Initializations G A A T C T G CC AA T T G AMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3-6-9-12-15-18-21

  • *S4,2 = G A A T C T G CC AA T T G AMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3-5-8-11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14?-15-18-21

  • *S5,5 = G A A T C T G CC AA T T G AMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3-5-8-11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14-38191613107-15-17-6516?-18-21

  • *S5,5 = 14 G A A T C T G CC AA T T G Aoptimal scoreMatch: 8Mismatch: -5Gap symbol: -3

    0-3-6-9-12-15-18-21-24-3-5-8-11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14-38191613107-15-17-651614242118-18-7-921311213229-21-101-1108182927

  • * G A A T C T G CC AA T T G A-5 +8 +8 +8 -3 +8 +8 -5 = 27C A A T - T G AG A A T C T G C

    0-3-6-9-12-15-18-21-24-3-5-8-11-14-4-7-10-13-6-830-3-6-9-12-15-9-11011852-1-4-12-14-38191613107-15-17-651614242118-18-7-921311213229-21-101-1108182927

  • *

  • *Global Alignment vs. Local Alignmentglobal alignment: local alignment:

  • *Maximum-sum intervalGiven a sequence of real numbers a1a2an , find a consecutive subsequence with the maximum sum.

    9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9

    For each position, we can compute the maximum-sum interval ending at that position in O(n) time. Therefore, a naive algorithm runs in O(n2) time.

  • *Computing a segment sum in O(1) time?Input: a sequence of real numbers a1a2anQuery: the sum of ai ai+1aj

    Kun-Mao Chao

  • *Computing a segment sum in O(1) timeprefix-sum(i) = a1+a2++aiall n prefix sums are computable in O(n) time.sum(i, j) = prefix-sum(j) prefix-sum(i-1)prefix-sum(j)ijprefix-sum(i-1)

    Kun-Mao Chao

  • *Maximum-sum interval(The recurrence relation)Define S(i) to be the maximum sum of the intervals ending at position i.

    If S(i-1) < 0, concatenating ai with its previous interval gives less sum than ai itself.

  • *Maximum-sum interval(Tabular computation)9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7The maximum sum

  • *Maximum-sum interval(Traceback)9 3 1 7 15 2 3 4 2 7 6 2 8 4 -9S(i) 9 6 7 14 1 2 5 1 3 4 6 4 12 16 7The maximum-sum interval: 6 -2 8 4

  • *An optimal local alignment

    Si,j: the score of an optimal local alignment ending at (i, j) between a1a2ai and b1b2bj.With proper initializations, Si,j can be computed as follows.

  • *local alignment C G G A T C A TC T T A A C TMatch: 8Mismatch: -5Gap symbol: -3

    000000000085200852053008531302000852110000853?000

  • *local alignment C G G A T C A TC T T A A C TMatch: 8Mismatch: -5Gap symbol: -3The best score

    0000000000852008520530085313020008521100008531310000085211808525313107053021310818

  • * C G G A T C A TC T T A A C TThe best scoreA C - T A T C A T8-3+8-3+8 = 18

    0000000000852008520530085313020008521100008531310000085211808525313107053021310818

  • *Now try this example in classSequence A: CAATTGASequence B: GAATCTGC

    Their optimal local alignment

  • *Did you get it right? G A A T C T G CC AA T T G A

    000000000000008528008855305008161310742005132421181512002102119292623085718162637340516131513233432

  • * G A A T C T G CC AA T T G AA A T T G A A T C T G8+8+8-3+8+8 = 37

    000000000000008528008855305008161310742005132421181512002102119292623085718162637340516131513233432

  • *Osamu Gotoh

  • *Affine gap penaltiesMatch: +8 (w(a, b) = 8, if a = b)Mismatch: -5 (w(a, b) = -5, if a b)Each gap symbol: -3 (w(-,b) = w(a,-) = -3)Each gap is charged an extra gap-open penalty: -4.C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12-4-4Alignment score: 12 4 4 = 4

  • *Affine gap panaltiesA gap of length k is penalized x + ky.gap-open penaltygap-symbol penaltyThree cases for alignment endings:...x ...x...x ...-...- ...xan aligned paira deletionan insertionThis is the same as the scoring schemethat penalizes the first symbol x + yand an extended symbol y.

  • *Affine gap penaltiesLet D(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj ending with a deletion.Let I(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj ending with an insertion.Let S(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj.

  • *Affine gap penalties(A gap of length k is penalized x + ky.)

  • *Affine gap penaltiesSID-y-x-y-x-y-yw(ai,bj)

  • *Constant gap penaltiesMatch: +8 (w(a, b) = 8, if a = b)Mismatch: -5 (w(a, b) = -5, if a b)Each gap symbol: 0 (w(-,b) = w(a,-) = 0)Each gap is charged a constant penalty: -4.C - - - T T A A C T C G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27-4-4Alignment score: 27 4 4 = 19

  • *Constant gap penaltiesLet D(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj ending with a deletion.Let I(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj ending with an insertion.Let S(i, j) denote the maximum score of any alignment between a1a2ai and b1b2bj.

  • *Constant gap penalties

  • *Restricted affine gap panaltiesA gap of length k is penalized x + f(k)y. where f(k) = k for k cFive cases for alignment endings:...x ...x...x ...-...- ...xand 5. for long gapsan aligned paira deletionan insertion

  • *Restricted affine gap penalties

  • *D(i, j) vs. D(i, j)Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length = D(i, j)Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j)
  • *Max{S(i,j)-x-ky, S(i,j)-x-cy}kcS(i,j)-x-cy

    ***********************************************