perform NW-algorithm

Anonymous
timer Asked: Dec 11th, 2017
account_balance_wallet $20

Question description

The file (Question file.doc) has my questions, and the other file (Assignment B) is inferred in the question

1 MUTATION Mutational implications on the residues in biosequences: Mutations can be classified several different ways. This tutorial will focus on sorting such mutations by their effect on the structure of DNA or a chromosome. For this categorization, mutations can be separated into two main groups, each with multiple specific types. The two general categories are large-scale and small-scale mutations. Small-scale mutations: These are those that effect the DNA at the molecular level by changing the normal sequence of nucleotide base pairs. These types of mutations may occur during the process of DNA replication during either meiosis or mitosis. There are three possible small-scale mutations that may occur: Substitution, deletion and insertion as described below. The occurrence of substitutions, deletions and insertions is in general due to mutations. A mutation refers to an epoch wherein a DNA gene is damaged or changed (in such a way as to alter the genetic message carried by that gene). Relevant permanent alteration to the physical composition of a DNA gene (such that the genetic message being changed) is caused by an agent of substance called mutagen. Large-scale mutations: These mutations effect entire portions of the chromosome. Some largescale mutations effect only single chromosomes, others occur across nonhomologous pairs. Some large-scale mutations in the chromosome are analogous to the small-scale mutations in DNA; the difference is that for large-scale mutations entire genes or sets of genes are altered rather that only a single nucleotide of the DNA. Single chromosome mutations are most likely to occur by some error in the DNA replication stage of cell growth, and therefore could occur during meiosis or mitosis. Mutation involving multiple chromosomes is more likely to occur in meiosis during the crossing-over that occurs during the prophase I. Large scale mutations are deletion, duplication, inversion, insertion, translocation and non-disjunction types as will be explained later. Mutation and derivatives: Mutation results in a change in DNA, usually in its sequence, the number of copies of a sequence that are present, how the DNA is arranged, or its location, (namely, at which chromosome). Use one or more of the following methods for mutating the design and build both - the resulting single strand and "duplex" relevant to a query sequence. Small-scale mutation - definition: (A) Point mutation: Substitute an individual base with another. It is a type of mutation that causes a single nucleotide base substitution, insertion, or deletion of the genetic material, DNA or RNA. Some common substitutions: A for C; A for G; C for T; G for T, A for T; G for C. A point mutant is an individual that is affected by a point mutation. Page 1 of 39 2 Illustration of three types of point mutations to a codon. Schematic of a single-stranded RNA molecule illustrating a series of three-base codons.Each three-nucleotide codon corresponds to an amino acid when translated to protein.When one of these codons is changed by a point mutation, the corresponding amino acid of the protein is changed. A point mutation, or single base modification, is a type of mutation that causes a change in a single nucleotide base via substitution, insertion, or deletion of the genetic material, DNA or RNA. Substitution: A substitution is a mutation that exchanges one base for another (i.e., a change in a single "chemical letter" such as switching an A to a G). Such a substitution could, (i) change a codon to one that encodes a different amino acid and cause a small change in the protein produced. For example, sickle cell anemia is caused by a substitution in the beta-hemoglobin gene, which alters a single amino acid in the protein produced; (ii) change a codon to one that encodes the same amino acid and causes no change in the protein produced. These are called silent mutations and (iii) change an amino-acid-coding codon to a single "stop" codon and cause an incomplete protein. This can have serious effects since the incomplete protein probably may not be functionally useful. Insertion: These are mutations in which extra base pairs are inserted into a new place in the DNA. Page 2 of 39 3 Deletion: These mutations are those in which a section of DNA is lost, or deleted - that is, deleting a segment of a sequence. Frameshift: The term frameshift mutation indicates the addition or deletion of a base pair. Since protein-coding DNA is divided into trinucleotides, insertions and deletions can alter a gene so that its message is no longer correctly parsed. Such changes are called frameshifts. (For example, consider the sentence, "The fat cat sat." Each word represents a codon. If we delete the first letter and parse the sentence in the same way, it doesn't make sense). With frameshifts, a similar error occurs at the DNA level, causing the codons to be parsed incorrectly. This usually generates truncated proteins like "hef atc ats at", which are uninformative. Transposition: Move a segment of the sequence from one place to the other in the overall order. Duplication: Repeat a section of the sequence one or more times Repeat induced point (RIP) mutations: These are recurring point mutations. RIP is a genome defense in fungi that hypermutates repetitive DNA. It is suggested that RIP limits the accumulation of transposable elements [M. E. Hood, M. Katawczik and T. Giraud: Repeatinduced point mutation and the population structure of transposable elements in Microbotryum violaceum. Genetics, 2005, vol. 170(3), 1081–1089]. Large-scale mutations – definitions Deletion: Large-scale deletion is a single chromosome mutation. This involves the loss of one or more genes from the parent chromosome. Duplication: Duplication is the addition of one or more genes that are already present in the chromosome. This is a single chromosome mutation. Inversion: It involves inverting a segment of the sequence, say, a complete reversal of one or more genes within a chromosome. The genes are retained post-inversion, but its order is backwards from the parent chromosome. This is also a single chromosome mutation. That is, inversions refer to one type of genetic mutation that creates changes in a chromosome. Insertion: Large-scale insertion involves multiple chromosomes. For this type of insertion, one or more genes are removed from one chromosome and inserted into another nonhomologous chromosome. This can occur by an error during the prophase-I of meiosis when the chromosomes are swapping genes to increase diversity. Translocation: Translocation also involves multiple nonhomologous chromosomes. Here the chromosomes swap one or more genes with another chromosome. Non-disjunction: A non-disjunction mutation does not involve any errors in DNA replication or crossing-over. Instead these mutations occur during the anaphase and telophase when the chromosomes are not separated properly into the new cells. Common non-disjunctions are missing or extra chromosomes. When gametes with non-disjunctions are produced during meiosis, it can result in an offspring with a monosomy or trisomy (referring to a missing or extra homologous chromosome). Effects of mutations The effects of mutations may range from nothing all the way to unviability of a cell. All mutations will affect the proteins created during protein synthesis; but not all mutations will have a significant impact on the final product. Such effects can also be distinct between the smallscale and large-scale mutations. Page 3 of 39 4 Conservative substitution: This refers to a nucleotide mutation, which alters the amino acid sequence of the protein, causing substitution of one amino acid with another, which has a side chain with similar charge/polarity characteristics. The size of the side chain may also be an important consideration. Conservative mutations are generally considered unlikely to profoundly alter the structure or function of a protein, but there are many exceptions Non-conservative substitution: This corresponds to a mutation, which results in the substitution of one amino acid within a polypeptide chain with an amino acid belonging to a different physico-chemical property such as, polarity/charge group. Convergent and parallel substitutions: In comparisons among orthologous proteins from a given set of species, convergent substitutions at a particular site refer to independent changes from different ancestral amino acids to the same derived amino acid. In the illustration (a) below, there is a change from G (the ancestral state) to T (the derived state) in one species, and a change from A to T in another species. The convergent substitutions are denoted by red bold lines. Parallel substitutions at a site refer to independent changes from the same ancestral amino acid to the same derived amino acid. In the case of illustration (b), changes from A to T occurred in two different species. The parallel substitutions are denoted by red bold lines. In sets of closely related species, parallelism is generally more common than convergence simply because - at any given site - close relatives will be more likely to share the same ancestral state prior to the occurrence of independent substitutions [J. F. Storz: Causes of molecular convergence and parallelism in protein evolution. Nature Reviews Genetics, 2016, vol.17, 239-250] Coincidental substitutions: The occurrence of two substitutions at the same nucleotide site in two homologous sequences. --------------------------------------------------------------------------------------------------------------------Example A: Assume a hypothetical initial strand … AAAAGGGGTTTTGACC … and perform an insertion version of mutation with a sub-sequence inserted at an arbitrary location. Solution For the assumed strand, the insertion version of mutation say, for example with a sub-sequence ‘CCCC’ at an arbitrary location, will result in the following: ... AAAAGGCCCCGGTTTTGACC ... Page 4 of 39 5 -----------------------------------------------------------------------------------Example B: Suppose one or more ancestral sequences are given. Assuming different types of mutational changes occur as indicated, evaluate the outcomes Presumed mutational change type No change: Retained as it is Single substitution CA Multiple sequential Substitutions GAT Back substitution CTC Result on the sequence(s) Coincidental substitutions: With reference to two homologous sequences, two substitutions at the same nucleotide site TG Parallel substitutions at a site: This refer to independent changes from the same ancestral amino acid to the same derived amino acid. T  C or G Convergent substitutions: Independent changes from different ancestral amino acids to the same derived amino acid. … A C C C T A C G … …ACCCTACG… …ACCCTACG… … A A A A T A A G… … A A A A T A A G…  … A A A A T A A C…  … A A A A T A A T… … A A A A T A A C…  … A A A A T A A T…  … A A A A T A A C… Homolog sequence: Y1 … A A A T A A T… Homolog sequence: Y2 … C A A T A A T… Coincidental substitutions are shown bold: Y1* … A A A G A A T… Y2* … C A A G A A T… Given homolog species Z: Z: … G A A A C A A T… Parallel mutations are shown bold : Z1*:  … G A A A C A A C… Z2*:  … G A A A C A A G… Say two different ancestral AAs, Z1 and Z2 are considered: Z1: … A A T G A T Z2 : ... A A T Independent changes from different ancestral amino acids to the same derived residue, T ----------------------------------------------------------------------------------Problems on mutational changes Problem B.1 Construct a matrix of the set {A, C, T, G} to illustrate the characteristic of the transition and transversion mutations. (Hint: You may use a score of 100 % to depict the element of the matrix pertinent to no mutation – for example for 100% for A-to-A as shown; and, use prorated percentages to represent other elements illustrating the characteristic as above. The spontaneous base substitutions ratio of transitions to transversions is approximately 2:1. Therefore each transition should have a probability of 2/3 and each transversion 1/3). Page 5 of 39 6 Answer: A A C T G 100% C T G ---------------------------------------------------------------------------------------------------------------------Problem # B.2 (a) A strand is presumably mutated at a location underlined in the sequences shown below. In each case, (i) write down the eventual resulting strand for the following additional mutations happening in succession: 1) Inversion of some subsequence part 2) Deletion of some subsequence part 3) Transpose of some subsequence part 4) Duplication of a base-pair in the sequence 5) Point mutation of one base into another (Hint: The answers may depend on subjective selection as required) a) …..TTAAGGGGGGCCTTTTGAAA…. Example answer: (5) GGGG → GGCG (b) ….,,AAAAGGGGGGCCTTGGGACC…. (c) ……CCAAGGGTGTCCTTTTGAGG…. (ii) In each case, also write down the corresponding final resulting duplex strands at preRNA level 1. GGCG Example answer: CCGC 2. AAATAA 3. TCCT 4. CGTTA -------------------------------------------------------------------------------------------------------------------Problem B. 2(b) Consider an ancestral sequence: …ACCCTAC … Suppose a sequence of changes occur on this segment as follows: Single substitution, no change, multiple substitutions, back substitution, parallel substitutions, coincidental substitutions and convergent substitutions. Write down two possible resulting sequences. Page 6 of 39 7 1) Initial ACCCTAC Single Sub (A→G)CCCTAC (Example answer) No Change Multiple Sub Back Sub Parallel Sub Coincidental SubConvergent Sub 2) Initial ACCCTAC Single Sub No Change Multiple Sub A(C→T→A)CCTTC (Example answer) Back Sub Parallel Sub Coincidental SubConvergent Sub --------------------------------------------------------------------------------------------------------------------Problem B.3 Given a sequence: X: 5’ - TAC GGA TCG AAT GCT CCC GTA ATC – 3’ Suppose the following mutations have occurred in succession: A single point mutation, deletion of a triplet and duplication of a triplet twice in succession; and, the resulting complementary strand is found to be: Y: 3’ – ATG CCT AAC TTA CGG CAT CAT CAT TAG – 5’ Find Single Point Mutation – (Example answer: Highlight in YELLOW. In triplet 3, the second base C was changed to T. .... GGA TCG .... to .... CCT AAC ....) Deletion of a Triplet – Highlight in RED. Duplication of a Triplet – Highlight in GREEN. Trace/identify all the mutational changes occurred from X to Y. -------------------------------------------------------------------------------------------------------------------- SEQUENCE ALIGNMENT & SCORING: TUTORIAL EXERCISES: Problems on sequence alignment and scoring Page 7 of 39 8 The following notations are used for the alignment status for a given pair: Notations used to denote the alignment status between a pair of sequences (i) Vertical bar Identical residues (ii) One dot Somewhat similar residues (iii) Two dots Very similar residues EXAMPLE x: AAGCTTACGCAAACCG | · | : || · | ·· ·: y: GCTCACGGTTGCCACT Problem B.4 (i) Apply the above notations for the alignment status for the given pair: ….L F D E L N R V V.......... | | | : : | . | . ….L F D D I N Q V L …….. (ii) Denoting “s” for transition mutation and “v” for transversion mutation, determine the sites at which s or v have occurred in the following test pair (a, b): Example: Black brackets and V denote transversions, while Green S denotes transition x: Q [Q D] [I L] F .... S S y: Q [D Q] [L V] V .... V V Test pair: a: F Q D I L F R R D D I I I F Q L b: F D Q L V V R E N D D D N Q F I V V V V Find transitions in y after all transversions have occurred. V→I, V→F, E→R, D→I, N→I, N→L. --------------------------------------------------------------------------------------------------------------------Problems on basic pairwise alignment procedure Example: Consider the following two short nucleotide sequences, each of seven residues only. Construct two possible alignments allowing two gaps. (A gap is defined as any maximal consecutive run of spaces in a single string of a given alignment. They facilitate creating alignments that better conform to underlying biological models and more closely and appropriately fit patterns vis-à-vis a meaningful alignment expected). X: T A C C A G T Y: C C C G T A A Solution(s) : (i) Page 8 of 39 9 X: T A C C A G T   Y: C  C C  G T A A (ii) X: T A C C A G T   Y:   C C C G T A A Problem B.5 Consider the following two short nucleotide sequences, each of 10 residues only. Construct two possible alignments allowing three gaps. (A gap is defined as any maximal consecutive run of spaces in a single string of a given alignment. They facilitate creating alignments that better conform to underlying biological models and more closely and appropriately fit patterns vis-à-vis a meaningful alignment expected). X: A A C C A G T A AT Y: T C C C T A A G T T --------------------------------------------------------------------------------------------------------------------Problems on scoring the alignments Example Consider the following alignment: S :x T: ATCG GATGGAC ACGGAAT  CC This alignment has four gaps containing a total of six spaces (). Further, it can be described as having five matches and two mismatches Problem B.6 Consider the following two pairs of aligned nucleotide sequences (X, Y) and (U, V). (i) Describe the alignments in each pair in terms of the counts on existing matches, mismatches, gaps and spaces (ii) Apply the following award/penalty scoring scheme and compare the alignment (X, Y) versus (U, V) in terms of the overall scores obtained in each case Scoring scheme: Match: say, 100; Mismatch: (Purine  Purine or Pyrimidine  Pyrimidine) – Transition, say:75; (Purine  Pyrimidine or Purine  Pyrimidine)- Transversion: say 10; Space: − 50. X: A GCC ATATA Y: A G G AC A A  T T A U: V: AGCCATATA AGCAATTA Page 9 of 39 10 Example answer for X, Y: This alignment has two gaps with three total spaces. It also has 5 matches and three mismatches. The score for this alignment would be (5 ×100) + (75 ×1) + (10 × 2) − (50 × 3) = 445 Deduce the score for: U : V (550) --------------------------------------------------------------------------------------------------------------------Problems on: Sequence similarity and notion of “distance” Given two character strings, the measures of “distance” between them are: (i) Statistical distances (in Euclidian sense) such as, Mahalanobis distance and its variations. (ii) Hamming distance and (iii) Levenshtein distance (edit distance) Hamming distance The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different. That is, it measures the minimum number of substitutions required to change one string into the other, or the minimum number of errors that could have transformed one string into the other. Examples AGTC CGTA Hamming distance (HD) = 2 KENTUCKY TENTURKI Hamming Distance = 3 (K/T, C/R, Y/I) Edit distance This refers to the edit distance (also known as Levenshtein distance) between two sequences expressed in terms of minimal number of operations (indels and substitutions) exercised to transform one sequence to another. This edit distance approximately specifies the number of DNA replications taken place across two sequences. That is, the Levenshtein distance (LD) is a string metric for measuring the difference between two sequences. Simply, the LD between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. Examples AG–TCC CGCTCA Levenshtein distance (LD) = 3 For the two sequences indicated below, determine the minimum number of edit operations required to transform one into the other. X: Y: Solution: Step (i): ACCUGA AGCUA Substitution of C by G in X: Page 10 of 39  AGCUGA 11 Step (ii): Indel operation (Insertion of G) in Y:  AGCUGA Thus, there are (a minimum) of two edit operations required for the transformation. (Note that edit distance implies that all operations are exercised on only one sequence). Therefore, the Levenshtein score is: 2 --------------------------------------------------------------------------------------------------------------------Problem # B.7 (i): Consider the following two sequences, namely: X: Y: A G T G G G CAT T C C T T T T C T A G A AT TT C T GT T The following alignment is done with minimal editing permitting identity matches and transitions, [purines (A G) or pyrimidines (C T)] to stay. Determine the associated Levenshtein score. X*: Y*: AGTGGGCATTCCTT TCTA GA TTCTGTT Is there a possibility of better scoring feasible in aligning X and Y? If so indicate it.. (ii): Perform a visual comparison of each of the following putatively related pair of nucleic acid sequences x and y. Indicate in your answer the identities by 1s on match and 0s on mismatches. Also indicate mutations by the scoring indices (notations) S for transition and V for transversion: Pair 1: x: T CG [C T] G G C G C A A A C C G 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 y C C [T C] A G G G T T G C A A C A (Answer Hint Bases bracketed in red in sequence Y show transversions from red bracketed sequences in X.) Pair 2 x: A A [G C A G T C] T C A A [A C G G] 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 y A [C T C A C G] T T T G [G G [A C]] C (iii): Determine the edit distance in transforming one into the other of the following sequence pair: X: Y: TACCAGT CCCGUAA Problem # B.8 Page 11 of 39 12 (a) Determine the Hamming distance between CUMBERLAND and TIMBERLAND (b) Determine the Levenshtein distance between: (a) BIOINFORMATICS and BIOINFORMATION; (b) TELE-INFORMATICS and HAIL-FORMATION; (c) TELEINFORMATICS and C-O-NFORMATION --------------------------------------------------------------------------------------------------------------------Problem # B.9 The Hamming distance (HD) measurements between locally-aligned pair of nucleotide sequences (s and t) are as shown. (a) s t HD(s,t) I GGU UGG 2 II AGCAA ACAUA 3 III AGCACACA ACACACUA 6 (Answers?) New HD is 2. Considering the III group of the sequences s and t (that is, AGCACACA and ACACACUA), align the associated eight nucleotides by introducing two gaps in each sequence randomly so that the HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an insertion in the sequence being compared). (b) s t HD(s,t) I CCU UCC 2 II UCGUU UGUAU 3 III UCGUGUGU UGUGUGAU 6 New HD is 2. Considering the III group of the sequences s and t (i.e., UCGUGUGU and UGUGUGAU), align the associated eight nucleotides by introducing two gaps in each sequence randomly so that the HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an insertion in the sequence being compared). --------------------------------------------------------------------------------------------------------------------Example Assuming 0 Insertion{s) and 2 Deletion(s), determine the Hamming distance between: s: AGACCA t: CACACA Positions: 3 5 7 1 Insertions: * * * * Deletions: C A T A Page 12 of 39 13 Answer: HD(s,t) = 4 Problem # B.10 AGCAACCA ACACACAT Reference to the above example, obtain solutions for the following cases: Assume: 1 Insertion{s) 1 Deletion(s) ================================== Assume 0 Insertion{s) 2 Deletion(s) ================================== Assume 1 Insertion{s) 1 Deletion(s) ================================== Assume 1 Insertion{s) 1 Deletion(s) ================================== Assume 0 Insertion{s) 2 Deletion(s) ================================== Assume 0 Insertion{s) 2 Deletion(s) --------------------------------------------------------------------------------------------------------------------Problem # B.11 Find the Hamming distance between 2 sequences after applying 2 random insertions or deletions on a pair of original sequences (s, t) given as below: s: TGCACACC t: TCACACTC s’:TGCAACTC t’:TGCAACTC (HD is 0?) ================================== Problem # B. 12 Given the following two binary sequences X and Y, plot the HD between as a function of binary digit locations in the 0 to (about) 100 binary residues listed below. Hence indicate the most common substring locations between them. X: 0110111001011010011 0110 0011111110100101011111011110011010101110 0011011100010100110100011111001111 Y: 1010001010011011100101100011111110100101011111011110011010100001 011010001111110 00101110011110101111 (Hint: Select a window of size 4. For each window, calculate HD. Plot window # versus HD Note: In the case of binary strings, the HD is decided by XOR operation across the two residues one below the other in the strings; and the number of 1’s are counted in each window in the resulting XOR output string, depicting the HD ) Page 13 of 39 14 Problem # B.13 For the two binary sequences X and Y, indicated above in Problem B. 12, plot the KullbackLeibler (KL) measure between the strings. Hence confirm the most common substring locations between them as decided via HD measure in the previous problem. (Hint: Again, select a window of size 4. For a given sequence in each window, calculate KL measure. Plot window # versus KL = KL1 + KL2 for each string KL1 = (p(0)loge[(p(0)/q(1)])window#1 + …. KL2 = (q(1)loge[(q(1)/p(0)])window#1 + …. p(0): Probability of 0 in that window; q(1): Probability of 1 in that window) ----------------------------------------------------------------------------------------------------------------------------SEQUENCE ALIGNMENT: NW & SW ALGORITHMS Implementing NW- and SW-algorithms Outlined earlier are details on sequence alignments of interest in bioinformatics; and, pertinent global- and local-alignment algorithms conforming to (i) a global optimization strategy that enforces the alignment across the entire span of all the query sequences and (ii) local alignment strategy that identifies alignments only in the locally-significant segments of similarity (within the long sequences) are explained. Relevant computational schemes based on dynamic programming are indicated as the NW-algorithm for global alignment and SW-algorithm for local alignment. Presented in this tutorial are exemplars of such algorithms using necessary hypothetical sequences; also, pursuing the exemplar set of solved exercises, a set of problems (with solution hints/answers if needed)are presented. Sequence alignment - global, local and glocal versions: Examples To illustrate the differences in the post-alignment results of global, local and glocal alignments enforced on a pairs of sequences, the following examples are furnished EXAMPLE: This example is indicated to illustrate the differences in the post-alignment results pertinent to global versus local alignments performed on a pairs of sequences. Suppose a pair of hypothetical amino acid sequences X and Y subjected to alignment procedure is as follows: X: AGPSSKQNGKPSSRIWDN Y: ANITKSAGKPAIMRLGDD The results of global and local alignments of X and Y are shown below so as to understand the underlying differences. (The results shown are obtained using NW- and SW- algorithms as explained in later examples; and, performing those algorithms on X and Y given above will be indicated as problem exercises to solve). 1. Global alignment: Result of aligning X and Y over their entire lengths via NW algorithm. When the alignment is completed, both sequences are same length. Page 14 of 39 15 X: Y: 2. AGPSSKQNGKPS−SRIWDN | | | | | | | A N − I T K S A G K P A I M R LG D D Local alignment: Result of aligning X and Y via SW algorithm showing the longest or best subsequence pair that has maximum similarity X: Y: − − − − − − −N G K P − − − − − − − − | | | − − − − − − −A G K P − − − − − − − − EXAMPLE: The results of global and glocal alignments of two nucleotide sequences u and v are presented below to understand the underlying differences. u: v: TGTCTGTGGGTGG TGCTTG The results of global and local alignments of U and V are shown below to understand the underlying differences. 1. Global alignment u: v: T G T C TG T G G G T G G | | | | | TG−C−−T −−− T−G T G T C T G T G G G T G G T G G G T G G T G C T T G 2. Glocal (semiglobal) alignement u: v: TGTCTG−TGGGTGG | | | | TGCTTG T G T C T G T G C T Page 15 of 39 16 T G The result of (1) shown is obtained using NW-algorithm; and, the semi-global algorithm is also based on NW-algorithm modified as follows: Once the NW-algorithm based updating the values of the underlying matrix is completed, the trace-back is started at the greatest element of the last row of the alignment matrix (scores matrix); or last column, if there are more rows than columns. This is in contrast with NW-algorithm where the starting of the trace-back commences from the absolute last cell (leading cell) of the matrix. Global sequence alignment: Implementation of NW algorithm As explained before, the global alignment of sequences is based on dynamic programming. Relevant NW algorithm can be adopted to align protein or nucleotide sequences. The underlying optimal global alignment can be understood with the examples and exercises furnished below. EXAMPLE: Given a set of sequence pairs, X: C T C G T and Y: C T A A G T, the problem is to determine the best global alignment between them via trace-back procedure using NW algorithm. Solution Global alignment refers to aligning sequences over their entire length; resulting in sequences of the same length. The global alignment of the sequences can be determined using the NeedlemanWunsch (NW) algorithm. Step 1: Construct a matrix for the two sequences as shown. X Y C T A A G T Step 2: T C G T Initialization: Cells representing identities are scored 1; and, cells representing mismatches are scored 0. X Y C T A A G T Step 3: C C T C G T 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 Add “dummy” columns (I2, I1, ...) and rows (J2, J1, ...) at the end of the matrix as illustrated; and, fill these columns and rows with zeros (shown with bold highPage 16 of 39 17 lighting). Mark the leading cell, LC. (The last corner cell on the right-side of the matrix designated as (i = I3, j = J3)th cell, as shown) • I6 I5 I4 I3 C T C G T Dummy J8 C 1 0 1 0 0 I2 I1 J7 T 0 1 0 0 1 0 0 J6 A 0 0 0 0 0 0 0 J5 A 0 0 0 0 0 0 0 J4 G 0 0 0 1 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 Dummy Step 4: I7 Starting from the leading cell, LC, (designated as (i = I3, j = J3)th cell), perform the following procedures: Move to (I2, J2)th cell (diagonally going downward) and apply the following algorithm to update the value in LC: Algorithm: Update the (leading) cell entry by adding the maximum value encountered at A: (i – 1, j – 1)th cell or along the three tracks, B or C as shown: Leading cell (i = I3, j = J3) 0 0 0 A The maximum value seen at: (I2 = i – 1, J2 = j – 1)th cell • • • C 1 0 0 0 0 0 0 The maximum value seen at I2, I1, ..., etc. along the jth row B The maximum value seen at J2, J1, ..., etc. along the jth column Observed maximum value at A or along B and C: 0 Existing value in the LC: 1 Hence, the updated value for the leading cell is, therefore: (1 + 0) = 1. Step 5: Use the same procedure and algorithm as above to update the value for each cell in the entire matrix pursuing the track-route indicated below: Page 17 of 39 18 (a) After updating the value in the LC, consider the other cells one-by-one along the row- J3 following the track moving leftward and encountering, the cells: I4, I5, ..., and I7. Corresponding updated values are shown bold in the matrix indicated below: I5 I4 I3 C T C G T Dummy C 1 0 1 0 0 I2 I1 J7 T 0 1 0 0 1 0 0 J6 A 0 0 0 0 0 0 0 J5 A 0 0 0 0 0 0 0 J4 G 0 0 0 1 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 Next considering the cells one-by-one, I3, I4, ..., and I7.. encountered along the upper row- J4 (by following again the track moving leftward), use the same procedure and algorithm as above to update the existing value in each cell. Corresponding updated values are shown bold in the matrix indicated below: I7 I6 I5 I4 I3 C T C G T Dummy J8 C 1 0 1 0 0 I2 I1 J7 T 0 1 0 0 1 0 0 J6 A 0 0 0 0 0 0 0 J5 A 0 0 0 0 0 0 0 J4 G 1 1 1 2 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 Dummy (c) I6 J8 Dummy (b) I7 Likewise, considering the cells one-by-one, I3, I4, ..., and I7, encountered along the next upper row- J5 (by following the track, moving leftward) and using the same procedure and algorithm indicated above the existing value in each cell is updated. Corresponding updated values are shown bold in the matrix indicated below: I7 I6 I5 I4 I3 C T C G T Dummy J8 C 1 0 1 0 0 I2 I1 J7 T 0 1 0 0 1 0 0 J6 A 0 0 0 0 0 0 0 J5 A 2 2 2 1 0 0 0 Page 18 of 39 19 J4 G 1 1 1 2 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 Dummy (d) The aforesaid procedure is repeated for the cells one-by-one, I3, I4, ..., and I7. encountered along the next upper rows- J6 through J7 (by following the track, moving leftward); and, by using the procedure and algorithm indicated above, the existing value in each cell is updated. Corresponding updated values are shown bold in the following matrices: I6 I5 I4 I3 C T C G T Dummy J8 C 1 0 1 0 0 I2 I1 J7 T 0 1 0 0 1 0 0 J6 A 0 2 2 1 0 0 0 J5 A 2 2 2 1 0 0 0 J4 G 1 1 1 2 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 I7 I6 I5 I4 I3 C T C G T Dummy Dummy J8 C 1 0 1 0 0 I2 I1 J7 T 0 3 2 1 1 0 0 J6 A 0 2 2 1 0 0 0 J5 A 2 2 2 1 0 0 0 J4 G 1 1 1 2 0 0 0 J3 T 0 1 0 0 0 0 J2 0 0 0 1 LC 0 0 0 J1 0 0 0 0 0 0 Dummy (e) I7 Lastly the procedure repeated for the cells one-by-one, I3, I4, ..., and I7 encountered along the next upper row- J8 (by following the track, moving leftward) completes the updating with the values shown bold in the following matrix I7 I6 I5 I4 I3 C T C G T Page 19 of 39 Dummy 20 J8 C J7 T J6 A J5 A J4 G J3 T Dummy 4 0 0 2 1 0 2 3 2 2 1 1 3 2 2 2 1 0 1 1 1 1 2 0 0 1 0 0 0 1 LC I2 I1 0 0 0 0 0 0 0 0 0 0 J2 0 0 0 0 0 0 J1 0 0 0 0 0 0 (f) The best global alignment is then determined using the back-tracing method as follows: (i) Starting from the lead cell (LC), trace an upward diagonal path. This is done regardless if the cell corresponds to a match or a mismatch. C T A A G T C 4 0 0 2 1 0 T 2 3 2 2 1 1 C 3 2 2 2 1 0 G 1 1 1 1 2 0 T 0 1 0 0 0 1 LC (ii) This trace-path is continued until an island is met. An island refers to a set of 4 cells with 3 or more entries that are identical as shown with bold entries below. C T A A G T C 4 2 2 2 1 0 T 2 3 2 2 1 1 C 3 2 2 2 1 0 G 1 1 1 1 2 0 T 0 1 0 0 0 1 (iii) From the island, there are two possible paths that can be taken: 1. First vertical, then diagonal C T A C 4 2 2 T 2 3 2 C 3 2 2 Page 20 of 39 G 1 1 1 T 0 1 0 21 A G T 2 1 0 2 1 1 2 1 0 1 2 0 0 0 1 2. First horizontal, then diagonal: C T A A G T C 4 2 2 2 1 0 T 2 3 2 2 1 1 C 3 2 2 2 1 0 G 1 1 1 1 2 0 T 0 1 0 0 0 1 Back-tracing: Summary: 1. 2. 3. 4. Start from lead cell Go diagonal from the cell to the next regardless of the entry in the new cell corresponds to a match or a mismatch Continue on diagonal trace until an “island” shown with bold score entries is reached. (The island can be identified as the set of 4 cells with 3 or more entries are identical in the four cells) Across the island, go “vertical” and continue on diagonal track; or, go “horizontal” and, then continue on the diagonal trace, as feasible. The choice of picking one of the two paths depends on the number of cells ahead of the island - whichever path that has the more cells traced ahead should be the selected path. In the present example, taking the vertical path leads to four cells, whereas the horizontal path has only one ahead. Therefore, the vertical path is selected. (g) Now, the given sequences can be aligned as per the following rules: I. A vertical track implies introducing a gap in the X (upper) sequence at its site II. A horizontal track implies introducing a gap in the Y (bottom) sequence at its site Relevant to present example of sequences X: CTCGT and Y: CTAAGT, the trace-back is illustrated below Page 21 of 39 22 Gap C T C G T C T A A G T Hence, the aligned sequences are written as follows, with the gap introduced. on sequence X: X: Y: C T – C G T | | | | C T A A G T (i) In summary, the NW-algorithm involves: (i) Setting up the matrix, (ii) updating the scores in the matrix cells and (iii) identifying the optimal alignments via a trace-back suite. (The four aspects of trace-back procedure are as follows: (a) Encountering identity of residues at a site between X and Y: Stay tracking along the diagonal; (b) mismatch of residues encountered at a site between X and Y: Stay tracking along the diagonal; (c) gap in top (X) sequence corresponds to tracking vertically in the island; and (d) gap in the bottom (Y) sequence corresponds to track horizontally in the island). Furnished below is an illustrative tutorial on trace-back procedure relevant to NW algorithm. Thus, based on vertical or horizontal trace-pursuit at the island indicated above, the following can be specified as alignment rules: • • Outside the island where the diagonal pursuit is done, it implies a region where “identity” or ‘mismatch” of residues across the compared sequences exist without any gaps to exist at those sites Within the island, if a vertical pursuit is done, it implies a “gap” to be introduced at the site of the upper sequence, for example in U of a hypothetical pair of sequences, U and V U: • A AG–CTG | | | | V: A ATCGTG Within the island, if a horizontal pursuit is done, it implies a “gap” to be introduced at the site of the bottom sequence, for example in V of a hypothetical pair of sequences, U and V Page 22 of 39 23 U: G AGCCTA | | | | V: GAT– GTA ______________________________________________ EXAMPLE Given the sequence pairs: x: W F G Q E T S A I S y: S F T Q F S E D A I Perform NW-algorithm based comparison between the two sequences and elucidate the optimal pathway in aligning them globally. Solution Step 1: Develop a score matrix for the two sequences and do initialization of cells representing the identities with the score-value1 and the cells representing mismatches with score-value 0. The constructed initial matrix is shown below: x y S F T Q F S E D A I W 0 0 0 0 0 0 0 0 0 0 F 0 1 0 0 1 0 0 0 0 0 G 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 1 0 0 0 0 0 0 E 0 0 0 0 0 0 1 0 0 0 T 0 0 1 0 0 0 0 0 0 0 S 1 0 0 0 0 1 0 0 0 0 A 0 0 0 0 0 0 0 0 1 0 I 0 0 0 0 0 0 0 0 0 1 S 1 0 0 0 0 1 0 0 0 0 Step 2: Add “dummy” rows and columns at the left-end of the matrix and fill these columns and rows with zeros. Name the columns (i: I12, I13, ..., I3) and rows (j: J12, J11, ..., J3) as shown. Identify the corner-most cell (last cell of the matrix; not including dummy rows/columns) and designate it as the leading cell, LC specified with the coordinate (i, j). J12 J11 S F I12 W 0 0 I11 F 0 1 I10 G 0 0 I9 Q 0 0 I8 E 0 0 I7 T 0 0 I6 S 1 0 Page 23 of 39 I5 A 0 0 I4 I 0 0 I3 S 1 0 I2 I1 Dummy 0 0 0 0 24 J10 J9 J8 J7 J6 J5 J4 J3 J2 J1 T Q F S E D A I Dummy 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Step 3: As described in the previous example, starting from the leading cell (LC: i, j), move to the S (i −1, j −1) cell (downward and diagonal). Then update the LC score value, S(i j) by adding the maximum of one of the three following observed value as per NW algorithm: • S (i −1, j −1) • Maximum of S (i − 1, j − 2), S(i −1, j − 3), ... (along I2, I1, ...), S (i −1, j − n) • Maximum of S (i − 2, j −1), S(i − 3, j −1), ... ... (along J2, J1, ...), S (i − n, j −1) This procedure is repeated for the entire cells one-by-one, encountered along the columns I3, I4, ..., and I12 and rows J3, J4, ..., and J12 and the updating of the scores completed leading to the values shown bold in the following matrix X Y S F T Q F S E D A I W F G Q E T S A I S 5 4 4 4 3 3 2 2 1 0 4 5 4 3 4 3 2 2 1 0 4 4 4 3 3 3 2 2 1 0 4 4 3 4 3 3 2 2 1 0 4 4 3 3 3 2 3 2 1 0 3 3 4 3 3 2 2 2 1 0 3 2 2 2 2 3 2 2 1 0 1 1 1 1 1 1 1 1 2 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 Step 4: Using the final score matrix, a trace-back from the LC is done via diagonal pursuit and resorting to vertical or horizontal path whenever an island is encountered. Shown below is the track prescribed thereof to the present problem: X Y S F T W F G Q E T S A I S 5 4 4 4 5 4 4 4 4 4 4 3 4 4 3 3 3 4 3 2 2 1 1 1 1 1 1 1 0 0 Page 24 of 39 25 Q F S E D A I 4 3 3 2 2 1 0 3 4 3 2 2 1 0 3 3 3 2 2 1 0 4 3 3 2 2 1 0 3 3 2 3 2 1 0 3 3 2 2 2 1 0 2 2 3 2 2 1 0 1 1 1 1 1 2 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0: LC The aligned sequence is therefore: WFGQETS−−AIS | | | | | S FTQF−SEDAI --------------------------------------------------------------------------------------------------------------------EXAMPLE The following pair of sequences is indicated in [T. K. Attwood and D. J. Parry-Smith: Introduction to Bioinformatics. Pearson Education Ltd., Essex UK: 1999] and obtaining relevant global alignment via NW algorithm is hence described. The present exercise is to obtain the final score matrix given in [ ] for the test pair and verify the result on alignment as posted: u: v: ADLGAVFALCDRYFQ ADLGRTQNCDRYYQ Final gapped alignment shown in [ ] is: u: ADLGAVFALCDRYFQ | | | | | | | | | v: ADLGRTQN−CDRYYQ -------------------------------------------------------------------------------------------------------------------Local sequence alignment: Implementation of SW algorithm The local alignment of sequences is based on dynamic programming using SW algorithm, which can be adopted to align protein and/or nucleotide sequences. The underlying optimal local alignment can be understood with the examples and exercises furnished below. EXAMPLE Given a set of sequence pairs, U and V as shown below, establish their best local alignment using SW algorithm. (i) u: ...W R N D C Q E G S A... v: ...W G Q E G S I E A... Solution Application of Smith-Waterman (SW) algorithm towards aligning u and v and elucidating the local alignment conforms to the procedure with following steps: (1) (2) Construct an initial matrix framed with u and v residues as shown below. Add a set of edge elements x: “0” along the right-most column and the top-most row of the matrix as shown. Inasmuch as the first row and first column cannot be an end-point of Page 25 of 39 26 (3) (4) any alignment, (x: 0’s) are introduced as indicated so as to serve as a dummy placeholder Next, populate the cells corresponding to identical matches (of residues of u and v at the cell site) scored with entries of “1”s; likewise, and cells corresponding to mismatches of residues scored with entries of “0”s u W R N D C Q E G S A v x 0 0 0 0 0 0 0 0 0 0 W 0 1 0 0 0 0 0 0 0 0 0 G 0 0 0 0 0 0 0 0 1 0 0 Q 0 0 0 0 0 0 1 0 0 0 0 E 0 0 0 0 0 0 0 1 0 0 0 G 0 0 0 0 0 0 0 0 1 0 0 S 0 0 0 0 0 0 0 0 0 1 0 I 0 0 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 1 0 0 0 A 0 0 0 0 0 0 0 0 0 0 1 Highlight all the match-score entries (1’s) bold. Relevant to each cell having match score entry (1) and marked bold, pursue the following three possible tracks to populate the rest of the cells with updated entries: (a) Diagonal track: Suppose the cell having a match score entry (1) and marked bold. It is designated with a coordinate (i − 1, j − 1) and the diagonal tracking is done downward towards the cell: (i, j) as illustrated below. Suppose the score value at (i − 1, j − 1) is S(i − 1, j − 1) then the score on cell S(i, j) is decided as follows: S(i, j) = S(i − 1, j − 1) + 1.0, if a similarity (match) of u and v exists at the site, (i − 1, j − 1); otherwise, S(i, j) = S(i − 1, j − 1) – 0.3, if a dissimilarity (mismatch) of u and v exists at the site, (i − 1, j − 1). In the above updating of scores, addition of 1.0 implies an “award” given to similarity-match observed; and, subtracting 0.3 refers to “penalty” given to dissimilarity-(mismatch) observed. (The value 0.3 is an approximation of 1/3 depicting the degree-of-freedom). Page 26 of 39 27 Award: + 1.0 (i − 1, j − 1) Penalty: − 0.3 (i , j ) The above-said diagonal track score update procedure is illustrated below with an example: An nth cell with an existing score Sn gets an award of (+ 1) due to the identity of residues (A  A) of u and v and its updated score, therefore becomes (Sn + 1). On the other hand, considering the mth cell as shown, with a score Sm takes a penalty of (− 0.3) due to the mismatch of residues (C  T) of u and v, and, as such, its updated score, becomes (Sm − 0.3). u: A u: C Sn − 1 v: A Sn + 1.0 Award Sm − 1 Sm − 0.3 v: T ≠ C Penalty (5) The diagonal pursuit as per the above step is exercised at all those cells that shoe the score entry of “1s” as confirmed in the initialization; and, all the relevant cells in the diagonal pursuits are updated with the new scores. This diagonal path of updating the score is terminated when the computed score value becomes negative. At that cell and subsequently, the score entries along the diagonal path are rendered as “0s”. Further, this diagonal cell score-filling is discontinued when a cell having a positive score value (possibly, 1 as registered in the initialization) is encountered en passé (6) The next step involves performing the following two algorithms with a horizontal pursuit along a row towards right and a vertical pursuit along a column downwards as illustrated below so as to populate the cells with updated entries. Suppose (i, j) is any cell considered. Then, the horizontal track along the row from this cell as shown, leads to a set of sequential set of cells whose entries are updated with scores using the following algorithm: S(i, j + k) = [S(i, j) − (1.0 + 0.3 × k)], k = 1, 2, ... Page 27 of 39 28 The k-value denoting the kth cell as shown is ended, when a negative value of the score results in; and thereupon, the subsequent cells are filled with 0 scores. This horizontal-track based filling is continued and eventually stopped when a match (identity)-value of 1 of the initial matrix is encountered on the row. S(i , j ) S(i , j + 1 ) S(i , j + 2 ) S(i , j + k ) (i , j ) Next, again considering a cell (i, j), the vertical track along the column from this cell as shown, leads to a set of sequential cells downward whose entries are updated with scores using the following algorithm: S(i + , j) = [S(i, j) − (1.0 + 0.3 × )],  = 1, 2, ... The -value denotes the th cell as shown is ended, when a negative value of the score results in; and thereupon, fill the subsequent cells with 0 scores. This vertical-track based filling is continued and eventually stopped when a matchvalue of 1 of the initial matrix is encountered on the column. (i , j ) S(i , j ) S(i + 1, j ) S(i + 2, j ) S(i +  , j ) In the above procedures, the values of k and  denote penalty-lengths that specify the extent of penalties being imposed on the scores of the cells consistent with the possible deletions. While proceeding along k or , if a negative value of the computed score results in, then the corresponding cell and the rest seen subsequently are filled with 0 scores. (It implies that there is no alignment similarity up to the current cellposition). Further, score-filling horizontally (along the row) or vertically (along the column) is terminated when a cell with identity (similarity) score of 1 (registered in the initialization) is seen ahead en passé. Thus, commencing from each of the cell (i − 1, j − 1) with initialized score entry S(i − 1, j − 1) = 1, the updating of score values is done via: (i) Diagonal passage Page 28 of 39 29 from cell (i − 1, j − 1) to (i, j), (ii) horizontal path-lengths, k (along the row) or (iii) vertical path-lengths,  (along the column) as illustrated below: Diagonal path (i − 1, j − 1) k-path (i , j ) -path (7) Now, considering the alignment exercise in hand, the diagonal, horizontal and vertical scoring procedures indicated above are performed in order to update the cell scores using the aforesaid algorithms pertinent to SW scheme of local alignment. The updated scores as computed are shown bold in the following matrix. The scored out values are the existing scores; and, all the pertinent diagonal pursuits are marked with arrows. u W R N D C Q v x 0 0 0 0 0 0 0 0 0 0 W 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 G 0 0 0 0.66 0 0 0 1 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 G S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0.67 0 0.33 0 0 I 0 0 0 0 0 0 0 1 E 0 0 0 0 0 A 0 0 0 0.67 0 3 1 1.67 0 1.33 0 1 0 0.67 0 0.67 0 0.33 0 1.67 0 4 1 2.67 0 2.33 0 2.0 0 0 Q 0 0.33 0 (8) G 0 0 0 E 0 0 0 S A 0.33 0 1.33 0 2.67 0 3.67 0 2.33 0 2.0 1 Alignment via trace-back: The trace-back refers to commencement of a trace bottom-up from the largest score value observed on the final score matrix and proceeding diagonally upward as illustrated below relevant to scores high-lighted bold. v u W R N D C Q E G S A x 0 0 0 0 0 0 0 0 0 0 Page 29 of 39 30 W 0 1 0 0 0 0 0 0 0 0 0 G 0 0 0.66 0 0 0 0 0 1 0 0 Q 0 0 0 0.33 0 0 1 0 0 0.66 0 E 0 0 0 0 0 0 0 2 0.67 0.33 0.33 G 0 0 0 0 0 0 0 0.66 3 1.66 1.33 S 0 0 0 0 0 0 0 0.33 1.66 4 2.66 I 0 0 0 0 0 0 0 0 1.33 2.66 3.66 E 0 0 0 0 0 0 0 1 1.00 2.33 2.33 A 0 0 0 0 0 0 0 0 0.66 2 2 Hence, the final result on local alignment is as follows: QEGS | | | | QEGS --------------------------------------------------------------------------------------------------------------------EXAMPLE Given a set of sequence pairs, u and v as shown below, determine the best local alignment via trace-back method using SW-algorithm. Given pair of sequences: u: AASTHECWCTWH v: AASRNPSCWTTWHT Solution v A A S R N P S C W T T W H T u x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 1 A 0 1 1 S 0 T 0 H 0 E 0 C 0 W 0 C 0 T 0 W 0 H 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Step 1: As indicated in the previous example, the starting point for the Smith-Waterman algorithm is to construct the matrix with given residue sequences and initialize it with edgeelements to x: 0 denoting the placeholder that accommodates the condition specified as follows: The first row and the first column of the matrix cannot form the endpoint of any specified alignment. Step 2: Next, the cells in the matrix representing with identities of residues (between u and v) are scored 1; and, rest of the cells representing mismatches are scored 0. Shown below is the Page 30 of 39 31 resulting matrix after Steps 1 and 2. (The mismatch values “0”s are omitted in the matrix illustration for clarity). Step 3: Reference to the initialized matrix as above, the cells are filled with updated scores following the algorithm indicated in the last example. Page 31 of 39 32 The final updated matrix is: v A A S R N P S C W T T W H T u x 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 A 0 1 2 0.7 0.3 0 0 0 0 0 0 0 0 0 0 S 0 0 0.7 3 1.7 1.3 1 1 0 0 0 0 0 0 0 T 0 0 0.3 1.7 2.7 1.3 1 0.7 0.7 0 1 1 0 0 1 H 0 0 0 1.3 1.3 2.3 1 0.7 0.3 0.3 0 0.7 0.7 1 0 E 0 0 0 1 1 1 2 0.7 0.3 0 0 0 0.3 0.3 0.7 C 0 0 0 0.7 0.7 0.7 0.7 1.7 1.7 0 0 0 0 0 0 W 0 0 0 0.3 0.3 0.3 0.3 0.3 1.3 2.7 0 0 1 0 0 C 0 0 0 0 0 0 0 0 1.3 1.3 1.3 0 0 0.7 0 T 0 0 0 0 0 0 0 0 0 1 2.3 2.3 0 0 1.7 W 0 0 0 0 0 0 0 0 0 1 1 2 3.3 0 0 H 0 0 0 0 0 0 0 0 0 0 0.7 0.7 2 4.3 0 Step 4: Using the final score matrix, the trace-back is performed starting from the highest value (4.3), as shown. Relevantly, the diagonal pursuit is continued along the path having the highest values until an “island” is met. (The description of an island is given earlier in the exercise pertinent to NW algorithm). Here the trace is directed horizontal and then vertical direction as shown and then pursued diagonal to 2.3 and further. It implies introducing a gap between H and E residues of u and v respectively. Hence, the aligned sequences are written as follows: AASTH−ECWCTWH | | | : | | | | | AASRNPSCWTTWH And, the local-alignment segment is shown bold. -----------------------------------------------------------------------------------------------------------EXAMPLE With reference to the following pair of sequences perform SW-algorithm based comparison and elucidate locally significant, common regions of similarity. u: v: WYGQEQSYIQ WY TQETSDIQ Solution Step 1: Pertinent to implementing Smith-Waterman algorithm, construct the initial score-matrix with edge-elements x: 0 being a placeholder as was done in the prior examples. Next, the cells representing identities are scored 1 and those representing mismatches are scored 0. The resulting matrix is shown below with the omission of 0 scores on mismatches for clarity. v W Y u x 0 0 W 0 1 Y 0 1 G 0 Q 0 E 0 Q 0 S 0 Y 0 I 0 Q 0 1 32 33 T Q E T S D I Q 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 Step 2: The cells are populated with updated scores via application of SW algorithm applied to, diagonal followed by row- and column-wise pursuits exercised at each cell showing the entry 1 (match scores). Hence, the resulting final score matrix is as follows: I0 u J0 J1 J2 J3 J4 J5 J6 J7 J8 J9 J10 X W Y T Q E T S D I Q x 0 0 0 0 0 0 0 0 0 0 I1 W 0 1 0 0 0 0 0 0 0 0 0 I2 Y 0 0 2 0.7 0.3 0 0 0 0 0 0 I3 G 0 0 0.7 1.7 0.3 0 0 0 0 0 0 I4 Q 0 0 0.3 0.3 2.7 1.3 1 0.7 0.3 0 1 I5 E 0 0 0 0 1.3 3.7 2.3 1 0.7 0.3 0 I6 Q 0 0 0 0 1 2.3 3.3 2 1.7 1.3 1.3 I7 S 0 0 0 0 0.7 2 2 4.3 3 2.7 2.3 I8 Y 0 0 1 0 0.3 1.7 1.7 3 4 2.7 2.3 I9 I 0 0 0 0.7 0 1.3 1.3 2.7 2.7 5 3.7 I10 Q 0 0 0 0 1.7 1 1 2.3 2.3 3.7 6 Step 3: The trace-back pathway conforms to the passage that accumulates most matches. The common regions of similarity can be determined by referencing these matches as illustrated below: J0 J1 J2 J3 J4 J5 J6 J7 J8 J9 J10 X W Y T Q E T S D I Q I0 X x 0 0 0 0 0 0 0 0 0 0 I1 W 0 1 0 0 0 0 0 0 0 0 0 I2 Y 0 0 2 0.7 0.3 0 0 0 0 0 0 I3 G 0 0 0.7 1.7 0.3 0 0 0 0 0 0 I4 Q 0 0 0.3 0.3 2.7 1.3 1 0.7 0.3 0 1 I5 E 0 0 0 0 1.3 3.7 2.3 1 0.7 0.3 0 I6 Q 0 0 0 0 1 2.3 3.3 2 1.7 1.3 1.3 I7 S 0 0 0 0 0.7 2 2 4.3 3 2.7 2.3 I8 Y 0 0 1 0 0.3 1.7 1.7 3 4 2.7 2.3 I9 I 0 0 0 0.7 0 1.3 1.3 2.7 2.7 5 3.7 I10 Q 0 0 0 0 1.7 1 1 2.3 2.3 3.7 6 The aligned sequence is as follows with the residues shown bold could be the locally-significant aligned pairs of interest. WYGQEQSYIQ | | | | | | | WYTQEQSDIQ 33 34 PROBLEMS/EXERCISES ON NW and SW ALIGNMENTS EXAMPLE – NW Algorithm Given a set of sequence pairs, U and V as indicated below. Determine the best global alignment via trace-back using NW algorithm. U: CACTHETW V: C A C S C A T TW Solution: Hand Calculation C 4 2 3 2 2 1 1 0 0 C A C S C A T T W A 2 3 2 2 1 2 1 0 0 C 2 1 2 1 2 1 1 0 0 T 0 0 0 0 0 0 1 1 0 H 0 0 0 0 0 0 0 0 0 E 0 0 0 0 0 0 0 0 0 T 0 0 0 0 0 0 1 1 0 W 0 0 0 0 0 0 0 0 1 Alignment = CACT-HETW |||: || CACSCATTW ----------------------------------------------------------------------------------------------------------------------Problem B.14 NW Algorithm Given a set of sequence pairs, X and Y as indicated below. Determine in each case the best global alignment via trace-back using NW algorithm (i) X: GAGCA Y: GATTCA Solution- Hint The following is the solution on alignment: GA-GCA | | | | GATTCA --------------------------------------------------------------------------------------------------------------------- 34 35 Problem B.15 NW Algorithm Given the sequence pairs: x: W F G Q F T S A I W y: S S T Q F S E D A I Perform NW-algorithm based comparison between the two sequences and elucidate the optimal pathway in aligning them globally. ----------------------------------------------------------------------------------------------------------------------Problem B. 16 Assigned is a pair of amino acid sequences (S and T). Determine the best global alignment S: C U U A C G C A T: A U G A G A A C U U Solution Hint: Final alignment S: C U U - A C G - - C - A T: A U - G A - G A A C U U ----------------------------------------------------------------------------------------------------------------------Problem B.17 Given a sequence pairs, U and V as indicated below, determine the best global alignment via traceback using NW algorithm U: CTCGT V: CTAAGT Answer: (ii) Alignment = CT-CGT || || CTAAGT Problem B.18 Via hand calculations, perform NW-algorithm based comparison between the two given sequences indicated belowand elucidate the maximum path-way: MA V R K L S L E G MSTALPGLGS Problem B.19: Bonus Problem for extra credits Via hand calculations, perform NW-algorithm based comparison between the two given sequences and elucidate the maximum path-way: Sequence Pair WFGQETSAIS S FTQFSEDAI 35 36 ----------------------------------------------------------------------------------------------------------------------LOCAL ALIGNMENT: SW ALGORITHM EXAMPLE Determine the best local alignment via trace-back method of Smith-Waterman method Step 1: Construct an n × m matrix and add the row of dummy '0's towards initialization x x 0 A 0 U G 0 0 A G 0 0 A A C 0 0 0 U U 0 0 C 0 U 0 U 0 A 0 C 0 G 0 C 0 A 0 Step 2: Use the Smith-Waterman algorithm to fill table to find the best scoring path (blank spaces represent '0's). x x 0 C 0 U 0 U 0 A 0 C 0 A 0 U G 0 0 A G 0 0 1.7 0.3 A A C 0 0 0 1.0 1.0 U U 0 0 G 0 C 0 A 0 1 1.0 1.0 0.7 1.0 0.7 0.7 1.0 2.0 1.3 1.3 0.3 1.3 1.3 3.0 1.0 0.7 0.7 0.7 1.0 1.0 1.0 0.3 1.0 1.0 0.7 0.7 1.0 0.3 0.3 0.7 Step 3: The row of dummy '0's is removed. C U U 1.0 1.0 0.7 A U G A 1.0 0.7 1.7 0.3 A A C 1.0 1.0 1.0 2.0 1.3 1.3 3.0 G C A 1.0 1.0 1.0 0.7 0.7 A G U U C 1.3 0.3 1.3 1.0 0.7 0.7 0.3 0.3 0.7 1.0 1.0 1.0 0.3 0.7 36 37 Results: Best Local Alignment C U C U U U EXAMPLE Given the sequence pair, X and Y as indicated below, determine the best local alignment via trace-back in each case using SW algorithm. X: PAWHEAE Y: HEAGEWGHEA Hand Calculation x 0 0 0 0 0 0 0 0 0 0 0 x H E A G E W G H E A P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 1 0 0 0 0 0 0 1 W 0 0 0 0 0.6667 0 1 0 0 0 0 H 0 1 0 0 0 0.3334 0 0.6667 1 0 0 E 0 0 2 0.6667 0.3334 1 0 0 0.3334 2 0.6667 A 0 0 0.6667 3 1.6667 1.3334 1.0001 0.6668 0.3335 0 3 E 0 0 1 1.6667 2.6667 1.3334 1.0001 0.6668 0.3335 1 1.6667 Answer: Alignment W-HEA | ||| WGHEA ----------------------------------------------------------------------------------------------------------------------EXAMPLE Determine the common regions of similarity (that is, optimal local alignment) via SW algorithm VSTVVLENPGLGRALS MSTVVTPNPGLGKAS x M S T x 0 0 0 0 V 0 S 0 T 0 V 0 V 0 L 0 E 0 N 0 P 0 G 0 L 0 G 0 R 0 A 0 L 0 S 1.0 2.0 37 38 V V T P N P G L G K A S 0 0 0 0 0 0 0 0 0 0 0 0 1.0 0.7 3.0 1.0 0.7 0.3 1.7 4.0 0.7 1.7 0.3 2.7 3.7 0.3 1.3 2.3 2.3 3.3 1.0 2.0 4.3 0.7 5.3 0.3 6.3 7.3 8.3 8.0 9.0 8.7 Problem B.20 SW Algorithm For the following pair of sequences u and v obtain the relevant local alignment via SW algorithm by obtaining the final score matrix outlined earlier: u: v: ACAGCCUCGCUUAG AAUGCCAUUGACGG Solution hint: The local alignment is: ... G C C − U C G ... ... G C C A U U G ... --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Problem B.21 Given a sequence pairs, X and Y as shown below, determine the best local alignment via traceback using SW algorithm. X: WRNDCQEGSA Y: WGQEGSIEA Answers Alignment = QEGH |||| QEGH Problem B.22 : Bonus problem for extra credits Given a sequence pairs, U and V as shown below, determine the best local alignment via traceback using SW algorithm. U: AASTHECWCTWH V: AASRNPSCWTTWHT 38 39 Answers Alignment = AASTH-ECWCTWH ||| : || ||| AASRNPSCWTTWH ----------------------------------------------------------------------------------------------------------------------SUBMISSION (HARD COPY AND SOFT COPY) DUE DATE: BY 3rd November, 2017 BONUS CREDIT will be given for: (a) Neat step-by-step demonstration of the problems (b) Developing your own MatLab/C/C++ codes as necessary (c) At least 2 hand-calculations of NW and SW problems indicated ________________________________________________________________________ 39
Course SPL PROJECT ASSIGNMENTS Time Accommodation for the Project Duration 1 hr 20 mts Details and Submission Date (I) Bonus Problem Problem B.19: for extra credits Page 15/ASSIGNMENT B Via hand calculations, perform NWalgorithm based comparison between the two given sequences and elucidate the maximum path-way: Sequence Pair WFGQETSAIS S FTQFSEDAI (II) Bonus Problem Problem B.22: DUE CREDIT will be given for: for extra credits (a) Neat step-by-step demonstration of Page 15/ASSIGNMENT B the problems Via hand calculations, determine the best (b) Developing your own local alignment via trace-back MatLab/C/C++ codes as necessary using SW algorithm for the given supplementing the handsequence pairs, U and V shown calculations below, U: V: AASTHECWCTWH AASRNPSCWTTWHT

Tutor Answer

Olav
School: UIUC

Hello,kindly not...

flag Report DMCA
Review

Anonymous
Awesome! Exactly what I wanted.

Similar Questions
Hot Questions
Related Tags

Brown University





1271 Tutors

California Institute of Technology




2131 Tutors

Carnegie Mellon University




982 Tutors

Columbia University





1256 Tutors

Dartmouth University





2113 Tutors

Emory University





2279 Tutors

Harvard University





599 Tutors

Massachusetts Institute of Technology



2319 Tutors

New York University





1645 Tutors

Notre Dam University





1911 Tutors

Oklahoma University





2122 Tutors

Pennsylvania State University





932 Tutors

Princeton University





1211 Tutors

Stanford University





983 Tutors

University of California





1282 Tutors

Oxford University





123 Tutors

Yale University





2325 Tutors