1
MUTATION
Mutational implications on the residues in biosequences: Mutations can be classified several
different ways. This tutorial will focus on sorting such mutations by their effect on the structure
of DNA or a chromosome. For this categorization, mutations can be separated into two main
groups, each with multiple specific types. The two general categories
are largescale and smallscale mutations.
Smallscale mutations: These are those that effect the DNA at the molecular level by changing
the
normal sequence of nucleotide base pairs. These types of mutations may occur during the
process of DNA replication during either meiosis or mitosis. There are three possible smallscale
mutations that may occur: Substitution, deletion and insertion as described below. The
occurrence of substitutions, deletions and insertions is in general due to mutations. A mutation
refers to an epoch wherein a DNA gene is damaged or changed (in such a way as to alter the
genetic message carried by that gene). Relevant permanent alteration to the physical composition
of a DNA gene (such that the genetic message being changed) is caused by an agent of substance
called mutagen.
Largescale mutations: These mutations effect entire portions of the chromosome. Some largescale mutations effect only single chromosomes, others occur across nonhomologous pairs.
Some largescale mutations in the chromosome are analogous to the smallscale mutations in
DNA; the difference is that for largescale mutations entire genes or sets of genes are altered
rather that only a single nucleotide of the DNA. Single chromosome mutations are most likely to
occur by some error in the DNA replication stage of cell growth, and therefore could occur
during meiosis or mitosis. Mutation involving multiple chromosomes is more likely to occur in
meiosis during the crossingover that occurs during the prophase I. Large scale mutations are
deletion, duplication, inversion, insertion, translocation and nondisjunction types as will be
explained later.
Mutation and derivatives: Mutation results in a change in DNA, usually in its sequence, the
number of copies of a sequence that are present, how the DNA is arranged, or its location,
(namely, at which chromosome). Use one or more of the following methods for mutating the
design and build both  the resulting single strand and "duplex" relevant to a query sequence.
Smallscale mutation  definition:
(A)
Point mutation: Substitute an individual base with another. It is a type of mutation that
causes a single nucleotide base substitution, insertion, or deletion of the genetic material, DNA
or RNA. Some common substitutions: A for C; A for G; C for T; G for T, A for T; G for C. A
point mutant is an individual that is affected by a point mutation.
Page 1 of 39
2
Illustration of three types of point mutations to a codon.
Schematic of a singlestranded RNA molecule illustrating a series of threebase codons.Each
threenucleotide codon corresponds to an amino acid when translated to protein.When one of
these codons is changed by a point mutation, the corresponding amino acid of the protein is
changed.
A point mutation, or single base modification, is a type of mutation that causes a change in a
single nucleotide base via substitution, insertion, or deletion of the genetic material, DNA or
RNA.
Substitution: A substitution is a mutation that exchanges one base for another (i.e., a change in a
single "chemical letter" such as switching an A to a G). Such a substitution could, (i) change a
codon to one that encodes a different amino acid and cause a small change in the protein
produced. For example, sickle cell anemia is caused by a substitution in the betahemoglobin
gene, which alters a single amino acid in the protein produced; (ii) change a codon to one that
encodes the same amino acid and causes no change in the protein produced. These are called
silent mutations and (iii) change an aminoacidcoding codon to a single "stop" codon and cause
an incomplete protein. This can have serious effects since the incomplete protein probably may
not be functionally useful.
Insertion: These are mutations in which extra base pairs are inserted into a new place in the
DNA.
Page 2 of 39
3
Deletion: These mutations are those in which a section of DNA is lost, or deleted  that is,
deleting a segment of a sequence.
Frameshift: The term frameshift mutation indicates the addition or deletion of a base pair. Since
proteincoding DNA is divided into trinucleotides, insertions and deletions can alter a gene so
that its message is no longer correctly parsed. Such changes are called frameshifts.
(For example, consider the sentence, "The fat cat sat." Each word represents a codon. If
we delete the first letter and parse the sentence in the same way, it doesn't make sense).
With frameshifts, a similar error occurs at the DNA level, causing the codons to be parsed
incorrectly. This usually generates truncated proteins like "hef atc ats at", which are
uninformative.
Transposition: Move a segment of the sequence from one place to the other in the overall order.
Duplication: Repeat a section of the sequence one or more times
Repeat induced point (RIP) mutations: These are recurring point mutations. RIP is a genome
defense in fungi that hypermutates repetitive DNA. It is suggested that RIP limits the
accumulation of transposable elements [M. E. Hood, M. Katawczik and T. Giraud: Repeatinduced point mutation and the population structure of transposable elements in Microbotryum
violaceum. Genetics, 2005, vol. 170(3), 1081–1089].
Largescale mutations – definitions
Deletion: Largescale deletion is a single chromosome mutation. This involves the loss of
one or more genes from the parent chromosome.
Duplication: Duplication is the addition of one or more genes that are already present in the
chromosome. This is a single chromosome mutation.
Inversion: It involves inverting a segment of the sequence, say, a complete reversal of one or
more genes within a chromosome. The genes are retained postinversion, but its order is
backwards from the parent chromosome. This is also a single chromosome mutation. That is,
inversions refer to one type of genetic mutation that creates changes in a chromosome.
Insertion: Largescale insertion involves multiple chromosomes. For this type of insertion,
one or more genes are removed from one chromosome and inserted into another
nonhomologous chromosome. This can occur by an error during the prophaseI of meiosis
when the chromosomes are swapping genes to increase diversity.
Translocation: Translocation also involves multiple nonhomologous chromosomes. Here
the chromosomes swap one or more genes with another chromosome.
Nondisjunction: A nondisjunction mutation does not involve any errors in DNA
replication or crossingover. Instead these mutations occur during the anaphase and
telophase when the chromosomes are not separated properly into the new cells. Common
nondisjunctions are missing or extra chromosomes. When gametes with nondisjunctions are
produced during meiosis, it can result in an offspring with a monosomy or trisomy (referring
to a missing or extra homologous chromosome).
Effects of mutations
The effects of mutations may range from nothing all the way to unviability of a cell. All
mutations will affect the proteins created during protein synthesis; but not all mutations will have
a significant impact on the final product. Such effects can also be distinct between the smallscale and largescale mutations.
Page 3 of 39
4
Conservative substitution: This refers to a nucleotide mutation, which alters the amino acid
sequence of the protein, causing substitution of one amino acid with another, which has a side
chain with similar charge/polarity characteristics. The size of the side chain may also be an
important consideration. Conservative mutations are generally considered unlikely to profoundly
alter the structure or function of a protein, but there are many exceptions
Nonconservative substitution: This corresponds to a mutation, which results in the substitution
of one amino acid within a polypeptide chain with an amino acid belonging to a different
physicochemical property such as, polarity/charge group.
Convergent and parallel substitutions: In comparisons among orthologous proteins from a
given set of species, convergent substitutions at a particular site refer to independent changes
from different ancestral amino acids to the same derived amino acid. In the illustration (a) below,
there is a change from G (the ancestral state) to T (the derived state) in one species, and a change
from A to T in another species. The convergent substitutions are denoted by red bold lines.
Parallel substitutions at a site refer to independent changes from the same ancestral amino acid
to the same derived amino acid. In the case of illustration (b), changes from A to T occurred in
two different species. The parallel substitutions are denoted by red bold lines.
In sets of closely related species, parallelism is generally more common than
convergence simply because  at any given site  close relatives will be more likely to share the
same ancestral state prior to the occurrence of independent substitutions [J. F. Storz: Causes of
molecular convergence and parallelism in protein evolution. Nature Reviews Genetics, 2016,
vol.17, 239250]
Coincidental substitutions: The occurrence of two substitutions at the same nucleotide site in
two homologous sequences.
Example A: Assume a hypothetical initial strand
… AAAAGGGGTTTTGACC … and
perform an insertion version of mutation with a subsequence inserted at an arbitrary location.
Solution
For the assumed strand, the insertion version of mutation say, for example with a subsequence
‘CCCC’ at an arbitrary location, will result in the following:
... AAAAGGCCCCGGTTTTGACC ...
Page 4 of 39
5
Example B: Suppose one
or more ancestral sequences are given. Assuming different types of mutational changes occur as
indicated, evaluate the outcomes
Presumed mutational
change type
No change:
Retained as it is
Single substitution
CA
Multiple sequential
Substitutions
GAT
Back substitution
CTC
Result on the sequence(s)
Coincidental
substitutions: With reference to
two homologous sequences,
two substitutions at the same
nucleotide site
TG
Parallel
substitutions
at a site: This refer to
independent changes from the
same ancestral amino acid to
the same derived amino acid.
T C or G
Convergent
substitutions:
Independent changes from
different ancestral amino acids
to the same derived amino
acid.
… A C C C T A C G …
…ACCCTACG…
…ACCCTACG…
… A A A A T A A G…
… A A A A T A A G…
… A A A A T A A C…
… A A A A T A A T…
… A A A A T A A C…
… A A A A T A A T…
… A A A A T A A C…
Homolog sequence: Y1
… A A A T A A T…
Homolog sequence: Y2
… C A A T A A T…
Coincidental substitutions are
shown bold:
Y1* … A A A G A A T…
Y2* … C A A G A A T…
Given homolog species Z:
Z: … G A A A C A A T…
Parallel mutations are shown bold :
Z1*: … G A A A C A A C…
Z2*: … G A A A C A A G…
Say two different ancestral AAs, Z1
and Z2 are considered:
Z1: … A A T G A T
Z2 : ... A A T
Independent changes from different
ancestral amino acids to the same
derived residue, T
Problems on mutational changes
Problem B.1
Construct a matrix of the set {A, C, T, G} to illustrate the characteristic of the transition and
transversion mutations.
(Hint: You may use a score of 100 % to depict the element of the matrix pertinent to no
mutation – for example for 100% for AtoA as shown; and, use prorated percentages to
represent other elements illustrating the characteristic as above. The spontaneous base
substitutions ratio of transitions to transversions is approximately 2:1. Therefore each transition
should have a probability of 2/3 and each transversion 1/3).
Page 5 of 39
6
Answer:
A
A
C
T
G
100%
C
T
G
Problem # B.2 (a)
A strand is presumably mutated at a location underlined in the sequences shown below. In each
case, (i) write down the eventual resulting strand for the following additional mutations
happening in succession:
1) Inversion of some subsequence part
2) Deletion of some subsequence part
3) Transpose of some subsequence part
4) Duplication of a basepair in the sequence
5) Point mutation of one base into another
(Hint: The answers may depend on subjective selection as required)
a)
…..TTAAGGGGGGCCTTTTGAAA….
Example answer: (5) GGGG → GGCG
(b)
….,,AAAAGGGGGGCCTTGGGACC….
(c)
……CCAAGGGTGTCCTTTTGAGG….
(ii) In each case, also write down the corresponding final resulting duplex strands at preRNA
level
1. GGCG
Example answer: CCGC
2. AAATAA
3. TCCT
4. CGTTA
Problem B. 2(b)
Consider an ancestral sequence:
…ACCCTAC …
Suppose a sequence of changes occur on this segment as follows: Single substitution, no change,
multiple substitutions, back substitution, parallel substitutions, coincidental substitutions and
convergent substitutions.
Write down two possible resulting sequences.
Page 6 of 39
7
1) Initial ACCCTAC
Single Sub (A→G)CCCTAC (Example answer)
No Change Multiple Sub Back Sub Parallel Sub Coincidental SubConvergent Sub 2) Initial ACCCTAC
Single Sub No Change Multiple Sub A(C→T→A)CCTTC (Example answer)
Back Sub Parallel Sub Coincidental SubConvergent Sub Problem B.3
Given a sequence:
X:
5’  TAC GGA TCG AAT GCT CCC GTA ATC – 3’
Suppose the following mutations have occurred in succession: A single point mutation, deletion
of a triplet and duplication of a triplet twice in succession; and, the resulting complementary
strand is found to be:
Y:
3’ – ATG CCT AAC TTA CGG CAT CAT CAT TAG – 5’
Find
Single Point Mutation – (Example answer: Highlight in YELLOW. In triplet 3, the
second base C was changed to T. .... GGA TCG .... to .... CCT AAC ....)
Deletion of a Triplet – Highlight in RED.
Duplication of a Triplet – Highlight in GREEN.
Trace/identify all the mutational changes occurred from X to Y.

SEQUENCE ALIGNMENT & SCORING: TUTORIAL
EXERCISES: Problems on sequence alignment and scoring
Page 7 of 39
8
The following notations are used for the alignment status for a given pair:
Notations used to denote the alignment status between a pair of sequences
(i)
Vertical bar
Identical residues
(ii)
One dot
Somewhat similar residues
(iii) Two dots
Very similar residues
EXAMPLE
x:
AAGCTTACGCAAACCG
 ·  :  ·  ·· ·:
y: GCTCACGGTTGCCACT
Problem B.4
(i)
Apply the above notations for the alignment status for the given pair:
….L F D E L N R V V..........
   : :  .  .
….L F D D I N Q V L ……..
(ii)
Denoting “s” for transition mutation and “v” for transversion mutation, determine
the sites at which s or v have occurred in the following test pair (a, b):
Example: Black brackets and V denote transversions, while Green S denotes transition
x: Q [Q D] [I L] F ....
S
S
y: Q [D Q] [L V] V ....
V
V
Test pair:
a: F Q D I L F R R D D I I I F Q L
b: F D Q L V V R E N D D D N Q F I
V
V
V
V
Find transitions in y after all transversions have occurred.
V→I, V→F, E→R, D→I, N→I, N→L.
Problems on basic pairwise alignment procedure
Example: Consider the following two short nucleotide sequences, each of seven residues only.
Construct two possible alignments allowing two gaps. (A gap is defined as any maximal
consecutive run of spaces in a single string of a given alignment. They facilitate creating
alignments that better conform to underlying biological models and more closely and
appropriately fit patterns visàvis a meaningful alignment expected).
X: T A C C A G T
Y: C C C G T A A
Solution(s) :
(i)
Page 8 of 39
9
X: T A C C A G T
Y: C C C G T A A
(ii)
X: T A C C A G T
Y: C C C G T A A
Problem B.5
Consider the following two short nucleotide sequences, each of 10 residues only. Construct two
possible alignments allowing three gaps. (A gap is defined as any maximal consecutive run of
spaces in a single string of a given alignment. They facilitate creating alignments that better
conform to underlying biological models and more closely and appropriately fit patterns visàvis
a meaningful alignment expected).
X: A A C C A G T A AT
Y: T C C C T A A G T T
Problems on scoring the alignments
Example
Consider the following alignment:
S :x
T:
ATCG GATGGAC
ACGGAAT CC
This alignment has four gaps containing a total of six spaces (). Further, it can be described as
having five matches and two mismatches
Problem B.6
Consider the following two pairs of aligned nucleotide sequences (X, Y) and (U, V).
(i)
Describe the alignments in each pair in terms of the counts on existing matches,
mismatches, gaps and spaces
(ii)
Apply the following award/penalty scoring scheme and compare the alignment
(X, Y) versus (U, V) in terms of the overall scores obtained in each case
Scoring scheme: Match: say, 100; Mismatch: (Purine Purine or Pyrimidine Pyrimidine) –
Transition, say:75; (Purine Pyrimidine or Purine Pyrimidine) Transversion: say 10;
Space: − 50.
X:
A GCC ATATA
Y:
A G G AC A A T T A
U:
V:
AGCCATATA
AGCAATTA
Page 9 of 39
10
Example answer for X, Y: This alignment has two gaps with three total spaces. It also has 5
matches and three mismatches. The score for this alignment would be (5 ×100) + (75 ×1) + (10
× 2) − (50 × 3) = 445
Deduce the score for: U : V
(550)
Problems on: Sequence similarity and notion of “distance”
Given two character strings, the measures of “distance” between them are: (i) Statistical
distances (in Euclidian sense) such as, Mahalanobis distance and its variations. (ii) Hamming
distance and (iii) Levenshtein distance (edit distance)
Hamming distance
The Hamming distance between two strings of equal length is the number of positions at which
the corresponding symbols are different. That is, it measures the minimum number of
substitutions required to change one string into the other, or the minimum number of errors that
could have transformed one string into the other.
Examples
AGTC
CGTA
Hamming distance (HD) = 2
KENTUCKY
TENTURKI
Hamming Distance = 3 (K/T, C/R, Y/I)
Edit distance
This refers to the edit distance (also known as Levenshtein distance) between two sequences
expressed in terms of minimal number of operations (indels and substitutions) exercised to
transform one sequence to another. This edit distance approximately specifies the number of
DNA replications taken place across two sequences. That is, the Levenshtein distance (LD) is a
string metric for measuring the difference between two sequences. Simply, the LD between two
words is the minimum number of singlecharacter edits (i.e. insertions, deletions or substitutions)
required to change one word into the other.
Examples
AG–TCC
CGCTCA
Levenshtein distance (LD) = 3
For the two sequences indicated below, determine the minimum number of edit operations
required to transform one into the other.
X:
Y:
Solution:
Step (i):
ACCUGA
AGCUA
Substitution of C by G in X:
Page 10 of 39
AGCUGA
11
Step (ii):
Indel operation (Insertion of G) in Y:
AGCUGA
Thus, there are (a minimum) of two edit operations required for the transformation. (Note that
edit distance implies that all operations are exercised on only one sequence). Therefore, the
Levenshtein score is: 2
Problem # B.7
(i): Consider the following two sequences, namely:
X:
Y:
A G T G G G CAT T C C T T T
T C T A G A AT TT C T GT T
The following alignment is done with minimal editing permitting identity matches and
transitions, [purines (A
G) or pyrimidines (C
T)] to stay. Determine the associated
Levenshtein score.
X*:
Y*:
AGTGGGCATTCCTT
TCTA GA TTCTGTT
Is there a possibility of better scoring feasible in aligning X and Y? If so indicate it..
(ii): Perform a visual comparison of each of the following putatively related pair of nucleic acid
sequences x and y. Indicate in your answer the identities by 1s on match and 0s on mismatches.
Also indicate mutations by the scoring indices (notations) S for transition and V for transversion:
Pair 1:
x:
T CG [C T] G G C G C A A A C C G
1 0 0 0 0 1 0 1 0 0 0 0 0 0 0
y
C C [T C] A G G G T T G C A A C A
(Answer Hint
Bases bracketed in red in sequence Y show transversions from red bracketed sequences in X.)
Pair 2
x:
A A [G C A G T C] T C A A [A C G G]
1 0 1 0 0 0 0 1 0 0 0 0 0 0 0
y
A [C T C A C G] T T T G [G G [A C]] C
(iii): Determine the edit distance in transforming one into the other of the following sequence
pair:
X:
Y:
TACCAGT
CCCGUAA
Problem # B.8
Page 11 of 39
12
(a) Determine the Hamming distance between CUMBERLAND and TIMBERLAND
(b) Determine the Levenshtein distance between: (a) BIOINFORMATICS and
BIOINFORMATION; (b) TELEINFORMATICS and HAILFORMATION; (c)
TELEINFORMATICS and CONFORMATION
Problem # B.9
The Hamming distance (HD) measurements between locallyaligned pair of nucleotide
sequences (s and t) are as shown.
(a)
s
t
HD(s,t)
I
GGU
UGG
2
II
AGCAA
ACAUA
3
III
AGCACACA
ACACACUA
6
(Answers?)
New HD is 2.
Considering the III group of the sequences s and t (that is, AGCACACA and ACACACUA),
align the associated eight nucleotides by introducing two gaps in each sequence randomly so that
the HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an
insertion in the sequence being compared).
(b)
s
t
HD(s,t)
I
CCU
UCC
2
II
UCGUU
UGUAU
3
III
UCGUGUGU
UGUGUGAU
6
New HD is 2.
Considering the III group of the sequences s and t (i.e., UCGUGUGU and UGUGUGAU), align
the associated eight nucleotides by introducing two gaps in each sequence randomly so that the
HD score (or cost) is optimally improved. (The gap denotes a deletion in a sequence or an
insertion in the sequence being compared).
Example
Assuming 0 Insertion{s) and 2 Deletion(s), determine the Hamming distance between:
s: AGACCA
t: CACACA
Positions: 3 5
7 1
Insertions: * *
* *
Deletions: C A
T A
Page 12 of 39
13
Answer: HD(s,t) = 4
Problem # B.10
AGCAACCA
ACACACAT
Reference to the above example, obtain solutions for the following cases:
Assume:
1 Insertion{s) 1 Deletion(s)
==================================
Assume 0 Insertion{s) 2 Deletion(s)
==================================
Assume 1 Insertion{s) 1 Deletion(s)
==================================
Assume 1 Insertion{s) 1 Deletion(s)
==================================
Assume 0 Insertion{s) 2 Deletion(s)
==================================
Assume 0 Insertion{s) 2 Deletion(s)
Problem # B.11
Find the Hamming distance between 2 sequences after applying 2 random insertions or deletions
on a pair of original sequences (s, t) given as below:
s: TGCACACC
t: TCACACTC
s’:TGCAACTC
t’:TGCAACTC
(HD is 0?)
==================================
Problem # B. 12
Given the following two binary sequences X and Y, plot the HD between as a function of binary
digit locations in the 0 to (about) 100 binary residues listed below. Hence indicate the most
common substring locations between them.
X:
0110111001011010011 0110 0011111110100101011111011110011010101110
0011011100010100110100011111001111
Y:
1010001010011011100101100011111110100101011111011110011010100001
011010001111110 00101110011110101111
(Hint: Select a window of size 4. For each window, calculate HD. Plot window # versus HD
Note: In the case of binary strings, the HD is decided by XOR operation across the two residues
one below the other in the strings; and the number of 1’s are counted in each window in the
resulting XOR output string, depicting the HD )
Page 13 of 39
14
Problem # B.13
For the two binary sequences X and Y, indicated above in Problem B. 12, plot the KullbackLeibler (KL) measure between the strings. Hence confirm the most common substring locations
between them as decided via HD measure in the previous problem.
(Hint: Again, select a window of size 4. For a given sequence in each window, calculate KL
measure. Plot window # versus KL = KL1 + KL2 for each string
KL1 = (p(0)loge[(p(0)/q(1)])window#1 + ….
KL2 = (q(1)loge[(q(1)/p(0)])window#1 + ….
p(0): Probability of 0 in that window; q(1): Probability of 1 in that window)
SEQUENCE ALIGNMENT: NW & SW ALGORITHMS
Implementing NW and SWalgorithms
Outlined earlier are details on sequence alignments of interest in bioinformatics; and, pertinent
global and localalignment algorithms conforming to (i) a global optimization strategy that
enforces the alignment across the entire span of all the query sequences and (ii) local alignment
strategy that identifies alignments only in the locallysignificant segments of similarity (within
the long sequences) are explained. Relevant computational schemes based on dynamic
programming are indicated as the NWalgorithm for global alignment and SWalgorithm for
local alignment. Presented in this tutorial are exemplars of such algorithms using necessary
hypothetical sequences; also, pursuing the exemplar set of solved exercises, a set of problems
(with solution hints/answers if needed)are presented.
Sequence alignment  global, local and glocal versions: Examples
To illustrate the differences in the postalignment results of global, local and glocal alignments
enforced on a pairs of sequences, the following examples are furnished
EXAMPLE: This example is indicated to illustrate the differences in the postalignment results
pertinent to global versus local alignments performed on a pairs of sequences. Suppose a pair of
hypothetical amino acid sequences X and Y subjected to alignment procedure is as follows:
X:
AGPSSKQNGKPSSRIWDN
Y:
ANITKSAGKPAIMRLGDD
The results of global and local alignments of X and Y are shown below so as to understand the
underlying differences. (The results shown are obtained using NW and SW algorithms as
explained in later examples; and, performing those algorithms on X and Y given above will be
indicated as problem exercises to solve).
1.
Global alignment: Result of aligning X and Y over their entire lengths via NW algorithm.
When the alignment is completed, both sequences are same length.
Page 14 of 39
15
X:
Y:
2.
AGPSSKQNGKPS−SRIWDN


  


A N − I T K S A G K P A I M R LG D D
Local alignment: Result of aligning X and Y via SW algorithm showing the longest or
best subsequence pair that has maximum similarity
X:
Y:
− − − − − − −N G K P − − − − − − − −
  
− − − − − − −A G K P − − − − − − − −
EXAMPLE: The results of global and glocal alignments of two nucleotide sequences u and v
are presented below to understand the underlying differences.
u:
v:
TGTCTGTGGGTGG
TGCTTG
The results of global and local alignments of U and V are shown below to understand the
underlying differences.
1.
Global alignment
u:
v:
T G T C TG T G G G T G G
 



TG−C−−T −−− T−G
T G T
C T
G
T G G
G
T
G G
T G G
G
T
G G
T
G
C
T
T
G
2.
Glocal (semiglobal) alignement
u:
v:
TGTCTG−TGGGTGG
  

TGCTTG
T G T
C
T G
T
G
C
T
Page 15 of 39
16
T
G
The result of (1) shown is obtained using NWalgorithm; and, the semiglobal algorithm is also
based on NWalgorithm modified as follows: Once the NWalgorithm based updating the values
of the underlying matrix is completed, the traceback is started at the greatest element of the last
row of the alignment matrix (scores matrix); or last column, if there are more rows than columns.
This is in contrast with NWalgorithm where the starting of the traceback commences from the
absolute last cell (leading cell) of the matrix.
Global sequence alignment: Implementation of NW algorithm
As explained before, the global alignment of sequences is based on dynamic programming.
Relevant NW algorithm can be adopted to align protein or nucleotide sequences. The underlying
optimal global alignment can be understood with the examples and exercises furnished below.
EXAMPLE: Given a set of sequence pairs, X: C T C G T and Y: C T A A G T, the problem is
to determine the best global alignment between them via traceback procedure using NW
algorithm.
Solution
Global alignment refers to aligning sequences over their entire length; resulting in sequences of
the same length. The global alignment of the sequences can be determined using the NeedlemanWunsch (NW) algorithm.
Step 1:
Construct a matrix for the two sequences as shown.
X
Y
C
T
A
A
G
T
Step 2:
T
C
G
T
Initialization: Cells representing identities are scored 1; and, cells representing
mismatches are scored 0.
X
Y
C
T
A
A
G
T
Step 3:
C
C
T
C
G
T
1
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
Add “dummy” columns (I2, I1, ...) and rows (J2, J1, ...) at the end of the matrix as
illustrated; and, fill these columns and rows with zeros (shown with bold highPage 16 of 39
17
lighting). Mark the leading cell, LC. (The last corner cell on the rightside of the
matrix designated as (i = I3, j = J3)th cell, as shown)
•
I6
I5
I4
I3
C
T
C
G
T
Dummy
J8
C
1
0
1
0
0
I2
I1
J7
T
0
1
0
0
1
0
0
J6
A
0
0
0
0
0
0
0
J5
A
0
0
0
0
0
0
0
J4
G
0
0
0
1
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
Dummy
Step 4:
I7
Starting from the leading cell, LC, (designated as (i = I3, j = J3)th cell), perform the
following procedures:
Move to (I2, J2)th cell (diagonally going downward) and apply the following algorithm to
update the value in LC:
Algorithm: Update the (leading) cell entry by adding the maximum value encountered at A: (i –
1, j – 1)th cell or along the three tracks, B or C as shown:
Leading cell
(i = I3, j = J3)
0
0
0
A
The maximum value
seen at: (I2 = i – 1,
J2 = j – 1)th cell
•
•
•
C
1
0
0
0
0
0
0
The maximum value
seen at I2, I1, ..., etc.
along the jth row
B
The maximum value
seen at J2, J1, ..., etc.
along the jth column
Observed maximum value at A or along B and C: 0
Existing value in the LC: 1
Hence, the updated value for the leading cell is, therefore: (1 + 0) = 1.
Step 5: Use the same procedure and algorithm as above to update the value for each cell in the
entire matrix pursuing the trackroute indicated below:
Page 17 of 39
18
(a)
After updating the value in the LC, consider the other cells onebyone along the row J3
following the track moving leftward and encountering, the cells: I4, I5, ..., and I7.
Corresponding updated values are shown bold in the matrix indicated below:
I5
I4
I3
C
T
C
G
T
Dummy
C
1
0
1
0
0
I2
I1
J7
T
0
1
0
0
1
0
0
J6
A
0
0
0
0
0
0
0
J5
A
0
0
0
0
0
0
0
J4
G
0
0
0
1
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
Next considering the cells onebyone, I3, I4, ..., and I7.. encountered along the upper row J4
(by following again the track moving leftward), use the same procedure and algorithm as
above to update the existing value in each cell. Corresponding updated values are shown
bold in the matrix indicated below:
I7
I6
I5
I4
I3
C
T
C
G
T
Dummy
J8
C
1
0
1
0
0
I2
I1
J7
T
0
1
0
0
1
0
0
J6
A
0
0
0
0
0
0
0
J5
A
0
0
0
0
0
0
0
J4
G
1
1
1
2
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
Dummy
(c)
I6
J8
Dummy
(b)
I7
Likewise, considering the cells onebyone, I3, I4, ..., and I7, encountered along the next upper
row J5 (by following the track, moving leftward) and using the same procedure and
algorithm indicated above the existing value in each cell is updated. Corresponding updated
values are shown bold in the matrix indicated below:
I7
I6
I5
I4
I3
C
T
C
G
T
Dummy
J8
C
1
0
1
0
0
I2
I1
J7
T
0
1
0
0
1
0
0
J6
A
0
0
0
0
0
0
0
J5
A
2
2
2
1
0
0
0
Page 18 of 39
19
J4
G
1
1
1
2
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
Dummy
(d)
The aforesaid procedure is repeated for the cells onebyone, I3, I4, ..., and I7. encountered
along the next upper rows J6 through J7 (by following the track, moving leftward); and, by
using the procedure and algorithm indicated above, the existing value in each cell is updated.
Corresponding updated values are shown bold in the following matrices:
I6
I5
I4
I3
C
T
C
G
T
Dummy
J8
C
1
0
1
0
0
I2
I1
J7
T
0
1
0
0
1
0
0
J6
A
0
2
2
1
0
0
0
J5
A
2
2
2
1
0
0
0
J4
G
1
1
1
2
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
I7
I6
I5
I4
I3
C
T
C
G
T
Dummy
Dummy
J8
C
1
0
1
0
0
I2
I1
J7
T
0
3
2
1
1
0
0
J6
A
0
2
2
1
0
0
0
J5
A
2
2
2
1
0
0
0
J4
G
1
1
1
2
0
0
0
J3
T
0
1
0
0
0
0
J2
0
0
0
1
LC
0
0
0
J1
0
0
0
0
0
0
Dummy
(e)
I7
Lastly the procedure repeated for the cells onebyone, I3, I4, ..., and I7 encountered along the
next upper row J8 (by following the track, moving leftward) completes the updating with the
values shown bold in the following matrix
I7
I6
I5
I4
I3
C
T
C
G
T
Page 19 of 39
Dummy
20
J8
C
J7
T
J6
A
J5
A
J4
G
J3
T
Dummy
4
0
0
2
1
0
2
3
2
2
1
1
3
2
2
2
1
0
1
1
1
1
2
0
0
1
0
0
0
1
LC
I2
I1
0
0
0
0
0
0
0
0
0
0
J2
0
0
0
0
0
0
J1
0
0
0
0
0
0
(f) The best global alignment is then determined using the backtracing method as follows:
(i) Starting from the lead cell (LC), trace an upward diagonal path. This is done
regardless if the cell corresponds to a match or a mismatch.
C
T
A
A
G
T
C
4
0
0
2
1
0
T
2
3
2
2
1
1
C
3
2
2
2
1
0
G
1
1
1
1
2
0
T
0
1
0
0
0
1
LC
(ii) This tracepath is continued until an island is met. An island refers to a set of 4 cells
with 3 or more entries that are identical as shown with bold entries below.
C
T
A
A
G
T
C
4
2
2
2
1
0
T
2
3
2
2
1
1
C
3
2
2
2
1
0
G
1
1
1
1
2
0
T
0
1
0
0
0
1
(iii) From the island, there are two possible paths that can be taken:
1. First vertical, then diagonal
C
T
A
C
4
2
2
T
2
3
2
C
3
2
2
Page 20 of 39
G
1
1
1
T
0
1
0
21
A
G
T
2
1
0
2
1
1
2
1
0
1
2
0
0
0
1
2. First horizontal, then diagonal:
C
T
A
A
G
T
C
4
2
2
2
1
0
T
2
3
2
2
1
1
C
3
2
2
2
1
0
G
1
1
1
1
2
0
T
0
1
0
0
0
1
Backtracing: Summary:
1.
2.
3.
4.
Start from lead cell
Go diagonal from the cell to the next regardless of the entry in the new
cell corresponds to a match or a mismatch
Continue on diagonal trace until an “island” shown with bold score
entries is reached. (The island can be identified as the set of 4 cells with
3 or more entries are identical in the four cells)
Across the island, go “vertical” and continue on diagonal track; or, go
“horizontal” and, then continue on the diagonal trace, as feasible.
The choice of picking one of the two paths depends on the number of cells ahead of the
island  whichever path that has the more cells traced ahead should be the selected path.
In the present example, taking the vertical path leads to four cells, whereas the horizontal
path has only one ahead. Therefore, the vertical path is selected.
(g)
Now, the given sequences can be aligned as per the following rules:
I. A vertical track implies introducing a gap in the X (upper) sequence at its site
II. A horizontal track implies introducing a gap in the Y (bottom) sequence at its site
Relevant to present example of sequences X: CTCGT and Y: CTAAGT, the traceback is
illustrated below
Page 21 of 39
22
Gap
C
T
C
G
T
C
T
A
A
G
T
Hence, the aligned sequences are written as follows, with the gap introduced. on sequence X:
X:
Y:
C T – C G T
 
 
C T A A G T
(i) In summary, the NWalgorithm involves: (i) Setting up the matrix, (ii) updating the scores in
the matrix cells and (iii) identifying the optimal alignments via a traceback suite. (The four
aspects of traceback procedure are as follows: (a) Encountering identity of residues at a site
between X and Y: Stay tracking along the diagonal; (b) mismatch of residues encountered at
a site between X and Y: Stay tracking along the diagonal; (c) gap in top (X) sequence
corresponds to tracking vertically in the island; and (d) gap in the bottom (Y) sequence
corresponds to track horizontally in the island). Furnished below is an illustrative tutorial on
traceback procedure relevant to NW algorithm.
Thus, based on vertical or horizontal tracepursuit at the island indicated above, the following can be
specified as alignment rules:
•
•
Outside the island where the diagonal pursuit is done, it implies a region where “identity” or
‘mismatch” of residues across the compared sequences exist without any gaps to exist at those
sites
Within the island, if a vertical pursuit is done, it implies a “gap” to be introduced at the site of the
upper sequence, for example in U of a hypothetical pair of sequences, U and V
U:
•
A AG–CTG
 
 
V:
A ATCGTG
Within the island, if a horizontal pursuit is done, it implies a “gap” to be introduced at the site of
the bottom sequence, for example in V of a hypothetical pair of sequences, U and V
Page 22 of 39
23
U:
G AGCCTA
 
 
V:
GAT– GTA
______________________________________________
EXAMPLE
Given the sequence pairs:
x: W F G Q E T S A I S
y: S F T Q F S E D A I
Perform NWalgorithm based comparison between the two sequences and elucidate the optimal
pathway in aligning them globally.
Solution
Step 1: Develop a score matrix for the two sequences and do initialization of cells representing
the identities with the scorevalue1 and the cells representing mismatches with scorevalue 0.
The constructed initial matrix is shown below:
x
y
S
F
T
Q
F
S
E
D
A
I
W
0
0
0
0
0
0
0
0
0
0
F
0
1
0
0
1
0
0
0
0
0
G
0
0
0
0
0
0
0
0
0
0
Q
0
0
0
1
0
0
0
0
0
0
E
0
0
0
0
0
0
1
0
0
0
T
0
0
1
0
0
0
0
0
0
0
S
1
0
0
0
0
1
0
0
0
0
A
0
0
0
0
0
0
0
0
1
0
I
0
0
0
0
0
0
0
0
0
1
S
1
0
0
0
0
1
0
0
0
0
Step 2: Add “dummy” rows and columns at the leftend of the matrix and fill these columns and
rows with zeros. Name the columns (i: I12, I13, ..., I3) and rows (j: J12, J11, ..., J3) as shown.
Identify the cornermost cell (last cell of the matrix; not including dummy rows/columns) and
designate it as the leading cell, LC specified with the coordinate (i, j).
J12
J11
S
F
I12
W
0
0
I11
F
0
1
I10
G
0
0
I9
Q
0
0
I8
E
0
0
I7
T
0
0
I6
S
1
0
Page 23 of 39
I5
A
0
0
I4
I
0
0
I3
S
1
0
I2
I1
Dummy
0
0
0
0
24
J10
J9
J8
J7
J6
J5
J4
J3
J2
J1
T
Q
F
S
E
D
A
I
Dummy
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Step 3: As described in the previous example, starting from the leading cell (LC: i, j), move to
the S (i −1, j −1) cell (downward and diagonal). Then update the LC score value, S(i j) by adding
the maximum of one of the three following observed value as per NW algorithm:
• S (i −1, j −1)
• Maximum of S (i − 1, j − 2), S(i −1, j − 3), ... (along I2, I1, ...), S (i −1, j − n)
• Maximum of S (i − 2, j −1), S(i − 3, j −1), ... ... (along J2, J1, ...), S (i − n, j −1)
This procedure is repeated for the entire cells onebyone, encountered along the columns I3, I4, ...,
and I12 and rows J3, J4, ..., and J12 and the updating of the scores completed leading to the values
shown bold in the following matrix
X
Y
S
F
T
Q
F
S
E
D
A
I
W
F
G
Q
E
T
S
A
I
S
5
4
4
4
3
3
2
2
1
0
4
5
4
3
4
3
2
2
1
0
4
4
4
3
3
3
2
2
1
0
4
4
3
4
3
3
2
2
1
0
4
4
3
3
3
2
3
2
1
0
3
3
4
3
3
2
2
2
1
0
3
2
2
2
2
3
2
2
1
0
1
1
1
1
1
1
1
1
2
0
1
1
1
1
1
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
Step 4: Using the final score matrix, a traceback from the LC is done via diagonal pursuit
and resorting to vertical or horizontal path whenever an island is encountered. Shown below
is the track prescribed thereof to the present problem:
X
Y
S
F
T
W
F
G
Q
E
T
S
A
I
S
5
4
4
4
5
4
4
4
4
4
4
3
4
4
3
3
3
4
3
2
2
1
1
1
1
1
1
1
0
0
Page 24 of 39
25
Q
F
S
E
D
A
I
4
3
3
2
2
1
0
3
4
3
2
2
1
0
3
3
3
2
2
1
0
4
3
3
2
2
1
0
3
3
2
3
2
1
0
3
3
2
2
2
1
0
2
2
3
2
2
1
0
1
1
1
1
1
2
0
1
1
0
0
0
0
1
0
0
1
0
0
0
0:
LC
The aligned sequence is therefore:
WFGQETS−−AIS



 
S FTQF−SEDAI
EXAMPLE
The following pair of sequences is indicated in [T. K. Attwood and D. J. ParrySmith:
Introduction to Bioinformatics. Pearson Education Ltd., Essex UK: 1999] and obtaining relevant
global alignment via NW algorithm is hence described. The present exercise is to obtain the final
score matrix given in [ ] for the test pair and verify the result on alignment as posted:
u:
v:
ADLGAVFALCDRYFQ
ADLGRTQNCDRYYQ
Final gapped alignment shown in [ ] is:
u:
ADLGAVFALCDRYFQ
   
   

v:
ADLGRTQN−CDRYYQ
Local sequence alignment: Implementation of SW algorithm
The local alignment of sequences is based on dynamic programming using SW algorithm, which
can be adopted to align protein and/or nucleotide sequences. The underlying optimal local
alignment can be understood with the examples and exercises furnished below.
EXAMPLE
Given a set of sequence pairs, U and V as shown below, establish their best local alignment
using SW algorithm.
(i)
u:
...W R N D C Q E G S A...
v:
...W G Q E G S I E A...
Solution
Application of SmithWaterman (SW) algorithm towards aligning u and v and elucidating the
local alignment conforms to the procedure with following steps:
(1)
(2)
Construct an initial matrix framed with u and v residues as shown below.
Add a set of edge elements x: “0” along the rightmost column and the topmost row of
the matrix as shown. Inasmuch as the first row and first column cannot be an endpoint of
Page 25 of 39
26
(3)
(4)
any alignment, (x: 0’s) are introduced as indicated so as to serve as a dummy
placeholder
Next, populate the cells corresponding to identical matches (of residues of u and v at the
cell site) scored with entries of “1”s; likewise, and cells corresponding to mismatches of
residues scored with entries of “0”s
u
W
R
N
D
C
Q
E
G
S
A
v
x
0
0
0
0
0
0
0
0
0
0
W
0
1
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
0
1
0
0
Q
0
0
0
0
0
0
1
0
0
0
0
E
0
0
0
0
0
0
0
1
0
0
0
G
0
0
0
0
0
0
0
0
1
0
0
S
0
0
0
0
0
0
0
0
0
1
0
I
0
0
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
0
0
1
0
0
0
A
0
0
0
0
0
0
0
0
0
0
1
Highlight all the matchscore entries (1’s) bold. Relevant to each cell having match score
entry (1) and marked bold, pursue the following three possible tracks to populate the rest
of the cells with updated entries:
(a) Diagonal track: Suppose the cell having a match score entry (1) and marked
bold. It is designated with a coordinate (i − 1, j − 1) and the diagonal tracking
is done downward towards the cell: (i, j) as illustrated below. Suppose the
score value at (i − 1, j − 1) is S(i − 1, j − 1) then the score on cell S(i, j) is
decided as follows:
S(i, j) = S(i − 1, j − 1) + 1.0,
if a similarity (match) of u and v exists at the site, (i − 1, j − 1); otherwise,
S(i, j) = S(i − 1, j − 1) – 0.3,
if a dissimilarity (mismatch) of u and v exists at the site, (i − 1, j − 1).
In the above updating of scores, addition of 1.0 implies an “award” given to
similaritymatch observed; and, subtracting 0.3 refers to “penalty” given to
dissimilarity(mismatch) observed. (The value 0.3 is an approximation of 1/3
depicting the degreeoffreedom).
Page 26 of 39
27
Award: + 1.0
(i − 1, j − 1)
Penalty: − 0.3
(i , j )
The abovesaid diagonal track score update procedure is illustrated below with an
example: An nth cell with an existing score Sn gets an award of (+ 1) due to the identity of
residues (A A) of u and v and its updated score, therefore becomes (Sn + 1). On the
other hand, considering the mth cell as shown, with a score Sm takes a penalty of (− 0.3)
due to the mismatch of residues (C T) of u and v, and, as such, its updated score,
becomes (Sm − 0.3).
u: A
u: C
Sn − 1
v: A
Sn
+ 1.0
Award
Sm − 1
Sm
− 0.3
v: T ≠ C
Penalty
(5)
The diagonal pursuit as per the above step is exercised at all those cells that shoe the
score entry of “1s” as confirmed in the initialization; and, all the relevant cells in the
diagonal pursuits are updated with the new scores. This diagonal path of updating the
score is terminated when the computed score value becomes negative. At that cell and
subsequently, the score entries along the diagonal path are rendered as “0s”. Further, this
diagonal cell scorefilling is discontinued when a cell having a positive score value
(possibly, 1 as registered in the initialization) is encountered en passé
(6)
The next step involves performing the following two algorithms with a horizontal pursuit
along a row towards right and a vertical pursuit along a column downwards as illustrated
below so as to populate the cells with updated entries.
Suppose (i, j) is any cell considered. Then, the horizontal track along the row
from this cell as shown, leads to a set of sequential set of cells whose entries are
updated with scores using the following algorithm:
S(i, j + k) = [S(i, j) − (1.0 + 0.3 × k)], k = 1, 2, ...
Page 27 of 39
28
The kvalue denoting the kth cell as shown is ended, when a negative value of the
score results in; and thereupon, the subsequent cells are filled with 0 scores. This
horizontaltrack based filling is continued and eventually stopped when a match
(identity)value of 1 of the initial matrix is encountered on the row.
S(i , j )
S(i , j + 1 )
S(i , j + 2 )
S(i , j + k )
(i , j )
Next, again considering a cell (i, j), the vertical track along the column from this
cell as shown, leads to a set of sequential cells downward whose entries are
updated with scores using the following algorithm:
S(i + , j) = [S(i, j) − (1.0 + 0.3 × )], = 1, 2, ...
The value denotes the th cell as shown is ended, when a negative value of the
score results in; and thereupon, fill the subsequent cells with 0 scores. This
verticaltrack based filling is continued and eventually stopped when a matchvalue of 1 of the initial matrix is encountered on the column.
(i , j )
S(i , j )
S(i + 1, j )
S(i + 2, j )
S(i + , j )
In the above procedures, the values of k and denote penaltylengths that specify
the extent of penalties being imposed on the scores of the cells consistent with the
possible deletions.
While proceeding along k or , if a negative value of the computed score results
in, then the corresponding cell and the rest seen subsequently are filled with 0
scores. (It implies that there is no alignment similarity up to the current cellposition).
Further, scorefilling horizontally (along the row) or vertically (along the column)
is terminated when a cell with identity (similarity) score of 1 (registered in the
initialization) is seen ahead en passé.
Thus, commencing from each of the cell (i − 1, j − 1) with initialized score entry
S(i − 1, j − 1) = 1, the updating of score values is done via: (i) Diagonal passage
Page 28 of 39
29
from cell (i − 1, j − 1) to (i, j), (ii) horizontal pathlengths, k (along the row) or
(iii) vertical pathlengths, (along the column) as illustrated below:
Diagonal
path
(i − 1, j − 1)
kpath
(i , j )
path
(7)
Now, considering the alignment exercise in hand, the diagonal, horizontal and vertical
scoring procedures indicated above are performed in order to update the cell scores using
the aforesaid algorithms pertinent to SW scheme of local alignment. The updated scores
as computed are shown bold in the following matrix. The scored out values are the
existing scores; and, all the pertinent diagonal pursuits are marked with arrows.
u
W
R
N
D
C
Q
v
x
0
0
0
0
0
0
0
0
0
0
W
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
G
0
0
0
0.66
0
0
0
1
0
0
0
0
0
0
0
E
0
0
0
0
0
0
0
G
S
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0.67
0
0.33
0
0
I
0
0
0
0
0
0
0
1
E
0
0
0
0
0
A
0
0
0
0.67
0
3
1
1.67
0
1.33
0
1
0
0.67
0
0.67
0
0.33
0
1.67
0
4
1
2.67
0
2.33
0
2.0
0
0
Q
0
0.33
0
(8)
G
0
0
0
E
0
0
0
S
A
0.33
0
1.33
0
2.67
0
3.67
0
2.33
0
2.0
1
Alignment via traceback: The traceback refers to commencement of a trace bottomup
from the largest score value observed on the final score matrix and proceeding diagonally
upward as illustrated below relevant to scores highlighted bold.
v
u
W
R
N
D
C
Q
E
G
S
A
x
0
0
0
0
0
0
0
0
0
0
Page 29 of 39
30
W
0
1
0
0
0
0
0
0
0
0
0
G
0
0
0.66
0
0
0
0
0
1
0
0
Q
0
0
0
0.33
0
0
1
0
0
0.66
0
E
0
0
0
0
0
0
0
2
0.67
0.33
0.33
G
0
0
0
0
0
0
0
0.66
3
1.66
1.33
S
0
0
0
0
0
0
0
0.33
1.66
4
2.66
I
0
0
0
0
0
0
0
0
1.33
2.66
3.66
E
0
0
0
0
0
0
0
1
1.00
2.33
2.33
A
0
0
0
0
0
0
0
0
0.66
2
2
Hence, the final result on local alignment is as follows:
QEGS
   
QEGS
EXAMPLE
Given a set of sequence pairs, u and v as shown below, determine the best local alignment via
traceback method using SWalgorithm.
Given pair of sequences:
u: AASTHECWCTWH
v: AASRNPSCWTTWHT
Solution
v
A
A
S
R
N
P
S
C
W
T
T
W
H
T
u
x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
1
A
0
1
1
S
0
T
0
H
0
E
0
C
0
W
0
C
0
T
0
W
0
H
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Step 1: As indicated in the previous example, the starting point for the SmithWaterman
algorithm is to construct the matrix with given residue sequences and initialize it with edgeelements to x: 0 denoting the placeholder that accommodates the condition specified as follows:
The first row and the first column of the matrix cannot form the endpoint of any specified
alignment.
Step 2: Next, the cells in the matrix representing with identities of residues (between u and v) are
scored 1; and, rest of the cells representing mismatches are scored 0. Shown below is the
Page 30 of 39
31
resulting matrix after Steps 1 and 2. (The mismatch values “0”s are omitted in the matrix
illustration for clarity).
Step 3: Reference to the initialized matrix as above, the cells are filled with updated scores
following the algorithm indicated in the last example.
Page 31 of 39
32
The final updated matrix is:
v
A
A
S
R
N
P
S
C
W
T
T
W
H
T
u
x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
A
0
1
2
0.7
0.3
0
0
0
0
0
0
0
0
0
0
S
0
0
0.7
3
1.7
1.3
1
1
0
0
0
0
0
0
0
T
0
0
0.3
1.7
2.7
1.3
1
0.7
0.7
0
1
1
0
0
1
H
0
0
0
1.3
1.3
2.3
1
0.7
0.3
0.3
0
0.7
0.7
1
0
E
0
0
0
1
1
1
2
0.7
0.3
0
0
0
0.3
0.3
0.7
C
0
0
0
0.7
0.7
0.7
0.7
1.7
1.7
0
0
0
0
0
0
W
0
0
0
0.3
0.3
0.3
0.3
0.3
1.3
2.7
0
0
1
0
0
C
0
0
0
0
0
0
0
0
1.3
1.3
1.3
0
0
0.7
0
T
0
0
0
0
0
0
0
0
0
1
2.3
2.3
0
0
1.7
W
0
0
0
0
0
0
0
0
0
1
1
2
3.3
0
0
H
0
0
0
0
0
0
0
0
0
0
0.7
0.7
2
4.3
0
Step 4: Using the final score matrix, the traceback is performed starting from the highest value
(4.3), as shown. Relevantly, the diagonal pursuit is continued along the path having the highest
values until an “island” is met. (The description of an island is given earlier in the exercise
pertinent to NW algorithm). Here the trace is directed horizontal and then vertical direction as
shown and then pursued diagonal to 2.3 and further. It implies introducing a gap between H and E
residues of u and v respectively. Hence, the aligned sequences are written as follows:
AASTH−ECWCTWH
   :
 
  
AASRNPSCWTTWH
And, the localalignment segment is shown bold.
EXAMPLE
With reference to the following pair of sequences perform SWalgorithm based comparison and
elucidate locally significant, common regions of similarity.
u:
v:
WYGQEQSYIQ
WY TQETSDIQ
Solution
Step 1: Pertinent to implementing SmithWaterman algorithm, construct the initial scorematrix
with edgeelements x: 0 being a placeholder as was done in the prior examples. Next, the cells
representing identities are scored 1 and those representing mismatches are scored 0. The resulting
matrix is shown below with the omission of 0 scores on mismatches for clarity.
v
W
Y
u
x
0
0
W
0
1
Y
0
1
G
0
Q
0
E
0
Q
0
S
0
Y
0
I
0
Q
0
1
32
33
T
Q
E
T
S
D
I
Q
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
Step 2: The cells are populated with updated scores via application of SW algorithm applied to,
diagonal followed by row and columnwise pursuits exercised at each cell showing the entry 1
(match scores). Hence, the resulting final score matrix is as follows:
I0
u
J0
J1
J2
J3
J4
J5
J6
J7
J8
J9
J10
X
W
Y
T
Q
E
T
S
D
I
Q
x
0
0
0
0
0
0
0
0
0
0
I1
W
0
1
0
0
0
0
0
0
0
0
0
I2
Y
0
0
2
0.7
0.3
0
0
0
0
0
0
I3
G
0
0
0.7
1.7
0.3
0
0
0
0
0
0
I4
Q
0
0
0.3
0.3
2.7
1.3
1
0.7
0.3
0
1
I5
E
0
0
0
0
1.3
3.7
2.3
1
0.7
0.3
0
I6
Q
0
0
0
0
1
2.3
3.3
2
1.7
1.3
1.3
I7
S
0
0
0
0
0.7
2
2
4.3
3
2.7
2.3
I8
Y
0
0
1
0
0.3
1.7
1.7
3
4
2.7
2.3
I9
I
0
0
0
0.7
0
1.3
1.3
2.7
2.7
5
3.7
I10
Q
0
0
0
0
1.7
1
1
2.3
2.3
3.7
6
Step 3: The traceback pathway conforms to the passage that accumulates most matches. The
common regions of similarity can be determined by referencing these matches as illustrated
below:
J0
J1
J2
J3
J4
J5
J6
J7
J8
J9
J10
X
W
Y
T
Q
E
T
S
D
I
Q
I0
X
x
0
0
0
0
0
0
0
0
0
0
I1
W
0
1
0
0
0
0
0
0
0
0
0
I2
Y
0
0
2
0.7
0.3
0
0
0
0
0
0
I3
G
0
0
0.7
1.7
0.3
0
0
0
0
0
0
I4
Q
0
0
0.3
0.3
2.7
1.3
1
0.7
0.3
0
1
I5
E
0
0
0
0
1.3
3.7
2.3
1
0.7
0.3
0
I6
Q
0
0
0
0
1
2.3
3.3
2
1.7
1.3
1.3
I7
S
0
0
0
0
0.7
2
2
4.3
3
2.7
2.3
I8
Y
0
0
1
0
0.3
1.7
1.7
3
4
2.7
2.3
I9
I
0
0
0
0.7
0
1.3
1.3
2.7
2.7
5
3.7
I10
Q
0
0
0
0
1.7
1
1
2.3
2.3
3.7
6
The aligned sequence is as follows with the residues shown bold could be the locallysignificant
aligned pairs of interest.
WYGQEQSYIQ
   
  
WYTQEQSDIQ
33
34
PROBLEMS/EXERCISES ON NW and SW ALIGNMENTS
EXAMPLE – NW Algorithm
Given a set of sequence pairs, U and V as indicated below. Determine the best global alignment
via traceback using NW algorithm.
U:
CACTHETW
V:
C A C S C A T TW
Solution: Hand Calculation
C
4
2
3
2
2
1
1
0
0
C
A
C
S
C
A
T
T
W
A
2
3
2
2
1
2
1
0
0
C
2
1
2
1
2
1
1
0
0
T
0
0
0
0
0
0
1
1
0
H
0
0
0
0
0
0
0
0
0
E
0
0
0
0
0
0
0
0
0
T
0
0
0
0
0
0
1
1
0
W
0
0
0
0
0
0
0
0
1
Alignment =
CACTHETW
:

CACSCATTW
Problem B.14
NW Algorithm
Given a set of sequence pairs, X and Y as indicated below. Determine in each case the best global
alignment via traceback using NW algorithm
(i)
X:
GAGCA
Y:
GATTCA
Solution Hint
The following is the solution on alignment:
GAGCA
 
 
GATTCA

34
35
Problem B.15
NW Algorithm
Given the sequence pairs:
x: W F G Q F T S A I W
y: S S T Q F S E D A I
Perform NWalgorithm based comparison between the two sequences and elucidate the optimal
pathway in aligning them globally.
Problem B. 16
Assigned is a pair of amino acid sequences (S and T). Determine the best global alignment
S: C U U A C G C A
T: A U G A G A A C U U
Solution Hint: Final alignment
S: C U U  A C G   C  A
T: A U  G A  G A A C U U
Problem B.17
Given a sequence pairs, U and V as indicated below, determine the best global alignment via traceback using NW algorithm
U:
CTCGT
V:
CTAAGT
Answer:
(ii) Alignment =
CTCGT
 
CTAAGT
Problem B.18
Via hand calculations, perform NWalgorithm based comparison between the two given sequences
indicated belowand elucidate the maximum pathway:
MA V R K L S L E G
MSTALPGLGS
Problem B.19: Bonus Problem for extra credits
Via hand calculations, perform NWalgorithm based comparison between the two given sequences
and elucidate the maximum pathway:
Sequence Pair
WFGQETSAIS
S FTQFSEDAI
35
36
LOCAL ALIGNMENT: SW ALGORITHM
EXAMPLE
Determine the best local alignment via traceback method of SmithWaterman method
Step 1: Construct an n × m matrix and add the row of dummy '0's towards initialization
x
x
0
A
0
U
G
0
0
A
G
0
0
A
A
C
0
0
0
U
U
0
0
C
0
U
0
U
0
A
0
C
0
G
0
C
0
A
0
Step 2: Use the SmithWaterman algorithm to fill table to find the best scoring path (blank spaces
represent '0's).
x
x
0
C
0
U
0
U
0
A
0
C
0
A
0
U
G
0
0
A
G
0
0
1.7
0.3
A
A
C
0
0
0
1.0
1.0
U
U
0
0
G
0
C
0
A
0
1
1.0
1.0
0.7
1.0
0.7
0.7
1.0
2.0
1.3
1.3
0.3
1.3
1.3
3.0
1.0
0.7
0.7
0.7
1.0
1.0
1.0
0.3
1.0
1.0
0.7
0.7
1.0
0.3
0.3
0.7
Step 3: The row of dummy '0's is removed.
C
U
U
1.0
1.0
0.7
A
U
G
A
1.0
0.7
1.7
0.3
A
A
C
1.0
1.0
1.0
2.0
1.3
1.3
3.0
G
C
A
1.0
1.0
1.0
0.7
0.7
A
G
U
U
C
1.3
0.3
1.3
1.0
0.7
0.7
0.3
0.3
0.7
1.0
1.0
1.0
0.3
0.7
36
37
Results:
Best Local
Alignment
C U
C U
U
U
EXAMPLE
Given the sequence pair, X and Y as indicated below, determine the best local alignment via
traceback in each case using SW algorithm.
X:
PAWHEAE
Y:
HEAGEWGHEA
Hand Calculation
x
0
0
0
0
0
0
0
0
0
0
0
x
H
E
A
G
E
W
G
H
E
A
P
0
0
0
0
0
0
0
0
0
0
0
A
0
0
0
1
0
0
0
0
0
0
1
W
0
0
0
0
0.6667
0
1
0
0
0
0
H
0
1
0
0
0
0.3334
0
0.6667
1
0
0
E
0
0
2
0.6667
0.3334
1
0
0
0.3334
2
0.6667
A
0
0
0.6667
3
1.6667
1.3334
1.0001
0.6668
0.3335
0
3
E
0
0
1
1.6667
2.6667
1.3334
1.0001
0.6668
0.3335
1
1.6667
Answer: Alignment
WHEA
 
WGHEA
EXAMPLE
Determine the common regions of similarity (that is, optimal local alignment) via SW algorithm
VSTVVLENPGLGRALS
MSTVVTPNPGLGKAS
x
M
S
T
x
0
0
0
0
V
0
S
0
T
0
V
0
V
0
L
0
E
0
N
0
P
0
G
0
L
0
G
0
R
0
A
0
L
0
S
1.0
2.0
37
38
V
V
T
P
N
P
G
L
G
K
A
S
0
0
0
0
0
0
0
0
0
0
0
0
1.0
0.7 3.0
1.0 0.7 0.3 1.7 4.0
0.7 1.7 0.3 2.7 3.7
0.3 1.3 2.3 2.3 3.3
1.0 2.0
4.3
0.7
5.3
0.3
6.3
7.3
8.3
8.0
9.0
8.7
Problem B.20
SW Algorithm
For the following pair of sequences u and v obtain the relevant local alignment via SW algorithm
by obtaining the final score matrix outlined earlier:
u:
v:
ACAGCCUCGCUUAG
AAUGCCAUUGACGG
Solution hint: The local alignment is:
... G C C − U C G ...
... G C C A U U G ...
Problem B.21
Given a sequence pairs, X and Y as shown below, determine the best local alignment via traceback using SW algorithm.
X:
WRNDCQEGSA
Y:
WGQEGSIEA
Answers
Alignment =
QEGH

QEGH
Problem B.22 : Bonus problem for extra credits
Given a sequence pairs, U and V as shown below, determine the best local alignment via traceback using SW algorithm.
U:
AASTHECWCTWH
V:
AASRNPSCWTTWHT
38
39
Answers
Alignment =
AASTHECWCTWH
 :  
AASRNPSCWTTWH
SUBMISSION (HARD COPY AND SOFT COPY) DUE DATE: BY 3rd November, 2017
BONUS CREDIT will be given for:
(a) Neat stepbystep demonstration of the problems
(b) Developing your own MatLab/C/C++ codes as necessary
(c) At least 2 handcalculations of NW and SW problems indicated
________________________________________________________________________
39