DOC

Creating Multiple Sequence Alignments

By Russell Smith,2014-03-28 08:11
7 views 0
Creating Multiple Sequence Alignments

    BIT150 Lab3

    Sequence Alignment, Multiple Sequence Alignment and Phylogenetics Copy 10_Lab3 from Z: to C:.

A. SEQUENCE ALIGNMENT

    The most basic task in sequence analysis is to ask whether two sequences are similar and can be compared. Proteins with very similar sequences probably share structural properties and similar functions.

    Objective: Explore different methods of sequence alignment, interpret their results, and compare them.

A1. Graphical method

    Dotter (http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.html ): A dot-matrix

    program with interactive grayscale for DNA and protein sequence analysis.

; Dotter is preinstalled on your lab computers.

    Follow these steps to run Dotter:

    1.1. The DNA sequence file WIS.txt to be used with Dotter is in ‘10_Lab3\Dotter

    files’. Copy this file into the C:\BIT150\Programs\Dotter (an alternative is to write

    the PATH of each file when you run the program).

    1.2. Dotter needs to be started from the window:

    Start-> Programs -> Accessories -> Command Prompt (create a shortcut in your

    desktop).

    Alternatively, Start -> Run… -> in Open, type cmd -> OK.

    This is the old DOS operating system (case insensitive).

    1.3. Move to the Dotter directory (located in ‘C:\BIT150\Programs\Dotter’), typing:

    call C: -> press Enter;

    cd BIT150\Programs\Dotter (to change directory).

    To see the files present in the Dotter directory, type dir. Check for WIS.txt and

    MITE2.txt.

    1.4. Using Dotter, align the DNA sequence of the retroelement WIS, WIS.txt, with

    itself to look for internal repeats. To do it, type:

    dotter WIS.txt WIS.TXT -> press Enter -> wait….

    1.5. Analyze the Dotter output:

     Dotter window: The first sequence runs along the x-axis and the second

    sequence along the y-axis. Segments of 25 bp in one sequence (along the X axis) are

    compared to segments of 25 bp in the second sequence (Y axis). In regions where the

    two sequences are similar to each other, a row of high scores runs diagonally across

    the dot matrix.

     1

    Set width of the sliding window: (right click on the Dotter window and select o

    ‘Change size of sliding window’). The default width of 25 residues over which

    the pairwise scores are averaged has proven to be very robust, but you can

    change the width of the sliding window.

    o Print to a file: (right click on the Dotter window and select ‘Print’). You can

    print the alignment to a PostScript file and later convert it to PDF.

     Greyramp Tool window: Generates windows along the diagonals, and draws a

    dot in the center of the window only if the sum of the scores of all ‘dots’ within that

    window is above the maximum threshold, while dots below the minimum threshold

    get the minimum intensity, and dots in between are ‘rendered’ with a grayscale

    intensity proportional to their sum of scores. Interactive and dynamic changing of

    maximum and minimum thresholds allows the exploration of various signal

    stringencies.

     Alignment Tool window: Allows you to see the match that causes a given dot in

    the dotplot. Move the crosshair of the Dotter window with the left mouse button to the

    dot, and pop up the Alignment Tool. Once in the proximity, use the cursor keys to

    move the crosshair one residue at the time.

    - Copy and paste the alignment into your Word document (use Shift/PrintScreen to

    copy all what you have in your screen, open Start/Programs/Accessories/Paint,

    paste the image, select what you want, cut it, and finally paste it into your Word

    document).

    - After aligning WIS.txt with itself, what type of repeat is present in the sequence?

     A2. Dynamic-programming methods

    ; Global: Needleman-Wunsch algorithm (1981)

    ; Local: Smith-Waterman algorithm (1970) >Seq1 ACCAACCATACGAGTATCAGACCTATCAGGCCTATCCAGAGCAGATCATGGACTAACCCTAGGACATACCATCT >Seq2 ACTAATCATGGACTAACCCCCTAGGACATACCACTACATATGGCCTGATACCTCTGATACTCGTATGGTATCT

2.1. Open the link: http://www.ebi.ac.uk/emboss/align/

     Paste Seq1 and Seq2 into the Sequence1 and Sequence2 windows, respectively. Select DNA as molecule where asked. Compare needle (global) and water (local) alignment results. For both, use the default settings of Gap 10 Extend 0.5.

NEEDLE - GLOBAL Seq1 1 ACCAACCATACGAGTATCAGACCTATCAGGCCTATCCAGAGCAGATCATG 50 .||| |||||| Seq2 1 ------------------------------ACTA----------ATCATG 10 Seq1 51 GACTAA--CCCTAGGACATACCATCT------------------------ 74 |||||| ||||||||||||||| || Seq2 11 GACTAACCCCCTAGGACATACCA-CTACATATGGCCTGATACCTCTGATA 59 Seq1 74 -------------- 74 Seq2 60 CTCGTATGGTATCT 73

    WATER - LOCAL Seq1 32 CTATCCAGAGCAGATCATGGACTAA--CCCTAGGACATACCA 71 ||| |||||||||||| ||||||||||||||| Seq2 2 CTA----------ATCATGGACTAACCCCCTAGGACATACCA 33

     2

A3. Words methods (heuristic)

    ; BLASTN: The Basic Local Alignment Search Tool (BLAST) finds regions of

    local similarity between sequences.

    3.1. Using BLAST 2 Sequences

    (http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi), run the same two sequences,

    Seq1 and Seq2 from 1.6. Select blastn as ‘Program’.

    - Copy and paste the Dot Matrix View and the

    alignment into your Word document.

    - What is the orientation of the conserved segments?

    - Compare this alignment with those previously

    obtained using, needle (global), and water (local).

    BLAST is more flexible to find inverted segments!

    3.2. Change ‘gap open penalty from 5 (default) to 3.

    Run.

    - Copy and paste the alignment into your Word document (use

    Shift/PrintScreen).

    - What types of repeats present in the sequences can you identify now?

    3.3. Which of the three methods (needle (global), water (local), BLAST 2 Sequences)

    detected better the similarities observed in Dotter?

; BLASTX: DNA-protein alignment (protein database using a translated nucleotide

    query).

    3.4. Using BLAST 2 Sequences, compare the genomic DNA sequence of the Acyl

    Co-A Synthetase from Lab1 with the predicted protein sequence. Sequences are in the

    file 10_Lab1\Sequin Acyl Co-A Synthetase\ Final annotation.doc. and also in

    10_Lab3.

    Paste the Acyl Co-A synthetase DNA sequence in the Sequence 1 window and the

    Acyl Co-A synthetase protein sequence in the Sequence 2 window. Select blastx as

    ‘Program’.

    - Could you identify the 6 exons?

    - Are the borders of the exons as precise as in the flat file prepared using Sequin?

    3.5. Change ‘gap extension penalty’ from 1 (default) to 2.

    - Can you see any improvement?

     3

; BLASTP: Comparing two proteins.

    3.6. Using BLAST 2 Sequences, align the following sequences. Select blastp as

    ‘Program’. >K_transport VGALLLYLPISTTRPISFLDALFTATSAVTVTGLAVLDTYSDFTLFGKLVILFLIQVGGLGYMTLSTFFLVLLGRRIGLKERLILAESLEYPSMHGLIRFLKRVFSFVFITELTGAILLSIYFSLKGVEDPVFNGIFHSVSAFNNAGFSTFKNG >TRK system potassium uptake protein NDIQTKYALIVTAFISIIISIKDKVPIIDSLFTVVSAMTSTGFTTINVGNLSSLSLFLIIFLMLIGGGAGTTTGGVKIIRFLVILKALLYEIKEIIYPKSAVIHEHLDDMDLNYRIIREAFVVFFLYCLSSFLTALIFIALGYNPYDSIFDAVSF

    - Compare alignments with Matrix BLOSUM62/BLOSUM80/

    /PAM30/PAM70. Any change when changing matrices?

     8PAM (Percentage of Acceptable point Mutations per 10 years) matrices

    BLOSUM (BLOcks SUbstitution Matrix) matrices

B. Creating Multiple Sequence Alignments (MSA)

    Objective: Perform multiple sequence alignments, calculate distance matrices, and construct phylogenetic trees, to understand and interpret relationships between species.

    In this example, we will create a multiple alignment of protein sequences that will be imported into the alignment editor using different methods. Multiple protein sequence alignment is a central tool to infer protein function, predict protein secondary structure, and identify residues important for protein specificity.

Open the file ‘FT proteins for MEGA.doc’.

    B1. Start MEGA4 by using Start\Programs\BioInformatics\MEGA4.

    B2. In the MEGA4 window, go to Alignment|Alignment Explorer/CLUSTAL. Select

    Create a new alignment’, and click on OK. Click on [NO] for protein sequence

    alignment.

    B3. Sequences can be entered either from FASTA files (opening the concatenated

    FASTA sequences TXT file using MEGA) or by hand. We will enter the

    sequences by hand, one by one. In the Alignment Explorer window, go to

    Edit|Insert Blank Sequence or click on, and repeat it to generate 8 blank

    sequences. Right-click on the blank sequence name and edit the sequence name

    for each protein sequence, as it is in the Word document ‘FT Proteins for MEGA’.

    Copy and paste each sequence.

    B4. Go to Edit|Select All to select every site for all the protein sequences in the

    alignment.

    B5. Go to Alignment|Align by ClustalW or click on to align the selected protein

    sequences using the ClustalW algorithm.

     4

B6. Save the current alignment by selecting the Data|Save Session. Save it as FT.mas’.

    This will allow the current alignment to be restored for future editing. Also,

    export it (Data|Export Alignment|FASTA format) as both a FASTA file

    (‘FT.fas’) and a MEGA file (‘FT.meg’).

C. Generating a publishable MSA using BoxShade

    C1. Using Word, open the previously created FASTA file (‘FT.fas’). Copy the FASTA

    sequences (including gaps). Past them in BOXShade:

    http://www.ch.embnet.org/software/BOX_form.html. In the ‘Output format

    select RTF_new and in the Input sequence format select other. Click on Run

    BOXSHADE. Click On here is your output number 1. The alignment will be

    open in a Word document.

    D. Exploring the MSA and identifying patterns

    D1. Back in MEGA4, exit the Alignment Explorer window by selecting the Data|Exit

    AlnExplorer. A dialog box will appear asking you if you would like to open the

    data file in MEGA; click on ‘Yes.

    D2. Observe different coloring schemes by clicking on: C: conserved residues (the same

    amino acid at a given site in all the aligned sequences), V: variable residues (at

    least 2 different amino acids at a given site), Pi: Parsimony informative (at least 2

    different amino acids at a given site and at least 2 of them occurring with a

    minimum frequency of 2), S: singletons (at least 2 different amino acids at a given

    site with at most 1 of them occurring multiple times).

    (When you have a coding DNA sequence you can translate it into a protein

    sequence by clicking on UUC->Phe. Clicking again you go back to the DNA

    sequence).

    - Can you discover some groups by looking at the Pi characters?

    - Move sequences to have OsFT2 close to TaFT2, and also TaFT, OsFTa, and

    OsFTb close to each other. Can you see patterns now?

    D3. To see the format of a MEGA file, in the MEGA4 window, go to File|Export Data,

    and click on OK to take a look at it. Exit (File|Exit Editor) this window.

    D4.

Mutations T V L

    TaFT2 Q D P

     5

    Which of the 3 mutations found in a TILLING screen of TaFT2 would you prioritize for

    characterizing a non-functional TaFT2 gene?

    BLOSUM62 information for mutations: T;Q= -1; V;D=-3; L;P=-3

    BLOSUM62 information for changes at the mutation positions: T;I= -1; V;I=3; I;L=2; E;L=-2)

    Maximize the conservation of the position and the negative impact of the mutation…

    D5. Using T-COFFEE as a consistency based program

    Copy the sequences below and open t-COFFEE in your web browser: http://tcoffee.vital-

    it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi. Use the Regular form of T-COFFEE. Paste

    the sequences in the INPUT window and press submit. Click on the link for score_pdf and save the file. (NOTE: Once the file is saved you may need to rename it so that it is a .pdf file, or it may not open properly.)

     >OsVIL1 MASSAGGDPPPPGLFAAALHACSGASALEEHIHADDSNTISDNTLEQLGFLDQESNDASVNTEKIQSSTPKCKSVEDIPIAPAAKRCKNMDSKKLVPNSNNNSCLTGSQAPRKLPRKGDYPVQLRRNETFQDTKPPSTWICKNAACKAVLTADNTFCKRCSCCICHLFDDNKDPSLWLVCSSETGDRDCCESSCHIECALQHQKVGCVDLGQSIQLDGNYCCAACGKVIGILGFWKRQLMVAKDARRVDILCSRIYLSHRLLDGTTRFKEFHKIVEDAKAKLETEVGPLDGTSSKMARGIVGRLPVAADVQKLCSLAIDMADAWLKSNCKAETKQIDTLPAACRFRFEDITTSSLVVVLKEAASSQYHAIKGYKLWYWNSREQPSTRVPAIFPKDQRRILVSNLQPCTEYAFRIISFTEYGDLGHSECKCFTKSVEIIHKNMEHGAEGCSSTAKRDSKSRNGWSSGFQVHQLGKVLRKAWAEENGCPSEACKDEIEDSCCQSDSALHDKDQAAHVVSHELDLNESSVPDLNAEVVMPTESFRNENICSPGKNGLRKSNGSSDSDICAEGLVGEAPAMESRSQSRKQTSDLEQETYLEQETGADDSTLLISPPKHFSRRLGQLDDNYEYCVKVIRWLECSGHIEKDFRMKFLTWFSLRSTEQERRVVITFIRTLADDPSSLAGQLLDSFEEIVSSKKPRTGFCSKLWH* >TmVIL1 MESTGGDPSGFAAAALHASSDVSEHEEIKPADDSNTISDYAQEPLNFFPEQESNDASVSTEKKESVVSKCKSVEEIPREATVKRCKNIDSKKLFSNNKNSPSLTGIQALRKPPRKGPHPIQLRESEMFQDKKPPSTWICKNAACKAVLTSENTFCKRCSCCICHLFDDNKDPSLWLVCSSETGDTDCCESSCHVECALQRRKAGRIDLGQSMHLDGNYCCAACGKVIGILGFWKRQLAVAKDARRVDILCSRIYLSHRLLDGTTRFKELHQIVQDAKAKLETEVGPLDGSSKMARCIVGRLPVAADVQKLCSLAMEKVDDWLQSNSQAETKQIDTLPTACRFRFEDITASSLVIVLKETASSQYHAIKGYKLWYWNSREPPSTGEPVIFPKDQRRILISNLQPCTEYAFRIISFVEDGELGHSESKCFTRSVEIMHKNIEHGAEGCSSTAKRNVKRHNGRSSGFKVRQLGKVLRRAWEEDGFPSEFCKDEIEDSCDQSDSVILEKGQVAHVVSRKLDLNETSVPDLNAEVVMPTECLRNENAYSSGKNDLRKSNGCGDFATCTEGHVGEAPAMESRSQSRKQTSDLEQETCAEDGNLVIGSQRHFSRRLGELDNNYEYCVKTIRWLECCGHIEKEFRMRFLTWFSLRSTEQERRVVLTFIRTLVDEPGSLAGQLLDSFEEIVASKRPRTGFCTKLWH* >OsVIL2 MDPPYAGVPIDPAKCRLMSVDEKRELVRELSKRPESAPDKLQSWSRREIVEILCADLGRERKYTGLSKQRMLEYLFRVVTGKSSGGGVVEHVQEKEPTPEPNTANHQSPAKRQRKSDNPSRLPIVASSPTTEIPRPASNARFCHNLACRATLNPEDKFCRRCSCCICFKYDDNKDPSLWLFCSSDQPLQKDSCVFSCHLECALKDGRTGIMQSGQCKKLDGGYYCTRCRKQNDLLGSWKKQLVIAKDARRLDVLCHRIFLSHKILVSTEKYLVLHEIVDTAMKKLEAEVGPISGVANMGRGIVSRLAVGAEVQKLCARAIETMESLFCGSPSNLQFQRSRMIPSNFVKFEAITQTSVTVVLDLGPILAQDVTCFNVWHRVAATGSFSSSPTGIILAPLKTLVVTQLVPATSYIFKVVAFSNYKEFGSWEAKMKTSCQKEVDLKGLMPGGSGLDQNNGSPKANSGGQSDPSSEGVDSNNNTAVYADLNKSPESDFEYCENPEILDSDKASHHPNEPTNNSQSMPMVVARVTEVSGLEEAPGLSASALDEEPNSAVQTQLLRESSNSMEQNQRSEVPGSQDASNAPAGNEVVIVPPRYSGSIPPTAPRYMENGKDISGRSLKAKPGDNILQNGSSKPEREPGNSSNKRTSGKCEEIGHKDGCPEASYEYCVKVVRWLECEGYIETNFRVKFLTWYSLRATPHDRKIVSVYVNTLIDDPVSLSGQLADTFSEAIYSKRPPSVRSGFCMELWH* >TmVIL2 MDPPYAGAIIEPAKCRLMSVDEKKDLVRELSKRPQTAPDKLQSWSRRDIVEILCADLGRERKYTGLSKQRMLDYLFRVVTGKSSGPVVHVQEKEPTLDPNTSNHQYPAKRQRKSDNPSRLPIAVNNPQTAVVPVQINNVRSCRNIACRAILSMEDKFCRRCSCCICFKYDDNKDPTIWLSCSSDHPMQKDSCGLSCHLECALKDGRTGILPSGQCKKLDGAYYCPNCRKQHDLLRSWKKQLMLAKEARRLDILCYRIFLGHKVLFSTEKYSVLHKFVDIAKQKLEAEVGSVAGHGSMGRGIVSRLTCGAEVQKLCAEALDVMQSKFPVESPTNSQFERSNMMPSSFIKFEPITPTSITVVFDLARCPYISQGVTGFKVWHQVDGTGFYSLNPTGTVHLMSKTFVVTALKPATCYMIKVTAFSNSSEFVPWEARVSTSSLKESDLKGLAPGGAGLVDQNNRSPKTNSGGQSDRSSEGVDSNNNATVYTDLNKSPESDFEYCENPEILDSDKVPHHPNGPSNNLQNMQIVAARVPEVTELEEAPGLSASALDEEPNSTVQAALLRESSNSMEQNQRSEVPISQDASNATAGVELALVPRFVGSMPPTAPRVMETGKETGGRSFNTKPSDNIFQNGSSKPDREPGNSSNKRSGKFEDAGHKDGCPEATYEYCVRVVRWLETEGYIETNFRVKFLTWYSLRATPHDRKIVSVYVDTLINDPASLCGQLTDTFSEAIYSKKPPSVPSGFCMNLWH*

    Note: NCBI multiple alignment tool for proteins is COBALT: it does progressive multiple

    alignment of protein sequences. The alignment is aided by a collection of pairwise constraints derived from conserved domain database, protein motif database, and local sequence similarity using RPS-BLAST, BLASTP, and PHI-BLAST, respectively. Computation time is reduced by forming clusters of sequences that share a large number of common words and finding conserved domains and motif matches for only one sequence per cluster.

     6

    D6. Creating a graphical representation of amino acid conservation. A FASTA file of the first 50 amino acids of the FT protein alignment has been saved in

    the 10_Lab3 folder. Open the FASTA file, ‘FTclipped.fas’, using Microsoft Word.

    Copy the FASTA alignment and paste it in the Multiple Sequence Alignment window of

    WebLogo: http://weblogo.berkeley.edu/logo.cgi. Click Create Logo.

E. Calculating a Distance Matrix

    E1. In the MEGA4 window, go to Distances|Compute Pairwise. In the ‘Analysis

    No. of differences (leave Preferences’ window, change ‘Model to Amino Acid|

    the default parameters in the other options). Click on Compute.

    E2. See the Pairwise Distances matrix.

    - Which sequences are the closest ones?

    - Which sequences are the most distant ones?

    E3. To see the matrix in a MEGA file and save it, go

    to File|Export/Print Distances, and change

    the ‘Output Formatfrom ‘Publication to

    MEGA. Click on Print/Save Matrix.

    E4. After you have inspected the matrix, go to File|Quit Viewer to close the Pairwise

    Distances matrix.

F. Drawing a Phylogenetic Tree

    F1. In the MEGA4 window, go to Phylogeny|Construct Phylogeny|Neighbor-Joining

    (NJ). In the ‘Analysis Preferences’ window, in the Options Summary’ tab,

    change Model to Amino Acid|No. of differences. (leave the default parameters

    in the other options). Click on Compute.

    F2. See the tree in the Tree Explorer window.

    F3. To select a branch, left-click on it. If you right-click on a branch, you will find

    several options to perform different operations on the ‘Selected subtree’. To edit

    the accession labels, double-click on them. Change the branch style by selecting

    the View|Tree/Branch Style.

     7

    F4. To save the tree to the clipboard and then be able to save it in a Word document, go

    to Image|Copy to clipboard. Open a Word document and paste this tree. Exit the

    Tree Explorer window (File|Exit Tree Explorer), without saving.

    - Use Phylogeny|Contruct Phylogeny to produce minimum evolution, maximum parsimony and UPGMA trees. Copy and paste each of them into the same Word

    document to compare them. Are the results consistent?

NJ OsFTa

     TaFT

     OsFTb

     OsFT2

     TaFT2

     AtFT

     AtTSF

     AtTFL1

     OsFTaUPGMA 10 TaFT

     OsFTb

     OsFT2

     TaFT2

     AtFT

     AtTSF

     AtTFL1

    3020100

    Max AtFT

    Parsimony AtTSF

     AtTFL1

     TaFT

     OsFTa

     OsFTb

     OsFT2

     TaFT2

     8

    Min OsFTa

    Evolution TaFT

     OsFTb

     OsFT2

     TaFT2

     AtFT

     AtTSF

     AtTFL1

     10

G. Evaluating a Phylogenetic Tree

    G1. In the MEGA4 window, go to Phylogeny|Construct Phylogeny|Neighbor-Joining

    (NJ). In the ‘Analysis Preferences’ window, in the ‘Test of Phylogeny’ tab, select

    Bootstrap with 1,000 replications. Click on Compute.

    G2. See the tree and the bootstrap values in the Tree Explorer window.

    - What is the confidence of the OsFTa-TaFT branch? G3. Go to Image|Copy to clipboard and paste the tree into your Word document. Exit

    the Tree Explorer window (File|Exit Tree Explorer), without saving.

H. Within MEGA Alignment Explorer we can retrieve sequences directly from

    GenBank

    We have discovered a MADS box protein from barley (GenBank # CAB97352) and we

    want to determine the closest protein in among the following three Arabidopsis proteins:

    AP1= CAA78909; AGL2= AAA32732; AGL6= AAA79328).

    H1. In the MEGA4 window, go to Alignment|Query databanks.

    H2. In the NCBI Entrez site, select Protein database, enter the first GenBank number

    CAB97352 into the search box, and click on Go. When the search result is

    displayed, open it and then click on ‘Add to Alignment.

     9

H3. Repeat step G2. for the three Arabidopsis sequences.

    H4. Align the protein sequence using ClustalW as before, save the alignment as

    MADS.mas’, exit and open the file in MEGA.

    H5. Perform a Neighbor-Joining (NJ) analysis. Copy and paste the phylogenetic tree

    into your Word document.

    - Which Arabidopsis protein is the closest one to the MADS box protein from

    barley?

I. Viewing the 3D structure of a protein

    I1. Cn3D is an application that allows you to view 3-dimensional structures of proteins.

    Go to protein blast (blastp).

     Copy and paste AtFT protein sequence and click on BLAST.

    I2. Once your results are completely displayed, go to Show Conserved Domains.

    - What is the name of the conserved domain?

    Click on it to find more information about the conserved domain.

     - What biological functions have been attributed to this

    conserved domain?

    H3. Click on Structure to go to Entrez, Structure database. In

    the Structure database, insert the name of the conserved domain

    you found and click on Go. Click on the link displayed as your

    results. In the Structure Summary window, click on Structure

    View in Cn3D. Open the file with Cn3D. Cn3D tutorial:

    http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.shtml .

    I4. Go to View|Animation|Spin for a complete view of the 3D structure of the conserved domain. You can change the Style in which you want to see the 3D structure.

     10

Report this document

For any questions or suggestions please email
cust-service@docsford.com