Data Analysis Project
Objectives and Overview
The primary objective of this project is to give students experience with an array of different phylogenetic techniques used to analyze DNA sequence data. Each student will analyze a data set of 15-20 aligned DNA sequences, including at least one sequence that will serve as the outgroup. Data will be analyzed using the computer programs MEGA, PAUP*4.0, PHYML, MrBayes and other programs that we encounter in our workshop sessions. The project will include analyses conducted using genetic distance approaches, maximum parsimony, maximum likelihood, and Bayesian statistics. Students should endeavor to include in the project as many of the skills and methods learned in the workshop sessions as possible.
Students will prepare a written report of their projects in standard journal format, and make a brief oral PowerPoint/Keynote presentation (10 minutes maximum; strictly enforced) to the class as a whole the week of Dec. 7, 2009. Emphasis in the write-up is placed on presentation of the results of the phylogenetic analyses and the interpretation of the results of the MEGA, PAUP*, PHYML, & Mr. Bayes analyses in the context of the biology of the organisms from which the DNA sequences were drawn.
The Data Analysis Project includes the following elements:
A. Choosing a Data Set and Preparing a Proposal
The Data Set
Choose a set of 15-20 DNA sequences for your analysis, including at least one sequence that will serve as the outgroup. Consult with Dr. Smith on your choice of taxa and sequences to be analyzed. Sequences should be nucleotide data for protein coding genes.
Sequences should be drawn from one of the following sources:
a. The literature. Find a published paper with a "phylogenetic tree" and GenBank accession numbers for the sequences used. These sequences (or a subset of them) can be downloaded to create the data set. Conversely, many aligned data sets are available from Treebase, from the journals in which papers were published (e.g., Systematic Biology), or from the authors themselves (many authors are flattered that someone else is interested in their dataset).
b. Create a data set of DNA sequences for a set of taxa in which you are interested, and for which appropriate sequences exist in GenBank. Download the sequences to create the data set. Be sure to consult with Dr. Smith if you choose this option.
c. Your own (or a colleagues) data. If you have some sequence data, I encourage you to use it for the project.
Prepare a one page proposal (word-processed, 12 pt. type w/ 1 in. margins) that includes the following:
a. A brief introduction to your system in which you briefly describe your taxon set and the molecular marker(s) you've chosen to analyse. Explain why this system is of interest to you and why you have chosen these marker(s).
b. List the set of analyses that you will be conducting, in outline form. (This should make a nice checklist). Also propose a timeline for your work. E.g., "on Nov. 18, I will carry out the distance analyses and the parsimony analyses".
c. In conjunction with part b, list a core set of figures and tables that you will be preparing for your final paper. E.g., Figure - ML tree obtained using PhyML shown as a phylogram (with aLRT test values on the branches).
d. Make an appointment with Dr. Smith for a 15-20 minute interview to agree upon the taxon set, the DNA sequences to be analyzed, and to discuss your proposal. Proposal interviews are scheduled to occur in the afternoon on Thurs. Nov. 12, Fri. Nov. 13, Mon. Nov. 16, and Tues. Nov. 17. Click here for the possible proposal interview times, and then email Dr. Smith to make an appointment.
B. Data Analysis
Required elements for the the project are:
1. Alignment of Sequences and Data Block Creation (refer to Workshop I)
a. Choose your taxa and sequences to be analyzed. Do this in consultation with Dr. Smith (see Proposal above).
b. Align your sequences using Clustal W as implemented in MEGA4. (For protein coding regions, this may not be too difficult to do, even without Clustal W.)
c. Create a datablock of aligned sequences in MEGA and/or Nexus format.
d. When you write your paper, include in an Appendix a web link or an electronic file reference to a figure that shows your aligned dataset. (Both MEGA and MacClade provide options for exporting data in publication style formats.)
2. Analysis of Pairwise Distances (Neighbor-joining) (refer to Workshop II)
a. Execute your datablock using MEGA4.
b. Generate a Neighbor-joining tree and save it to a file, both as a Newick file and as a graphic image.
c. Carry out a bootstrap analysis to obtain Bootstrap Confidence Limits (BCL) for branches on this tree.
d. Create a figure that contains the NJ tree, the bootstrap values on the branches, and a complete, descriptive figure legend. Make sure you include a scale bar in the figure to indicate genetic distance.
3. Maximum Parsimony Analysis (refer to Workshops III & IV)
a. Execute your datablock using PAUP. Make sure that the analysis type is set to, "Parsimony", and take note of whether or not all of your characters are included in the analysis.
b. Search the treespace for the shortest tree(s). Save these trees to a treefile.
c. Open the treefile in edit mode of PAUP, and add to it the NJ tree from step 2b.
d. Record treelength, tree retention index and tree consistency index for the MPRs.
e. Perform a bootstrap analysis to obtain a bootstrap consensus tree.
f. Determine Bremer support (Decay Indices) using AutoDecay.
g. Create a figure that contains the MP tree (or a strict consensus of the MP trees), with bootstrap values on the branches, Decay indices at the nodes, and a complete, descriptive figure legend.
4. Analysis of Character Evolution using MacClade (refer to Workshop V)
a. Open your data block in MacClade.
b. Open your treefile when prompted.
c. Reconstruct a character phylogeny (Trace Character) for a character that has different CI and RI two of the trees in your treefile.
d. Use "Compare Trees" to compare character evolution in these same two trees.i. If there are no significant differences between the two trees, use MacClade to move a branch and create a new tree. Use this tree for the Tree Comparison.
5. Maximum Likelihood Analysis (refer to Workshops VI & VII)
a. Run Modeltest to determine the appropriate model of nucleotide sequence evolution to use for your maximum likelihood analysis.
b. Incorporate the likelihood settings that you obtain into your Nexus file.
c. Search for the maximum likelihood tree(s) using either PAUP*4.0 or PhyML. Append the ML tree to the treefile that you created above.
d. Conduct an aLRT branch support test using PhyML.
e. Compare the likelihoods of the trees in your treefile using the Kishino-Hasegawa test and the Shimodaira-Hasegawa test. (Note: you should have at a minimum the NJ tree, one MP tree, and the ML tree in this comparison.)
f. Create a figure that contains the ML tree (with aLRT values on the branches), and a complete, descriptive figure legend.
6. Bayesian Inference of Phylogeny (refer to Workshop VIII & IX)
a. Determine an appropriate set of likelihood settings to put into your "mrbayes" block in your Nexus file. See p. 147 in PTME3 and our course Bayes block templates for help.
b. Determine appropriate MCMC parameters and an appropriate burnin value to incorporate into your Bayes block.
c. Run, Mr. Bayes!, using procedures outlined in Hall's PTME3 (2008), pp. 146-149.
d. Use TreeView to visualize your Bayesian consensus tree. Have you remembered to exclude the "burn-in"?
e. Create a figure that contains the Bayes tree as a phylogram, with posterior probability values on the branches, and a complete, descriptive figure legend.
7. Calculation of Nucleotide Diversity and Polymorphism (refer to Workshop X) (Bonus Section)
a. For your data set, use DnaSP to estimate the parameters p (pi), k, and q (theta). Explain in your write-up what these parameters are, and what the observed values mean.
b. Examine your data set for evidence of selection using Tajima's test and Fu and Li's Test.
c. Use MEGA to examine codon usage and to check the GC content at 3rd positions.
C. Progress Report
Progress toward completion of the project will be assessed in the Workshop sessions on Wed. Nov. 25 and Wed. Dec. 2. The progress report will be given orally during the workshop in a one-on-one conversation with Dr. Smith. The report will allow me to assess who needs help and in what area(s).
Your report should be in standard journal article format, containing the following sections: Introduction, Materials and Methods, Results and Discussion, and References.
Good introductions contain the following elements:
Some additional items that can be addressed in the introduction to this paper are: what was the rationale for your choice of taxa?; What molecule was chosen for the analysis and why?
2. Materials and Methods
Be as concise as possible but be sure to include information about the following:
3. Results and Discussion
a. State your results objectively and without interpretation. Do the interpretation in the discussion.
b. Some of the questions that should be addressed in the discussion are:
All work and ideas that you present in your paper that are not your own need citations. Refer to the "Instructions for Authors" for a journal of your choice and use that format for your citations. (When I searched on the string "Instructions for Authors" using Google, I obtained 121,000 hits (!).)
D. Oral Presentation of Your Project
Grading Criteria for Presentations
Each student will prepare an oral presentation of their Data Analysis Project. These are informal presentations that will take place in class on Monday Dec. 7 and Wednesday Dec. 9. Presentations will be done in a "lab meeting" format, and each student will have 10 minutes (strictly enforced) to tell the class what the project is, what kinds of analyses were conducted, and what results were obtained.
A sign-up sheet for presentations will be available in class on Monday Nov. 16th.
One purpose of these reports is to solicit ideas and feedback from the class as a whole. These ideas and suggestions can then be incorporated into the final write-up
E. What to Turn In
On Friday December 18th (by 5pm), students should turn in:
1. A hard copy of the written report.
2. An electronic version of the ppt presentation.
3. An electronic version of your Nexus file.
The Data Analysis Project is worth a total of 160 points. These points are distributed within the project as follows:
1. Proposal and Proposal Interview (20 points)
2. Oral Presentation of Project (20 points)
3. The paper (100 points)
4. Following Items turned in on time (Dec. 18th, 5 pm):
a. A hard copy of the written report. (5 points)
b. An electronic version of the ppt presentation. (10 points)
c. An electronic version of your Nexus file. (5 points)