I just published a paper on a user-friendly webserver for prokaryote genomics analysis with colleagues from the universities of Sussex, Wageningen and Nijmegen. The webserver is available here and the paper is freely accessible at the PLoS ONE website.
From the abstract/cover letter: Quantifying selection is a central goal in evolutionary biology. The McDonald-Kreitman (MK) test is a powerful test of selection comparing patterns of nucleotide substitutions within a species to those separating this species from an outgroup species. The MK test recognizes two classes of mutations: synonymous mutations changing the DNA sequence but not the protein it codes for, and nonsynonymous mutations changing both DNA and protein. The former class of mutations is presumed to be (largely) neutral, as the phenotype of the organism is not changed. The latter class could also be neutral, or deleterious (most random changes to an organism are for the worst) or beneficial. Under purely neutral evolution, the ratio of synonymous and nonsynonymous substitutions within a species is expected to be the same as that between this species and an outgroup species. If a species has diverged due to adaptation (having changed its phenotype), an excess of nonsynonymous changes is expected between species relative to that within species. This is because adaptive mutations are fixed relatively rapidly and so contribute little to intra-specific polymorphisms but do contribute to between-species divergence.
Both a lack of software applications and statistical difficulties with the MK test have prevented this test to be commonly used on a genome-wide scale. Two novel extensions of the MK test have been introduced last year: the Direction of Selection (DoS) statistic can be calculated for genes of low nucleotide diversity, substantially increasing the total number of genes that can be analysed, and the weighted-average statistic (NITG) can be calculated for the entire genome regardless of between-gene heterogeneity. The ODoSE webserver is the first tool that allows the MK test of selection, and its two novel extensions, to be performed on the level of entire prokaryote genomes.
I have had a long-standing interest in elucidating past recombination and selection from sequence data, but my interest in the McDonald Kreitman (MK) test developed while writing a paper on bacterial speciation (see also this post). Very briefly my reasoning was: bacterial types could be defined on the basis of their niche > different selective pressures experienced in each niche will manifest itself in the genome > these genomic patterns of differential adaptation could be tested using the MK test > different clusters that yield significant differences in adaptation can be classified as distinct types/species. In a sense, my idea was to use this test not in the classical sense of taking an outgroup species to test for the presence of positive selection, but just the other way around, by defining the outgroup to be a separate species only when there is evidence of positive selection (‘inhabitation of a different niche’). Of course, this reasoning is relatively simplistic, only comparing genes shared by different types (because this test cannot be performed on genes that are only present in one type but not the other) and natural selection can act in ways that are not picked up by the MK test. However, it does provide some sort of handle on the issue and has the potential to come up with a biologically meaningful ‘species border’ (statistical significance) instead of an arbitrarily chosen measure of genetic distance.
Adaptive divergence between different bacterial types is central to bacterial speciation research, but the MK test as implemented in odose can of course be used in the original sense of detecting patterns of selection, for instance screening for candidate genes under suspected positive selection in a pathogenic bacterium that could give clues to its virulence. Odose allows analysis of user-generated genome data to be analyzed as well as genomes deposited in NCBI or combinations of both. What is particularly nice about it is that it has a graphical user interface and so no coding is required. Population geneticists might not be afraid of that, but I know from experience that many microbiologists are! The Galaxy environment in which it is hosted allows customization and/or coupling to work flows developed by others. Moreover, the webserver gives a lot of additional output that can be used for a range of analyses. For instance, the distribution of each gene in every genome is given, three tests of recombination are performed and all orthologous genes are automatically trimmed, aligned and concatenated for each genome which makes it easy to generate high-resolution phylogenies. For those who are interested, the website contains a manual with more information.
Lastly, Mark van Passel and I were only able to embark on this project because of generous support by the Netherlands Bioinformatics Centre (NBIC), specifically in the form of hard work by bioinformatician Tim te Beek, who actually scripted this pipeline, thanks Tim!