Wondering why it would took PLoS ONE so long to publish my new paper, I decided to google it and found it was actually out already! There is no proof reading stage so that might have something to do with email contact halting early in the publishing process. Anyhow, the paper was produced with (now ex-)colleagues from the Netherlands Institute of Ecology and Christopher Quince from the University of Glasgow, who did most of the work.
It is a bit of an exploratory paper, taking a first look at using an alternative marker for assessing bacterial diversity to the gold standard 16S rRNA gene (’16S’). Quoting from the paper: 16S is universally present and contains both highly conserved fragments, facilitating the design of PCR primers targeting all members of a community, and more variable regions that allow for the discrimination of different microbial taxa. The identity of 16S sequences from the environment can be related to the taxonomic identity of sequences obtained from cultivated, characterized strains. With the introduction of high throughput (pyro)sequencing methods, very large datasets are accruing in studies of microbial diversity. However, 16S is not without potential drawbacks and the use of alternative (protein-coding) markers has been proposed, including the beta subunit of DNA polymerase, rpoB.
We hypothesized that using the rpoB gene could offer potential advantages over standard 16S-based approaches. First, since most bacterial genomes contain multiple copies of the 16S rRNA gene, and copy number varies per species, extrapolation of relative abundancs from gene recovery frequencies is seriously impaired. (This is further complicated by the fact that sequence variation between the different 16S copies present exists in some
genomes.) rpoB typically occurs in a single copy. Second, the high level of conservation across 16S rRNA genes can obscure most intraspecific, and sometimes interspecific,
variation. In contrast, the higher resolution rpoB marker is capable of revealing molecular variation down to the population level. Third, it has been shown that genetic divergence of rpoB correlates better with overall genomic divergence and provides better bootstrap support for phylogenetic reconstruction. Fourth, given the fact that rpoB is a protein-encoding gene, the data generated from this marker is more readily interpreted in an evolutionary framework. Fifth, (pyro)sequencing error is an important confounding factor in studies of microbial diversity using 16S rRNA gene sequences, whereas in single-copy, essential protein-encoding genes, sequence errors can be identified and removed if they introduce disruptions in reading frame (see below).
OK, sounds promising right? Before expanding on our ‘good’ results, I’ll discuss three aspects where rpoB fails when compared to 16S. First, although rpoB might be the most conserved protein, it has very little on 16S in this respect. We used degenerate primers (a pool of different primers to cover as much variation as possible) and targeted Proteobacteria, not the whole community. This means that rpoB cannot be used for whole-community surveys, which is what most microbial ecologists are interested in. It can be used when targeting specific groups of interest though, and this population-level approach combines well with the fact that protein coding sequences can be aligned and that population-genetic tests (detecting for instance recombination or selection) are available for them. A variety of genes with known functions of interest can be used, not only rpoB of course. Second, assigning taxonomy to the sequences is a problem. Databases (and classifiers matching query sequence to database sequence) exist for 16S but not for other genes (at least not of comparable size and sophistication). Third, we compared 16S and rpoB sequences generated from a single soil sample. We did not know the actual identity of the community in the sample, and so biases in rpoB and in 16S amplification distorted our findings. A more rigorous test would have been possible using an artificial community of known composition. I must confess that I could well see rpoB performing worse than 16S….
OK, enough with the negativity.What is cool about protein-coding genes is that an internal check for sequencing error is possible (rpoB is essential and single-copy and so frame shifts are expected to be lethal and not present in the sample):
Roche 454 pyrosequencing does yield a very large number of sequences, but it is also vulnerable to miscalls of homopolymer runs that cause frame shifts. Denoising algorithms such as the popular PyroNoise/AmpliconNoise software developed by Chris can mitigate this effect to a large extent. Reading frame correction in rpoB was found to reduce diversity more than did denoising; assuming we threw out ‘bad’ proteins and not the ‘good’ ones, this indicates that this method can outperform denoising. This correction step might prove especially useful for sequencing protein-coding genes using the Illumina method, for which no denoising program yet exists.
Focusing just on the sequences that could be assigned to the Proteobacteria, it turned out that for a given sample size, more species were detected using rpoB than using 16S:One explanation for this phenomenon is that whereas single cells have one rpoB copy, they usually have multiple 16S copies and the repeated sampling of these (usually) identical copies results in the sampling curve flattening out. One thing to keep in mind though is that sampling curves are dependent on how you define species (or ‘Operational Taxonomic Units’, a preferred term as we have little idea of what the majority of the organisms we sampled sequences from look like). We used the criterion of 1% sequence dissimilarity for 16S (in many studies this % is greater, and often people use a couple of cut-off values so nobody can accuse them of using the wrong one) and 2.3% divergence for rpoB based on previous work where I shall not go into now.
Finally, we messed around a bit by zooming into a cluster of closely related sequences and testing for recombination. This is mainly proof of principle stuff. As bacterial communities are so diverse, much more specific primers (and more sequencing power) are needed to be able to find things that are closely related. (Who knows, with enough computer power, it will eventually be possible to analyze all sequences on the nucleotide level, rather than binning them into groups based on overall similarity…)
More work, focusing on multiple samples collected at small spatial scales are in progress with the same team. This approach would be interesting to use for any bacteriophage sequencing studies. Bacteriophages do not possess 16S (so reviewers cannot argue that that approach would always be superior as they did for this paper!), but do possess family-specific capsid genes that could be used to monitor their diversity. More about that later this year if all goes well.