‏إظهار الرسائل ذات التسميات metagenomics. إظهار كافة الرسائل
‏إظهار الرسائل ذات التسميات metagenomics. إظهار كافة الرسائل

Guest post from Rachid Ounit on CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences

Recently I received and email from Rachid Ounit pointing me to a new open access paper he had on a metagenomics analysis tool called CLARK.  I asked him if he would be willing to write a guest post about it and, well, he did.  Here is it:


CLARK: Accurate metagenomic analysis of a million reads in 20 seconds or less…

At the University of California, Riverside, we have developed a new lightweight algorithm to classify accurately metagenomic samples while minimizing computational resources better than any other classifiers (e.g., Kraken).  While CLARK and Kraken have comparable accuracy, CLARK is significantly faster (cf. Fig. a) and uses less RAM and disk space (cf. Fig. b-c). In default mode and single-threaded, CLARK’s classification speed is higher than 3 million short reads per minute (cf. Fig. a), and it also scales better in multithreading (cf. Fig. d). Like Kraken, CLARK uses k-mers (short DNA words of length k) to solve the classification problem. However, while Kraken and other k-mers based classifiers consider the whole taxonomy tree and must resolve k-mers that match genomes from different taxa (by using the concept of “lowest common ancestor” from MEGAN), CLARK rather considers taxa defined for a unique taxonomy rank (e.g. species/genus), and, during the preprocessing, discards any k-mers that can be found in any pair of taxon. In other words, CLARK exploits specificities of each taxon (against all others) to populate its light and efficient data structure. It uses a customized dictionary of k-mers, in which each k-mer is associated to at most one taxon and results in fast k-mer queries. Then, the read is assigned to the taxon that has the highest amount of k-mers matches with it. Since these matches are discriminative, CLARK assignments are highly accurate. We also show that the choice of the value of k is critical for the optimal performance, and long k-mers (e.g., 31-mers) are not necessarily the best choice to perform accurate identification.  For example, high confidence assignments using 20-mers from real metagenomes show strong consistency with several published and independent results. 

Finally, CLARK can be used for detecting contamination in draft reference genome or, in genomics, chimera in sequenced BACs. We are currently investigating new techniques for improving the sensitivity and the speed of the tool, and we plan to release a new version later this year. We are also extending the tool for comparative genomics/metagenomics purposes. A “RAM-light” version of CLARK for your 4 GB RAM laptop is also available. CLARK is user-friendly (i.e., easy to use, it does not require strong background in programming/bioinformatics) and self-contained (i.e., does not need depend on any external software tool). The latest version of CLARK (v1.1.2) contains several features to analyze your results and is freely available under the GNU GPL license (for more details, please visit CLARK’s webpage). Experimental results and algorithm details can be found in the BMC genomics manuscript.


Performance of Kraken (v0.10.4-beta) and CLARK (v1.0) for the classification of a metagenome sample of 10,000 reads (average reads length 92bp).  a) The classification speed (in 103 reads per minute) in default mode. b) RAM usage (in GB) for the classification. c) Disk space (in GB) required for the database (bacterial genomes from NCBI/RefSeq). d) Classification speed (in 10^3 reads per minute) using 1, 2, 4 and 8 threads.




Do preprints count for anything? Not according to Elife & G3 & some authors ..



Well, just got pointed to this paper: Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms | eLife by Martial Marbouty, Axel Cournac, Jean-François Flot, Hervé Marie-Nelly, Julien Mozziconacci, Romain Koszul.  Seems potentially really interesting.

It is similar in concept and in many aspects to a paper we published in PeerJ earlier in the year (see Beitel et al., 2014 Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415.

Yet despite the similarities to our paper and to another paper that was formally published around the time of ours, this new paper does not mention these other pieces of work any where in the introduction as having any type of "prior work" relevance.  Instead, they wait until late in their discussion:
Taking advantage of chromatin conformation capture data to address genomic questions is a dynamic field: while this paper was under review, two studies were released that also aimed at exploiting the physical contacts between DNA molecules to deconvolve genomes from controlled mixes of microorganisms (Beitel et al., 2014; Burton et al., 2014).
Clearly, what they are trying to do here is to claim that since they paper was submitted before these other two (including ours) was published, that they should get some sort of "priority" for their work.  Let's look at that in more detail.  Their paper was received May 9, 2014.  Our paper was published online May 27 and the other related paper by Burton et al. was published online May 22.  In general, if a paper on what your paper is about comes out just after you submit your paper, while your paper is still in review, the common, normal thing to be asked to do is to rewrite your paper to deal with the fact that you were, in essence, scooped.  But that does not really appear to be the case here.  They are treating this in a way as "oh look, some new papers came out at the last minute and we have commented on them."  The last minute would be in this case, 6 months before this new paper was accepted.  Seems like a long time to treat this as "ooh - a new paper came up that we will add a few comments about".

But - one could quibble about the ethics and policies of dealing with papers that were published after one submitted one's own paper.  From my experience, I have always had to do major rewrites to deal with such papers.  But maybe E-Life has different policies.  Who knows.  But that is where things get really annoying here.  This is because it was May 27 when our FINAL paper came out online at PeerJ. However, the preprint of the paper was published on February 27, more than two months before their paper was even submitted.  So does this mean that the authors of this new paper do not believe that preprints exist?  It is pretty clear on the web site for our paper that there is a preprint that was published earlier.  Given what they were working on - something directly related to what our preprint/paper was about, one would assume they would have seen it with a few simple Google searches.  Or a reviewer might have pointed them to it.  Maybe not.  I do not know.  But either way, our preprint was published long before their paper was submitted and therefore I believe they should have discussed it in more detail.

Is this a sign that some people believe preprints are nothing more than rumors?  I hope not.  Preprints are a great way to share research prior to the delays that can happen in peer review.  And in my opinion, preprints should count as prior research and be cited as such.  I note - the Burton group in their paper in G3 also did not reference our preprint in what I consider to be a reasonable manner.  They add some comments in their acknowledgements
While this manuscript was in preparation, a preprint describing a related method appeared in PeerJ PrePrints (Beitel et al. 2014a). Note added in proof: this preprint was subsequently published (Beitel et al. 2014b). 
Given that our preprint was published before their paper was submitted too, I believe that they also should have made more reference to it in their paper.   But again, I can only guess that both the Burton and the Marbouty group just do not see preprints as being respectable scientific objects.  That is a bad precedent to set and I think the wrong one too.  And it is a shame.  A preprint is a publication.  No - it is not peer reviewed.  But that does not mean it should not be considered part of the scientific literature in some way.  I note - this new paper from the Marbouty group seems really interesting.  Not sure I want to dig into it any deeper if they are going to play games with the timing of submission vs. published "papers" as part of how they are positioning themselves to be viewed as doing something novel.

Story Behind the Paper: Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads (by Rogan Carr and Elhanan Borenstein)

Here is another post in my "Story Behind the Paper" series where I ask authors of open access papers to tell the story behind their paper.  This one comes from Rogan Carr and Elhanan Borenstein.  Note - this was crossposted at microBEnet.  If anyone out there has an open access paper for which you want to tell the story -- let me know.


We’d like to first thank Jon for the opportunity to discuss our work in this forum. We recently published a study investigating direct functional annotation of short metagenomic reads that stemmed from protocol development for our lab. Jon invited us to write a blog post on the subject, and we thought it would be a great venue to discuss some practical applications of our work and to share with the research community the motivation for our study and how it came about.

Our lab, the Borenstein Lab at the University of Washington, is broadly interested in metabolic modeling of the human microbiome (see, for example our Metagenomic Systems Biology approach) and in the development of novel computational methods for analyzing functional metagenomic data (see, for example, Metagenomic Deconvolution). In this capacity, we often perform large-scale analysis of publicly available metagenomic datasets as well as collaborate with experimental labs to analyze new metagenomic datasets, and accordingly we have developed extensive expertise in performing functional, community-level annotation of metagenomic samples. We focused primarily on protocols that derive functional profiles directly from short sequencing reads (e.g., by mapping the short reads to a collection of annotated genes), as such protocols provide gene abundance profiles that are relatively unbiased by species abundance in the sample or by the availability of closely-related reference genomes. Such functional annotation protocols are extremely common in the literature and are essential when approaching metagenomics from a gene-centric point of view, where the goal is to describe the community as a whole.

However, when we began to design our in-house annotation pipeline, we pored over the literature and realized that each research group and each metagenomic study applied a slightly different approach to functional annotation. When we implemented and evaluated these methods in the lab, we also discovered that the functional profiles obtained by the various methods often differ significantly. Discussing these findings with colleagues, some further expressed doubt that that such short sequencing reads even contained enough information to map back unambiguously to the correct function. Perhaps the whole approach was wrong!

We therefore set out to develop a set of ‘best practices’ for our lab for metagenomic sequence annotation and to prove (or disprove) quantitatively that such direct functional annotation of short reads provides a valid functional representation of the sample. We specifically decided to pursue a large-scale study, performed as rigorously as possible, taking into account both the phylogeny of the microbes in the sample and the phylogenetic coverage of the database, as well as several technical aspects of sequencing like base-calling error and read length. We have found this evaluation approach and the results we obtained quite useful for designing our lab protocols, and thought it would be helpful to share them with the wider metagenomics and microbiome research community. The result is our recent paper in PLoS One, Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads.

The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life. The phylogenetic tree was obtained from Ciccarelli et al. Colored rings represent the recall for identifying reads originating from a KO gene using the top gene protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for E. coli, B. thetaiotaomicron, and S. Pneumoniae. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life using the 'top gene' protocol. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 


To perform a rigorous study of functional annotation, we needed a set of reads whose true annotations were known (a “ground truth”). In other words, we had to know the exact locus and the exact genome from which each sequencing read originated and the functional classification associated with this locus. We further wanted to have complete control over technical sources of error. To accomplish this, we chose to implement a simulation scheme, deriving a huge collection of sequence reads from fully sequenced, well annotated, and curated genomes. This schemed allowed us to have complete information about the origin of each read and allowed us to simulate various technical factors we were interested in. Moreover, simulating sequencing reads allowed us to systematically eliminate variations in annotation performance due to technological or biological effects that would typically be convoluted in an experimental setup. For a set of curated genomes, we settled on the KEGG database, as it contained a large collection of consistently functionally curated microbial genomes and it has been widely used in metagenomics for sample annotation. The KEGG hierarchy of KEGG Orthology groups (KOs), Modules, and Pathways could then serve as a common basis for comparative analysis. To control for phylogenetic bias in our results, we sampled broadly across 23 phyla and 89 genera in the bacterial and archaeal tree of life, using a randomly selected strain in KEGG for each tip of the tree from Ciccarelli et al. From each of the selected 170 strains, we generated either *every* possible contiguous sequence of a given length or (in some cases) every 10th contiguous sequence, using a sliding window approach. We additionally introduced various models to simulate sequencing errors. This large collection of reads (totaling ~16Gb) were then aligned to the KEGG genes database using a translated BLAST mapping. To control for phylogenetic coverage of the database (the phylogenetic relationship of the database to the sequence being aligned) we also simulated mapping to many partial collections of genomes. We further used four common protocols from the literature to convert the obtained BLAST alignments to functional annotations. Comparing the resulting annotation of each read to the annotation of the gene from which it originated allowed us to systematically evaluate the accuracy of this annotation approach and to examine the effect of various factors, including read length, sequencing error, and phylogeny.

First and foremost, we confirmed that direct annotation of short reads indeed provides an overall accurate functional description of both individual reads and the sample as a whole. In other words, short reads appear to contain enough information to identify the functional annotation of the gene they originated from (although, not necessarily the specific taxa of origin). Functions of individual reads were identified with high precision and recall, yet the recall was found to be clade dependent. As expected, recall and precision decreased with increasing phylogenetic distance to the reference database, but generally, having a representative of the genus in the reference database was sufficient to achieve a relatively high accuracy. We also found variability in the accuracy of identifying individual KOs, with KOs that are more variable in length or in copy number having lower recall. Our paper includes abundance of data on these results, a detailed characterization of the mapping accuracy across different clades, and a description of the impact of additional properties (e.g., read length, sequencing error, etc.).

A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols. HMP samples are numbered from 1 to 15 according to the list that appears in the Methods section of the manuscript. The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols.The different protocols are represented by color and shape. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 
Importantly, while the obtained functional annotations are in general representative of the true content of the sample, the exact protocol used to analyze the BLAST alignments and to assign functional annotation to each read could still dramatically affect the obtained profile. For example, in analyzing stool samples from the Human Microbiome Project, we found that each protocol left a consistent “fingerprint” on the resulting profile and that the variation introduced by the different protocols was on the same order of magnitude as biological variation across samples. Differences in annotation protocols are thus analogous to batch effects from variation in experimental procedures and should be carefully taken into consideration when designing the bioinformatic pipeline for a study.

Generally, however, we found that assigning each read with the annotation of the top E-value hit (the ‘top gene’ protocol) had the highest precision for identifying the function from a sequencing read, and only slightly lower recall than methods enriching for known annotations (such as the commonly used ‘top 20 KOs’ protocol). Given our lab interests, this finding led us to adopt the ‘top gene’ protocol for functionally annotating metagenomic samples. Specifically, our work often requires high precision for annotating individual reads for model reconstruction (e.g., utilizing the presence and absence of individual genes) and the most accurate functional abundance profile for statistical method development. If your lab has similar interests, we would recommend this approach for your annotation pipelines. If however, you have different or more specific needs, we encourage you to make use of the datasets we have published along with our paper to help you design your own solution. We would also be very happy to discuss such issues further with labs that are considering various approaches for functional annotation, to assess some of the factors that can impact downstream analyses, or to assist in such functional annotation efforts.

Total Pageviews

Popular Posts

‏إظهار الرسائل ذات التسميات metagenomics. إظهار كافة الرسائل
‏إظهار الرسائل ذات التسميات metagenomics. إظهار كافة الرسائل

الثلاثاء، 5 مايو 2015

Guest post from Rachid Ounit on CLARK: Fast and Accurate Classification of Metagenomic and Genomic Sequences

Recently I received and email from Rachid Ounit pointing me to a new open access paper he had on a metagenomics analysis tool called CLARK.  I asked him if he would be willing to write a guest post about it and, well, he did.  Here is it:


CLARK: Accurate metagenomic analysis of a million reads in 20 seconds or less…

At the University of California, Riverside, we have developed a new lightweight algorithm to classify accurately metagenomic samples while minimizing computational resources better than any other classifiers (e.g., Kraken).  While CLARK and Kraken have comparable accuracy, CLARK is significantly faster (cf. Fig. a) and uses less RAM and disk space (cf. Fig. b-c). In default mode and single-threaded, CLARK’s classification speed is higher than 3 million short reads per minute (cf. Fig. a), and it also scales better in multithreading (cf. Fig. d). Like Kraken, CLARK uses k-mers (short DNA words of length k) to solve the classification problem. However, while Kraken and other k-mers based classifiers consider the whole taxonomy tree and must resolve k-mers that match genomes from different taxa (by using the concept of “lowest common ancestor” from MEGAN), CLARK rather considers taxa defined for a unique taxonomy rank (e.g. species/genus), and, during the preprocessing, discards any k-mers that can be found in any pair of taxon. In other words, CLARK exploits specificities of each taxon (against all others) to populate its light and efficient data structure. It uses a customized dictionary of k-mers, in which each k-mer is associated to at most one taxon and results in fast k-mer queries. Then, the read is assigned to the taxon that has the highest amount of k-mers matches with it. Since these matches are discriminative, CLARK assignments are highly accurate. We also show that the choice of the value of k is critical for the optimal performance, and long k-mers (e.g., 31-mers) are not necessarily the best choice to perform accurate identification.  For example, high confidence assignments using 20-mers from real metagenomes show strong consistency with several published and independent results. 

Finally, CLARK can be used for detecting contamination in draft reference genome or, in genomics, chimera in sequenced BACs. We are currently investigating new techniques for improving the sensitivity and the speed of the tool, and we plan to release a new version later this year. We are also extending the tool for comparative genomics/metagenomics purposes. A “RAM-light” version of CLARK for your 4 GB RAM laptop is also available. CLARK is user-friendly (i.e., easy to use, it does not require strong background in programming/bioinformatics) and self-contained (i.e., does not need depend on any external software tool). The latest version of CLARK (v1.1.2) contains several features to analyze your results and is freely available under the GNU GPL license (for more details, please visit CLARK’s webpage). Experimental results and algorithm details can be found in the BMC genomics manuscript.


Performance of Kraken (v0.10.4-beta) and CLARK (v1.0) for the classification of a metagenome sample of 10,000 reads (average reads length 92bp).  a) The classification speed (in 103 reads per minute) in default mode. b) RAM usage (in GB) for the classification. c) Disk space (in GB) required for the database (bacterial genomes from NCBI/RefSeq). d) Classification speed (in 10^3 reads per minute) using 1, 2, 4 and 8 threads.




الأربعاء، 24 ديسمبر 2014

Do preprints count for anything? Not according to Elife & G3 & some authors ..



Well, just got pointed to this paper: Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms | eLife by Martial Marbouty, Axel Cournac, Jean-François Flot, Hervé Marie-Nelly, Julien Mozziconacci, Romain Koszul.  Seems potentially really interesting.

It is similar in concept and in many aspects to a paper we published in PeerJ earlier in the year (see Beitel et al., 2014 Beitel CW, Froenicke L, Lang JM, Korf IF, Michelmore RW, Eisen JA, Darling AE. (2014) Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2:e415 http://dx.doi.org/10.7717/peerj.415.

Yet despite the similarities to our paper and to another paper that was formally published around the time of ours, this new paper does not mention these other pieces of work any where in the introduction as having any type of "prior work" relevance.  Instead, they wait until late in their discussion:
Taking advantage of chromatin conformation capture data to address genomic questions is a dynamic field: while this paper was under review, two studies were released that also aimed at exploiting the physical contacts between DNA molecules to deconvolve genomes from controlled mixes of microorganisms (Beitel et al., 2014; Burton et al., 2014).
Clearly, what they are trying to do here is to claim that since they paper was submitted before these other two (including ours) was published, that they should get some sort of "priority" for their work.  Let's look at that in more detail.  Their paper was received May 9, 2014.  Our paper was published online May 27 and the other related paper by Burton et al. was published online May 22.  In general, if a paper on what your paper is about comes out just after you submit your paper, while your paper is still in review, the common, normal thing to be asked to do is to rewrite your paper to deal with the fact that you were, in essence, scooped.  But that does not really appear to be the case here.  They are treating this in a way as "oh look, some new papers came out at the last minute and we have commented on them."  The last minute would be in this case, 6 months before this new paper was accepted.  Seems like a long time to treat this as "ooh - a new paper came up that we will add a few comments about".

But - one could quibble about the ethics and policies of dealing with papers that were published after one submitted one's own paper.  From my experience, I have always had to do major rewrites to deal with such papers.  But maybe E-Life has different policies.  Who knows.  But that is where things get really annoying here.  This is because it was May 27 when our FINAL paper came out online at PeerJ. However, the preprint of the paper was published on February 27, more than two months before their paper was even submitted.  So does this mean that the authors of this new paper do not believe that preprints exist?  It is pretty clear on the web site for our paper that there is a preprint that was published earlier.  Given what they were working on - something directly related to what our preprint/paper was about, one would assume they would have seen it with a few simple Google searches.  Or a reviewer might have pointed them to it.  Maybe not.  I do not know.  But either way, our preprint was published long before their paper was submitted and therefore I believe they should have discussed it in more detail.

Is this a sign that some people believe preprints are nothing more than rumors?  I hope not.  Preprints are a great way to share research prior to the delays that can happen in peer review.  And in my opinion, preprints should count as prior research and be cited as such.  I note - the Burton group in their paper in G3 also did not reference our preprint in what I consider to be a reasonable manner.  They add some comments in their acknowledgements
While this manuscript was in preparation, a preprint describing a related method appeared in PeerJ PrePrints (Beitel et al. 2014a). Note added in proof: this preprint was subsequently published (Beitel et al. 2014b). 
Given that our preprint was published before their paper was submitted too, I believe that they also should have made more reference to it in their paper.   But again, I can only guess that both the Burton and the Marbouty group just do not see preprints as being respectable scientific objects.  That is a bad precedent to set and I think the wrong one too.  And it is a shame.  A preprint is a publication.  No - it is not peer reviewed.  But that does not mean it should not be considered part of the scientific literature in some way.  I note - this new paper from the Marbouty group seems really interesting.  Not sure I want to dig into it any deeper if they are going to play games with the timing of submission vs. published "papers" as part of how they are positioning themselves to be viewed as doing something novel.

الثلاثاء، 23 سبتمبر 2014

Story Behind the Paper: Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads (by Rogan Carr and Elhanan Borenstein)

Here is another post in my "Story Behind the Paper" series where I ask authors of open access papers to tell the story behind their paper.  This one comes from Rogan Carr and Elhanan Borenstein.  Note - this was crossposted at microBEnet.  If anyone out there has an open access paper for which you want to tell the story -- let me know.


We’d like to first thank Jon for the opportunity to discuss our work in this forum. We recently published a study investigating direct functional annotation of short metagenomic reads that stemmed from protocol development for our lab. Jon invited us to write a blog post on the subject, and we thought it would be a great venue to discuss some practical applications of our work and to share with the research community the motivation for our study and how it came about.

Our lab, the Borenstein Lab at the University of Washington, is broadly interested in metabolic modeling of the human microbiome (see, for example our Metagenomic Systems Biology approach) and in the development of novel computational methods for analyzing functional metagenomic data (see, for example, Metagenomic Deconvolution). In this capacity, we often perform large-scale analysis of publicly available metagenomic datasets as well as collaborate with experimental labs to analyze new metagenomic datasets, and accordingly we have developed extensive expertise in performing functional, community-level annotation of metagenomic samples. We focused primarily on protocols that derive functional profiles directly from short sequencing reads (e.g., by mapping the short reads to a collection of annotated genes), as such protocols provide gene abundance profiles that are relatively unbiased by species abundance in the sample or by the availability of closely-related reference genomes. Such functional annotation protocols are extremely common in the literature and are essential when approaching metagenomics from a gene-centric point of view, where the goal is to describe the community as a whole.

However, when we began to design our in-house annotation pipeline, we pored over the literature and realized that each research group and each metagenomic study applied a slightly different approach to functional annotation. When we implemented and evaluated these methods in the lab, we also discovered that the functional profiles obtained by the various methods often differ significantly. Discussing these findings with colleagues, some further expressed doubt that that such short sequencing reads even contained enough information to map back unambiguously to the correct function. Perhaps the whole approach was wrong!

We therefore set out to develop a set of ‘best practices’ for our lab for metagenomic sequence annotation and to prove (or disprove) quantitatively that such direct functional annotation of short reads provides a valid functional representation of the sample. We specifically decided to pursue a large-scale study, performed as rigorously as possible, taking into account both the phylogeny of the microbes in the sample and the phylogenetic coverage of the database, as well as several technical aspects of sequencing like base-calling error and read length. We have found this evaluation approach and the results we obtained quite useful for designing our lab protocols, and thought it would be helpful to share them with the wider metagenomics and microbiome research community. The result is our recent paper in PLoS One, Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads.

The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life. The phylogenetic tree was obtained from Ciccarelli et al. Colored rings represent the recall for identifying reads originating from a KO gene using the top gene protocol. The 4 rings correspond to varying levels of database coverage. Specifically, the innermost ring illustrates the recall obtained when the strain from which the reads originated is included in the database, while the other 3 rings, respectively, correspond to cases where only genomes from the same species, genus, or more remote taxonomic relationships are present in the database. Entries where no data were available (for example, when the strain from which the reads originated was the only member of its species) are shaded gray. For one genome in each phylum, denoted by a black dot at the branch tip, every possible 101-bp read was generated for this analysis. For the remaining genomes, every 10th possible read was used. Blue bars represent the fraction of the genome's peptide genes associated with a KO; for reference, the values are shown for E. coli, B. thetaiotaomicron, and S. Pneumoniae. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
The performance of BLAST-based annotation of short reads across the bacterial and archaeal tree of life using the 'top gene' protocol. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 


To perform a rigorous study of functional annotation, we needed a set of reads whose true annotations were known (a “ground truth”). In other words, we had to know the exact locus and the exact genome from which each sequencing read originated and the functional classification associated with this locus. We further wanted to have complete control over technical sources of error. To accomplish this, we chose to implement a simulation scheme, deriving a huge collection of sequence reads from fully sequenced, well annotated, and curated genomes. This schemed allowed us to have complete information about the origin of each read and allowed us to simulate various technical factors we were interested in. Moreover, simulating sequencing reads allowed us to systematically eliminate variations in annotation performance due to technological or biological effects that would typically be convoluted in an experimental setup. For a set of curated genomes, we settled on the KEGG database, as it contained a large collection of consistently functionally curated microbial genomes and it has been widely used in metagenomics for sample annotation. The KEGG hierarchy of KEGG Orthology groups (KOs), Modules, and Pathways could then serve as a common basis for comparative analysis. To control for phylogenetic bias in our results, we sampled broadly across 23 phyla and 89 genera in the bacterial and archaeal tree of life, using a randomly selected strain in KEGG for each tip of the tree from Ciccarelli et al. From each of the selected 170 strains, we generated either *every* possible contiguous sequence of a given length or (in some cases) every 10th contiguous sequence, using a sliding window approach. We additionally introduced various models to simulate sequencing errors. This large collection of reads (totaling ~16Gb) were then aligned to the KEGG genes database using a translated BLAST mapping. To control for phylogenetic coverage of the database (the phylogenetic relationship of the database to the sequence being aligned) we also simulated mapping to many partial collections of genomes. We further used four common protocols from the literature to convert the obtained BLAST alignments to functional annotations. Comparing the resulting annotation of each read to the annotation of the gene from which it originated allowed us to systematically evaluate the accuracy of this annotation approach and to examine the effect of various factors, including read length, sequencing error, and phylogeny.

First and foremost, we confirmed that direct annotation of short reads indeed provides an overall accurate functional description of both individual reads and the sample as a whole. In other words, short reads appear to contain enough information to identify the functional annotation of the gene they originated from (although, not necessarily the specific taxa of origin). Functions of individual reads were identified with high precision and recall, yet the recall was found to be clade dependent. As expected, recall and precision decreased with increasing phylogenetic distance to the reference database, but generally, having a representative of the genus in the reference database was sufficient to achieve a relatively high accuracy. We also found variability in the accuracy of identifying individual KOs, with KOs that are more variable in length or in copy number having lower recall. Our paper includes abundance of data on these results, a detailed characterization of the mapping accuracy across different clades, and a description of the impact of additional properties (e.g., read length, sequencing error, etc.).

A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols. HMP samples are numbered from 1 to 15 according to the list that appears in the Methods section of the manuscript. The different protocols are represented by color and shape. Note that two outlier protocols for sample 14 are not shown but were included in the PCA calculation. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776. doi:10.1371/journal.pone.0105776. See the manuscript for full details.
A principal component analysis of the pathway abundance profiles obtained for 15 HMP samples and by four different annotation protocols.The different protocols are represented by color and shape. See the manuscript for full details. Figure and text adapted from: Carr R, Borenstein E (2014) Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9(8): e105776 
Importantly, while the obtained functional annotations are in general representative of the true content of the sample, the exact protocol used to analyze the BLAST alignments and to assign functional annotation to each read could still dramatically affect the obtained profile. For example, in analyzing stool samples from the Human Microbiome Project, we found that each protocol left a consistent “fingerprint” on the resulting profile and that the variation introduced by the different protocols was on the same order of magnitude as biological variation across samples. Differences in annotation protocols are thus analogous to batch effects from variation in experimental procedures and should be carefully taken into consideration when designing the bioinformatic pipeline for a study.

Generally, however, we found that assigning each read with the annotation of the top E-value hit (the ‘top gene’ protocol) had the highest precision for identifying the function from a sequencing read, and only slightly lower recall than methods enriching for known annotations (such as the commonly used ‘top 20 KOs’ protocol). Given our lab interests, this finding led us to adopt the ‘top gene’ protocol for functionally annotating metagenomic samples. Specifically, our work often requires high precision for annotating individual reads for model reconstruction (e.g., utilizing the presence and absence of individual genes) and the most accurate functional abundance profile for statistical method development. If your lab has similar interests, we would recommend this approach for your annotation pipelines. If however, you have different or more specific needs, we encourage you to make use of the datasets we have published along with our paper to help you design your own solution. We would also be very happy to discuss such issues further with labs that are considering various approaches for functional annotation, to assess some of the factors that can impact downstream analyses, or to assist in such functional annotation efforts.