close
Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 5;10(1):1556.
doi: 10.1038/s41467-019-09583-2.

Improved measures for evolutionary conservation that exploit taxonomy distances

Affiliations

Improved measures for evolutionary conservation that exploit taxonomy distances

Nawar Malhis et al. Nat Commun. .

Abstract

Selective pressures on protein-coding regions that provide fitness advantages can lead to the regions' fixation and conservation in genome duplications and speciation events. Consequently, conservation analyses relying on sequence similarities are exploited by a myriad of applications across all biosciences to identify functionally important protein regions. While very potent, existing conservation measures based on multiple sequence alignments are so pervasive that improvements to solutions of many problems have become incremental. We introduce a new framework for evolutionary conservation with measures that exploit taxonomy distances across species. Results show that our taxonomy-based framework comfortably outperforms existing conservation measures in identifying deleterious variants observed in the human population, including variants located in non-abundant sequence domains such as intrinsically disordered regions. The predictive power of our approach emphasizes that the phenotypic effects of sequence variants can be taxonomy-level specific and thus, conservation needs to be interpreted accordingly.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
New conservation measures based on alignment identity and taxonomy distances. a Simplified taxonomy tree. Shared taxa (ST) is defined as the number of taxonomy tree edges that are shared between human and another species. Goldfish, for instance, shares 13 edges with the human taxonomic lineage, and thus its ST value is 13. It is important to note that a given taxa can include multiple species. For instance, shared taxa 22 contains mouse, rat and other rodents not listed as well as lagomorphs, treeshrews, colugos, and primates. The entire human taxonomy lineage can be found in Supplementary Table 1. b Simplified MSA used to illustrate the calculation of different LIST measures that include local identity (LI) and ST. LI for a sequence at a location τ is computed by counting the number of residues that are identical to the query sequence (shaded in blue) in a window size nine centered at τ, excluding the residue at τ. c The STP at position τ associated with the simplified MSA presented in b
Fig. 2
Fig. 2
New conservation measures separate benign and deleterious variants. a Distribution of variant shared taxa (VST) for deleterious and benign human variants that have a matching allele in the raw MSA. For each of the 32 possible VST values that were found in the MSA analysis, the percentages of benign and deleterious variants are shown. VST values can only be calculated when a matching amino acid is found in the MSA, which defines the number of benign and deleterious variants that could be used for this plot (see methods for details on data). b The average shared taxa profiles (STP) of deleterious and benign variants (see methods for details on data)
Fig. 3
Fig. 3
LIST performs better than other predictors in separating benign and pathogenic variants. a ROC curves calculated for the predictions by LIST, phyloP_Vertebrata (phyloP_V), SIFT, PROVEAN, and SiPhy on the variants from the ClinVar/ExAC test set that are scored by all methods compared (Supplementary Table 3). Shown here are only the best performing methods that solely use conservation measures (see Supplementary Table 3 for the results of other methods tested). AUC values are provided for each method in parentheses. b Precision-recall curves for the same tools and data set
Fig. 4
Fig. 4
Example illustrating the advantage of new conservation measures. STP for position 150 of protein RAD51 compared with the averaged STPs of benign and deleterious  variants

References

    1. Stearns, S. C. The Evolution of Life Histories. (Oxford Press, 1992).
    1. Cygler M, et al. Relationship between sequence conservation and three-dimensional structure in a large family of esterases, lipases, and related proteins. Protein Sci. 1993;2:366–382. doi: 10.1002/pro.5560020309. - DOI - PMC - PubMed
    1. Hopf TA, et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012;149:1607–1621. doi: 10.1016/j.cell.2012.04.012. - DOI - PMC - PubMed
    1. Ovchinnikov S, et al. Protein structure determination using metagenome sequence data. Science. 2017;355:294–298. doi: 10.1126/science.aah4043. - DOI - PMC - PubMed
    1. Gabaldon T, Koonin EV. Functional and evolutionary implications of gene orthology. Nat. Rev. Genet. 2013;14:360–366. doi: 10.1038/nrg3456. - DOI - PMC - PubMed

Publication types

LinkOut - more resources