Burrows-Wheeler Transform (BWT) is an extremely useful tool for textual lossless data compression. Recently, it has found many applications to bioinformatics. In this paper, BWT is introduced from the view of combinatorics, and then an equivalence relation on words is proposed which shows that the transformation captures some common features of equivalent words. Based on the rationale that to what extent two words differ can be evaluated by the factors excluding their common features, a matrix representation for a DNA sequence is defined by means of a “subtraction operation” between the original word and its BWT word, thus a DNA sequences is converted into a 24-D vector whose components are the spectral norms of such matrices. To illustrate the use of the quantitative characterization of DNA sequences, phylogenetic trees of the full β-globin genes of 15 species and the S segments of 13 hantaviruses are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups.
Published in | Computational Biology and Bioinformatics (Volume 2, Issue 3) |
DOI | 10.11648/j.cbb.20140203.11 |
Page(s) | 33-37 |
Creative Commons |
This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited. |
Copyright |
Copyright © The Author(s), 2014. Published by Science Publishing Group |
BWT, Elementary Row Operations, Equivalence, k-Word, Matrix Representation, Permutation, Phylogenetic Analysis
[1] | Yang LP, Chang G, Zhang X, Wang TM, Use of the Burrows-Wheeler similarity distribution to the comparison of the proteins, Amino acids, 39: 887-898, 2010. |
[2] | Hamori E, Ruskin J, H curves, a novel method of representation of nucleotide series especially suited for long DNA sequences, The Journal of biological chemistry, 258: 1318-1327, 1983. |
[3] | Nandy A, A new graphical re-presentation and analysis of DNA sequence structure. I: Methodology and application to globin genes, Current science, 66: 309-314, 1994. |
[4] | Nandy A, Empirical Relationship between In-tra-Purine and Intra-Pyrimidine Differences in Conserved Gene Sequences. PLoS ONE, 4(8): e6829, 2009. |
[5] | Waz P, Bielinska-Waz D, Nandy A, Descriptors of 2D-dynamic graphs as a classifica-tion tool of DNA sequences, J. Math. Chem. 52:132-140, 2014. |
[6] | Randić M, Vracko M, Lers N, Plavsic D, Novel 2-D graphical representation of DNA sequences and their numerical characte-rization, Chemical Physics Letters, 368: 1-6, 2003. |
[7] | Randić M, Vracko M, Lers N, Plavsic D, Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation, Chemical Physics Letters, 371: 202-207, 2003. |
[8] | Randic M, Pisanski T, Novic M, Plavsic D, Novel graph distance matrix, J Comput. Chem., 31: 1832-1841, 2010. |
[9] | Randic M, Zupan J, Balaban AT, Vikic-Topic D, Plavsic D., Graphical representation of proteins. Chem Rev., 111(2): 790-862, 2011. |
[10] | Randic M, Novic M, Plavsic D, Milestones in graphical bioinformatics, Int. J. Quantum Chem., 113: 2413-2446 , 2013. |
[11] | Qi ZH, Li L, Qi XQ, Using Huffman coding me-thod to visualize and analyze DNA sequences, J. Comput. Chem., 32: 3233-3240, 2011. |
[12] | Zhang YS, Liao B, Ding K, On 3DD-curves of DNA sequences, Molecular Simulation, 32(1): 29-34, 2006. |
[13] | Li C, Tang NN, Wang J, Directed graphs of DNA sequences and their numerical characterization. J. Theor. Biol. 241: 173-177, 2006. |
[14] | Yao YH, Dai Q, Nan X, He P, Nie ZM, Zhou SP, Zhang YZ, Analysis of similarity/dissimilarity of DNA sequences based on a class of 2D graphical representation, Journal of computational chemistry, 29: 1632-1639 , 2008. |
[15] | He P, Li D, Zhang Y, Wang X, Yao Y., A 3D graphical representation of protein se-quences based on the Gray code. J Theor Biol., 304: 81-87, 2012. |
[16] | Randić M, Vracko M, Nandy A, Basak SC, On 3-D graphical representation of DNA primary sequences and their numer-ical characterization, Journal of chemical information and computer sciences, 40: 1235-1244, 2000. |
[17] | Li C, Ma H, Zhou Y, Wang X, Zheng X, Similarity analysis of DNA sequences based on the weighted pseudo-entropy, J. Comput. Chem., 32(4): 675-680, 2011. |
[18] | Kantorovitz MR, Robinson GE, Sinha S, A statistical method for alignment-free comparison of regulatory sequences, Bioinformatics, 23: i249-255, 2007. |
[19] | Van Helden J, Metrics for comparing regulatory sequences on the basis of pattern counts, Bioinformatics, 20: 399-406, 2004. |
[20] | Dai Q, Yang Y, Wang T, Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison, Bioinformatics, 24(20): 2296-2302, 2008. |
[21] | Gao L, Qi J, Sun JD, Hao BL, Prokaryote phylogeny meets taxonomy: an exhaustive comparison of composition vector trees with systematic bacteriology, Science in China. Series C, Life sciences, 50(5): 587-599, 2007. |
[22] | Koohy H, Dyer NP, Reid JE, Koentges G, Ott S, An alignment-free model for comparison of regulatory sequences, Bioinformatics , 26(19): 2391-2397, 2010. |
[23] | Otu HH, Sayood K, A new sequence distance measure for phylogenetic tree construction, Bioinformatics, 19: 2122-2130, 2003. |
[24] | Mantaci S, Restivo A, Sciortino M., Burrows-Wheeler transform and Sturmian words, Information Processing Letters, 86: 241-246, 2003. |
[25] | Mantaci S, Restivo A, Rosone G, Sciortino M., An extension of the Burrows-Wheeler transform, Theoretical Computer Science, 387: 298-312, 2007. |
[26] | Mantaci S, Restivo A, Sciortino M., Distance measures for biological sequences: Some recent approaches, International Journal of Approximate Reasoning, 47: 109-124, 2008. |
[27] | Zheng X, Li C, Wang J., A complexity-based measure and its applica-tion to phylogenetic analysis, Journal of mathematical chemistry, 46: 1149-1157, 2009. |
[28] | Zheng X, Li C, Wang J., An information-theoretic approach to the prediction of protein structural class, Journal of computational chemistry, 31(6): 1201-1206, 2010. |
[29] | Burrows M, Wheeler DJ, A block-sorting lossless data compression algorithm. SRC Research Report 124, 1994. |
[30] | Mantaci S, Restivo A, Rosone G, Sciortino M, A new combi-natorial approach to sequence comparison, Theory of Computing Systems, 42: 411-429, 2008. |
[31] | Restivo A, Rosone G, Balanced Words Having Simple Burrows-Wheeler Transform. Lecture Notes in Computer Science, 5583: 431-442, 2009. |
[32] | Konstantin ML, Arseny MS, Two Combinatorial Criteria for BWT Images, Computer Science-Theory and Applications, Lecture Notes in Computer Science, 6651: 385-396, 2011. |
[33] | Randić M, Guo XF, Basak SC, On the characterization of DNA primary sequences by triplet of nucleic acid bases, Journal of chemical information and computer sciences, 41: 619-626, 2001. |
[34] | He PA, Wang J, Characteristic sequences for DNA primary sequence, Journal of chemical information and computer sciences, 42: 1080-1085, 2002. |
[35] | Yang Y, Zhang YY, Jia MD, Li C, Meng LY, Non-degenerate graphical representation of DNA sequences and its applications to phylogenetic analysis, Comb. Chem. High Throughput Screen., 16: 585-589 , 2013. |
[36] | Zhang YZ, Dong X, Li X, Ma C, Xiong HP, Yan GJ, Gao N, Jiang D, Li LP, Zou Y, Plyusnin A, Seoul virus and hantavirus disease, Shenyang, People's Republic of China, Emerging infectious diseases, 15(2): 200-206, 2009. |
[37] | Yao PP, Zhu HP, Deng XZ, Xu F, Xie RH, Yao CH, Weng JQ, Zhang Y, Yang ZQ, Zhu ZY, Molecular evolution analysis of hantaviruses in Zhejiang Province, Chinese journal of virology, 26(6): 465-470, 2010. |
APA Style
Chun Li, Huan Liu, Junhong Liu, Yuping Qin, Zhifu Wang. (2014). A Burrows-Wheeler Transform Based Method for DNA Sequence Comparison. Computational Biology and Bioinformatics, 2(3), 33-37. https://doi.org/10.11648/j.cbb.20140203.11
ACS Style
Chun Li; Huan Liu; Junhong Liu; Yuping Qin; Zhifu Wang. A Burrows-Wheeler Transform Based Method for DNA Sequence Comparison. Comput. Biol. Bioinform. 2014, 2(3), 33-37. doi: 10.11648/j.cbb.20140203.11
AMA Style
Chun Li, Huan Liu, Junhong Liu, Yuping Qin, Zhifu Wang. A Burrows-Wheeler Transform Based Method for DNA Sequence Comparison. Comput Biol Bioinform. 2014;2(3):33-37. doi: 10.11648/j.cbb.20140203.11
@article{10.11648/j.cbb.20140203.11, author = {Chun Li and Huan Liu and Junhong Liu and Yuping Qin and Zhifu Wang}, title = {A Burrows-Wheeler Transform Based Method for DNA Sequence Comparison}, journal = {Computational Biology and Bioinformatics}, volume = {2}, number = {3}, pages = {33-37}, doi = {10.11648/j.cbb.20140203.11}, url = {https://doi.org/10.11648/j.cbb.20140203.11}, eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.cbb.20140203.11}, abstract = {Burrows-Wheeler Transform (BWT) is an extremely useful tool for textual lossless data compression. Recently, it has found many applications to bioinformatics. In this paper, BWT is introduced from the view of combinatorics, and then an equivalence relation on words is proposed which shows that the transformation captures some common features of equivalent words. Based on the rationale that to what extent two words differ can be evaluated by the factors excluding their common features, a matrix representation for a DNA sequence is defined by means of a “subtraction operation” between the original word and its BWT word, thus a DNA sequences is converted into a 24-D vector whose components are the spectral norms of such matrices. To illustrate the use of the quantitative characterization of DNA sequences, phylogenetic trees of the full β-globin genes of 15 species and the S segments of 13 hantaviruses are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups.}, year = {2014} }
TY - JOUR T1 - A Burrows-Wheeler Transform Based Method for DNA Sequence Comparison AU - Chun Li AU - Huan Liu AU - Junhong Liu AU - Yuping Qin AU - Zhifu Wang Y1 - 2014/05/30 PY - 2014 N1 - https://doi.org/10.11648/j.cbb.20140203.11 DO - 10.11648/j.cbb.20140203.11 T2 - Computational Biology and Bioinformatics JF - Computational Biology and Bioinformatics JO - Computational Biology and Bioinformatics SP - 33 EP - 37 PB - Science Publishing Group SN - 2330-8281 UR - https://doi.org/10.11648/j.cbb.20140203.11 AB - Burrows-Wheeler Transform (BWT) is an extremely useful tool for textual lossless data compression. Recently, it has found many applications to bioinformatics. In this paper, BWT is introduced from the view of combinatorics, and then an equivalence relation on words is proposed which shows that the transformation captures some common features of equivalent words. Based on the rationale that to what extent two words differ can be evaluated by the factors excluding their common features, a matrix representation for a DNA sequence is defined by means of a “subtraction operation” between the original word and its BWT word, thus a DNA sequences is converted into a 24-D vector whose components are the spectral norms of such matrices. To illustrate the use of the quantitative characterization of DNA sequences, phylogenetic trees of the full β-globin genes of 15 species and the S segments of 13 hantaviruses are constructed. The resulting monophyletic clusters agree well with the established taxonomic groups. VL - 2 IS - 3 ER -