Efficient computation of absent words in genomic sequences

Herold J, Kurtz S, Giegerich R (2008)
BMC Bioinformatics 9(1): 167.

Zeitschriftenaufsatz | Veröffentlicht | Englisch
Abstract / Bemerkung
Background: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. Conclusion: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
BMC Bioinformatics
Page URI


Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008;9(1): 167.
Herold, J., Kurtz, S., & Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1), 167. https://doi.org/10.1186/1471-2105-9-167
Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. 2008. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9 (1): 167.
Herold, J., Kurtz, S., and Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9:167.
Herold, J., Kurtz, S., & Giegerich, R., 2008. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1): 167.
J. Herold, S. Kurtz, and R. Giegerich, “Efficient computation of absent words in genomic sequences”, BMC Bioinformatics, vol. 9, 2008, : 167.
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 9, : 167 (2008).
Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9.1 (2008): 167.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]
Access Level
OA Open Access
Zuletzt Hochgeladen
MD5 Prüfsumme

20 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study.
Rahman MS, Alatabbi A, Athar T, Crochemore M, Rahman MS., BMC Res Notes 9(), 2016
PMID: 27004958
The bulk and the tail of minimal absent words in genome sequences.
Aurell E, Innocenti N, Zhou HJ., Phys Biol 13(2), 2016
PMID: 27043075
Nullomers and High Order Nullomers in Genomic Sequences.
Vergni D, Santoni D., PLoS One 11(12), 2016
PMID: 27906971
Three minimal sequences found in Ebola virus genomes and absent from human DNA.
Silva RM, Pratas D, Castro L, Pinho AJ, Ferreira PJ., Bioinformatics 31(15), 2015
PMID: 25840045
keeSeek: searching distant non-existing words in genomes for PCR-based applications.
Falda M, Fontana P, Barzon L, Toppo S, Lavezzo E., Bioinformatics 30(18), 2014
PMID: 24867942
Linear-time computation of minimal absent words using suffix array.
Barton C, Heliou A, Mouchard L, Pissis SP., BMC Bioinformatics 15(), 2014
PMID: 25526884
Pervasive sequence patents cover the entire human genome.
Rosenfeld JA, Mason CE., Genome Med 5(3), 2013
PMID: 23522065
Clustering of DNA words and biological function: a proof of principle.
Hackenberg M, Rueda A, Carpena P, Bernaola-Galván P, Barturen G, Oliver JL., J Theor Biol 297(), 2012
PMID: 22226985
Insertion site preference of Mu, Tn5, and Tn7 transposons.
Green B, Bouchier C, Fairhead C, Craig NL, Cormack BP., Mob DNA 3(1), 2012
PMID: 22313799
Minimal absent words in prokaryotic and eukaryotic genomes.
Garcia SP, Pinho AJ, Rodrigues JM, Bastos CA, Ferreira PJ., PLoS One 6(1), 2011
PMID: 21386877
Microbial diversity in saliva of oral squamous cell carcinoma.
Pushalkar S, Mane SP, Ji X, Li Y, Evans C, Crasta OR, Morse D, Meagher R, Singh A, Saxena D., FEMS Immunol Med Microbiol 61(3), 2011
PMID: 21205002
Minimal absent words in four human genome assemblies.
Garcia SP, Pinho AJ., PLoS One 6(12), 2011
PMID: 22220210
On finding minimal absent words.
Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM., BMC Bioinformatics 10(), 2009
PMID: 19426495
Word-based characterization of promoters involved in human DNA repair pathways.
Lichtenberg J, Jacox E, Welch JD, Kurz K, Liang X, Yang MQ, Drews F, Ecker K, Lee SS, Elnitski L, Welch LR., BMC Genomics 10 Suppl 1(), 2009
PMID: 19594877
Multiplex primer prediction software for divergent targets.
Gardner SN, Hiddessen AL, Williams PL, Hara C, Wagner MC, Colston BW., Nucleic Acids Res 37(19), 2009
PMID: 19759213
Genomic DNA k-mer spectra: models and modalities.
Chor B, Horn D, Goldman N, Levy Y, Massingham T., Genome Biol 10(10), 2009
PMID: 19814784
The word landscape of the non-coding segments of the Arabidopsis thaliana genome.
Lichtenberg J, Yilmaz A, Welch JD, Kurz K, Liang X, Drews F, Ecker K, Lee SS, Geisler M, Grotewold E, Welch LR., BMC Genomics 10(), 2009
PMID: 19814816

24 References

Daten bereitgestellt von Europe PubMed Central.

GISMO--gene identification using a support vector machine for ORF classification.
Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F., Nucleic Acids Res. 35(2), 2006
PMID: 17175534
Structure and function of type II restriction endonucleases.
Pingoud A, Jeltsch A., Nucleic Acids Res. 29(18), 2001
PMID: 11557805
Monotony of Surprise And Large-Scale Quest for Unusual Words
Apostolico A, Bock ME, Lonardi S., 2002
Verbumculus and the Discovery of Unusual Words
Apostolico A, Gong F, Lonardi S., 2004
Mauve: multiple alignment of conserved genomic sequence with rearrangements.
Darling AC, Mau B, Blattner FR, Perna NT., Genome Res. 14(7), 2004
PMID: 15231754
Genome comparison without alignment using shortest unique substrings.
Haubold B, Pierstorff N, Moller F, Wiehe T., BMC Bioinformatics 6(), 2005
PMID: 15910684
Absent sequences: nullomers and primes.
Hampikian G, Andersen T., Pac Symp Biocomput (), 2007
PMID: 17990505
Nullomers: really a matter of natural selection?
Acquisti C, Poste G, Curtiss D, Kumar S., PLoS ONE 2(10), 2007
PMID: 17925870
Replacing Suffix Trees with Enhanced Suffix Arrays
Abouelhoda M, Kurtz S, Ohlebusch E., 2004
On the distribution of the number of missing words in random texts
Rahmann S, Rivals E., 2003
Human Genome
Mouse Genome
Drosophila Genomes
C. elegans Genome
The genome sequence of the filamentous fungus Neurospora crassa
Galagan J, Calvo S, Borkovich K, Selker E, Read N, Jaffe D, FitzHugh W, Ma L, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen C, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson M, Werner-Washburne M, Selitrennikoff C, Kinsey J, Braun E, Zelter A, Schulte U, Kothe G, Jedd G, Mewes W, Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange-Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzenberg R, Perkins D, Kroken S, Cogoni C, Macino G, Catcheside D, Li W, Pratt R, Osmani S, DeSouza C, Glass L, Orbach M, Berglund J, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, Aramayo R, Natvig D, Alex L, Mannhaupt G, Ebbole D, Freitag M, Paulsen I, Sachs M, Lander E, Nusbaum C, Birren B., 2003
S. cerevisiae Genome
Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes.
Fukui T, Atomi H, Kanai T, Matsumi R, Fujiwara S, Imanaka T., Genome Res. 15(3), 2005
PMID: 15710748
Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC., Science 273(5278), 1996
PMID: 8688087
Construction of a large signature-tagged mini-Tn5 transposon library and its application to mutagenesis of Sinorhizobium meliloti.
Pobigaylo N, Wetter D, Szymczak S, Schiller U, Kurtz S, Meyer F, Nattkemper TW, Becker A., Appl. Environ. Microbiol. 72(6), 2006
PMID: 16751548
Computing Unwords on BibiServ

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

PMID: 18366790
PubMed | Europe PMC

Suchen in

Google Scholar