Efficient computation of absent words in genomic sequences

Herold J, Kurtz S, Giegerich R (2008)
BMC Bioinformatics 9(1).

Download
OA
Journal Article | Published | English
Author
Abstract
Background: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. Conclusion: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
Publishing Year
ISSN
PUB-ID

Cite this

Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008;9(1).
Herold, J., Kurtz, S., & Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1).
Herold, J., Kurtz, S., and Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9.
Herold, J., Kurtz, S., & Giegerich, R., 2008. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1).
J. Herold, S. Kurtz, and R. Giegerich, “Efficient computation of absent words in genomic sequences”, BMC Bioinformatics, vol. 9, 2008.
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 9, (2008).
Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9.1 (2008).
Main File(s)
File Name
Access Level
OA Open Access

This data publication is cited in the following publications:
This publication cites the following data publications:

15 Citations in Europe PMC

Data provided by Europe PubMed Central.

Three minimal sequences found in Ebola virus genomes and absent from human DNA.
Silva RM, Pratas D, Castro L, Pinho AJ, Ferreira PJ., Bioinformatics 31(15), 2015
PMID: 25840045
Linear-time computation of minimal absent words using suffix array.
Barton C, Heliou A, Mouchard L, Pissis SP., BMC Bioinformatics 15(), 2014
PMID: 25526884
keeSeek: searching distant non-existing words in genomes for PCR-based applications.
Falda M, Fontana P, Barzon L, Toppo S, Lavezzo E., Bioinformatics 30(18), 2014
PMID: 24867942
Pervasive sequence patents cover the entire human genome.
Rosenfeld JA, Mason CE., Genome Med 5(3), 2013
PMID: 23522065
Insertion site preference of Mu, Tn5, and Tn7 transposons.
Green B, Bouchier C, Fairhead C, Craig NL, Cormack BP., Mob DNA 3(1), 2012
PMID: 22313799
Clustering of DNA words and biological function: a proof of principle.
Hackenberg M, Rueda A, Carpena P, Bernaola-Galvan P, Barturen G, Oliver JL., J. Theor. Biol. 297(), 2012
PMID: 22226985
Minimal absent words in four human genome assemblies.
Garcia SP, Pinho AJ., PLoS ONE 6(12), 2011
PMID: 22220210
Minimal absent words in prokaryotic and eukaryotic genomes.
Garcia SP, Pinho AJ, Rodrigues JM, Bastos CA, Ferreira PJ., PLoS ONE 6(1), 2011
PMID: 21386877
Microbial diversity in saliva of oral squamous cell carcinoma.
Pushalkar S, Mane SP, Ji X, Li Y, Evans C, Crasta OR, Morse D, Meagher R, Singh A, Saxena D., FEMS Immunol. Med. Microbiol. 61(3), 2011
PMID: 21205002
The word landscape of the non-coding segments of the Arabidopsis thaliana genome.
Lichtenberg J, Yilmaz A, Welch JD, Kurz K, Liang X, Drews F, Ecker K, Lee SS, Geisler M, Grotewold E, Welch LR., BMC Genomics 10(), 2009
PMID: 19814816
Genomic DNA k-mer spectra: models and modalities.
Chor B, Horn D, Goldman N, Levy Y, Massingham T., Genome Biol. 10(10), 2009
PMID: 19814784
Multiplex primer prediction software for divergent targets.
Gardner SN, Hiddessen AL, Williams PL, Hara C, Wagner MC, Colston BW Jr., Nucleic Acids Res. 37(19), 2009
PMID: 19759213
On finding minimal absent words.
Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM., BMC Bioinformatics 10(), 2009
PMID: 19426495

24 References

Data provided by Europe PubMed Central.

GISMO--gene identification using a support vector machine for ORF classification.
Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F., Nucleic Acids Res. 35(2), 2007
PMID: 17175534
Structure and function of type II restriction endonucleases.
Pingoud A, Jeltsch A., Nucleic Acids Res. 29(18), 2001
PMID: 11557805
Monotony of Surprise And Large-Scale Quest for Unusual Words
Apostolico A, Bock ME, Lonardi S., 2002
Verbumculus and the Discovery of Unusual Words
Apostolico A, Gong F, Lonardi S., 2004
Mauve: multiple alignment of conserved genomic sequence with rearrangements.
Darling AC, Mau B, Blattner FR, Perna NT., Genome Res. 14(7), 2004
PMID: 15231754
Genome comparison without alignment using shortest unique substrings.
Haubold B, Pierstorff N, Moller F, Wiehe T., BMC Bioinformatics 6(), 2005
PMID: 15910684
Absent sequences: nullomers and primes.
Hampikian G, Andersen T., Pac Symp Biocomput (), 2007
PMID: 17990505
Nullomers: really a matter of natural selection?
Acquisti C, Poste G, Curtiss D, Kumar S., PLoS ONE 2(10), 2007
PMID: 17925870
Replacing Suffix Trees with Enhanced Suffix Arrays
Abouelhoda M, Kurtz S, Ohlebusch E., 2004
Vmatch
AUTHOR UNKNOWN, 0
On the distribution of the number of missing words in random texts
Rahmann S, Rivals E., 2003
Human Genome
AUTHOR UNKNOWN, 0
Mouse Genome
AUTHOR UNKNOWN, 0
Drosophila Genomes
AUTHOR UNKNOWN, 0
C. elegans Genome
AUTHOR UNKNOWN, 0
The genome sequence of the filamentous fungus Neurospora crassa
Galagan J, Calvo S, Borkovich K, Selker E, Read N, Jaffe D, FitzHugh W, Ma L, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen C, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson M, Werner-Washburne M, Selitrennikoff C, Kinsey J, Braun E, Zelter A, Schulte U, Kothe G, Jedd G, Mewes W, Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange-Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzenberg R, Perkins D, Kroken S, Cogoni C, Macino G, Catcheside D, Li W, Pratt R, Osmani S, DeSouza C, Glass L, Orbach M, Berglund J, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, Aramayo R, Natvig D, Alex L, Mannhaupt G, Ebbole D, Freitag M, Paulsen I, Sachs M, Lander E, Nusbaum C, Birren B., 2003
S. cerevisiae Genome
AUTHOR UNKNOWN, 0
Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes.
Fukui T, Atomi H, Kanai T, Matsumi R, Fujiwara S, Imanaka T., Genome Res. 15(3), 2005
PMID: 15710748
Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC., Science 273(5278), 1996
PMID: 8688087
Construction of a large signature-tagged mini-Tn5 transposon library and its application to mutagenesis of Sinorhizobium meliloti.
Pobigaylo N, Wetter D, Szymczak S, Schiller U, Kurtz S, Meyer F, Nattkemper TW, Becker A., Appl. Environ. Microbiol. 72(6), 2006
PMID: 16751548
Computing Unwords on BibiServ
AUTHOR UNKNOWN, 0
Unwords
AUTHOR UNKNOWN, 0

Export

0 Marked Publications

Open Data PUB

Web of Science

View record in Web of Science®

Sources

PMID: 18366790
PubMed | Europe PMC

Search this title in

Google Scholar