Efficient computation of absent words in genomic sequences

Herold, Julia; Kurtz, Stefan; Giegerich, Robert

Efficient computation of absent words in genomic sequences

Herold J, Kurtz S, Giegerich R (2008)
BMC Bioinformatics 9(1): 167.

Zeitschriftenaufsatz | Veröffentlicht | Englisch

Download

gieg_2.pdf

DOI

https://doi.org/10.1186/1471-2105-9-167

URN

urn:nbn:de:0070-pub-17840255

Autor*in

Herold, Julia^UniBi; Kurtz, Stefan; Giegerich, Robert^UniBi

Einrichtung

Centrum für Biotechnologie > Institut für Bioinformatik
Centrum für Biotechnologie > Arbeitsgruppe R. Giegerich
Center of Excellence - Cognitive Interaction Technology CITEC
Technische Fakultät > AG Praktische Informatik

Abstract / Bemerkung

Background: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. Conclusion: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.

Erscheinungsjahr

2008

Zeitschriftentitel

BMC Bioinformatics

Band

Ausgabe

Art.-Nr.

167

ISSN

1471-2105

Page URI

https://pub.uni-bielefeld.de/record/1784025

Zitieren

Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008;9(1): 167.

Herold, J., Kurtz, S., & Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1), 167. https://doi.org/10.1186/1471-2105-9-167

Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. 2008. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9 (1): 167.

Herold, J., Kurtz, S., and Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9:167.

Herold, J., Kurtz, S., & Giegerich, R., 2008. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1): 167.

J. Herold, S. Kurtz, and R. Giegerich, “Efficient computation of absent words in genomic sequences”, BMC Bioinformatics, vol. 9, 2008, : 167.

Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 9, : 167 (2008).

Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9.1 (2008): 167.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Copyright Statement:

Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]

Volltext(e)

Name

gieg_2.pdf

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T08:48:53Z

MD5 Prüfsumme

d2ae09323894f1609aa545bf52e0f47a

Daten bereitgestellt von European Bioinformatics Institute (EBI)

20 Zitationen in Europe PMC

Daten bereitgestellt von Europe PubMed Central.

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study.
Rahman MS, Alatabbi A, Athar T, Crochemore M, Rahman MS., BMC Res Notes 9(), 2016
PMID: 27004958

The bulk and the tail of minimal absent words in genome sequences.
Aurell E, Innocenti N, Zhou HJ., Phys Biol 13(2), 2016
PMID: 27043075

Spatial distribution of predicted transcription factor binding sites in Drosophila ChIP peaks.
Pettie KP, Dresch JM, Drewell RA., Mech Dev 141(), 2016
PMID: 27264535

Nullomers and High Order Nullomers in Genomic Sequences.
Vergni D, Santoni D., PLoS One 11(12), 2016
PMID: 27906971

Three minimal sequences found in Ebola virus genomes and absent from human DNA.
Silva RM, Pratas D, Castro L, Pinho AJ, Ferreira PJ., Bioinformatics 31(15), 2015
PMID: 25840045

keeSeek: searching distant non-existing words in genomes for PCR-based applications.
Falda M, Fontana P, Barzon L, Toppo S, Lavezzo E., Bioinformatics 30(18), 2014
PMID: 24867942

Enhancing the detection of barcoded reads in high throughput DNA sequencing data by controlling the false discovery rate.
Buschmann T, Zhang R, Brash DE, Bystrykh LV., BMC Bioinformatics 15(), 2014
PMID: 25099007

Linear-time computation of minimal absent words using suffix array.
Barton C, Heliou A, Mouchard L, Pissis SP., BMC Bioinformatics 15(), 2014
PMID: 25526884

Comparative analysis of DNA word abundances in four yeast genomes using a novel statistical background model.
Hariharan R, Simon R, Pillai MR, Taylor TD., PLoS One 8(3), 2013
PMID: 23472131

Pervasive sequence patents cover the entire human genome.
Rosenfeld JA, Mason CE., Genome Med 5(3), 2013
PMID: 23522065

Clustering of DNA words and biological function: a proof of principle.
Hackenberg M, Rueda A, Carpena P, Bernaola-Galván P, Barturen G, Oliver JL., J Theor Biol 297(), 2012
PMID: 22226985

Insertion site preference of Mu, Tn5, and Tn7 transposons.
Green B, Bouchier C, Fairhead C, Craig NL, Cormack BP., Mob DNA 3(1), 2012
PMID: 22313799

Minimal absent words in prokaryotic and eukaryotic genomes.
Garcia SP, Pinho AJ, Rodrigues JM, Bastos CA, Ferreira PJ., PLoS One 6(1), 2011
PMID: 21386877

Microbial diversity in saliva of oral squamous cell carcinoma.
Pushalkar S, Mane SP, Ji X, Li Y, Evans C, Crasta OR, Morse D, Meagher R, Singh A, Saxena D., FEMS Immunol Med Microbiol 61(3), 2011
PMID: 21205002

Minimal absent words in four human genome assemblies.
Garcia SP, Pinho AJ., PLoS One 6(12), 2011
PMID: 22220210

On finding minimal absent words.
Pinho AJ, Ferreira PJ, Garcia SP, Rodrigues JM., BMC Bioinformatics 10(), 2009
PMID: 19426495

Word-based characterization of promoters involved in human DNA repair pathways.
Lichtenberg J, Jacox E, Welch JD, Kurz K, Liang X, Yang MQ, Drews F, Ecker K, Lee SS, Elnitski L, Welch LR., BMC Genomics 10 Suppl 1(), 2009
PMID: 19594877

Multiplex primer prediction software for divergent targets.
Gardner SN, Hiddessen AL, Williams PL, Hara C, Wagner MC, Colston BW., Nucleic Acids Res 37(19), 2009
PMID: 19759213

Genomic DNA k-mer spectra: models and modalities.
Chor B, Horn D, Goldman N, Levy Y, Massingham T., Genome Biol 10(10), 2009
PMID: 19814784

The word landscape of the non-coding segments of the Arabidopsis thaliana genome.
Lichtenberg J, Yilmaz A, Welch JD, Kurz K, Liang X, Drews F, Ecker K, Lee SS, Geisler M, Grotewold E, Welch LR., BMC Genomics 10(), 2009
PMID: 19814816

24 References

Daten bereitgestellt von Europe PubMed Central.

The spectrum of genomic signatures: from dinucleotides to chaos game representation.
Wang Y, Hill K, Singh S, Kari L., Gene 346(), 2005
PMID: 15716010

No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution.
Workman C, Krogh A., Nucleic Acids Res. 27(24), 1999
PMID: 10572183

GISMO--gene identification using a support vector machine for ORF classification.
Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F., Nucleic Acids Res. 35(2), 2006
PMID: 17175534

Structure and function of type II restriction endonucleases.
Pingoud A, Jeltsch A., Nucleic Acids Res. 29(18), 2001
PMID: 11557805

Monotony of Surprise And Large-Scale Quest for Unusual Words
Apostolico A, Bock ME, Lonardi S., 2002

Verbumculus and the Discovery of Unusual Words
Apostolico A, Gong F, Lonardi S., 2004

Mauve: multiple alignment of conserved genomic sequence with rearrangements.
Darling AC, Mau B, Blattner FR, Perna NT., Genome Res. 14(7), 2004
PMID: 15231754

Genome comparison without alignment using shortest unique substrings.
Haubold B, Pierstorff N, Moller F, Wiehe T., BMC Bioinformatics 6(), 2005
PMID: 15910684

Absent sequences: nullomers and primes.
Hampikian G, Andersen T., Pac Symp Biocomput (), 2007
PMID: 17990505

Nullomers: really a matter of natural selection?
Acquisti C, Poste G, Curtiss D, Kumar S., PLoS ONE 2(10), 2007
PMID: 17925870

Replacing Suffix Trees with Enhanced Suffix Arrays
Abouelhoda M, Kurtz S, Ohlebusch E., 2004

Vmatch
AUTHOR UNKNOWN, 0

On the distribution of the number of missing words in random texts
Rahmann S, Rivals E., 2003

Human Genome
AUTHOR UNKNOWN, 0

Mouse Genome
AUTHOR UNKNOWN, 0

Drosophila Genomes
AUTHOR UNKNOWN, 0

C. elegans Genome
AUTHOR UNKNOWN, 0

The genome sequence of the filamentous fungus Neurospora crassa
Galagan J, Calvo S, Borkovich K, Selker E, Read N, Jaffe D, FitzHugh W, Ma L, Smirnov S, Purcell S, Rehman B, Elkins T, Engels R, Wang S, Nielsen C, Butler J, Endrizzi M, Qui D, Ianakiev P, Bell-Pedersen D, Nelson M, Werner-Washburne M, Selitrennikoff C, Kinsey J, Braun E, Zelter A, Schulte U, Kothe G, Jedd G, Mewes W, Staben C, Marcotte E, Greenberg D, Roy A, Foley K, Naylor J, Stange-Thomann N, Barrett R, Gnerre S, Kamal M, Kamvysselis M, Mauceli E, Bielke C, Rudd S, Frishman D, Krystofova S, Rasmussen C, Metzenberg R, Perkins D, Kroken S, Cogoni C, Macino G, Catcheside D, Li W, Pratt R, Osmani S, DeSouza C, Glass L, Orbach M, Berglund J, Voelker R, Yarden O, Plamann M, Seiler S, Dunlap J, Radford A, Aramayo R, Natvig D, Alex L, Mannhaupt G, Ebbole D, Freitag M, Paulsen I, Sachs M, Lander E, Nusbaum C, Birren B., 2003

S. cerevisiae Genome
AUTHOR UNKNOWN, 0

Complete genome sequence of the hyperthermophilic archaeon Thermococcus kodakaraensis KOD1 and comparison with Pyrococcus genomes.
Fukui T, Atomi H, Kanai T, Matsumi R, Fujiwara S, Imanaka T., Genome Res. 15(3), 2005
PMID: 15710748

Complete genome sequence of the methanogenic archaeon, Methanococcus jannaschii.
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, Blake JA, FitzGerald LM, Clayton RA, Gocayne JD, Kerlavage AR, Dougherty BA, Tomb JF, Adams MD, Reich CI, Overbeek R, Kirkness EF, Weinstock KG, Merrick JM, Glodek A, Scott JL, Geoghagen NS, Venter JC., Science 273(5278), 1996
PMID: 8688087

Construction of a large signature-tagged mini-Tn5 transposon library and its application to mutagenesis of Sinorhizobium meliloti.
Pobigaylo N, Wetter D, Szymczak S, Schiller U, Kurtz S, Meyer F, Nattkemper TW, Becker A., Appl. Environ. Microbiol. 72(6), 2006
PMID: 16751548

Computing Unwords on BibiServ
AUTHOR UNKNOWN, 0

Unwords
AUTHOR UNKNOWN, 0

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB