Efficient computation of absent words in genomic sequences

Herold J, Kurtz S, Giegerich R (2008)
BMC Bioinformatics 9(1): 167.

Zeitschriftenaufsatz | Veröffentlicht| Englisch
 
Download
OA
Autor/in
Abstract / Bemerkung
Background: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. Results: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. Conclusion: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
Erscheinungsjahr
2008
Zeitschriftentitel
BMC Bioinformatics
Band
9
Ausgabe
1
Seite(n)
167
ISSN
1471-2105
Page URI
https://pub.uni-bielefeld.de/record/1784025

Zitieren

Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 2008;9(1):167.
Herold, J., Kurtz, S., & Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1), 167. doi:10.1186/1471-2105-9-167
Herold, J., Kurtz, S., and Giegerich, R. (2008). Efficient computation of absent words in genomic sequences. BMC Bioinformatics 9, 167.
Herold, J., Kurtz, S., & Giegerich, R., 2008. Efficient computation of absent words in genomic sequences. BMC Bioinformatics, 9(1), p 167.
J. Herold, S. Kurtz, and R. Giegerich, “Efficient computation of absent words in genomic sequences”, BMC Bioinformatics, vol. 9, 2008, pp. 167.
Herold, J., Kurtz, S., Giegerich, R.: Efficient computation of absent words in genomic sequences. BMC Bioinformatics. 9, 167 (2008).
Herold, Julia, Kurtz, Stefan, and Giegerich, Robert. “Efficient computation of absent words in genomic sequences”. BMC Bioinformatics 9.1 (2008): 167.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Copyright Statement:
This Item is protected by copyright and/or related rights. [...]
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2019-09-06T08:48:53Z
MD5 Prüfsumme
d2ae09323894f1609aa545bf52e0f47a

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Web of Science

Dieser Datensatz im Web of Science®

Quellen

PMID: 18366790
PubMed | Europe PMC

Suchen in

Google Scholar