@phdthesis{2302056,
abstract = {This thesis treats two problem areas in bioinformatics which can both be beneficially formalized as string problems.
The first (and larger) part deals with weighted string problems as they arise from biotechnological mass spectrometry applications. In a mass spectrometry experiment - put in a simplified manner - the molecular mass of the sample molecules is determined. The aim is to identify the sample, either with or without additional information from a database, relying on the fact that the different building blocks of proteins (namely, amino acids) and of DNA molecules (nucleotides) have different molecular masses. Viewing proteins and DNA molecules as strings, this leads naturally to the definition of weighted strings as strings over a finite alphabet with an additional weight (or mass) function.
We develop some results in weighted string combinatorics, and then present efficient algorithms for three weighted string problems which are motivated by mass spectrometry.
First, the mass decomposition problem is the problem of finding all compomers whose mass equals a query mass. Here, a compomer is an integer vector specifying the number of occurrences of each character. A compomer abstracts from the order of characters in a string and identifies instead all strings with the same number of occurrences of each character - since these strings all have the same mass. We also present simulation results and tables detailing the number of compomers for different biomolecules, in the ranges appropriate for mass spectrometry applications.
Next, we give several efficient algorithms for the submass problem: to test, for a given weighted string s and a query mass M, whether s has a substring with mass M; to find where such a substring occurs; and other variants. We present an algorithm for binary alphabets which runs in time logarithmic in the length of s. Furthermore, we present several algorithms for the problem where multiple masses are sought; these algorithms encode the submasses of s in a polynomial and rely for their efficiency on Fast Fourier Transform for polynomial multiplication.
The third weighted string problem we discuss is de novo peptide sequencing, i.e., recovering the amino acid sequence of a sample peptide from tandem MS data, a specialized mass spectrometry experiment. We describe an algorithm which is an enhancement of a dynamic programming algorithm presented by Chen et al. in 2000. We describe our implementation and present some simulation results.
The second part of the thesis deals with EST clustering. ESTs (expressed sequence tags) are short DNA sequences which are partial copies of mRNAs; they are produced in a high-throughput manner and submitted to public databases. In EST clustering, the aim is to produce a partition of a large set of ESTs, where each cluster corresponds to a gene; thus enabling the researcher to identify which genes were being expressed. Clearly, the quality of the clustering is highly dependent on the dissimilarity measure (distance/similarity measure) for strings and on the clustering algorithm employed. We have developed a method for evaluating different string dissimilarity measures and clustering algorithms. Finally, we present simulation results for five dissimilarity measures and one clustering algorithm.},
author = {Lipták, Zsuzsanna},
publisher = {Bielefeld University},
title = {{Strings in proteomics and transcriptomics : algorithmic and combinatorial questions in mass spectrometry and EST clustering}},
year = {2005},
}