From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain
Kühnel L (2024)
Bielefeld: Universität Bielefeld.
Bielefelder E-Dissertation | Englisch
Download
phdthesis_lkuehnel.pdf
7.76 MB
Autor*in
Gutachter*in / Betreuer*in
Fluck, Juliane;
Cimiano, PhilippUniBi ;
Klinger, Roman
Einrichtung
Abstract / Bemerkung
Machine-readability and access to data, information and knowledge are key requirements for data-driven research to generate new knowledge and insights from data. However, data-driven research, especially in the biomedical domain, is hampered by several aspects. While literature data is freely available to researchers, it is neither machine-readable nor easy to find, given the enormous growth of electronic data. On the other hand, although medical data is (at least partially) stored in a structured format, access to it is restricted by privacy laws. To advance data-driven research, these two problems are therefore addressed in this thesis.
A lot of research has been done to automatically extract information from unstructured data -- this is called text mining. Current state-of-the-art methods have been boosted by advances in deep learning and show promising results on available corpora. However, as manual curation of data sets is a complex and time-consuming task often resulting in very small data sets, it remains questionable how robust the methods are on new data and whether they are ready to be used in digital services to support researchers.
Therefore, a central aim of this thesis is to investigate the robustness of current state-of-the-art methods and how they can be incorporated into services. We show that despite high performance on a single corpus, so-called cross evaluations lead to a significant drop in performance. Using biomedical named entity recognition (e.g. of diseases and genes) as an example, we analyse available corpora and conclude that they are either too small or too specific to train a robust model. Furthermore, we show the importance of annotation guidelines and how they can influence the final model. Despite these limitations, the increasing amount of literature appearing daily -- especially during the COVID-19 pandemic -- underlines the need for digital services with automatic indexing methods.
To support researchers during the pandemic, we set up a text mining-based semantic search engine that combines preprints (i.e. non peer-reviewed articles) from several different preprint servers. Semantic indexing and further search functionalities facilitate the retrieval of relevant information. Based on this service, we iteratively analyse the requirements for building such an engine in close cooperation with the users (who are mainly information specialists). We show how a running system can be used to improve text mining methods by implementing feedback modes.
Such crowd-sourcing approaches can lead to an increase in available training data sets, which should of course be used to improve the algorithms. Since re-training from scratch is time-consuming and inefficient, desirable solutions simply train the already trained model on the new data set. However, a well-known problem that arises in such cases is a phenomenon called catastrophic forgetting, which means that the model is biased towards the last training data set and forgets what it has previously learned.
We analyse this effect on biomedical named entity recognition with the current state-of-the-art methods and develop a new efficient continuous learning algorithm, which allows re-training of a model and prevents forgetting by a post-processing step.
In contrast to the automatic structuring of literature data and the provision of appropriate digital services to end users, restricted access to personal health data poses different challenges. While anonymisation techniques are widely used to protect individuals, the removal of information may call into question the usefulness of the data. Therefore, synthetic data generation methods have been developed. These methods create a completely new data set that retains similar statistical properties. However, it is unclear to what extent these data sets can be used for real data analysis.
Therefore, the second central aim of this thesis is to investigate the applicability of current state-of-the-art methods for generating synthetic longitudinal data using a real-world example. We adapt the algorithms to deal with dependencies across time points and evaluate the usefulness of the generated data at different levels together with domain experts.
With this work, we contribute to the availability and discoverability of machine-readable biomedical data and advance translational science by bringing the added value of the new research findings to the users.
A lot of research has been done to automatically extract information from unstructured data -- this is called text mining. Current state-of-the-art methods have been boosted by advances in deep learning and show promising results on available corpora. However, as manual curation of data sets is a complex and time-consuming task often resulting in very small data sets, it remains questionable how robust the methods are on new data and whether they are ready to be used in digital services to support researchers.
Therefore, a central aim of this thesis is to investigate the robustness of current state-of-the-art methods and how they can be incorporated into services. We show that despite high performance on a single corpus, so-called cross evaluations lead to a significant drop in performance. Using biomedical named entity recognition (e.g. of diseases and genes) as an example, we analyse available corpora and conclude that they are either too small or too specific to train a robust model. Furthermore, we show the importance of annotation guidelines and how they can influence the final model. Despite these limitations, the increasing amount of literature appearing daily -- especially during the COVID-19 pandemic -- underlines the need for digital services with automatic indexing methods.
To support researchers during the pandemic, we set up a text mining-based semantic search engine that combines preprints (i.e. non peer-reviewed articles) from several different preprint servers. Semantic indexing and further search functionalities facilitate the retrieval of relevant information. Based on this service, we iteratively analyse the requirements for building such an engine in close cooperation with the users (who are mainly information specialists). We show how a running system can be used to improve text mining methods by implementing feedback modes.
Such crowd-sourcing approaches can lead to an increase in available training data sets, which should of course be used to improve the algorithms. Since re-training from scratch is time-consuming and inefficient, desirable solutions simply train the already trained model on the new data set. However, a well-known problem that arises in such cases is a phenomenon called catastrophic forgetting, which means that the model is biased towards the last training data set and forgets what it has previously learned.
We analyse this effect on biomedical named entity recognition with the current state-of-the-art methods and develop a new efficient continuous learning algorithm, which allows re-training of a model and prevents forgetting by a post-processing step.
In contrast to the automatic structuring of literature data and the provision of appropriate digital services to end users, restricted access to personal health data poses different challenges. While anonymisation techniques are widely used to protect individuals, the removal of information may call into question the usefulness of the data. Therefore, synthetic data generation methods have been developed. These methods create a completely new data set that retains similar statistical properties. However, it is unclear to what extent these data sets can be used for real data analysis.
Therefore, the second central aim of this thesis is to investigate the applicability of current state-of-the-art methods for generating synthetic longitudinal data using a real-world example. We adapt the algorithms to deal with dependencies across time points and evaluate the usefulness of the generated data at different levels together with domain experts.
With this work, we contribute to the availability and discoverability of machine-readable biomedical data and advance translational science by bringing the added value of the new research findings to the users.
Jahr
2024
Seite(n)
161
Urheberrecht / Lizenzen
Page URI
https://pub.uni-bielefeld.de/record/2985870
Zitieren
Kühnel L. From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Bielefeld: Universität Bielefeld; 2024.
Kühnel, L. (2024). From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Bielefeld: Universität Bielefeld.
Kühnel, Lisa. 2024. From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Bielefeld: Universität Bielefeld.
Kühnel, L. (2024). From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Bielefeld: Universität Bielefeld.
Kühnel, L., 2024. From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain, Bielefeld: Universität Bielefeld.
L. Kühnel, From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain, Bielefeld: Universität Bielefeld, 2024.
Kühnel, L.: From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Universität Bielefeld, Bielefeld (2024).
Kühnel, Lisa. From Hidden Data and Information towards Data-Driven Research in the Biomedical Domain. Bielefeld: Universität Bielefeld, 2024.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung 4.0 International Public License (CC-BY 4.0):
Volltext(e)
Name
phdthesis_lkuehnel.pdf
7.76 MB
Access Level
Open Access
Zuletzt Hochgeladen
2024-01-07T11:21:01Z
MD5 Prüfsumme
12d7bc7f818972f547211280179042eb