- Version
- Download 1
- File Size 165.52 KB
- File Count 1
- Create Date November 9, 2023
- Last Updated November 9, 2023
D2.3 Final version of algorithms for extraction and the data pipeline - Executive Summary
Most of the existing data is stored in unstructured format. From such unstructured data, most of it is written in free text format; i.e., in natural language text. Data in free text format, however, is not machine readable, which means that machines cannot make use of the large amount of knowledge that is written in free text. To exploit the value of the information stored in natural language text, people have turned to constructing Knowledge Graphs (KGs) from natural language text [1]. KGs are graph databases that usually store information in (subject, predicate, object)-format [2, 3, 4, 5, 6]. When the information in text is presented in such structured format, it can be used for many downstream tasks such as information retrieval [7, 8], data exploration [9, 10], automatic question answering [11, 12] as well as for targeted domain specific tasks such as molecular property prediction and drug design [13].
In this deliverable, we present the final version of the algorithms for information extraction from research papers of the omics sciences. We also present the final data processing pipeline integrated with the GLOMICAVE cloud platform. The final result of this deliverable is a KG-based structured representation of the information found in the research papers of interest. The information comprised in the KG can be further enhanced and enriched by means of machine learning algorithms – namely graph neural networks – which, after appropriate training on the collected data, are able to discover new properties and relations between entities. Moreover, the information in the KG can be used for enabling the users to explore the information in more structured and meaningful manner.