- Version
- Download 5
- File Size 163.58 KB
- File Count 1
- Create Date March 9, 2022
- Last Updated March 9, 2022
D2.1 Novel training dataset and labelling functions - Executive Summary
Modern machine learning approaches, in particular neural networks in the field of deep learning, require large amounts of training data (i.e., pairs of input data and expected output) to achieve their best performances. However, manually labeling large amounts of data is time-consuming, and hence, expensive. Labeling function development provides an alternative to manually label training data. Key idea is to develop several heuristic labeling functions that can be used to automatically annotate large amounts of data. The heuristic labels can then be used to train weakly supervised models, which do not require high-quality labels but can also be trained with lower-quality (so called noisy) labels.
In this deliverable, we present a set of labeling functions that have been developed for three different annotation levels and a novel dataset that have been labeled with the developed labeling functions. More specifically, we have developed labeling functions to label named entity in natural language texts, as it is of key important for subsequent processing steps. Furthermore, we have developed labeling functions to annotate the relevance of individual sentences and publications that can be used in the processing pipeline for content filtering. The developed labeling functions have been used to annotate obtained scientific publications. Moreover, the generated dataset has already been used within the GLOMICAVE project for further analysis and weakly supervised training.