PASCAL Pump Prime 1

Normalised Compression Distance Measures and Their Application in Unsupervised and Supervised Analysis of Polymorphic Data

The project researchers have built a generic open-source software package (Complearn) for building tree structured representations from normalised compression distance (NCD) or normalised Google distance (NGD) distance matrix. Preliminary empirical tests with the software have indicated that the data representation is crucial for the performance of the algorithms, which led us to further study applications of lossy
compression algorithms (audio stream to midi and lossy image compression through wavelet transformation).

In the second part of the project, the methods and tools developed in the project were applied in a challenging and exciting real-world problem. The goal was to recover the relations among different variants of a text that has been gradually altered as a result of imperfectly copying the text over and over again. In addition to using the currently available methods in Complearn, we also developed a new compression- based method that is specifically designed for stemmatic analysis of text variants.

The various methods developed in the project were applied and tested using the tradition of the legend of St. Henry of Finland, which forms a collection of the oldest written texts found in Finland. The results were quite encouraging: the obtained family tree of the variants, the stemma, corresponds to a large extent with results obtained with more traditional methods (as verified by the leading domain expert, Tuomas Heikkilä Ph.D., Department of History, University of Helsinki). Moreover, some of the identified groups of manuscripts are previously unrecognised ones. Due to the impossibility of manually exploring all plausible alternatives among the vast number of possible trees, this work is the first attempt at a complete stemma for the legend of St. Henry. The new compression-based methods developed specifically for the stemmatology domain will be released in the future as part of the open-source Complearn package. We are also considering the possibility of creating a Pascal challenge using this type of data.

Learning with Labeled and Unlabeled Data

Semi-supervised learning belongs to the main directions of the recent machine learning research. The exploitation of the unlabeled data is an attractive approach either to extend the capability of the known methods or to derive novel learning devices. Learning a rule from a finite sample is the fundamental problem of machine learning. For this purpose two resources are needed: a big enough sample and enough computational power. While the computational power has been growing rapidly, the cost of collecting a large sample remains high since it is labour intensive. The unlabeled data can be used to find a compact representation of the data which preserves as much as possible its original structure.