The project researchers have built a generic open-source
software package (Complearn) for building tree structured
representations from normalised compression distance (NCD)
or normalised Google distance (NGD) distance matrix.
Preliminary empirical tests with the software have indicated that
the data representation is crucial for the performance of the
algorithms, which led us to further study applications of lossy
compression algorithms (audio stream to midi and lossy image
compression through wavelet transformation).
In the second part of the project, the methods and tools
developed in the project were applied in a challenging and
exciting real-world problem. The goal was to recover the relations among different variants of a text that has been
gradually altered as a result of imperfectly copying the text over
and over again. In addition to using the currently available
methods in Complearn, we also developed a new compression-
based method that is specifically designed for stemmatic
analysis of text variants.
The various methods developed in the project were applied
and tested using the tradition of the legend of St. Henry of
Finland, which forms a collection of the oldest written texts
found in Finland. The results were quite encouraging: the
obtained family tree of the variants, the stemma, corresponds to
a large extent with results obtained with more traditional
methods (as verified by the leading domain expert, Tuomas
Heikkilä Ph.D., Department of History, University of Helsinki).
Moreover, some of the identified groups of manuscripts are
previously unrecognised ones. Due to the impossibility of
manually exploring all plausible alternatives among the vast
number of possible trees, this work is the first attempt at a
complete stemma for the legend of St. Henry. The new
compression-based methods developed specifically for the
stemmatology domain will be released in the future as part of
the open-source Complearn package. We are also considering
the possibility of creating a Pascal challenge using this type of
data. |

Semi-supervised learning belongs to the main directions of the
recent machine learning research. The exploitation of the
unlabeled data is an attractive approach either to extend the
capability of the known methods or to derive novel learning
devices. Learning a rule from a finite sample is the fundamental
problem of machine learning. For this purpose two resources
are needed: a big enough sample and enough computational
power. While the computational power has been growing
rapidly, the cost of collecting a large sample remains high since
it is labour intensive.
The unlabeled data can be used to find a compact
representation of the data which preserves as much as possible
its original structure. |