### 2013

1.

Wicker, Jörg

Large Classifier Systems in Bio- and Cheminformatics PhD Thesis

Technische Universität München, 2013.

Abstract | Links | BibTeX | Tags: biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity

@phdthesis{wicker2013large,

title = {Large Classifier Systems in Bio- and Cheminformatics},

author = {J\"{o}rg Wicker},

url = {http://mediatum.ub.tum.de/node?id=1165858},

year = {2013},

date = {2013-01-01},

school = {Technische Universit\"{a}t M\"{u}nchen},

abstract = {Large classifier systems are machine learning algorithms that use multiple

classifiers to improve the prediction of target values in advanced

classification tasks. Although learning problems in bio- and

cheminformatics commonly provide data in schemes suitable for large

classifier systems, they are rarely used in these domains. This thesis

introduces two new classifiers incorporating systems of classifiers

using Boolean matrix decomposition to handle data in a schema that

often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using

Boolean matrix decomposition), uses Boolean matrix decomposition to

decompose the labels in a multi-label classification task. The

decomposed matrices are a compact representation of the information

in the labels (first matrix) and the dependencies among the labels

(second matrix). The first matrix is used in a further multi-label

classification while the second matrix is used to generate the final

matrix from the predicted values of the first matrix.

MLC-BMaD was evaluated on six standard multi-label data sets, the

experiments showed that MLC-BMaD can perform particularly well on data

sets with a high number of labels and a small number of instances and

can outperform standard multi-label algorithms.

Subsequently, MLC-BMaD is extended to a special case of

multi-relational learning, by considering the labels not as simple

labels, but instances. The algorithm, called ClassFact

(Classification factorization), uses both matrices in a multi-label

classification. Each label represents a mapping between two

instances.

Experiments on three data sets from the domain of bioinformatics show

that ClassFact can outperform the baseline method, which merges the

relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics

data sets, the first one is used to predict the environmental fate of

chemicals by predicting biodegradation pathways. The second is a data

set from the domain of predictive toxicology. In biodegradation

pathway prediction, I extend a knowledge-based system and incorporate

a machine learning approach to predict a probability for

biotransformation products based on the structure- and knowledge-based

predictions of products, which are based on transformation rules. The

use of multi-label classification improves the performance of the

classifiers and extends the number of transformation rules that can be

covered.

For the prediction of toxic effects of chemicals, I applied large

classifier systems to the ToxCasttexttrademark data set, which maps

toxic effects to chemicals. As the given toxic effects are not easy to

predict due to missing information and a skewed class

distribution, I introduce a filtering step in the multi-label

classification, which finds labels that are usable in multi-label

prediction and does not take the others in the

prediction into account. Experiments show

that this approach can improve upon the baseline method using binary

classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a

role in future research challenges, especially in bio- and

cheminformatics, where data sets frequently consist of more complex

structures and data can be rather small in terms of the number of

instances compared to other domains.},

keywords = {biodegradation, bioinformatics, cheminformatics, computational sustainability, data mining, enviPath, machine learning, multi-label classification, multi-relational learning, toxicity},

pubstate = {published},

tppubtype = {phdthesis}

}

Large classifier systems are machine learning algorithms that use multiple

classifiers to improve the prediction of target values in advanced

classification tasks. Although learning problems in bio- and

cheminformatics commonly provide data in schemes suitable for large

classifier systems, they are rarely used in these domains. This thesis

introduces two new classifiers incorporating systems of classifiers

using Boolean matrix decomposition to handle data in a schema that

often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using

Boolean matrix decomposition), uses Boolean matrix decomposition to

decompose the labels in a multi-label classification task. The

decomposed matrices are a compact representation of the information

in the labels (first matrix) and the dependencies among the labels

(second matrix). The first matrix is used in a further multi-label

classification while the second matrix is used to generate the final

matrix from the predicted values of the first matrix.

MLC-BMaD was evaluated on six standard multi-label data sets, the

experiments showed that MLC-BMaD can perform particularly well on data

sets with a high number of labels and a small number of instances and

can outperform standard multi-label algorithms.

Subsequently, MLC-BMaD is extended to a special case of

multi-relational learning, by considering the labels not as simple

labels, but instances. The algorithm, called ClassFact

(Classification factorization), uses both matrices in a multi-label

classification. Each label represents a mapping between two

instances.

Experiments on three data sets from the domain of bioinformatics show

that ClassFact can outperform the baseline method, which merges the

relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics

data sets, the first one is used to predict the environmental fate of

chemicals by predicting biodegradation pathways. The second is a data

set from the domain of predictive toxicology. In biodegradation

pathway prediction, I extend a knowledge-based system and incorporate

a machine learning approach to predict a probability for

biotransformation products based on the structure- and knowledge-based

predictions of products, which are based on transformation rules. The

use of multi-label classification improves the performance of the

classifiers and extends the number of transformation rules that can be

covered.

For the prediction of toxic effects of chemicals, I applied large

classifier systems to the ToxCasttexttrademark data set, which maps

toxic effects to chemicals. As the given toxic effects are not easy to

predict due to missing information and a skewed class

distribution, I introduce a filtering step in the multi-label

classification, which finds labels that are usable in multi-label

prediction and does not take the others in the

prediction into account. Experiments show

that this approach can improve upon the baseline method using binary

classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a

role in future research challenges, especially in bio- and

cheminformatics, where data sets frequently consist of more complex

structures and data can be rather small in terms of the number of

instances compared to other domains.

classifiers to improve the prediction of target values in advanced

classification tasks. Although learning problems in bio- and

cheminformatics commonly provide data in schemes suitable for large

classifier systems, they are rarely used in these domains. This thesis

introduces two new classifiers incorporating systems of classifiers

using Boolean matrix decomposition to handle data in a schema that

often occurs in bio- and cheminformatics.

The first approach, called MLC-BMaD (multi-label classification using

Boolean matrix decomposition), uses Boolean matrix decomposition to

decompose the labels in a multi-label classification task. The

decomposed matrices are a compact representation of the information

in the labels (first matrix) and the dependencies among the labels

(second matrix). The first matrix is used in a further multi-label

classification while the second matrix is used to generate the final

matrix from the predicted values of the first matrix.

MLC-BMaD was evaluated on six standard multi-label data sets, the

experiments showed that MLC-BMaD can perform particularly well on data

sets with a high number of labels and a small number of instances and

can outperform standard multi-label algorithms.

Subsequently, MLC-BMaD is extended to a special case of

multi-relational learning, by considering the labels not as simple

labels, but instances. The algorithm, called ClassFact

(Classification factorization), uses both matrices in a multi-label

classification. Each label represents a mapping between two

instances.

Experiments on three data sets from the domain of bioinformatics show

that ClassFact can outperform the baseline method, which merges the

relations into one, on hard classification tasks.

Furthermore, large classifier systems are used on two cheminformatics

data sets, the first one is used to predict the environmental fate of

chemicals by predicting biodegradation pathways. The second is a data

set from the domain of predictive toxicology. In biodegradation

pathway prediction, I extend a knowledge-based system and incorporate

a machine learning approach to predict a probability for

biotransformation products based on the structure- and knowledge-based

predictions of products, which are based on transformation rules. The

use of multi-label classification improves the performance of the

classifiers and extends the number of transformation rules that can be

covered.

For the prediction of toxic effects of chemicals, I applied large

classifier systems to the ToxCasttexttrademark data set, which maps

toxic effects to chemicals. As the given toxic effects are not easy to

predict due to missing information and a skewed class

distribution, I introduce a filtering step in the multi-label

classification, which finds labels that are usable in multi-label

prediction and does not take the others in the

prediction into account. Experiments show

that this approach can improve upon the baseline method using binary

classification, as well as multi-label approaches using no filtering.

The presented results show that large classifier systems can play a

role in future research challenges, especially in bio- and

cheminformatics, where data sets frequently consist of more complex

structures and data can be rather small in terms of the number of

instances compared to other domains.