Author: Andrew Estabrooks
Degree: Master of Computer Science, Dalhousie University
Year: October 2000
Chairperson of the Supervisory Committee: Nathalie Japkowicz
Thesis available in Word Format: Here
Abstract:
This thesis explores inductive learning and its application to imbalanced
data sets. Imbalanced data sets occur in two class domains when one class
contains a large number of examples, while the other class contains only a
few examples. Learners, presented with imbalanced data sets, typically produce
biased classifiers which have a high predictive accuracy over the over
represented class, but a low predictive accuracy over the under represented
class. As a result, the under represented class can be largely ignored by an
induced classifier. This bias can be attributed to learning algorithms being
designed to maximize accuracy over a data set. The assumption is that an
induced classifier will encounter unseen data with the same class distribution
as the training data. This limits its ability to recognize positive examples.
This thesis investigates the nature of imbalanced data sets and looks at two
external methods, which can increase a learner's performance on under
represented classes. Both techniques artificially balance the training data;
one by randomly re-sampling examples of the under represented class and adding
them to the training set, the other by randomly removing examples of the over
represented class from the training set. Tested on an artificial domain of
k-DNF expressions, both techniques are effective at increasing the predictive
accuracy on the under represented class.
A combination scheme is then presented which combines multiple classifiers
in an attempt to further increase the performance of standard classifiers
on imbalanced data sets. The approach is one in which multiple classifiers
are arranged in a hierarchical structure according to their sampling
techniques. The architecture consists of two experts, one that boosts
performance by combining classifiers that re-sample training data at different
rates, the other by combining classifiers that remove
data from the training data at different rates.
The combination scheme is tested on the real world application of text
classification, which is typically associated with severely imbalanced data
sets. Using the F-measure, which combines precision and recall as a
performance statistic, the combination scheme is shown to be effective at
learning from severely imbalanced data sets. In fact, when compared to a
state of the art combination technique, Adaptive-Boosting, the proposed
system is shown to be superior for learning on imbalanced data sets.