Author:   Andrew Estabrooks 
     Degree:  Master of Computer Science, Dalhousie University 
     Year:  October 2000 
     Chairperson of the Supervisory Committee:  Nathalie Japkowicz 
Thesis available in Word Format: Here
 Abstract:                                       
This thesis explores inductive learning and its application to imbalanced 
data sets. Imbalanced data sets occur in two class domains when one class 
contains a large number of examples, while the other class contains only a 
few examples. Learners, presented with imbalanced data sets, typically produce 
biased classifiers which have a high predictive accuracy over the over 
represented class, but a low predictive accuracy over the under represented 
class. As a result, the under represented class can be largely ignored by an 
induced classifier. This bias can be attributed to learning algorithms being 
designed to maximize accuracy over a data set. The assumption is that an 
induced classifier will encounter unseen data with the same class distribution 
as the training data. This limits its ability to recognize positive examples.
This thesis investigates the nature of imbalanced data sets and looks at two 
external methods, which can increase a learner's performance on under 
represented classes. Both techniques artificially balance the training data; 
one by randomly re-sampling examples of the under represented class and adding 
them to the training set, the other by randomly removing examples of the over 
represented class from the training set. Tested on an artificial domain of 
k-DNF expressions, both techniques are effective at increasing the predictive 
accuracy on the under represented class.
A combination scheme is then presented which combines multiple classifiers 
in an attempt to further increase the performance of standard classifiers 
on imbalanced data sets. The approach is one in which multiple classifiers 
are arranged in a hierarchical structure according to their sampling 
techniques. The architecture consists of two experts, one that boosts 
performance by combining classifiers that re-sample training data at different 
rates, the other by combining classifiers that remove 
data from the training data at different rates.
The combination scheme is tested on the real world application of text 
classification, which is typically associated with severely imbalanced data 
sets. Using the F-measure, which combines precision and recall as a 
performance statistic, the combination scheme is shown to be effective at 
learning from severely imbalanced data sets. In fact, when compared to a 
state of the art combination technique, Adaptive-Boosting, the proposed 
system is shown to be superior for learning on imbalanced data sets.