A COMBINATION SCHEME FOR LEARNING FROM IMBALANCED DATA SETS

Author: Andrew Estabrooks
Degree: Master of Computer Science, Dalhousie University
Year: October 2000
Chairperson of the Supervisory Committee: Nathalie Japkowicz

Thesis available in Word Format: Here

Abstract: This thesis explores inductive learning and its application to imbalanced data sets. Imbalanced data sets occur in two class domains when one class contains a large number of examples, while the other class contains only a few examples. Learners, presented with imbalanced data sets, typically produce biased classifiers which have a high predictive accuracy over the over represented class, but a low predictive accuracy over the under represented class. As a result, the under represented class can be largely ignored by an induced classifier. This bias can be attributed to learning algorithms being designed to maximize accuracy over a data set. The assumption is that an induced classifier will encounter unseen data with the same class distribution as the training data. This limits its ability to recognize positive examples.

This thesis investigates the nature of imbalanced data sets and looks at two external methods, which can increase a learner's performance on under represented classes. Both techniques artificially balance the training data; one by randomly re-sampling examples of the under represented class and adding them to the training set, the other by randomly removing examples of the over represented class from the training set. Tested on an artificial domain of k-DNF expressions, both techniques are effective at increasing the predictive accuracy on the under represented class.

A combination scheme is then presented which combines multiple classifiers in an attempt to further increase the performance of standard classifiers on imbalanced data sets. The approach is one in which multiple classifiers are arranged in a hierarchical structure according to their sampling techniques. The architecture consists of two experts, one that boosts performance by combining classifiers that re-sample training data at different rates, the other by combining classifiers that remove data from the training data at different rates. The combination scheme is tested on the real world application of text classification, which is typically associated with severely imbalanced data sets. Using the F-measure, which combines precision and recall as a performance statistic, the combination scheme is shown to be effective at learning from severely imbalanced data sets. In fact, when compared to a state of the art combination technique, Adaptive-Boosting, the proposed system is shown to be superior for learning on imbalanced data sets.