Authors: Nathalie Japkowicz and Robert Holte
Workshop Report:
The AAAI-2000 workshop on learning from imbalanced data sets
provided a venue for researchers to discuss fundamental
questions pertaining to machine learning and to challenge some
of the field's institutional practices.
Several observations were made and certain issues were explored
in particular depth. First, it was observed that a large number
of applications suffer from the class imbalance problem. A
distinction, nonetheless, was drawn between the small sample
versus the imbalance problem and it was remarked that although
smart sampling can, sometimes, help, it is not always possible.
Among the issues that received a lot of attention was the problem
of evaluating learning algorithms in the case of class imbalances.
It was emphasized that the use of common evaluation measures can
yield misleading conclusions. More accurate measures include ROC
Curves and Cost Curves. An evaluation measure was also proposed for
the case where only data from one class is available. The other issues
concerned the design of learning algorithms. It was shown that
concept-learning methods can use a one-sided approach focusing on
either the majority or the minority class. If both classes are used,
however, avoiding fragmentation in the minority class is useful. Another
important issue concerned the close connection between the class imbalance
problem and cost-sensitive learning. Finally, the goal of creating a
classifier that performs well across a range of costs/priors was declared
to be an important one.