Choosing a marginal class distribution for classifier induction
Choosing a marginal class distribution for classifier induction
Foster Provost
New York University
(work done with Gary Weiss of AT&T)
Practitioners often face the question of choosing the marginal class
distribution
with which to learn. This is especially the case when the class
distribution
is unbalanced, in which case practitioners often learn models using a larger
percentage of the minority class (a practical rule of thumb is to train with
a
balanced distribution). I will talk about various aspects and intricacies
of
this
problem, and present the results of an empirical study examining the
relationship
between class distribution and generalization performance. I also will
present
a new "budget-sensitive" progressive sampling algorithm, that selects a
class
distribution while staying within a predetermined budget for
procuring/preprocessing
data.