Date: Monday, July 31, 2000
Location: Austin, Texas
Organizers: Rob Holte, University of Ottawa (holte@site.uottawa.ca), Nathalie Japkowicz, Dalhousie University (nat@cs.dal.ca), Charles Ling, University of Western Ontario (ling@csd.uwo.ca) and Stan Matwin, University of Ottawa (stan@site.uottawa.ca)
Workshop Description:
As the field of machine learning makes a rapid transition from the status
of "academic discipline" to that of "applied science", a myriad of new
issues, not previously considered by the machine learning research
community, is now coming to light. One such issue is the problem of
imbalanced data sets. Indeed, the majority of learning systems previously
designed and tested on toy problems or carefully crafted benchmark data
sets usually assumes that the training sets are well balanced. In the
case of concept-learning, for example, classifiers typically expect that
their training set contains as many examples of the positive as of the
negative class.
Unfortunately, this balanced assumption is often violated in real world
settings. Indeed, there exist many domains for which one class is better
represented than the other. This is the case, for example, in fault-
monitoring tasks where non-faulty examples are plentiful since they
typically involve recording from the machine during normal operation
whereas faulty examples involve recording from a malfunctioning machine,
which is not always possible, easy, or financially worthwhile. More
generally, the problem of imbalanced data sets occurs anytime one class
represents a circumscribed concept, while the other represents the
counterpart of that concept. The imbalanced data set problem can thus
take two distinct forms: either the counterpart class is under-sampled
relative to the concept class (as in the above example) or it is
over-sampled but particularly sparse (e.g., it includes the profile of
a large number of patients who do not have lung cancer).
Although the imbalanced data set problem is starting to attract researchers'
attention, attempts at tackling it have remained isolated. It is our
belief that much progress could be achieved from a concerted effort and
a greater amount of interactions between researchers interested in this
issue. The purpose of our workshop is to provide a forum to foster such
interactions and identify future research directions.
To this day, we have identified four categories of methods capable of
tackling the imbalanced set problem in concept-learning tasks:
Proposed Format: The workshop will consist of four panels corresponding to the categories identified above. A fifth panel will be created for papers falling in categories which we did not anticipate. [Please, note that this structure may be revisited once contributions have been received]. Each panel will consist of a short introduction by an invited discussant, of a series of paper presentations, and of a discussion also led by the discussant. The workshop will conclude with a general panel discussion during which four distinguished guests will comment on the presentations of the day, discuss future directions, and open the floor for general discussion.
Proposed Length: One Day during which each panel will be allocated 1 to 2 hours, depending on the number of contributions and the expected length of the discussion session.
Submissions:
Authors are invited to submit papers on the topics outlined above or
on other related issues. Submissions should be 6 pages, and be in line with
the AAAI style sheet. Electronic submissions, in Postscript format, are
prefered and should be sent to Nathalie Japkowicz at nat@cs.dal.ca. If
electronic submissions are inconvenient, please send four hard copies of
your submission to:
Timetable: