Usenet.com

www.Usenet.com

Group Index

Comp Thread Archive from Usenet.com

<-- __Chronological__ --> <-- __Thread__ -->

Article: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction



JAIR is pleased to announce the publication of the following article:

Weiss, G.M. and Provost, F. (2003)
  "Learning When Training Data are Costly: The Effect of Class Distribution on Tree 
Induction", 
   Volume 19, pages 315-354.

   For quick access via your WWW browser, use this URL:
     http://www.jair.org/abstracts/weiss03a.html

Abstract:
For large, real-world inductive learning problems, the number of
training examples often must be limited due to the costs associated
with procuring, preparing, and storing the training examples and/or
the computational costs associated with learning from them. In such
circumstances, one question of practical importance is: if only n
training examples can be selected, in what proportion should the
classes be represented?  In this article we help to answer this
question by analyzing, for a fixed training-set size, the relationship
between the class distribution of the training data and the
performance of classification trees induced from these data. We study
twenty-six data sets and, for each, determine the best class
distribution for learning.  The naturally occurring class distribution
is shown to generally perform well when classifier performance is
evaluated using undifferentiated error rate (0/1 loss).  However, when
the area under the ROC curve is used to evaluate classifier
performance, a balanced distribution is shown to perform well.  Since
neither of these choices for class distribution always generates the
best-performing classifier, we introduce a budget-sensitive
progressive sampling algorithm for selecting training examples based
on the class associated with each example.  An empirical analysis of
this algorithm shows that the class distribution of the resulting
training set yields classifiers with good (nearly-optimal)
classification performance. 

The article is available via:
   
 -- comp.ai.jair.papers (also see comp.ai.jair.announce)

 -- World Wide Web: The URL for our World Wide Web server is
       http://www.jair.org/
    For direct access to this article and related files try:
       http://www.jair.org/abstracts/weiss03a.html

 -- Anonymous FTP from Carnegie-Mellon University (USA):
        ftp://ftp.cs.cmu.edu/project/jair/volume19/weiss03a.ps
    The compressed PostScript file is named weiss03a.ps.Z 

For more information about JAIR, visit our WWW or FTP sites, or
contact [EMAIL PROTECTED]



-- 
Steven Minton
JAIR Managing Editor



<-- __Chronological__ --> <-- __Thread__ -->


Usenet.com



Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.