Benchmarking of learning algorithms

information repository page

Abstract: Proper benchmarking of (neural network and other) learning architectures is a prerequisite for orderly progress in this field. In many published papers deficiencies can be observed in the benchmarking that is performed.
A workshop about NN benchmarking at NIPS*95 addressed the status quo of benchmarking, common errors and how to avoid them, currently existing benchmark collections, and, most prominently, a new benchmarking facility including a results database.
This page contains pointers to written versions or slides of most of the talks given at the workshop plus some related material. The page is intended to be a repository for such information to be used as a reference by researchers in the field. Note that most links lead to Postscript documents. Please send any additions or corrections you might have to Lutz Prechelt (prechelt@ira.uka.de).

What's new?

1996-01-05: added David Rosen's Data Sources page in 'Other sources of data' section.
1995-12-18: added ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/ in 'Other sources of data' section.
1995-12-12: added paper by Zhu/Rohwer in 'Related Information' section.

Workshop Chairs:

Thomas G. Dietterich <tgd@chert.cs.orst.edu>,
Geoffrey Hinton <hinton@cs.toronto.edu>,
Wolfgang Maass <maass@igi.tu-graz.ac.at>,
Lutz Prechelt <prechelt@ira.uka.de> [communicating chair]
Terry Sejnowski <terry@salk.edu>

Assessment of the status quo:

Lutz Prechelt. A quantitative study of current benchmarking practices.
A quantitative survey of 400 journal articles of 1993 and 1994 on NN algorithms. Most articles used far too few problems during benchmarking.
Arthur Flexer. Statistical Evaluation of Neural Network Experiments: Minimum Requirements and Current Practice. Says that it is insufficient what is reported about the benchmarks and how.

Methodology:

Tom Dietterich. Experimental Methodology
Benchmarking types, correct statistical testing, synthetic versus real-world data, understanding via algorithm mutation or data mutation, data generators.
Lutz Prechelt. Some notes on neural learning algorithm benchmarking.
A few very basic general remarks about volume, validity, reproducibility, and comparability of benchmarking; DOs and DON'Ts.
Brian Ripley. What can we learn from the study of the design of experiments?
(Only two slides, though).
Brian Ripley. Statistical Ideas for Selecting Network Architectures.
(Also somewhat related to benchmarking.)

Benchmarking facilities:

Previously available NN benchmarking data collections
Advantages of these: UCI is large and growing and popular, Statlog has largest and most orderly collection of results available (in a book, though), and Proben1 is most easy to use and best supports reproducible experiments. Elena and nnbench have no particular advantages.
Disadvantages: UCI and Probem1 have too few and too unstructured results available, Proben1 is also inflexible and small, Statlog is partially confidential and neither data nor results collection are growing.
Carl Rasmussen and Geoffrey Hinton. DELVE: A thoroughly designed benchmark collection
A proposal of data, terminology, and procedures and a facility for the collection of benchmarking results.
This is the newly proposed standard for benchmarking NN (and other) learning algorithms. DELVE is currently available as an Alpha release developed at the University of Toronto.

Other sources of data:

See the appropriate section in Part 4 of the comp.ai.neural-nets FAQ
(Thanks to Nici Schraudolph <schraudo@salk.edu>)
There is a large amount of game data about the board game Go available on the net. Good starting points are the Go game database project and the Go game server. The database holds several hundred thousand games of Go and could for instance be used for advanced reinforcement learning projects.
(Thanks to Matthias Blume <mablume@sdcc10.ucsd.edu>)
Tony Robinson has made some speech data available via ftp: ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/. In particular, the Peterson-Barney vowel data (file PetersonBarney.tar.Z) seems useful for NN benchmarking. The data set is well documented in journals and has been fairly widely used.
(Thanks to David Rosen <rosen@unr.edu>)
David Rosen maintains a Data Sources page, which is a part of the WWW Virtual Library: Statistics.
(Thanks to Partha Mitra <pmitra@bell-labs.com>)
Los Alamos Systems Neuroscience Archive.
(Thanks to Wouter.Favoreel@esat.kuleuven.ac.be)
DAISY: A platform for the interchange of information and data in the field of System Identification.

Other related information:

Huaiyu Zhu and Richard Rohwer. Bayesian regression filters and the issue of priors. Discusses:
Test of priors versus test of learning algorithms;
is there a program which will do best on average on the set of benchmarking data sets freely retrievable over the Internet (the Internet Game)?
Is there a theoretically optimal algorithm for an arbitrary given prior?
A prior for the Internet Game, regardless of the problem domain.
Can the motivation of benchmarking be exploited in designing an algorithm?

This page has received a Key Resource award in the Neural Networks topic.
Please send additions and corrections to Lutz Prechelt (prechelt@ira.uka.de).
To NIPS homepage.

Last modified: Wed Mar 29 18:09:30 MET DST 2000