Bob's Favourite Papers
On this page I have collected my favourite papers and given an explanation of
their key ideas and significance. I have arranged the papers under the headings
of Machine Learning,
Signal Processing and
Miscellaneous.
Machine Learning
My work in machine learning has covered
learning theory,
online learning algorithms and
kernel machines.
Learning Theory
Peter L. Bartlett,
Philip M. Long and Robert C. Williamson,
``Fat-Shattering
and the Learnability of Real-Valued Functions''
Journal of Computer and System Sciences, 52(3),
434-452, (1996).
This paper provided the first necessary and sufficient conditions
for learnability of real valued functions. Previously it was known
that finiteness of the Pollard Pseudo-Dimension was sufficient, but
it was also known not to be necessary. They key idea of the proof
of the main result was inspired by ideas of dithering in
analog-to-digital converters used in audio. Fat-shattering was
introduced in the machine learning community by Alon et al, but
actually show up in work by Tihomirov in the early sixties. A
distribution dependent version was also used by Talagrand around
the time the present paper was written.
Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson,
``Efficient Agnostic Learning of Neural Networks with Bounded Fan-in''
IEEE Transactions on Information Theory, 42(6),
2118-2132, (1996).
This was the first (I believe) paper to give a COLT-like proof of
the efficient learnability of a certain class of neural networks.
As with most COLT-style proofs, the constants are ridiculous and
useless...
Wee Sun Lee, Peter L. Bartlett and Robert C. Williamson,
``The
Importance of Convexity in Learning with Squared Loss''
IEEE Transactions on Information Theory 44(5),
1974-1980, 1998.
The normal bounds of learning theory provide for a convergence
rate dominated by 1/sqrt{n} in the agnostic setting (wheras one
can obtain 1/n in the realisable case). Somewhat amazingly,
when using squared loss for regression, if the class of functions
is convex, then one can obtain a faster rate. Yet another example
of the general rule: geometry matters! The following paper extends
the idea in a more esoteric manner and more importantly, points
out that the lower bound claimed in the present paper is false.
Shahar Mendelson and Robert C. Williamson, Agnostic Learning Nonconvex Function Classes, pages
1-13, in J. Kivinen and R.H. Sloan (Eds), Computational Learning Theory
15th Annual Conference on Computational Learning Theory, COLT 2002,
Sydney, Australia, July 8-10, 2002. Proceedings Lecture Notes in
Artificial Intelligence
2375.
This builds on the convexity result of the previous paper and shows
that what really matters is how far (in a certain special sense) ones
target function is to the set of nonuniqueness points of the function
class. Convex function classes have no non-uniqueness points (every
function has a unique best approximation in the class). It also shows
that the sort of lower bound claimed in the previous paper can not be
true.
Robert C. Williamson, John Shawe-Taylor, Bernhard
Schölkopf and Alex J. Smola, Sample Based Generalization Bounds, accepted subject
to revision to IEEE Transactions on Information Theory, November
1999.
Shows how empirical covering numbers can be used to bound the
performance of learning machines. The significan is twofold: the
empirical covering numbers are easier to compute, and they can reflect
the complexity of the task actually being considered (rather than an a
priori bound which is inevitably worst case).
This paper (which was accepted subject to changes) will probably never
be formally published. The changes requested would have been a lot of
work, but none of the authors felt they would actually add much value.
A shorter conference version was published (see P112 in publications
list).
Erik Weyer, Iven M.Y. Mareels, and Robert C. Williamson,
``Sample Complexity of Stochastic Least Squares System Identification'',
IEEE Transactions on Automatic Control, 44(7), 1370-1383 (1999)
This was the first work to study the finite sample complexity of
linear system identification. Historically only asymptotic results
were obtained.
(The paper was submitted in 1995 and actually based in part on
results in a CDC paper from 1992. Thus from time of doing the work
(1992) to publication was 7 years!)
John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson and Martin
Anthony, ``Structural Risk Minimization over Data-Dependent Hierarchies'',
IEEE Transactions on Information Theory, 44(5),
1926-1940 (1998).
This paper (which to my surprise is one of my most highly cited)
was the first COLT like analysis of the performance of the maximum
margin algorithm. The reason for the difficulty in analysing such
algorithms is that the usual machinery presumes there is a
fixed class of functions from which the algorithm draws an
hypothesis. In the MM case (and others), the function class
actually depends on the data. Now you simply can not sweep this
under the carpet: you have to use the information in the data only
once. The paper introduced the notion of luckiness which
is a way of thinking about how good your sample is. The following
paper provides a more powerful and general framework.
Ralf Herbrich and Robert C. Williamson, Algorithmic Luckiness, Journal
of Machine Learning Research 3, 175-212 (2002).
The traditional methods of statistical learning theory analyse learning
algorithms only in terms of the class of functions from which the
algorithm draws its hypotheses. The luckiness framework (previous paper)
introduced a method by which performance on the sample received could
be captured, but it still essentially used the device of considering the
class of functions from which the (data-dependent) hypotheses were
drawn. In contrast, the present paper, for the first time, allowed the use of
the traditional machinery of statistical learning theory in a manner that
allowed the use of properties of the algorithm beyond merely the class of
functions from which the hypothesis is drawn. The paper formally subsumes the
previous luckiness framework, and in addition allows the analysis of sparsity
and algorithmic stability. Finally, the new machinery is applied to the
analysis of the maximum margin algorithm to obtaion a new performance bound
which interestingly combines three key factors: margin, sparsity, and degree of
clustering of the data.
Kim L. Blackmore, Robert C. Williamson and Iven M.Y. M
areels, ``Decision Region
Approximation'',
IEEE Transactions on Information Theory, 43(3),
903-907, 1997.
Most work in Learning Theory focusses on the sample error in the
learning process. One also has to deal with the approximation
error. The present paper is one of the very few that deals
with the approximation by the levels sets of a function, which is
what is usually done for classification learning.
Online Learning Algortithms
Online learning algorithms are perhaps the most widely used since they
can deal with arbitrarily large amounts of data and can be used in real
time applications. The first three papers below focus on the use for
nonlinearly paramterized classes of functions. The last two papers
focus on the geometry of online learning.
Kim L. Blackmore, Iven M.Y. Mareels, and Robert C.
Williamson, ``Learning Nonlinearly Parametrized Decision Regions''
Summary in Journal of Mathematical Systems,
Estimation and Control , 6(1), 129-132 (1996). Full
version published electronically.
This paper presents a careful mathematical analysis of the
behaviour of stochastic gradient descent for classification
learning when the function class is nonlinearly parametrised.
This is a good example of the problems of picking the right
journal! The paper was submitted in 1993 but was not published
until 1996. And the journal became defunct in 1998...
Kim L. Blackmore, Robert C. Williamson and
Iven M.Y. Mareels, ``Local Minima and
Attractors at Infinity of Gradient Descent Learning
Algorithms,''
Summary in Journal of Mathematical Systems, Estimation and
Control, 6(2), pages 231-234, (1996). Full version was
published electronically.
Many things can go wrong with online learning when the functions
are nonlinearly parametrised. The algorithms can get "stuck" in
local minima. But they can also fail even when they are not stuck
by heading towards an attractor at infinity.
Kim L. Blackmore, Robert C. Williamson, Iven M.Y. Mareels,
William A. Sethares,
``On-line Learning via Congregational Gradient Descent'',
Mathematics of Control, Signals and Systems,
10(4), 331-363, 1997.
Genetic Algorithms were often held up to be superior to gradient
descent because they did not get stuck in a local minima. This
comparison always struck me as unfair as the gradient descent
algorithms only maintain a single hypothesis. A fairer comparison
is with a variant of gradient descent which maintains several
candidate hypotheses and regularly culls the weak ones. The
intricacy of the analysis of the present paper arises from the use
of stochastic gradient descent algorithms and the
subtelty of figuring out when to cull.
A followup paper was prepared but never
published. One result in it that might be of broader interest is
in section 7 and concerns the density of learning problems.
Robert E. Mahony and Robert C. Williamson, "Prior
Knowledge and Preferential Structures in Learning
Algorithms,"
Journal of Machine Learning Research, 1, 311-355, 2001. (see
the final version on the JMLR web page
Studies the geometry of prior knowledge for online learning
algorithms. Shows how one can define a prior for stochastic
gradient descent. One can derive the Exponentiated Gradient
descent algorithm from this framework. The framework is being
extended; see P174 in publication list.
Kernel Machines
Bernhard Schölkopf, Alex Smola, Robert Williamson and
Peter Bartlett, ``New Support Vector Algorithms,''
Neural Computation 12(5), 1207-1245, May 2000.
This paper introduced the nu-trick (hence "new"; geddit?) which is
a nicer parameterization to use for SVMs. The nu parametrisation is
now widely used, and this paper highly cited (117 according to
Google scholar in March 2005)
Bernhard Schölkopf, John C. Platt, John Shawe-Taylor, Robert
C. Williamson and Alex J.Smola, Estimating the Support of a High-Dimensional Distribution,,
Microsoft technical report MSR-TR-99-87. Slightly abridged version in
Neural Computation 13(7), 1443-1471, 2001.
Showed how to use kernel machines for novelty detection or one-class
classification. Previous algorithms in the statistical literature are
all limited to low dimensional input spaces. By utilising the nu-trick
(see previous paper) a very nice knob to control the percentage of
outliers is provided.
Robert C. Williamson, Alex Smola and Bernhard
Schölkpof, ``Generalization Performance
of Regularization Networks and Support Vector Machines via
Entropy Numbers of Compact
Operators,''
IEEE Transactions on Information Theory,
47(6), 2516-2532, 2001.
This paper showed (in a COLT setting) how the kernel of a SVM
affects its generalisation performance. The paper used the
machinery of entropy numbers of linear operators. A follow-up
paper provides a more explicit calculation of the bounds
which is easier to interpret.
Signal Processing
Brian C. Lovell and Robert C. Williamson, The
Statistical Performance of Some Instantaneous Frequency
Estimators IEEE Transactions on Signal Processing
40, pages 1708-1723,
(July 1992).
Brian C. Lovell, Robert C. Williamson and Boaulem Boashash,
The
Relationship Between Instantaneous Frequency and Time
Frequency Representations IEEE Transactions on
Signal Processing , 41, pages 1458-1461 (1993).
These two papers were written whilst I was a student along with Brian
Lovell who was also a student at the time. They show that the complex
and intricate schemes for estimating instantaneous frequency using the
Wigner-Ville distribution are, when done properly, nothing more than
filtered finite differences of the phase of the analytic signal.
"Doing it properly" simply means using the right formula for the
moments of a random variable defined on a circle.
Darren Ward,
Rodney A. Kennedy and Robert C. Williamson,
"The Theory of Broadband Sensor Arrays with Frequency Invariant Far-Field Beam Patterns"
Journal of the Acoustical Society of America 97(2),
1023-1034, (February 1995)
Shows how to design a beamformer (e.g. a microphone array) with a
beampattern that is invariant with frequency (over wide frequency
ranges).
Simon I. Hill and Robert C. Williamson, Convergence of Exponentiated Gradient Algorithms,IEEE
Transactions on Signal Procesing, 49(6), 1208-1215, June 2001.
This paper analysed the Exponentiated Gradient Descent Algorithm,
familiar in the Machine Learning community, using the standard
machinery of signal processing.
Paul D. Teal, Robert C. Williamson and Rodney A. Kennedy,
"Error Performance of a Channel of Known Impulse
Response",
Acoustics, Speech, and Signal Processing, 2000. ICASSP '00.
Proceedings. 2000 IEEE International Conference on ,
Volume: 5 , 2000 Page(s): 2733 -2736.
Classical text books on communication theory seperately consider
the error performance of a channel and its equalisation. It turns
out for a fixed Signal to Noise Ratio, some channels are much
harder than others. This has significant implications for mobile
communications when the channel is rapidly varying.
Biljana Radlovic, Robert C. Williamson and
Rodney A. Kennedy, "Equalization in an Acoustic
Reverberant Environment: Robustness
Results,"
IEEE Transactions on Speech and Audio Processing,
8(3), 311-319, May 2000.
Reverberation is a standard problem in audio transduction. (Humans
remove reverberation remarkably well; See why. The present
paper shows that linear equalisation is intrinsically
non-robust. Even supposing you perfectly removed the reverberation by
equalisation, if the source just moves a few centimetres, you will be
worse off than if you had not equalised at all. Interestingly, the
cocktail party effect (more specifically demixing convolutive
mixtures) is a harder problem than equalisation, so it too will be
very non-robust, but that does not stop lots of people writing papers
anbd doing experiments on simulated data!
Richard K. Martin, William A. Sethares, and Robert C. Williamson,
"Exploiting
Sparsity in Adaptive Filters," IEEE transactions on Signal
Processing, 50(8), 1883-1893, August 2002.
By exploting and simplifying work with Mahony on prior knowledge in
online learning we showed how to tune adaptive filtering algorithms
to exploit sparsity of the target in a very simple but effective way.
Darren B. Ward, Eric A. Lehmann and Robert C.
Williamson,
Particle
Filtering Algorithms for Tracking an Acoustic Source in a Reverberant
Environment,
IEEE Transactions on Speech
and Audio Processing, 11(6), 826-836, November 2003.
Shows how to use particle filtering combined with beamforming to track an
acoustic source in a real reverberant room. As far as I am aware this was
the first published work deminstrating reliable tracking in a typical
reverberant office.
Miscellaneous
Robert C. Williamson and Tom Downs, ``Probabilistic
Arithmetic: Numerical Methods for Calculating Convolutions
and Dependency Bounds,'' International
Journal of Approximate Reasoning, 4(2), pages
89-158 (1990).
The main contribution of my PhD thesis and my only paper that
filled an entire issue of a journal!
It presented numerical algorithms for computing bounds on the
probability distribuition of elementary functions of random
variables when only their marginal distributions are known.
Algorithms developed are now available in commercial software from
RAMAS Software. (See
Risk Calc, Probability Bounds Analysis) and seem widely cited.
I do not have an electronic version of the paper, but you can
look at my
thesis (without pictures).
(Here is a scanned copy of my thesis with pictures (12M))
Note that I never published part
2 of the paper (bad idea to call a paper part 1 without a definite
plan to make part 2!). Chapter 4 of my thesis (connection with
other ideas) was what I had in mind for part 2...
Robert C. Williamson, ``The Law of Large Numbers for Fuzzy
Variables under a General Triangular Norm Extension
Principle,'' Fuzzy Sets and Systems, 41,
55-81, (1991).
A paper arising from my PhD thesis. Apart from the technical
contribution (which incidentally contains an error), there is a
remark, which I believe was novel at the time, explaining how the
calculus of operations on fuzzy numbers can strictly be
interpreted in the theory of probability. The key is that one
needs to consider the marginals only, and compute a resultant
distribution which is the worst case over all possible joint
distributions. It is very often said that fuzzy numbers are
different to random variables because the manner by which they
are combined is so different. The truth of the matter is that one
can fully recover fuzzy sets from probability (!). A (very long!)
argument along these lines comprises section 4.5 of my PhD thesis.
I do not have an electronic version of the paper, but you can
look at my
thesis (without pictures) which contains everything in the
paper.
Here is a scanned copy of my thesis with pictures (12M)