CS667

Weight Space and the PLA Criterion

Where perceptron weight vector W lives

Pattern vector X lives in N dimensional space

Abstraction of real world objects
Every point in space represents a real world object

Up to now perceptron weight vector W in same space as X

W does not represent a real world object!
Really already required N+1 free parameters
More advanced networks will have no connection with X space at all
Actually all dot products link two different spaces

From now on think of W in a different space

W space

W lives in weight space
Every point in W space represents a classifier

A classifier is a dichotomization of X space
A perceptron classifier is a hyperplane

All classifiers that satisfy training set form a region in W space

For perceptron each pattern defines a hyperplane in W space
Converse of each perceptron defining hyperplane in X space

Agmon's relaxation algorithm

PLA moves in correct direction but fixed amount
PLA with decreasing step size always converges, even for nonseparable case
Distance of W from hyperplane defined by pattern
Formula for correction of controlled size
Converges for 0 < rho < 2

Extension to multioutput perceptrons

Many separate perceptrons
Winner takes all decision

Can fail to give unambiguous decision

Soft decision

Not probability, but can fix with softmax
MAXNET

Interpretation of training in weight space

Intersection of constraint regions

All solution classifiers are in intersection of all constraint regions
Think of each addition pattern removing possible perceptrons
This interpretation used in modern machine learning and statistical physics
Impractical to use for real (large) problems

Iterative improvement methods

Start with arbitrary classifier
Every point in W space has an error (cost function)
Move to neighboring point with lower error
Gradient descent (steepest descent) uses derivative information to choose direction

Perceptron criterion

We originally presented PLA and proved - but how did Rosenblatt know?
We will now learn to derive training algorithms
Would like to use number of errors as cost function

Piecewise continuous function
Too jumpy at transitions
Too smooth between

Requirements for good cost functions:

Non-negative
Only zero for solutions
Smaller for better classifiers
Continuous and differentiable

Perceptron criterion - sum dot product over misclassified patterns

Obeys requirements
Leads to PLA!

Assignment

Program MAXNET for 10 input problem. (More on MAXNET in Lippmann, IEEE Signal Processing Magazine, April 1987) ANSWER
Numerically find the minimum of y = (x - 1)^2 and y = x^2 - 1 = (x-1)(x+1) using gradient descent. From which initial points does gradient descent reach each minimum? ANSWER

Back to the CS667 Home Page