CS667
Weight Space and the PLA Criterion
- Where perceptron weight vector W lives
- Pattern vector X lives in N dimensional space
- Abstraction of real world objects
- Every point in space represents a real world object
- Up to now perceptron weight vector W in same space as X
- W does not represent a real world object!
- Really already required N+1 free parameters
- More advanced networks will have no connection with X space at all
- Actually all dot products link two different spaces
- From now on think of W in a different space
- W space
- W lives in weight space
- Every point in W space represents a classifier
- A classifier is a dichotomization of X space
- A perceptron classifier is a hyperplane
- All classifiers that satisfy training set form a region in W space
- For perceptron each pattern defines a hyperplane in W space
- Converse of each perceptron defining hyperplane in X space
- Agmon's relaxation algorithm
- PLA moves in correct direction but fixed amount
- PLA with decreasing step size always converges, even for nonseparable case
- Distance of W from hyperplane defined by pattern
- Formula for correction of controlled size
- Converges for 0 < rho < 2
- Extension to multioutput perceptrons
- Many separate perceptrons
- Winner takes all decision
- Can fail to give unambiguous decision
- Soft decision
- Not probability, but can fix with softmax
- MAXNET
- Interpretation of training in weight space
- Intersection of constraint regions
- All solution classifiers are in intersection of all constraint regions
- Think of each addition pattern removing possible perceptrons
- This interpretation used in modern machine learning and statistical physics
- Impractical to use for real (large) problems
- Iterative improvement methods
- Start with arbitrary classifier
- Every point in W space has an error (cost function)
- Move to neighboring point with lower error
- Gradient descent (steepest descent) uses derivative information to choose direction
- Perceptron criterion
- We originally presented PLA and proved - but how did Rosenblatt know?
- We will now learn to derive training algorithms
- Would like to use number of errors as cost function
- Piecewise continuous function
- Too jumpy at transitions
- Too smooth between
- Requirements for good cost functions:
- Non-negative
- Only zero for solutions
- Smaller for better classifiers
- Continuous and differentiable
- Perceptron criterion - sum dot product over misclassified patterns
- Obeys requirements
- Leads to PLA!
Assignment
- Program MAXNET for 10 input problem. (More on MAXNET in Lippmann, IEEE Signal Processing Magazine, April 1987)
ANSWER
- Numerically find the minimum of
y = (x - 1)^2
and
y = x^2 - 1 = (x-1)(x+1)
using gradient descent.
From which initial points does gradient descent reach each minimum?
ANSWER
Back to the CS667 Home Page