**Where perceptron weight vector W lives**- Pattern vector X lives in N dimensional space
- Abstraction of real world objects
- Every point in space represents a real world object
- Up to now perceptron weight vector W in same space as X
- W does not represent a real world object!
- Really already required N+1 free parameters
- More advanced networks will have no connection with X space at all
- Actually all dot products link two different spaces
- From now on think of W in a different space
**W space**- W lives in weight space
- Every point in W space represents a classifier
- A classifier is a dichotomization of X space
- A perceptron classifier is a hyperplane
- All classifiers that satisfy training set form a
**region**in W space - For perceptron each pattern defines a hyperplane in W space
- Converse of each perceptron defining hyperplane in X space
- Agmon's relaxation algorithm
- PLA moves in correct direction but fixed amount
- PLA with decreasing step size always converges, even for nonseparable case
- Distance of W from hyperplane defined by pattern
- Formula for correction of controlled size
- Converges for 0 < rho < 2
**Extension to multioutput perceptrons**- Many separate perceptrons
- Winner takes all decision
- Can fail to give unambiguous decision
- Soft decision
- Not probability, but can fix with
**softmax** **MAXNET****Interpretation of****training**in weight space- Intersection of constraint regions
- All solution classifiers are in intersection of all constraint regions
- Think of each addition pattern removing possible perceptrons
- This interpretation used in modern machine learning and statistical physics
- Impractical to use for real (large) problems
- Iterative improvement methods
- Start with arbitrary classifier
- Every point in W space has an
**error****(cost function)** - Move to neighboring point with lower error
- Gradient descent (steepest descent) uses derivative information to choose direction
- Perceptron criterion
- We originally presented PLA and proved - but how did Rosenblatt know?
- We will now learn to derive training algorithms
- Would like to use number of errors as cost function
- Piecewise continuous function
- Too jumpy at transitions
- Too smooth between
- Requirements for good cost functions:
- Non-negative
- Only zero for solutions
- Smaller for better classifiers
- Continuous and differentiable
- Perceptron criterion - sum dot product over misclassified patterns
- Obeys requirements
- Leads to PLA!

- Program MAXNET for 10 input problem. (More on MAXNET in Lippmann, IEEE Signal Processing Magazine, April 1987) ANSWER
- Numerically find the minimum of y = (x - 1)^2 and y = x^2 - 1 = (x-1)(x+1) using gradient descent. From which initial points does gradient descent reach each minimum? ANSWER