# Weight Space and the PLA Criterion

1. Where perceptron weight vector W lives
• Pattern vector X lives in N dimensional space
• Abstraction of real world objects
• Every point in space represents a real world object
• Up to now perceptron weight vector W in same space as X
• W does not represent a real world object!
• Really already required N+1 free parameters
• More advanced networks will have no connection with X space at all
• Actually all dot products link two different spaces
• From now on think of W in a different space
2. W space
• W lives in weight space
• Every point in W space represents a classifier
• A classifier is a dichotomization of X space
• A perceptron classifier is a hyperplane
• All classifiers that satisfy training set form a region in W space
• For perceptron each pattern defines a hyperplane in W space
• Converse of each perceptron defining hyperplane in X space
• Agmon's relaxation algorithm
• PLA moves in correct direction but fixed amount
• PLA with decreasing step size always converges, even for nonseparable case
• Distance of W from hyperplane defined by pattern
• Formula for correction of controlled size
• Converges for 0 < rho < 2
3. Extension to multioutput perceptrons
• Many separate perceptrons
• Winner takes all decision
• Can fail to give unambiguous decision
• Soft decision
• Not probability, but can fix with softmax
• MAXNET
4. Interpretation of training in weight space
• Intersection of constraint regions
• All solution classifiers are in intersection of all constraint regions
• Think of each addition pattern removing possible perceptrons
• This interpretation used in modern machine learning and statistical physics
• Impractical to use for real (large) problems
• Iterative improvement methods
• Every point in W space has an error (cost function)
• Move to neighboring point with lower error
• Gradient descent (steepest descent) uses derivative information to choose direction
• Perceptron criterion
• We originally presented PLA and proved - but how did Rosenblatt know?
• We will now learn to derive training algorithms
• Would like to use number of errors as cost function
• Piecewise continuous function
• Too jumpy at transitions
• Too smooth between
• Requirements for good cost functions:
• Non-negative
• Only zero for solutions
• Smaller for better classifiers
• Continuous and differentiable
• Perceptron criterion - sum dot product over misclassified patterns
• Obeys requirements