CS661 Artifical Intelligence
Lecture 2 - Pattern Recognition
- The simplest type of AI - Pattern Recognition
- classification of real-world objects
- not a unified field, more art than science
- Not a trivial problem
- example: complete the sequence 1, 2, 3, 4, 5, 6, 7, X
- desired answer: 5048 since the sequence was n + (n-1)(n-2)(n-3)(n-4)(n-5)(n-6)(n-7)
- Occam's razor is hard to explain to a computer
- A few applications:
- speech and speaker recognition
- character or signature recognition
- image and scene recognition
- fingerprint recognition
- biomedical applications
- recognition can be from closed set (verification) or open set (spotting)
- for verification there is probability of correct classification
- for spotting there is also false alarm rate (trade-off)
- Syntactical (Linguistic) Approach (GOFAI)
- Describe classes by rules (grammar) in a formal language
- Identify pattern's class by the grammar it obeys
- like identifying computer language by trying different compilers
- Reduces classification to string parsing
- Typical applications: Fingerprinting, Scene analysis, Chinese character recognition
- Statistical (Decision Theoretic) Approach
- Describe patterns as points in n dimensional vector space
- Describe classes as hypervolumes or statistical clouds
- Reduces classification to geometry or function evaluation
- Typical applications: Signals, Speech, Latin character recognition
- Comparison:
Class A Class B Class C
Statistical (1, 0, 0 ) (0, 1, 0 ) (0, 0, 1 )
(0.9, 0.1, 0 ) (0.1, 1, -0.1) (-0.1, 0.1, 1 )
(1.1, 0, 0.1) (0, 0.9, -0.1) ( 0.1, 0, 1.1)
Syntactic (1, 1, 1) (1, 2, 3) (1, 2, 4 )
(2, 2, 2) (2, 3, 4) (2, 4, 8 )
(3, 3, 3) (3, 4, 5) (3, 6, 12)
How hard can it be?
- small spheres far apart
- large spheres close together
- rotated cigars
- bananas
- true overlap
Bayes Decision Theory
- priors - a priori probability distribution p(x|ci)
- posteriors - classify by MAP
- Bayes theorem: p(x, ci) = p(ci|x) p(x) = p(x|ci) p(ci)
- so p(ci|x) = p(ci) p(x|ci) / p(x) where p(x) = \sum_i p(ci) p(x|ci)
- Bayes error is the minimal error (draw 2 Gaussians w/ overlap)
- Why do we need anything else ? unknown priors!
- So does pattern recognition reduce to estimating priors?
- No - generalization (remember real-world)
Learning Theory
- Supervised
- Unsupervised (clustering)
- Reinforcement (good dog, bad dog)
- Cross Validation
- Training set vs. test set
- Validation set
- Training error
- Test error
Structure of classical pattern recognition system
- Preprocessing (noise reduction, registration, etc.)
- Feature extraction (dimensionality reduction)
- Classification
- Postprocessing (smoothing, outlier removal, etc.)
Generic classifiers (from more direct probability estimate to more separation)
- Direct probability estimation (1NN, kNN, Parzen; LBG, LVQ)
- Hypersphere (potentials, Mahalanobis, RBF, RCE)
- Hyperplane (KL, Fisher, CART, MLP)
The VC revolution
- PAC (Probably Approximately Correct) Learning
- teacher and student - same prob dist from world
- limit on generalization error p(E>e) < 1 - d
- VC dimension - strength of classifier
- 1d with line d=2
- 2d line d=3
- 2d arbitrary rect d=4
- Match classifier's VC dimension to amount of training data
- Egen - Etrain < dVC / Ntrain