## What hypothesis set can we use?

## A Simple Hypothesis Set: the “Perceptron”

- Perceptron in $\mathbb{R}^2$

- features x: points on the plane (or points in $\mathbb{R}^d$ )
- labels y: (+1), × (-1)
- hypothesis h: lines (or hyperplanes in $\mathbb{R}^d$ ),positive on one side of a line, negative on the other side
- different line classifies simples differently

perceptrons <=> linear (binary) classifiers

## Select g from $\mathscr{H}$

- want: $g \approx f$ (hard when f unknown)
- almost necessary: $g \approx f$ on D, ideally $g(x_n) = f(x_n) = y_n$
- difficult: H is of infinite size
- idea: start from some $g_0$, and correct its mistakes on $D$

## Perceptron Learning Algorithm

- if PLA halts (i.e. no more mistakes),

(necessary condition) $D$ allows some w to make no mistake - call such $D$ linear separable
- as long as linear separable and correct by mistake
- inner product of $w_f$ and $w_t$ grows fast; length of wt grows slowly
- PLA ‘lines’ are more and more aligned with $w_f \rightarrow halts$

## Line with Noise Tolerance

$D$ is not linear separable?

## Pocket Algorithm

**Find the best weights in pocket until enough iterations**

### Question

Since we do not know whether D is linear separable in advance, we may decide to just go with pocket instead of PLA. If D is actually linear separable, what’s the difference between the two?

1 pocket on D is slower than PLA

2 pocket on D is faster than PLA

3 pocket on D returns a better g in approximating f than PLA

4 pocket on D returns a worse g in approximating f than PLAanswer: Because pocket need to check whether $w_{t+1}$ is better than $\hat{w}$ in each iteration, it is slower than PLA. On linear separable D, $w_{POCKET}$ is the same as $w_{PLA}$, both making no mistakes.