Probably Approximately Correct Learning
January 1, 2017
We generally talk about a lot of machine learning algorithms but how do we know if a classifier is good enough or how much data points you need to make your classifier good enough. These are some of the questions that arises while we are building a machine learning model. I'll talk about PAC learning framework which would help us answer these very basic questions about machine learning classifiers.
$X$ is the set of all possible examples or the input space. $Y$ is the set of all possible labels. To make it simpler, we'll take the case of binary classification where $Y \in \{0, 1\}$. $c$ is the concept which maps the input space to the output, $X \to Y$ and $C$ is set of all the concepts. In real world problems, we don't know about concept or the concept class. We are just given the $X$ and $Y$ and our goal is to come up with a hypothesis $h$ which has the small generalization error or true error compared to $c$. Now there could be multiple hypothesis that can classify a data. We'll call that hypothesis space, $H$.
Lets consider that we are given a sample of $m$ data points draw from an unknown distribution $D$ with output labels. We can denote the data as, $( (x_1, y_1,...,(x_m, y_m )$ or $( (x_1, c(x_1)),...,(x_m, c(x_m)) )$. Now we can define the generalization error (error on the unseen data), denoted by $R(h)$, as
$R(h) = \text{P}[h(x) \neq c(x)] = \text{E}[\text{1}_{h(x) \neq c(x)}]$
where $1_\omega$ is the indicator function of $\omega$.
But we don't know about distribution $D$ and concept $c$, then how can we calculate generalization error. We cannot calculate generalization error but we can calculate empirical error, denoted as $\hat R(h)$ and defined as,
$\hat R(h) = \frac{1}{m} \sum_{i=1}^{m}1_{h(x_i) \neq c(x_i)}$
which is the average error over the sample.
So in a way, generalization error is the expected value of empirical error and we can prove that.
We can write that,
$\text{E}[\hat R(h)] = \text{E}[\frac{1}{m} \sum_{i=1}^{m}1_{h(x_i) \neq c(x_i)}]$
Using linearity of expectation(expectation value of sum of random variables is equal to the sum of individual expectations),
$\text{E}[\hat R(h)] = \frac{1}{m} \sum_{i=1}^{m} \text{E}[1_{h(x_i) \neq c(x_i)}]$
Also, since the data is drawn from same distribution independently, expectation of a event is same for any $x_i$ in the sample, therefor we can write,
$\text{E}[\hat R(h)] = \frac{1}{m} \sum_{i=1}^{m} \text{E}[1_{h(x) \neq c(x)}] = \text{E}[1_{h(x) \neq c(x)}] = R(h)$
Therefore, we can see that generalization error is the expected value of empirical error.
Now, a concept class is said to be PAC learnable if there exist an algorithm $\text{A}$ such that for any $\epsilon > 0$ and $0 < \delta \leq 1/2$, for all distributions $D$ on $X$ and for any target concept c \in C, we can say that,
$\text{P}[R(h_S) \leq \epsilon] \geq 1 - \delta$
Above equation means that the hypothesis returned by our algorithm $\text{A}$ is approximately correct with error at most \epsilon, with high probability, at least $ 1 - \delta$. We can write it other way as well,
$\text{P}[R(h_S) \gt \epsilon] \lt \delta$
We are saying that probability of something bad happening is upper bounded by a constant which is good. When such algorithm $\text{A}$ exists, we say that it is PAC learning algorithm for concept class $C$.Lets take an example with consistent hypothesis space. A consistent hypothesis space is the set of hypothesis which gives zero error on sample data. Here we are limiting the hypothesis space.
Learning axis aligned rectangles
Suppose we are given sample from $\mathbb{R}^2$ space. $C$ is the set of all axis-aligned rectangles lying in $\mathbb{R}^2$. Now each $c$ is a particular rectangle with a set of points inside it. Machine Learning problem would be to determine a axis aligned rectangle with small true error using labeled training samples. Our goal is to show that concept class of axis-aligned rectangle is PAC learnable.
Now suppose our algorithm $\text{A}$ returned a hypothesis which is the tightest rectangle, say $\text{R}^{'}$. This hypothesis gives zero error on sample but what about true error? As we can see that true error is possible only due to the region $\text{R} - \text{R}^{'}$. We can divide the rectangle $\text{R}$ in four regions $r_1, r_2, r_3$ and $r_4$ such that probability mass of each region is $\epsilon/4$. This means that if we draw random points from any distribution $D$, the points will fall in region $r_i$ with probability $\epsilon/4$. These regions might look like in the figure below.
Now if our hypothesis misses region $r_i$, the error it would make is $1 - \epsilon/4$. And if m points draw i.i.d from the distribution $D$, the error would be $(1 - \epsilon/4)^m$. Now for our hypothesis to make an error at least $\epsilon$ either the rectangle should touch all the sides of the region or it would miss at least one region. So the probability of error at least $\epsilon$ would be the union of all the regions errors, $\bar r_1,\bar r_2,\bar r_3$ and $\bar r_4$ ($\bar r_i$ denotes the negation of probability in that region which denotes the error made by our hypothesis). So now we can write the equations,
$\text{P}[R(\text{R}^{'}) > \epsilon] = \text{P}[\bar r_1 \cup \bar r_2 \cup \bar r_3 \cup \bar r_4]$
Using union bound,
$\text{P}[R(\text{R}^{'}) > \epsilon]$
$\leq$
$\text{P}[\bar r_1] + \text{P}[\bar r_2] + \text{P}[\bar r_3] + \text{P}[\bar r_4]]$
$\leq$
$4(1 - \epsilon/4)^m$
Using the inequality, $1 - x \leq \text{exp}(-x)$, we can write that
$\text{P}[R(\text{R}^{'}) > \epsilon] \leq 4\text{exp}(-m\epsilon/4)$
We can choose $\delta$ to satisfy the equation $\text{P}[R(\text{R}^{'}) \gt \epsilon] \lt \delta$,
$4\text{exp}(-m\epsilon/4) \lt \delta$
From the above equation we can find a relation for m, the number of data points now
$m \gt (4/\epsilon)\log(4/\delta)$
So, if we are given at least $(4/\epsilon)\log(4/\delta)$ independent sample data and use the tightest rectangle $\text{R}^{'}$ as our hypothesis $h$, we can classify a given point with error at most $\epsilon$ with probability at least $1 - \delta$.
I explained an example for consistent hypothesis case where the hypothesis was selected from finite space, where the error on the sample was zero. There is a general formula for $m$, the number of sample points, as well for such cases which is given by,
$m \geq \frac{1}{\epsilon}(\log|H| + \log\frac{1}{\delta})$
$|H|$ is the VC dimension of a hypothesis class, $H$ which I'll post about sometime later. For $m$ number of sample points and $\delta > 0$, we can now define a relation for generalization error,
$R(h_S) \leq \frac{1}{m}(\log|H| + \log\frac{1}{\delta})$
From the above relation we can also infer that the upper bound is dependent on $1/m$, which says that more the sample size, lesser the upper bound. Similarly, for $\delta$. However, it also depends directly on size of hypothesis space, $H$ but that is logarithmic. Also, it gives us hope that if hypothesis set is finite, we can find a consistent algorithm, $A$, which is PAC learning algorithm. We can prove that for inconsistent case when hypothesis set is infinite. We'll see that later.