Supervised learning superstitions cheat sheet

This notebook contains my notes and beliefs about several commonly-used supervised learning algorithms. My dream is that it will be useful as a quick reference for people who are studying for machine learning interviews/quizzes/etc..

After some setup code, the methods discussed are:

  • Logistic regression
  • Decision trees
  • Support vector machines
  • K Nearest neighbors
  • Naive Bayes

To better understand each classifier we train on various versions of the "two moons" dataset and plot empirical decision boundaries. Each plot shows the training data on top of a few thousand randomly chosen points which have been colored by the output of the learned model. Superstition #1: The plots suggest that linear classifiers are often out performed on high quality training sets but still produce sane results on noisy small datasets. Note: not all the plots have the same xy dimensions.

The code is available here: https://github.com/rcompton/ml_cheat_sheet

Other resources:

In [1]:
import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets
%pylab inline

#sklearn two moons generator makes lots of these...
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
Populating the interactive namespace from numpy and matplotlib
In [2]:
"""
Build some datasets that I'll demo the models on
"""
Xs = []
ys = []

#low noise, plenty of samples, should be easy
X0, y0 = sklearn.datasets.make_moons(n_samples=1000, noise=.05)
Xs.append(X0)
ys.append(y0)

#more noise, plenty of samples
X1, y1 = sklearn.datasets.make_moons(n_samples=1000, noise=.3)
Xs.append(X1)
ys.append(y1)

#less noise, few samples
X2, y2 = sklearn.datasets.make_moons(n_samples=200, noise=.05)
Xs.append(X2)
ys.append(y2)

#more noise, less samples, should be hard
X3, y3 = sklearn.datasets.make_moons(n_samples=200, noise=.3)
Xs.append(X3)
ys.append(y3)
In [3]:
def plotter(model, X, Y, ax, npts=10000):
    """
    Simple way to get a visualization of the decision boundary 
    by applying the model to randomly-chosen points
    could alternately use sklearn's "decision_function"
    at some point it made sense to bring pandas into this
    """
    xs = []
    ys = []
    cs = []
    for _ in range(npts):
        x0spr = max(X[:,0])-min(X[:,0])
        x1spr = max(X[:,1])-min(X[:,1])
        x = np.random.rand()*x0spr + min(X[:,0])
        y = np.random.rand()*x1spr + min(X[:,1])
        xs.append(x)
        ys.append(y)
        cs.append(model.predict([x,y]))
    ax.scatter(xs,ys,c=cs, alpha=.35)
    ax.hold(True)
    ax.scatter(X[:,0],X[:,1],
                 c=list(map(lambda x:'r' if x else 'lime',Y)), 
                 linewidth=0,s=25,alpha=1)
    ax.set_xlim([min(X[:,0]), max(X[:,0])])
    ax.set_ylim([min(X[:,1]), max(X[:,1])])
    return

Logistic Regression

Logistic regression is the canoncial example of a "discriminative" classifier (i.e. one that learns the mapping $f:X \rightarrow Y$ directly from the signal as opposed from learning a model of how the data was generated). Here, $Y$ is categorical and $X$ may be either continuous or categorical. A good short introduction (which also touches on generalized linear models) is [10].

Logistic regression assumes that $f$ is a sigmoid function of an inner product with the features, eg.

$$ P(Y=1 | X; \theta) = \sigma(\theta ^T X) = \frac{1}{1+exp(-\theta ^T X)} $$

Training a logistic regression classifier amounts to learning the parameter vector, $\theta$. The most simple approach is via maximum liklihood estimation, i.e.:

\begin{eqnarray} \theta^{MLE} &=& argmax_\theta \; \mathcal{L}(\theta)\\ &=& argmax_\theta \; P(Y | X; \theta)\\ \end{eqnarray}

Now, in the special case where $Y$ can be only $0$ or $1$, the conditional density can be concisely written as

$$ P(Y | X; \theta) = \sigma(\theta ^T X)^Y (1-\sigma(\theta ^T X))^{1-Y} $$

Assume the $M$ training examples were generated independently, we can thus write the likelihood as:

\begin{eqnarray} \mathcal{L}(\theta) = \prod_i^M P(Y_i | X_i; \theta) \end{eqnarray}

It's easier to work with the log of the product, this leads to a convex unconstrained optimization:

$$ argmax_\theta = \sum_{i=1}^M \log(P(Y_i | X_i; \theta)) $$

Remark: Optimization problems tend to be stated as mins, so it's common to see this version to:

$$ argmin_\theta = \sum_{i=1}^M -\log(P(Y_i | X_i; \theta)) $$

The standard way solve this maximization is via stochastic gradient ascent. Usual gradient ascent involve an update rule over all of $\theta$, eg. $\theta^{next} = \theta + \alpha \nabla \mathcal{L}(\theta)$. If we have lots of training data computing the full gradient is slow. Stochastic gradient ascent means that we just compute the gradient in one (randomly-selected) direction before updating. The algorithm looks like this instead:

$$ \begin{align} 1. & \mbox{Select a random ordering of the coordinates} \\ 2. & \mbox{while not converged:}\\ & \quad \mbox{for }i \in \{1, \ldots, M\}: \\ & \qquad \theta^{next} = \theta + \alpha \nabla_i \mathcal{L}(\theta) \end{align} $$

Unlike naive Bayes, logistic regression does not assume conditional independece (or any relation) between the features. As a result, when several extraneous features are provided, the MLE optimization can be solved by optimizing the irrelevant features (ie overfitting). The standard approach to reduce overfitting is to replace the MLE estimate with a MAP estime where a Laplacian prior, $P(\theta) = (\beta/2)^N exp(-\beta/|\theta|_1)$, has been assumed (the parameter $\beta$ is user-specified) [5]. The resulting minimization for such a MAP estimate is:

$$ argmin_\theta = \beta |\theta|_1 + \sum_{i=1}^M -\log(P(Y_i | X_i; \theta) $$

Note the $L_1$ regularization. It favors a sparser $\theta$ (i.e. a simpler model (which is less prone to overfitting)) at the expense of a more difficult optimization problem.

Pros:

  • Liblinear is usually available [1]
  • No user-defined parameters to experiment with unless you regularize (which you probably do), $\beta$ is intuitive
  • Fast to train
  • Fast to apply
  • You probably won't get fired for suggesting it
  • No assumptions about P(X|Y) during the learning stage (true of any discriminative method)
  • Logistic Regression is available in Spark [6]
  • Mark Tygert has done some writing about it [7]
  • Robust to outliers when compared against LDA (LDA assumes normal distributions in the density of the training data)

Cons:

  • Often less accurate than the newer methods
  • Interpreting $\theta$ isn't straightforward
  • $L_1$ optimization is not easy
  • Bizarre use of the word "regression" until you learn about generalized linear models

Neutral:

  • Parametric
  • Discriminative
  • Somehow equlivalent to neural networks with a single hidden layer
  • Somehow equlivalent to Gaussian naive Bayes

[1] http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf

[2] https://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf

[3] https://cs.stanford.edu/people/ang//papers/nips01-discriminativegenerative.pdf

[4] http://stackoverflow.com/questions/879432/what-is-the-difference-between-a-generative-and-discriminative-algorithm

[5] https://cs.stanford.edu/people/ang//papers/aaai06-efficientL1logisticregression.pdf

[6] https://spark.apache.org/docs/1.1.0/mllib-linear-methods.html

[7] http://www.cims.nyu.edu/~tygert/lr.pdf

[8] http://math.arizona.edu/~hzhang/math574m/2014Lect6_LDAlog.pdf

[9] http://webdocs.cs.ualberta.ca/~greiner/C-466/SLIDES/3b-Regression.pdf

[10] http://cs229.stanford.edu/notes/cs229-notes1.pdf

In [4]:
import sklearn.linear_model
classifier = sklearn.linear_model.LogisticRegression()

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11,13))
i=0
for X,y in zip(Xs,ys): 
    classifier.fit(X,y)
    plotter(classifier,X,y,ax=axes[i//2,i%2])
    i += 1
plt.show()