Discussion of "The two cultures," by Leo Breiman

Presented on Feb 18, 2016
Presenter: Greg
Scribe: Allison

There are two goals in analyzing data: prediction and extracting information (or understanding and exploring the data). There are two approaches toward these goals: data modeling and algorithmic modeling.

The data modeling culture is popular among statisticians, and consists of assuming a stochastic model of how the data came to be (e.g., linear/logisitic regression and the Cox model) and then fitting these model parameters from the data. The fit parameters can then be used for either information extraction or prediction. Models are validated using goodness of fit tests and examining residuals.

The algorithmic modeling culture considers the process from the input variables x to the output variables y to be unknown. Instead of assuming the data was generated in a particular way, this approach is to find some function to approximate/mimic the data generation process, such as decision trees and neural nets. The models are validated with predictive accuracy and this approach is most popular in disciplines outside of statistics, like computer science.

Even more recently in the data modeling culture, but not in the paper: discriminative models vs. generative models. The author only focuses on discriminate data models.

This paper argues that the focus on data modeling in statistics has lead to:
irrelevant theory and questionable scientific conclusions
prevented statisticians from using more suitable algorithmic models
prevented statisticians from working on interesting problems

Case studies:

The ozone project: a complete failure due to high false alarm rate. He blames generative modeling.
The chlorine project: success with 95% accuracy, using a decision tree. He claims it is a win for discriminative models.

Data models: conclusions drawn about the model, not about the data. Gender bias in salary: a study collected 25 variables and concluded at the 5% significance level that genre bias exists in salaries. He concludes this could be erroneous, but does not give solutions.

Limitations of data models: he brushes away more complicated models (outside regression, etc.) as being too difficult to interpret.

Theory of algorithmic modeling still assumes that data are iid form an unknown multivariate distribution.
Rashomon: multiplicity of good models
in stats, many equations f(x) can produce the same error rate; which to choose
bagging (by Breiman, the author) helps for both algorithms and data modeling
Occam’s razor
interpretability and prediction accuracy are at odds; Breimen’s advice is to go for predictive accuracy first
Bellman’s curse of dimensionality
want to reduce the number of variables to reduce the variance of our model

Greg showed that in genomics probabilistic models are more popular (in terms of citations), but algorithmic models are also very popular.

Comments / responses:
Efron: “At first glance Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance it still looks that way, but the paper is stimulating, and Leo has some important points to hammer home.”
Hadley “Algorithmic modeling is a very important area of statistics. It has evolved naturally in environments with lots of data and lots of decisions. But you can do it without suffering the Occam dilemma; for example, use medium trees with interpretable GAMs in the leaves.”
Parson: prediction (management/profit) vs. science (truth)

Comments Off on Discussion of "The two cultures," by Leo Breiman

Discussion of "Strength of Weak Learnability" by Robert E. Schapire

Presented: March 31, 2016
Presenter: Irineo
Scribe: Adam

Oracle, sampled.
Knowledge: barely above chance
Double boost away!

This past week we discussed the classical paper "Strength of Weak Learnability" by Robert E. Schapire which describes a boosting algorithm for (conditionally) improving any better-than-chance classifier into an arbitrarily good classifier. Schapire proves this strong claim is true for any classification problem that meets the probably-almost-correct (PAC) setting. The proof presented is a proof by construction: i.e. Schapire presents a method, termed boosting, which can take a better-than-chance classifier and create an arbitrarily precise classifier. Towards the end of the paper, the practical aspects of the algorithm are analyzed, the boosting method is improved by increasing the algorithm speed via double boosting.

This paper has flavors of a number of very important current research directions. The boosting method seems to have a lot in common with adaptive sampling methods in how it seeks to modify the data used (or sampled) in order to iteratively improve deficient areas in a classifier. The final sections of this paper also touch on the important idea of the underlying complexity of functions by relating the ``size'' of the circuit needed to calculate (or even approximate) functions of a certain class.

The work in this paper is focused on the PAC setting, where a noise-free oracle provides binary outputs for points sampled from a space X according to the distribution D. Specifically, the oracle returns EX = c(x) \in \{+1,-1\} for a given x\in X. The goal of learning in this PAC setting is to create an output "hypothesis" function h(X) which returns a binary value for a given input x\in X that matches the oracle function for an arbitrarily large fraction of the samples of X. Within the PCA setting, strong and weak learnability are defined with respect to the achievable accuracy of a derivable hypothesis h(x):
Weak Learnability: A function c(x) is weakly learnable if for all
sample numbers m
class functions c(x) with "size" s
sampling distributions D
accuracy levels \epsilon>0
probability values \delta\leq 1
There exists a method A and function p(m,s) such that with access to the oracle samples EX_D, the method A can output a hypothesis function h(x) that is 0.5 - P^{-1}(m,s)- close in polynomial(m,s,\delta^{-1}).
Strong Learnability: A function c(x) is learnable if for all
sample numbers m
class functions c(x) with "size" s
sampling distributions D
accuracy levels \epsilon > 0
probability values \delta \leq 1
Then there exists a method A that with access to the oracle samples EX_D can output a hypothesis function h(x) in polynomial(m,s,\delta^{-1}) that is \epsilon-close to c(x) with probability 1-\delta.

With these definitions, this paper came at a time when the distinction between these forms of learnability was in question. In particular, the question of when a method was weakly learnable but not strongly learnable was of particular interest. In some cases, such as for the case where the sampling distribution D was uniform, it was shown that weak learnability does NOT imply strong learnability. Schapire's paper is thus quite amazing in that it shows that for a class of settings these two forms imply each other. Moreover, Schapire actually provides a method to transform weak hypotheses into strong hypotheses.

The method (which is also the proof, as the paper is based on proof-by-construction), has the flavor of an early adaptive sampling scheme. The algorithm works by starting with a weak-learnable method A and initializes by learning a weak hypothesis h(x) on a set of samples from the oracle EX. The method then modifies the sampling procedure to focus on locations that the first hypothesis produced high errors at. Essentially, points are rejected from the oracle outside of these regions at a rate that allows half the points from the modified oracle sampler EX_1 to be from the high-error locations. The method A is then used to make a new hypothesis h_1(x) based on these new points, and the total classification of a particular location x is achieved by majority vote over all hypotheses discovered in this fashion. This procedure is repeated with new oracle samplers EX2,EX3,... until the areas that are incorrectly classified are arbitrarily small.

This method can boost any weak method A that is \alpha-close to a method A` that is \gamma-close with \gamma \leq 3\alpha^2 - 2\alpha^3. The method can then be iterated again on itself to turn any \epsilon-close method into a method that has arbitrarily good performance. With Theorem 2 proving that this algorithm will do just what is desired, Schapire then further improves the algorithm with the double-boosting method.

Double boosting was presented to alleviate the problem with sampling over small erroneous regions. Essentially, it takes many more samples to get a good sampling over small areas when a hypothesis (say h_2) is already very close to the desired error rate. The work-around here is to introduce additional errors in the too-accurate hypothesis in order to artificially increase sampling rates. The extra error is accomplished by using a Bernoulli sample with a carefully selected bias to modify the hypothesis at some example points w. For each w, if the sample is a one, then the hypothesis at that example is not changed. Otherwise, the hypothesis is set to h_1(x) = \neg c(w): creating an error at that location. This increases the ranges to sample from, hastening the algorithm.

The final sections of the paper draw some interesting conclusions from the presented results. First, it is shown that if a class is learnable, boosting can find the method to train a hypothesis. The logic here is that any strong learnable method A` can be ``dumbed down'' by artificially setting the error rate to \epsilon = 0.25. Boosting can then provably recover a strong learning method by using the dumbed down method as the weak learning step.

Schapire also draws interesting connections to polynomial computation of functions. Specifically, for some classes of functions (i.e. if the binary function in question is learnable), then there exists a polynomial-size circuit for computing the function. Alternatively, if a function is not learnable, then there exists no polynomial-size circuit for computing the function values. Additionally, even approximating these functions to a given error tolerance can be very difficult with limited size. This result has some interesting implications for the circuits needed to compute functions by exposing these size-complexity limitations.

Comments Off on Discussion of "Strength of Weak Learnability" by Robert E. Schapire

Discussion of: Consistent Nonparametric Regression (Charles Stone, 1977)

Presenter: Alex Ochoa
Scriber: Irineo Cabreros
Date: Dec, 3 2015

Introduction and Problem Statement:
This week we discussed the 1977 paper ``Consistent Nonparametric Regression" by Charles Stone. Given a collection of observations (x_i, y_i) drawn from some joint distribution P(X, Y), our goal is to estimate (non-parametrically) the conditional distribution P(Y|X) with which we can compute many quantities defined by this distribution (conditional expectations, variances, covariances, etc.). Fig. 1 gives us a rough idea of what is going on. Suppose we wish to estimate E(Y|X=x) of the this one-dimensional noisy sinusoidal data. Perhaps the most intuitive approach to doing this would be to take the average of the values of y_i that are within a small window (dashed green lines) of the point x (blue line). This corresponds to an estimate of the distribution \hat{P}(Y|X=x) equal to the empirical distribution of the values of y_i within the window. In this example, both X and Y are in \mathbb{R}, but in the framework of Stone, X and Y can each be in different high-dimensional spaces \mathbb{R}^d and \mathbb{R}^{d'}.

ex
Simplistic method of non-parametric regression. E(Y|X=x) is estimated at the point x (blue line) by the average of the points within a small window around x (dashed green lines).

In this paper Stone seeks estimators of the form:

(1)   \begin{equation*} \hat{P}(Y \in A | X = x) = \sum_{i=1}^n W_{n,i}(x,x_1,\hdots, x_n) I_A(y_i)  \end{equation*}

where the A is a Borel set, and

    \begin{displaymath}    I_A(y) = \left\{      \begin{array}{lr}        1 & : y \in A\\        0 & : \text{else}      \end{array}    \right. \end{displaymath}

Main Results: The W_{n,i} in equation (1) are weight functions, and providing conditions under which these weight functions result in ``consistent" estimators of P(Y|X=x) is the main result of this paper. When we are thinking about estimating the conditional expectation, we have (plugging in equation (1)):

    \begin{align*} \hat{E}_n(Y|X=x) &= \int_y y\hat{P}(y \in dy | X = x) \\ &= \sum_{i=1}^n W_{n,i}(x,x_1,\hdots,x_n) y_i \end{align*}

Therefore the conditional expectation is just the weighted average of all of the y_i.

Definition
A sequence \{W_n\} of weights is said to be consistent if whenever (X,Y), (X_1,Y_1),(X_2,Y_2),\hdots are i.i.d, Y is real valued, r>1, and E|Y|^r < \infty; then \hat{E}_n(Y|X) \to E(Y|X) in L^r. Z_n \to Z in L^r means that E(Z_n \to Z)^r \to 0.

Two natural constraints one may wish to impose on the weights are

  • normality: \sum_{i=1}^n X_{n,i}(X) = 1
  • non-negetivity: W_{n,i}(X) \ge 0 ~~ \forall i

When both of these constraints are satisfied, \{W_{n}\} are referred to as a probability weight function. Theorem ?? refers considers general \{W_n\} while Corallary ?? is restricted to probability weight functions.

Theorem 1. Let f \ge 0 be a Borel function in the domain of X. Suppose \exists C,D \ge 0 such that:

(2)   \begin{align*} E\left[\sum_{i=1}^n |W_{n,i} (X)| f(X_i) \right] \le C E[f(X)] ~~~~~ & \forall n \ge 1, \\  P\left(\sum_{i=1}^n |W_{n,i}(X)|  a} \overset{P}{\to} 0 ~~~~~& \forall a > 0, \\ \sum_{i=1}^n W_{n,i}(X) \overset{P}{\to} 1, ~~~~~&  \\ \underset{i}{\max} |W_{n,i}(X)| \overset{P}{\to} 0.  ~~~~~& \end{align*}

Then that sequence \{W_n\} is consistent. These conditions are necessary and sufficient.

The first condition states that the estimate of the conditional expectation of f(X) must be bounded by a constant times the true expectation. The second condition states that the sum of the weights must be finite. The third (important) condition states that the sum of the weights outside of a ball of radius a from X must vanish as n goes to infinity. The fourth condition indicates that the normality condition must asymptotically hold, and the final condition states that all W_{n,i}(X) must vanish asymptotically (i.e. no constant mass can be given to any single point as n\to \infty). Notice that these conditions rule out several ad-hoc schemes. For instance, taking the average of the ten closest points to x as an estimator of the conditional expectation E(Y|X=x) will not result in a consistent estimator (since \underset{i}{\max} |W_{n,i}(X)| = 0.2). This theorem also rules out some obviously non-sensical estimators: for instance, any estimator that places no weight inside balls around x (violating condition 3.). The following corollary restricts to the case of probability weight functions:

Corollary 2. Let f \ge 0 be a Borel function in the domain of X. Suppose \exists C \ge 1 such that:

(3)   \begin{align*} E\left[\sum_{i=1}^n |W_{n,i} (X)| f(X_i) \right] \le C E[f(X)] ~~~~~ & \forall n \ge 1, \\ \sum_{i=1}^n |W_{n,i}(X)| I_{||X_i-X|| > a} \overset{P}{\to} 0 ~~~~~& \forall a > 0, \\ \underset{i}{\max} |W_{n,i}(X)| \overset{P}{\to} 0.  ~~~~~& \end{align*}

Then that sequence is consistent since the omitted conditions (2 and 4) are trivially satisfied for probability weights. These conditions are necessary and sufficient.

Other results:

  • We briefly discussed the construction of several explicit consistent \{W_n\} sequences known as nearest neighbor weights (corollary 3).
  • Stone presents an idea for obtaining weights W_n from probability weights U_n that are locally linear fits, and which behave after trimming. This idea is combined with a global linear fit, which he calls trend removal.
  • Second order quantities, such as covariances, variances, standard deviations, and correlations, are estimated consistently by plugging in the expectation estimator and combining with the continuous mapping theorem. Conditional quantiles also work out.
Comments Off on Discussion of: Consistent Nonparametric Regression (Charles Stone, 1977)

Discussion of: Why Isn't Everyone a Bayesian? Efron, 1986

Presenter: Will Lowe
Scriber: Greg Darnell
Date: Feb, 11 2016

Today we discussed Brad Efron's ``Why Isn't Everyone a Bayesian'' (1986). The article is a discussion of four primary points, as to why frequentist approaches may be used or are desirable to Bayesian counterparts: 1) ease of use, 2) model building, 3) division of labor, 4) objectivity. This paper is important to interpret in its context (1986), where Efron puts forth, that most statistical analysis is conducted in a frequentist way. As a group we note that Lindley's prediction of a transition from primarily frequentist to Bayesian practices has been evident over the past 30 years, but clearly frequentist practices are still in wide use.

In ``competition'' are three schools of thought: 1) Fisherian, 2) Neyman-Pearson-Wald (NPW), and 3) Bayesian. Fisherian statistics concerns itself with significance testing, NPW statistics with hypothesis testing, and Bayesian statistics with inference.

Fisherian statistics, Efron argues, is closer to Bayesian statistics than NPW by virtue of the belief that there is \emph{correct} inference. While our discussion led us to question the meaning of ``correct'' in this context, Efron puts forth the following example of the correct inference of the mean, \theta, of a random sample of x_i ~ Cauchy(\theta). The correct 95% confidence interval for \theta is \hat{\theta} \pm 1.96 / \sqrt{-i''_{\hat{\theta}}}. The mathematically equivalent approximation using the expected Fisher information, is not correct (see Efron and Hinkley 1978 for lengthier discussion). Using the word ``automatic,'' Efron puts forth the idea that Fisherian statistics are used because of point 1), namely, the ease of use of methods such as maximum likelihood. On the other hand, Bayesian analysis requires more effort in creating models and building inference machinery.

As an aside, we note many issues in the difficulty of use of Bayesian methods have been mitigated recently with the advent of mathematical and computational methods that allow for inference in Bayesian modeling. Our presenter posits that the vast majority of use of frequentist models over Bayesian is sociological (practice).

The NPW does not concern itself with correctness nor coherence, but rather takes a pragmatic view. In the face of inference about a distribution, F, which a researcher may know nothing about, NPW school might argue the Bayesian prior on F may be difficult to specify. Finally, the decision tree below is clearly overfit to the observed data. Cross-validation shows the much lower accuracy out-of-sample than in-sample. Bayesian analysis is problematic here because there is no notion of overfitting the tree - the Bayesian is only concerned with the likelihood function, not which data is used to construct the tree.

efron_1986

The final issue raised against Bayesian statistics is objectivity. The ``dominant Bayesian school'' is subjective (promoted by de Finetti and Savage). While useful, science must be objectively correct: ``'scientific objectivity' is more than a catch-phrase.' This is perhaps a more difficult claim to validate, as no rigorous definition is put forth. However, the Bayesian school promoted by Bayes and Laplace is objective Bayes theory - where prior distributions are meant to ``capture the idea of objectivity.''

Comments Off on Discussion of: Why Isn't Everyone a Bayesian? Efron, 1986

Discussion of The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators, Stigler 1990

Presenter: Wei Hao
Scriber: Li-Fang Cheng
Date: Oct, 22 2015

This paper examines Stein's paradox (or James-Stein estimator) from Galton's perspective. The problem to discuss starts from a simple setting: for a collection of independent k measurements X_{1}, X_{2}, \cdots, X_{k}, assuming each has an independent parameter \theta_{i} form the normal distribution \mathcal{N}(\theta_{i}, 1), we want to find the joint estimator \hat{\theta} = (\hat{\theta_{1}}, \hat{\theta_{2}}, \cdots, \hat{\theta_{k}}) by minimizing the loss function:

(1)   \begin{gather*} L(\theta, \hat{\theta})=\sum_{i=1}^{k}(\theta_{i}-\hat{\theta_{i}})^{2}. \end{gather*}

What Stein discovered is that in the case of k \geq 3, the ``ordinary'' estimator \hat{\theta}_{i}^{o}=X_{i} is no more admissible. That is, for instance, the James-Stein estimator \hat{\theta}^{\text{JS}}, written as

    \[ \hat{\theta}^{\text{JS}} = \left( 1 - \frac{c}{S^{2}}\right)X_{i}, \quad 0<c<2(k-2) \]

would dominate in the case of three or more dimensions. Stigler mentioned in the paper that there are several explanations (e.g. empirical Bayes estimator or ``pre-test'' estimator) but none of them are fully satisfactory due to additional prior distribution assumptions.

Instead, Stigler provides another explanation by deriving the James-Stein estimator from a simple linear regression problem without any a priori. In the problem settings we have X_{i} but not \theta_{i} are unknown, so an ideal estimate function would be E(\theta|X). The idea is then to consider finding the regression line of \theta on X (i.e. \theta_{i} as a function of X_{i}) and try to find the best linear estimator. To be clear, by estimating \theta_{i} of the form

    \[ \hat{\theta_{i}}= a + bX_{i} \]

by minizming the loss function, we can find the least squares line:

    \[ \hat{\theta_{i}}=\bar{\theta} + \hat{\beta}(X_{i}-\bar{X}) \]

where

    \[ \sum{\hat{\beta}} = \frac{\sum{(X_{i}-\bar{X})(\theta_{i}-\bar{\theta})}}{\sum{(X_{i}-\bar{X})^{2}}}. \]

This regression line is optimum for both the loss function and the ideal regression function E(\theta|X). Stigler then shows that both Efron-Morris estimator and James-Stein estimator are estimators of the least squares line. This explains the Stein's paradox in a transparent way: the oridinary estimators \hat{\theta}_{i}^{o}=X_{i} are derived from the regression line of X on \theta, while the Efron-Morris estimator and James-Stein estimator are derived from the aprromations to the right regression of \theta on X.

Comments Off on Discussion of The 1988 Neyman Memorial Lecture: A Galtonian Perspective on Shrinkage Estimators, Stigler 1990

Discussion of: Stein's Estimation Rule and its Competitors—An Empirical Bayes Approach, Efron and Morris 1972

Scribed by Chee Chen

This note focuses on the mathematical and statistical intuitions rather than
the math details presented in the discussion.

Motivation and Problem

A '| ' will be used to denote either conditioning in then Bayesian sense or parameters in frequentist sense. For example, f\left(  \cdot|\theta\right)  or F\left(  \cdot|\theta\right)    denotes the conditional density or distribution given \theta when \theta has a prior, and it denotes the same quantities with \theta being the parameter when \theta is deterministic. Consider the problem of estimating the mean vector \theta=\left(  \theta_{1},\ldots,\theta_{k}\right)  ^{T} based on observing x|\theta\sim MVN\left(  \theta,I\right)  , where x=\left( x_{1},...,x_{k}\right)  ^{T}. The superscript T means transpose. MVN stands for `` multivariate normal", and I the identity matrix. For any estimator \hat{\theta}=\left(  \hat{\theta}_{1},\ldots,\hat{\theta}_{k}\right)  ^{T} of \theta , consider the squared loss
L\left(  \hat{\theta},\theta\right)  =\frac{1}{k}\sum_{i-1}^{k}\left(  \hat{\theta}_{i}-\theta_{i}\right)  ^{2}\label{eq:riskA}
and risk
R\left(  \hat{\theta},\theta\right)  =\mathbb{E}L\left(  \hat{\theta}  ,\theta\right)  \label{eq:riskB}

Normal distribution, ordinary least squares and maximum likelihood
estimate

First, let \theta be deterministic. Then a candidate estimate of \theta is the ordinary least
squares (OLS) estimator \theta_{ols} , due to its excellent properties such as being the best linear unbiased estimator (BLUE) in terms of its variance. However, note that OLS estimator \theta_{ols} is a linear functional in the observed data x. More specifically, x|\theta\sim MVN\left(  \theta  ,I\right)   can be equivalently written as

x=\theta+\varepsilon\text{ \ with }\varepsilon\sim MVN\left(  0,I\right)  ,

which means that the design matrix is also I , the orthogonal projection is I=\left(  I^{T}I\right)  ^{-1}I^{T} , and the OLS estimator \theta_{ols}=\left(  I^{T}I\right)  ^{-1}I^{T}x=x . Put another way, in this case the OLS estimator \theta_{ols} is obtained by projecting x onto the whole Euclidean space \mathbb{R}^{k} , and nothing has been done to x when projecting it. Most importantly, the orthogonal projection I puts each x_{i} around its center \theta_{i} and scatters the x_{i} 's widely. Note that the OLS estimator \theta_{ols} in this case is equal to the maximum likelihood estimate (MLE) of \theta because

\log f\left(  x|\theta\right)  \propto\log\left(  \exp\left(  -\frac{1}{2}\left\Vert x-\theta\right\Vert ^{2}\right)  \right)

where \left\Vert \cdot\right\Vert denotes the Euclidean norm. It is easy to see that
\mathbb{E}_{x}L\left(  \theta_{ols},\theta\right)  =1
where \mathbb{E}_{\cdot}   denotes the expectation with respect to the distribution of the subscript

Finally, note that the setting x|\theta\sim MVN\left(  \theta,I\right)    is equivalent to
x_{i}|\theta_{i}\sim MVN\left(  \theta_{i},1\right)  ,
i.e., there is one and only one observation x_{i} for \theta_{i} . In other words, the sample size available to estimate a \theta_{i} is 1, which in sharp contrast to the conventional setting where there are k observations corresponding to 1 parameter. Statistically speaking, there is just not sufficient information to accurately estimate each \theta_{i} . Further, since \theta_{i} 's are deterministic, the distributions of x_{i}|\theta _{i} can have centers \theta_{i} as wild and dispersed as possible.

To summarize, in the setting x_{i}|\theta_{i}\sim MVN\left(  \theta_{i},1\right) with deterministic \theta_{i} 's, the OLS estimate equals the MLE but the orthogonal projection does nothing to the data x , and the OLS estimate is a linear estimator.

Shrinkage Estimators and nonlinear estimators

Recall the settings
x_{i}|\theta_{i}\sim MVN\left(  \theta_{i},1\right)  .
Once we understand that the OLS estimate is a linear estimator and the orthogonal projection scatters x_{i} 's very widely, it is natural to ask: can we do better in terms of the risk R\left(  \hat{\theta},\theta\right)   via the squared loss L\left(  \hat{\theta},\theta\right)  by considering nonlinear estimators? The answer is 'no' when k=1 or 2 , since the squared loss is a quadratic function in the estimator \hat{\theta} and we do not have any more degrees of freedom to tweek the projections. However, the answer is 'yes' when k\geq3  . Note that a k  -dimensional simple random walk is recurrent if and only if k=1  or 2 , and that there is a deep connection between this and the admissibility of estimators of \theta in the settings above.

Now the question is how to modify the OLS estimate \theta_{ols} ? A natural perspective is by putting a prior on \theta_{i} 's, look at the Bayes risk of the estimator \hat{\theta} , and try to approximate the Bayes estimate \theta_{bayes} under the squared loss. In this perspective, the
prior on \theta_{i} tends to bundle the \theta_{i} 's together (if one thinks of the probability \Pr\left(  \max_{1\leq i\leq k}\theta_{i}\geq C\right)  \leq a_{k}\exp\left(  -b_{k}C\right)   , the x_{i} 's will be less scattered since x_{i}|\theta_{i} is highly concentrated around \theta_{i} , and thus smaller R\left(  \hat{\theta},\theta\right)   can be achieved since R\left(  \hat{\theta},\theta\right)   measures on average how scattered the estimated \hat{\theta}_{i} 's are. This is the approach taken by Efron and Morris (1973).

However, a geometric point of view can be taken too, which is revealed by James and Stein's original paper. Note that the James-Stein's estimator is
\theta_{js}=\left(  1-\frac{k-2}{\left\Vert x\right\Vert ^{2}}\right)  x
or
\theta_{js,i}=\left(  1-\frac{k-2}{\left\Vert x\right\Vert ^{2}}\right)  x_{i}
There are three things to notice about \theta_{js} :
(1) in \theta_{js} each component \theta_{js,i} depends on all other x_{i} 's via the factor \lambda=1-\frac{k-2}{\left\Vert x\right\Vert ^{2}} ,
(2) \theta_{js} is nonlinear in x , and it is an estimator employing a compound decision rule;
(3) the ``projection'' \mathcal{P} in \theta_{js} (which is not a projection matrix any more since \mathcal{P} is a nonlinear functional) is
\mathcal{P}=\left(  1-\frac{k-2}{\left\Vert x\right\Vert ^{2}}\right)  I\text{,}
it pulls (or shrinks) all x_{i} 's towards a center in order to estimate the
\theta_{i} 's, and the amount of shrinkage is decided by the dimension k
and how dispersed the x_{i} 's are, i.e., by \left\Vert x\right\Vert ^{2} .

This makes sense since: (i) a large \left\vert x_{i}\right\vert  does not necessarily mean that the corresponding \theta_{i} is large since the sample size is 1; (ii) a large \left\Vert x\right\Vert ^{2}  means that some of the \theta_{i} 's have to be large but it is not clear which; (iii) since R\left(  \hat{\theta},\theta\right) is the expectation of a quadratic function of \hat{\theta  } which essentially involves the variance of \hat{\theta} through \left\Vert  x\right\Vert ^{2} , the appearance of \left\Vert x\right\Vert ^{2} in \lambda or in general \lambda=g\left(  k,\left\Vert x\right\Vert^{2}\right)   is expected.

Miscellaneous

Based on these intuitions, one can consider other cases of estimation where
shrinkage estimators are desirable.

Posted in Shrinkage estimators | Comments Off on Discussion of: Stein's Estimation Rule and its Competitors—An Empirical Bayes Approach, Efron and Morris 1972

Discussion of: A Stochastic Approximation Method, Robbins and Monro 1951

Presenter: Chee Chen
Scribe: Adam Charles
Date: 10/8/2015
Paper: A Stochastic Approximation Method, Robbins and Monro 1951

This week we kicked off the ``Classical Papers in Statistics'' series of the CSML reading group meetings by reviewing a seminal paper in the stochastic optimization literature: ``A Stochastic Approximation Method''. This paper presents a randomized approach to root finding that leverages the properties of weighted averages of random numbers. The randomized approach to root finding presented in this work has been cited by some as the basis for modern methods such as stochastic gradient descent.

The principal goal of this paper was to find a root x=\theta of the equation

(1)   \begin{gather*} M(x) = \alpha \end{gather*}

under the condition that M(x) is not explicitly known. As with other methods, such as Newton steps or gradient based methods, the authors propose an iterative method wherein an initial guess for the root, x_1 is updated in the hopes that \lim_{n\rightarrow\infty}x_n = \theta. In stark contrast to the preceding literature, which define specific updates, the methodology used in the paper takes an approach based on randomized updates to the estimate of the root. Since the ideal authors thus refer to this method as a ``stochastic approximation method''.

For a standard, known function M(x), Newton's method can be used to quickly find a root. If M(x) is unknown, however, then standard methods may not work well. To find a root, the authors assume that instead of being a direct, known, function, that M(x) is defined as the expected value of a conditional random variable y that depends on x:

(2)   \begin{gather*} M(x) = E[y|x] \end{gather*}

The proposed algorithm is based on a Markov chain where the values of x are iteratively chosen as

(3)   \begin{gather*} x_{n+1} = x_n -a_n(\alpha - y_n) = x_n - a_n(M(x_n) + \epsilon_n) \end{gather*}

As compared to standard gradient methods we can see that the additional random term is in a sense approximating the gradient step by replacing it with a random draw from the conditional distribution.

The bulk of the paper is spent proving when and how the above iterations converge.
In particular, the authors aim to show that E[\lim_{n\rightarrow\infty}x_n] = \theta and that E[(\lim_{n\rightarrow\infty}x_n - \theta)^2] = 0, i.e. that x_n\rightarrow \theta in probability.
A critical point in this proof is the existence of a set of constants a_n such that

(4)   \begin{eqnarray*} \lim_{N\rightarrow\infty} \sum_{n=1}^N\sum_{n=1}^N a_n & = & \infty \\ \lim_{N\rightarrow\infty} \sum_{n=1}^N\sum_{n=1}^N a_n^2 & < & \infty % \lim_{N\rightarrow\infty} \sum_{n=1}^N a_nk_n & = & \infty \end{eqnarray*}

This pair of conditions on a_n ensure that the y_n` s are averaged with high enough weights such that they converge to the average, but that they are not weighted too high such that the average value still converges to a single value and not a distribution. Another important component of the proof is the shape of the distribution over y_n. In the proof, it seems that the authors require the p(y_n|x_n) to be bounded. While the authors state that this condition should be able to be replaced with finite mean and variance, they state that they do not know if any of the proofs will hold.

At this point the authors give an example of estimating a quantile (Section 4 in the paper).
Chee had actually provided simulations for this example, and it seems that while the algorithm was supposed to work, the discrepancy in using unbounded Gaussian distributions and mis-matching conditional distributions and functions that you are looking for a root for could cause more issues for accurate convergence than are discussed in the paper.

In the final small section, the authors talk about about expanding the method to more general regression problems. As an example, the authors mention a setting where the observed function M(x) is assumed to be linear in x (M(x) = \beta_0 + \beta_1x). The authors seem to recommend replacing functions such as a least-squared estimate of the \beta values with some kind of model-free estimate using their root finding method. This discussion is not long and seems to leave on a bit of a cliff-hanger.

Overall, the importance of this paper can be summarized by the novel use of randomness in an iterative algorithm. The authors describe a deceptively simple setup and then provide an analysis of the algorithm under specific conditions. While this paper works in one-dimension, and the approach is riddled with unanswered questions (such as convergence rate, good choices of a_n), the concepts are very important in modern-day computing. Current methods are utilizing similar approaches, utilizing localized, randomized algorithms with convergence and accuracy guarantees where exhaustive calculations over large datasets are prohibitive.

 

 

Comments Off on Discussion of: A Stochastic Approximation Method, Robbins and Monro 1951

Discussion of: Sparse Spectrum Gaussian Process Regression

Presenter: Mikio Aoi
Scribe: Li-Fang Cheng
Date: 07/02/15
Paper: Sparse Spectrum Gaussian Process Regression (Lázaro-Gredilla et al., 2010)
Theme: Gaussian Processes

In contrast to the paper discussed last week (``Gaussian Process Kernels for Pattern Discovery and Extrapolation'', Wilson and et. al), which proposes the design of novel kernel inspired by Fourier transform, today's paper focuses on improving the computational efficiency of Gaussian process (GP) regression. The proposed method is called the Sparse Spectrum Gaussian Process (SSGP) --- an approximation method by sparsifying the spectral representation of the covariance functions in the GP model.

In a Gaussian process regression problem, what interests us most is the predictive distribution conditioned on observed training inputs and outputs. Unfortunately, obtaining exact predictive distribution requires \mathcal{O}(n^{3}) computational complexity and \mathcal{O}(n^{2}) memory usage due to the inversion of a n \times n covariance matrix, where n is the number of training samples. These computational requirements thus limit GP in real applications with large scale datasets. In this paper, SSGP sparsely approximates full stationary GPs by decomposing the covariance function into finite m pairs of tigonometric basis functions. The authors analyze the main results in two ways:

From the perspective of Trigonometric Bayesian Regression:
Take the function f(\mathbf{x}) we're interested in and represent it directly with m pair of basis functions with spectral frequencies \mathbf{s}_{r}:

    \[ f(\mathbf{x}) = \sum_{r=1}^{m}{a_{r}\cos(2 \pi \mathbf{s}_{r}^{\intercal} \mathbf{x}) + b_{r}\sin(2 \pi \mathbf{s}_{r}^{\intercal} \mathbf{x})}. \]

When using the same prior \mathcal{N}(0, \frac{\sigma_{0}^2}{m}) for the amplitude a_{r} and b_{r} of each basis function, we can write the covariance function of f(\mathbf{x}) as

    \[ k(\mathbf{x}_{i}, \mathbf{x}_{j}) = \frac{\sigma_{0}^{2}}{m}\phi(\mathbf{x}_{i})^{\intercal} \phi(\mathbf{x}_{j}) = \frac{\sigma_{0}^{2}}{m}\sum_{r=1}^{m}{\cos\left(2 \pi \mathbf{s}_{r}^{\intercal} (\mathbf{x}_{i}-\mathbf{x}_{j}) \right)}, \]

where \phi(\mathbf{x}) is the 2m-length vector containing m pair of trigonometric function:

    \[ \phi(\mathbf{x}) = \begin{bmatrix} \cos(2\pi\mathbf{s}_{1}^{\intercal}\mathbf{x}) & \sin(2\pi\mathbf{s}_{1}^{\intercal}\mathbf{x}) & \cdots & \cos(2\pi\mathbf{s}_{m}^{\intercal}\mathbf{x}) & \sin(2\pi\mathbf{s}_{m}^{\intercal}\mathbf{x}) \end{bmatrix}. \]

From the perspective of power spectrum of covariance functions:
For a stationary GP, the autocorrelation function is equal to the stationary covariance function. According to the Wiener-Khintchine theorem and Bochner's theorem, we can represent a stationary covariance function k(\mathbf{\tau}) with its power spectrum S(\mathbf{s}) in terms of the expectation with the probability measure p_{S}(\mathbf{s})

    \[ \begin{array}{ll} k(\mathbf{x}_{i}, \mathbf{x}_{j}) & = k(\mathbf{\tau}) \\ & = \displaystyle \int_{\mathbb{R}^{D}}{e^{2\pi i \mathbf{s}^{\intercal}(\mathbf{x}_{i}-\mathbf{x}_{j})}S(\mathbf{s})d\mathbf{s}}\\ & = \displaystyle \sigma_{0}^{2} \int_{\mathbb{R}^{D}}{ e^{2 \pi i \mathbf{s}^{\intercal} \mathbf{x}_{i}}\left( e^{2\pi i \mathbf{s}^{\intercal}\mathbf{x}_{j}} \right)^{*}p_{S}(\mathbf{s})d\mathbf{s}} \\ & = \displaystyle \sigma_{0}^{2} \mathbb{E}_{p_{S}}\left[ e^{2\pi i \mathbf{s}^{\intercal}\mathbf{x}_{i}}\left( e^{2\pi i \mathbf{s}^{\intercal}\mathbf{x}_{j}}\right)\right] \end{array} \]

Due to the symmetry of power spectrum around zero, we can sample it using Monte Carlo with m spectral points with pairwise frequencies \left\lbrace \mathbf{s}_{r}, -\mathbf{s}_{r} \right\rbrace. The final approximation of the covariance function is

    \[ \begin{array}{ll} k(\mathbf{x}_{i}, \mathbf{x}_{j}) & \simeq \displaystyle \frac{\sigma_{0}^{2}}{m} \sum_{r=1}^{m} \left[ e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{i}} \left( e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{j}} \right)^{*} + \left( e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{i}} \right)^{*} e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{j}} \right]\\ & = \displaystyle \frac{\sigma_{0}^{2}}{m}\text{Re} \left[ \sum_{r=1}^{m}{e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{i}} \left( e^{2 \pi i \mathbf{s}_{r}^{\intercal} \mathbf{x}_{j}} \right)^{*}} \right]\\ & = \displaystyle \frac{\sigma_{0}^{2}}{m}\sum_{r=1}^{m}{\cos\left(2 \pi \mathbf{s}_{r}^{\intercal} (\mathbf{x}_{i}-\mathbf{x}_{j}) \right)}, \end{array} \]

which is the same as that derived by using the trigonometric basis functions.

Explanations:
The authors explain the results as ``sparsifying'' the spectrum of GP because this approximation is equivalent to replacing the power spectrum S(\mathbf{s}) by a set of Dirac deltas distributed according to the probability measure p_{S}(\mathbf{s}). Furthermore, the authors quote the analogy to neural network and describe the result as: ``A stationary GP can be seen as a neural network with infinitely many hidden units and trigonometric activations if independent priors following Gaussian and p_{S}(\mathbf{s}) distributions are placed on the output and input weights, respectively''.

Using this approximation of the covariance function, we can get an efficient computation of the predictive distribution as

    \[ \mu_{*} = \phi(\mathbf{x}_{*})^{\intercal} \mathbf{A}^{-1} \mathbf{\Phi_{f}} \mathbf{y}, \]

    \[ \sigma_{*}^{2} = \sigma_{n}^{2} + \sigma_{n}^{2} \phi(\mathbf{x}_{*})^{\intercal} \mathbf{A}^{-1} \phi(\mathbf{x}_{*}), \]

where \mathbf{\Phi_{f}} = [\phi(\mathbf{x}_{1}), \cdots, \phi(\mathbf{x}_{n})] and \mathbf{A} = \mathbf{\Phi_{f}}\mathbf{\Phi_{f}}^{\intercal} + \frac{m\sigma_{n}^{2}}{\sigma_{0}^{2}}\mathbf{I}_{2m}. Since the matrix \mathbf{A} to perform inversion is of size 2m \times 2m, the computational cost is reduced to \mathcal{O}(nm^{2}) instead of original \mathcal{O}(n^{3}).

The authors report the comparisons in predictive accuracy with state-of-the-art sparse approximation methods for GP, including FITC (Snelson and Ghahramani, 2006) and SMGP (Walder et al., 2008). For computational cost, Among most of the tested datasets, SSGP is shown to provide better results than other methods under the same amount of computational cost. However, SSGP also performs significantly worse in one small dataset. There are also some unexpected cases that SSGP performs better than full GP. The authors view these cases as the results of overfitting.

In summary, this paper provides a novel perspective to reduce the computational complexity of GP models. We also discuss the following questions during the reading group:
The point of viewing a stationary GP as a neural network is interesting. Would there be a way to link GP and neural network together? For example, how to choose the number of hidden layers automatically?

Posted in Gaussian Processes | Comments Off on Discussion of: Sparse Spectrum Gaussian Process Regression

Discussion of: Gaussian process optimization in the bandit setting: No regret and experimental design

Presenter: Li-Fang Cheng
Scribe: Po-Hsuan (Cameron) Chen
Date: 04/09/15
Paper: Gaussian process optimization in the bandit setting: No regret and experimental design (Srinivas et al., 2009)
Theme: Bayesian optimization

Gaussian process optimization in the bandit setting: No regret and experimental design (Srinivas, Krause, Kakade, & Seeger, 2009) is the third paper after [Box 1951] and [Jones 2001] that we have discussed in the theme of Bayesian optimization. In [Box 1951] and [Jones 2001], the concept of most probable improvement (MPI) and expected improvement (EI) as auxiliary functions for global optimization were discussed and tested in the context of the response surface modeled using kriging, also known as Gaussian process regression. However, in these earlier papers, no theoretical guarantees were provided. This paper addresses some theoretical results for a related auxiliary function for Bayesian optimization.

The motivating problem for this paper is to optimize a noisy unknown function f under the regime that sampling is expensive. The challenge is twofold, it requires estimating an unknown function with noisy samples, and optimize the estimate over input space. Without any assumption on f, there isn't much we can work with, therefore, the authors model unknown function f with Gaussian process (GP). Specifically, authors analyze Gaussian Process Upper Confidence Bound (GP-UCB) [Auer 2003] and derive regret bounds under different smoothness assumptions. By putting the problem under the framework of no regret, the analysis is connected with muti-armed bandit setting. And the effort to explore the unknown function with as few samples as possible is connected with ideas in experimental design.

In the problem set up, the authors want to design an algorithm that adaptively selects samples x_t such that there is no regret on average over time:

\lim_{t\rightarrow\infty} \frac{R_T}{T} = \lim_{t\rightarrow\infty} \sum_{t=1}^T\frac{f(x^*)-f(x_t)}{T} = 0.

While a number of auxuliary functions have been proposed for this problem, none have theoretical guarantees. The authors' main contribution in this paper is to prove theoretical bound on R_T under this setting with three different smoothness assumption. To maximize f(x_t), a heuristic way will be picking x_t with highest posterior mean:
x_t = \arg\max_{x\in D} \mu_{t-1}(x),

where D is the input space, and \mu_{t-1}(x) is the posterior mean with respect to input x. However, this will get stuck in local minima due to inadequate exploration of the space. A pure exploration strategy would pick x such that:

x_t = \arg\max_{x\in D} \sigma_{t-1}(x),

where \sigma_{t-1}(x) is the posterior variance with respect to input x, GP-UCB provides a trade of between exploration and exploitation by picking

x_t = \arg\max_{x\in D} \mu_{t-1}(x) + \sqrt{\beta_t}\sigma_{t-1}(x).

The authors prove regret bounds for three smoothness conditions: (1) GP under a finite set, (2) GP under a compact set, and (3) the general RKHS case. The proof can be sketched into two steps: bounding the regret through information gain, and bounding the information gain.


This paper is interesting for its theoretical guarantee on providing regret bounds. From a practical standpoint, there are a few points that were not specified. In GP optimization, the selection of a kernel and the optimization of hyperparameters generally plays an important role. In this paper, the effect of selecting the kernel and hyperparameter on the bounds is not discussed.

Posted in Bayesian Optimization | Comments Off on Discussion of: Gaussian process optimization in the bandit setting: No regret and experimental design

Discussion of: A Taxonomy of Global Optimization Methods Based on Response Surfaces

Presenter: Mikio Aoi
Scribe: Nick Turner
Date: 03/12/15
Paper: A Taxonomy of Global Optimization Methods Based on Response Surfaces (Donald Jones, 2001)
Theme: Bayesian Optimization

Over the past two sessions, our discussion group has begun a tour of the foundations of Bayesian Optimization. In the first of these meetings, we considered On the Experimental Attainment of Optimum Conditions by G. E. P. Box and K. B. Wilson. This article attempted to motivate the use of numerical optimization methods driven by chemical experiments. There, by representing physical experiments as costly samples of some unknown function, the authors gently walk through the reasoning behind the (now very familiar) techniques of gradient-based optimization.

This time, we reviewed A Taxonomy of Global Optimization Methods Based on Response Surfaces by Donald R. Jones (2001). This paper is also motivated to find relatively cheap surrogates for otherwise time-consuming functions - yet in this case the functions are extensive computer simulations. The `taxonomy' acts as a user's guide to several attempts to create these surrogates by creating a response surface: a function which is hypothesized to mimic the real function of interest. One particularly neat figure summarizes each of the considered approaches.

taxonomy

The general strategy of using a response surface for optimization often takes two steps. The first is fitting a function of the following form to your existing samples.

y(x^*) = \sum_{k=1}^{m}a_k\pi_k + \sum_{j=1}^{n}b_j\psi(x^*-x_j)

where \pi_k is a basis of polynomial functions up to a certain degree, and \psi refers to a basis function of a few mentioned forms. After fitting, the second step then takes the fitted surface as ground truth, and makes an explicit decision of which point to sample next based either on the fitted surface values, or on an auxiliary function over those values.

The most simple example of this strategy could be fitting a specific polynomial function to the sampled points (drastically limiting the terms within \pi_k and setting all b_j to zero), and then selecting the minimum of this polynomial function for the next sample. This tactic is often too simple for realistic scenarios, though it illustrates the general approach.

Jones proceeds through a few more nuanced methods (including standard basis functions and kriging predictors) to motivate a general principal for any aspiring global optimization algorithm:

The practical lesson of the theorem is that any globally convergent method must have a feature that forces it to pay attention to parts of the space that have been relatively unexplored and, from time to time, to go back and sample in these regions.

This lesson motivates the use of more sophisticated auxiliary functions which incorporate uncertainty. To do this, the paper delves into further detail with kriging predictors, as its statistical interpretation allows for natural incorporation of uncertainty. Then, it constructs two auxiliary functions to sample based on the probability of improvement or the expected improvement of each point. Below is a figure that demonstrates the expected improvement auxiliary function in action. The first panel describes the function we would like to estimate, and the subsequent panels represent the response surface for the given points. Underneath the surface is a plot of the auxiliary function, where the maximum of this function indicates the next point to sample.

expected_imp
Methods that incorporate uncertainty are largely able to sample unexplored regions, though Jones notes that they can still suffer given misleading samples. Addressing this flaw, Jones motivates one-step methods, which differ from the general strategy described above. One-step methods attempt to directly estimate the likelihood of the data given that the optimum value (a user-specified f^*) is achieved at a particular point. Then, we select the next sample to be the point which maximizes this likelihood, iterating over potential points on the surface, potential optimal values f^*, and maximizing the origin parameters for each candidate point. This procedure was the preferred of the paper, despite its apparent computational complexity.


A few discussion points were of particular interest. First, the group was quick to note that the paper does not incorporate any sample noise into the creation of the response surface. Further, some indicated that selecting the next sample point often implicitly included an inefficient search over a grid of potential values.

Posted in Bayesian Optimization | Comments Off on Discussion of: A Taxonomy of Global Optimization Methods Based on Response Surfaces