The aftermath of ORF523 and the final

It has been two weeks since my last post on the blog, and I have to admit that it felt really good to take a break from my 2-posts-per-week regime of the previous months. Now that ORF523 is over the blog will probably be much more quiet in the near future, though I have a few cool topics that I want to discuss in the coming weeks and I’m also hoping to get interesting guest posts. In any case the next period of intense activity will probably be next Fall, as I will be visiting the newly created Simons Institute for the Theory of Computing at UC Berkeley and I plan to blog about the new tricks that I will learn from the program on the Theoretical Foundations of Big Data Analysis and the one on Real Analysis in Computer Science. More details on this to come soon!

For those of you who followed assiduously my lectures I thought that you might want to take a look the final that I gave to my students this morning. The first part is based on this paper by Alon and Naor, and the second part is based on this paper by Nesterov (how could I not finish with a result from Nesterov??). Enjoy!


An SDP relaxation based on Grothendieck’s inequality

In this problem we consider the space of rectangular matrices \mathbb{R}^{m \times n} for m, n \geq 1. We denote by \langle \cdot, \cdot \rangle the Frobenius inner product on \mathbb{R}^{m \times n}, that is

    \[\langle A, B \rangle = \sum_{i=1}^m \sum_{j=1}^n A_{i,j} B_{i,j} .\]

Let p, p' \in \mathbb{R}_{++} \cup \{\infty\}. We consider the following norm on \mathbb{R}^{m \times n}:

    \[\|A\|_{p \rightarrow p'} = \max_{x \in \mathbb{R}^n : \|x\|_p \leq 1} \|A x\|_{p'} .\]

In other words this is the operator norm of A when we view it as a mapping from \ell_p^n to \ell_{p'}^m. In this problem we are interested in computing this operator norm for a given matrix.

1. Is the problem of computing the operator norm of A a convex problem?

2. For what values of p and p' can you easily propose a polynomial-time algorithm?

3. Let q' be such that 1/p' + 1/q' =1. Recall why one has

    \[\|A\|_{p \rightarrow p'} = \max_{x \in \mathbb{R}^n : \|x\|_p \leq 1, y \in \mathbb{R}^m : \|y\|_{q'} \leq 1} \langle A , y x^{\top} \rangle .\]

We will now focus on the case p=\infty and p'=1, that is

    \[\|A\|_{\infty \rightarrow 1} = \max_{x \in \mathbb{R}^n : \|x\|_{\infty} \leq 1, y \in \mathbb{R}^m : \|y\|_{\infty} \leq 1} \langle A , y x^{\top} \rangle.\]

We also consider the following quantity:

    \begin{eqnarray*} \text{SDP}(A) = & \max & \langle A , V U^{\top} \rangle \\ & \text{subject to} & U \in \mathbb{R}^{n \times (n+m)}, V \in \mathbb{R}^{m \times (n+m)} \\ & & \text{with unit norm rows (in} \; \ell_2). \end{eqnarray*}

4. Show that

    \[\|A\|_{\infty \rightarrow 1} \leq \text{SDP}(A) .\]

5. Show that the problem of computing \text{SDP}(A) for a given matrix A is an SDP.

Grothendieck’s inequality says that there exists a constant K_G \in <sup class='footnote'><a href='#fn-653-1' id='fnref-653-1' onclick='return fdfootnote_show(653)'>1</a></sup> such that

    \[\text{SDP}(A) \leq K_G \|A\|_{\infty \rightarrow 1} .\]

6. What is the implication of Grothendieck’s inequality for the problem of computing the norm \|A\|_{\infty \rightarrow 1}?


Coordinate Descent

Let f be a differentiable convex function on \mathbb{R}^n that admits a minimizer x^*. We denote \partial_i f(x) for the partial derivative in direction e_i of f at x (it is the i^{th} coordinate of \nabla f(x)). In this problem we consider the Coordinate Descent algorithm to solve the problem \min_{x \in \mathbb{R}^n} f(x): Let x_1 \in \mathbb{R}^n and for t>1 let

    \begin{eqnarray*} i_t & = & \mathrm{argmax}_{1 \leq i \leq n} | \partial_i f(x_t) | , \\ x_{t+1} & = & x_t - \eta \ \partial_{i_t} f (x_t) e_{i_t} . \end{eqnarray*}

We say that f is coordinate-wise \beta-smooth if for any x \in \mathbb{R}^n, h \in \mathbb{R}, i \in [n],

    \[|\partial_i f(x+h e_i) - \partial_i f(x) | \leq \beta |h| .\]

1. Is coordinate-wise smoothness a stronger or weaker assumption than standard smoothness?

2. Assume that f is coordinate-wise \beta-smooth and let \eta = \frac{1}{\beta}. Show first that

    \[f(x_{t+1}) - f(x_t) \leq - \frac{1}{2 \beta n} \| \nabla f(x_t) \|^2_2 .\]

Let R = \max \{\|x - x^*\|_2 : x \in \mathbb{R}^n \; \text{and} \; f(x) \leq f(x_1) \}, show next that

    \[f(x_{t+1}) - f(x_t) \leq - \frac{1}{2 \beta R^2 n} (f(x_t) - f(x^*))^2 .\]

Conclude by proving that

    \[f(x_t) - f(x^*) \leq \frac{2 \beta R^2 n}{t+3} .\]

Hint: to deal with the term f(x_1) - f(x^*) you can first show that f has to be (\beta n)-smooth.

3. Can you use this algorithm and the convergence result described above to show existence of sparse minimizers in the spirit of what we did with the Conditional Gradient Descent algorithm?

4. In general how does the computational complexity of Coordinate Descent compares to the computational complexity of Gradient Descent to attain an \epsilon-optimal point for a \beta-smooth function? What happens if you assume access to an oracle to compute i_t?

5. Same question but in the special case where f(x) = \sum_{i=1}^n f_i(x(i)) + \|A x - b \|_2^2 with A \in \mathbb{R}^{m \times n} and b \in \mathbb{R}^m.

6. Can you propose a variant of Coordinate Descent where the computation of i_t would be much faster and yet this new i_t would still allow us to prove the rate of convergence of question 2.? Hint: use randomization!

  1. 67, 1.79
This entry was posted in Optimization. Bookmark the permalink.