Smooth distributed convex optimization

A couple of months ago we (Kevin Scaman, Francis Bach, Yin Tat Lee, Laurent Massoulie and myself) uploaded a new paper on distributed convex optimization. We came up with a pretty clean picture for the optimal oracle complexity of this setting, which I will describe below. I should note that there are hundreds of papers on this topic, but the point of the post is to show our simple cute calculations and not to survey the immense literature on distributed optimization, see the paper itself for a number of pointers to other recent works.

Distributed optimization setting

Let G=(V,E) be an undirected graph on n vertices (|V| =n) and with diameter \Delta. We will think of the nodes in G as the computing units. To each vertex v \in V there is an associated convex function f_v : \mathbb{R}^d \rightarrow \mathbb{R}. For machine learning applications one can think that each computing unit has access to a “private” dataset, and f_v(x) represents the fit of the model corresponding to x \in \mathbb{R}^d on this dataset (say measured on least squares loss, or logistic loss for example). The goal will be to find in a distributed way the optimal “consensus” point:

    \[x^* \in \mathrm{argmin}_{x \in \mathbb{R}^d} \sum_{v \in V} f_v(x) .\]

The distributed processing protocol is as follows: asynchronously/in parallel, each node v can (i) compute a (local) gradient \nabla f_i in time 1, and (ii) communicate a vector in \mathbb{R}^d to its neighbors in G in time \tau. We denote by x_{v,t} the local model (essentially its guess for x^*) of node v at time t \in \mathbb{R}_+. We aim to characterize the smallest time T_{\epsilon} such that one can guarantee that all nodes v satisfy \bar{f}(x_{v,T}) - \bar{f}(x^*) \leq \epsilon where \bar{f} = \frac1n \sum_{v \in V} f_v.

We focus on the case where \bar{f} is \beta-smooth and \alpha-strongly convex (\kappa =\beta/\alpha is the condition number), which is arguably the most challenging case since one expects linear convergence (i.e., the scaling of T_{\epsilon} in \epsilon should be \log(1/\epsilon)) which a priori makes the interaction of optimization error and communication error potentially delicate (one key finding is that in fact it is not delicate!). Also, having in mind applications outside of large-scale machine learning (such as “federated” learning), we will make no assumptions about the functions at different vertices relate to each other.

A trivial answer

Recall that Nesterov’s accelerated gradient descent solves the serial problem in time O(\sqrt{\kappa} \log(1/\epsilon)). Trivially one can distribute a step of Nesterov’s accelerated gradient descent in time O(\tau \Delta) (simply designate a master node at the beginning, and everybody sends its local gradient to the master node in time O(\tau \Delta)). Thus we arrive at the upper bound T_{\epsilon} = O((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon)) using a trivial (centralized) algorithm. We now show (slightly informally, see the paper for proper definitions) that this in fact optimal!

First let us recall the lower bound proof in the serial case (see for example Theorem 3.15 here). The idea is to introduce the function f(x) = \beta L[x,x] - 2 \alpha x(1) + \alpha |x|^2 where L is the Laplacian of the path graph on \N, or in other words

    \[L[x,x] = \sum_{i \in \mathbb{N}} (x(i) - x(i+1))^2 .\]

First it is easy to see that this function is indeed O(\beta)-smooth and \Omega(\alpha)-strongly convex. The key point is that, for any algorithm starting at x_0=0 and such that each iteration stays in the linear span of the previously computed gradients (a very natural assumption) then

    \[x_t(i) = 0 \;\; \text{for all} \;\; i > t ~.\]

In words one can say that each gradient calculation “discovers” a new edge of the path graph involved in the definition of f. Concluding the serial proof is then just a matter of brute force calculations.

Now let us move to the distributed setting, and consider two vertices u and v that realize the diameter of G. The idea goes as follows: let L_1 (respectively L_2) be the Laplacian of even edges of the path graph on \mathbb{N} (respectively the odd edges), that is

    \[L_1[x,x] = \sum_{i \in \mathbb{N} : i \; \text{even}} (x(i) - x(i+1))^2 .\]

Now define \frac{1}{n} f_u = \beta L_1[x,x] - 2 \alpha x(1) + \frac{\alpha}{2} |x|^2, \frac{1}{n} f_v = \beta L_2[x,x] + \frac{\alpha}{2} |x|^2, and f_w = 0 for any w \not\in \{u,v\}. The key observation is that node u does not “know” about the even edges until it receives a message from v and vice versa. Thus it fairly easy to show that in this case one has:

    \[x_{u,t}(i) = 0 \;\; \text{for all} \;\; i > t / (1+\tau \Delta) ~,\]

which effectively amounts to a slowdown by a factor (1+\tau \Delta) compared to the serial case and proves the lower bound T_{\epsilon} = \Omega((1+\tau \Delta) \sqrt{\kappa} \log(1/\epsilon).

Not so fast!

One can say that the algorithm proposed above defeats a bit the purpose of the distributed setting. Indeed the centralized communication protocol it relies on is not robust to various real-life issues such as machine failures, time-varying graphs, edges with different latency, etc. An elegant and practical solution is to restrict communication to be gossip-like. That is local computations have now to be communicated via matrix multiplication with a walk matrix W which we define as satisfying the following three conditions: (i) W(i,j) \neq 0 \Rightarrow (i,j) \in E, (ii) \mathrm{Ker}(W) =\mathrm{Span}((1,\hdots,1)), and (iii) W \succeq 0. Let us briefly discuss these conditions: (i) simply means that if x \in \mathbb{R}^V represents real values stored at the vertices, then W x can be calculated with a distributed communication protocol; (ii) says that if there is consensus (that is all vertices hold the same value) then no communication occurs with this matrix multiplication; and (iii) will turn out to be natural in a just a moment for our algorithm based on duality. A prominent example of a walk matrix would be the (normalized) Laplacian of G

We denote by \gamma the inverse condition number of W on (1,\hdots,1)^{\perp} (that is the ratio of the smallest non-zero eigvenvalue of W to its largest eigenvalue), also known as the spectral gap of G when W is the Laplacian. Notice that \gamma naturally controls the number of gossip steps to reach consensus, in the sense that gossip steps corresponds to gradient descent steps on f(x) = x^{\top} W x, which will converge in O(\gamma^{-1} \log(1/\epsilon)) steps. Doing an “accelerated gossip” (also known as Chebyshev gossiping) one could thus hope to essentially replace the diameter \Delta by O(\gamma^{-1/2} \log(1/\epsilon)). Notice that this is hopeful thinking because in the centralized model \Delta steps gets you to an exact consensus, while in the gossip model one only reaches an \epsilon-approximate consensus and errors might compound. In fact with a bit of graph theory one can immediately see that simply replacing \Delta by O(\gamma^{-1/2} \log(1/\epsilon)) is too good to be true: there are graphs (namely expanders) where \Delta is of order \log(n) while \gamma is of order of a constant, and thus an upper bound of the form (say) \Omega((1+\tau \gamma^{-1/2} \log(1/\epsilon)) \sqrt{\kappa} \log(1/\epsilon) would violate our previous lower bound by a factor 1/\log(n).

To save the day we will make extra assumptions, namely that each local function f_v has condition number \kappa and that in addition to computing local gradient the vertices can also compute local gradients of the Fenchel dual functions f_v^*. The latter assumption can be removed at the expense of extra logarithmic factors but we will ignore this point here (see the paper for some hints as well as further discussion on this point). For the former assumption we note that the lower bound proof given above completely breaks under this assumption. However one can save the construction for some specific graphs (finding the correct generalization to arbitrary graphs is one of our open problems). For example imagine a line graph, and cluster the vertices into three groups, the first third, the middle, and the last third. Then one could distribute the even part of the Laplacian on \mathbb{N} in the first group, and the odd part on the last group, as well as distribute the Euclidean norm evenly among all vertices. This construction verifies that each vertex function has condition number O(\kappa) and furthermore the rest of the argument still goes through. Interestingly in this case one also has \Delta \approx 1/\sqrt{\gamma} and thus this proves that for the line graph one has T_{\epsilon} = \Omega((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon) for gossip algorithms. We will now show a matching upper bound (which holds for arbitrary graphs).

Dual algorithm

For X \in \mathbb{R}^{d \times n} (which we think of as a set of column vectors, one for each vertex i \in V = [n]), denote X_i for the i^{th} column and let F(X) := \sum_{i =1}^n f_i(X_i). We are interested in minimizing F under the constraint that all columns are equal, which can be written as X \sqrt{W} = 0. By definition of the Fenchel dual F^* and a simple change of variable one has:

    \[\min_{X \in \mathbb{R}^{d\times n} : X \sqrt{W} = 0} F(X) = \max_{\lambda \in \mathbb{R}^{d \times n}} - F^*(\lambda \sqrt{W}) .\]

Next observe that gradient ascent on \lambda can be written as

    \[\lambda^+ = \lambda + \eta \nabla F^*(\lambda \sqrt{W}) \sqrt{W}\]

and with the notation y = \lambda \sqrt{W} this is simply y^+ = y - \eta \nabla F^*(y) W. Crucially \nabla F^*(y) W exactly corresponds to gossiping the local conjugate gradients (which are also the local models) \nabla f_i^*(y_{i,t}) =: X_{i,t}. In other words we only have to understand the condition number of the function \lambda \mapsto F^*(\lambda \sqrt{W}). The beauty of all of this is that this condition number is precisely \kappa / \gamma (i.e. it naturally combines the condition number of the vertex functions with the “condition number” of the graph). Thus by accelerating gradient ascent we arrive at a time complexity of O((1+\tau) \sqrt{\kappa / \gamma} \log(1/\epsilon)) (recall that a gossip step takes time \tau). We call the corresponding algorithm SSDA (Single-Step Dual Accelerated). One can improve it slightly in the case of low communication cost by doing multiple rounds of communication between two gradient computations (essentially replacing W by W^k). We call the corresponding algorithm MSDA (Multi-Step Dual Accelerated) and its attains the optimal (in the worst case over graphs) complexity of O((1+\tau \gamma^{-1/2}) \sqrt{\kappa} \log(1/\epsilon)).

This entry was posted in Optimization. Bookmark the permalink.

Leave a reply