Predict-and-recompute conjugate gradient variants

Tyler Chen

This is a companion piece to the publication:

@article{predict_and_recompute,
    title={Predict-and-recompute conjugate gradient variants},
    author={Tyler Chen and Erin C. Carson},
    author+an = {1=highlight},
    doi = {10.1137/19m1276856},
    year = {2020},
    month = jan,
    publisher = {Society for Industrial {\&} Applied Mathematics ({SIAM})},
    volume = {42},
    number = {5},
    pages = {A3084--A3108},
    journal = {SIAM Journal on Scientific Computing},
    eprint={1905.01549},
    archivePrefix={arXiv},
    primaryClass={cs.NA},
}

The conjugate gradient algorithm (CG) is very popular for solving positive definite linear systems \(\mathbf{A}\mathbf{x}=\mathbf{b}\). CG is popular for many reasons including low storage costs and the fact you don’t need to be able to actually represent \(\mathbf{A}\), only to be able to evaluate the product \(\mathbf{v}\mapsto \mathbf{A}\mathbf{v]\). While these properties make CG an attractive choice for solving very large sparse (lots of zeros in \(\mathbf{A}\)) systems, the standard implementation of the conjugate gradient algorithm requires that nearly every computation be done sequentially. In particular, it requires two inner products and one (often sparse) matrix vector product per iteration, none of which can occur simultaneously. Each inner product requires global communication (meaning all the processors you use have to talk to one another), and the matrix vector product (if sparse) requires local communication. Communication (moving data between places on a computer) takes time, and on supercomputers, is the biggest thing slowing down the conjugate gradient algorithm. In the past others have come up with version of the CG algorithm which take less time per iteration by reducing communication. I’ve written a bit about some of those variants here.

However, it’s well known that CG behaves very differently in finite precision than it does in exact arithmetic. Understanding why this happens is a hard problem, and only a few results have been proved about it. I’ve written an introduction to the effects of finite precision on CG here, but to summarize, the main effects are (i) the loss of ultimately attainable accuracy and (ii) the increase in number of iterations to reach a given level of accuracy (delay of convergence). Unfortunately, many of the past communication hiding variants do not always work well numerically. We would therefore like to develop variants which reduce communication (and therefore the time per iteration), while simultaneously ensuring that their numerical stability is not too severely impacted (so that the number of iterations required is not increased too much).

The primary contributions of this paper are several new mathematically equivalent CG variants, which perform better than their previously studied counterparts. A general framework for constructing these methods is presented. More importantly, the idea to use predictions of quantities to allow a computation to begin, and then recomputing these quantities at a later point (an idea originally due to Meurant) is applied to the “pipelined” versions of these variants. Despite the advances presented in this paper, there is still significant room for future work on high performance conjugate gradient variants, especially in the direction of further decreasing the communication costs.