Using iterative methods and reusable utilities

Means and ends

In our previous post, we introduced iterative methods, implemented one (conjugate gradient) in Rust as a StreamingIterator, and saw a "cost" of its solutions decrease towards zero. The StreamingIterator interface allowed us to separate two concerns:

  • produce a sequence of solutions
  • use the solutions

In this post we focus on the second part, using as our focus this workflow of tracking solution quality. This is an handy tool for evaluation of methods, from ensuring the initial implementation is complete, through validating on a real problem, to a plot in a research paper demonstrating advantage (hopefully) over the state of the art.

But what do we mean by quality? and how does it look to decompose the current problem into reusable parts?

Before diving into those questions we recall the end goal: users eventually want to call a function, pass it a problem, and receive one final good enough solution (or an indication of error). The existence of the sequence of worse preceding approximations is an implementation detail to them. However, this kind of interface also is easy to build on top of a stream based implementation and a few of the same utilities.

Measures of progress

Let's generalize and nail down the terms we used loosely before. Let \(x_0\) be an initial solution to a problem \(p\), \(x_1,x_2\dots\) the sequence of solutions produced by an iterative method, and \(f(p,x)\) a real function measuring a cost of using solution \(x\) for problem \(p\).

In the workflows we are considering, we show \(f(p,x_t)\) as a function of either the iteration number \(t\) or elapsed time. Going back to the example of finding solutions for a matrix equality \(Ax=b\), we would like to have the vector of errors \(Ax-b\) to be all zero, so a natural measure of cost is the Euclidean length (aka \(l_2\) norm) of the error vector: \(\sqrt{\sum_{i=1}^n\left(Ax-b\right)_i^2}\), as we printed last time.

There are different useful cost functions, not just when considering different problem types, but even for one particular problem. Other measures of the size of a vector sometimes make more sense; the sum of absolute values (\(l_1\)) or their maximum (\(l_\infty\)), the worst error in any dimension) are two very common examples. Another source of variants is to give some error components a larger weight than others, for example based on their roles in an application. Finding a cost function that models well the desiderata of a particular application is a bit of an art, but a critical step in selecting among distinct methods.

We have defined an iterative method as one that improves a solution. More precisely, this means there exists a particular cost function (also known as a Lyapunov function or energy function), which the improvement recipe can be proven to reduce (perhaps in expectation). This energy function is generally tailored to the problem data, method, assumptions and proof method, and so not to an application, but it has a special role to play in validation.

Implementing benchmarking

Time to get dirty with some implementation. First let us review how we used ConjugateGradient last time (removing the comments), and then improve on it:

    let p = make_3x3_pd_system_2();
    let mut cg_iter = ConjugateGradient::for_problem(&p);

    while let Some(result) = cg_iter.next() {
        let res = result.a.dot(&result.solution) - &result.b;
        let res_squared_length = res.dot(&res);
        println!(
            "||Ax - b||_2 = {:.5}, for x = {:+.3}, residual = {:+.3}",
            res_squared_length.sqrt(),
            result.solution,
            res
        );
        if res_squared_length < 1e-3 {
            break;
        }
    }

To start with, our computation of the l_2 of residual cost is mixed in with the I/O in the loop. We extract a reusable function:

fn residual_l2(result: &ConjugateGradient) -> f64 {
    let res = result.a.dot(&result.solution) - &result.b;
    res.dot(&res).sqrt()
}

We can now separate the application of the cost function out of the loop body:

let p = make_3x3_pd_system_2();
let cg_iter = ConjugateGradient::for_problem(&p);

// Annotate each approximate solution with its cost
let cg_iter = assess(cg_iter, residual_l2);
// and use it in stopping condition
let mut cg_iter = take_until(cg_iter, |ar| ar.annotation < 1e-3);
// Deconstruct to break out the result and cost
while let Some(AnnotatedResult {
	result: cgi,
	annotation: euc,
}) = cg_iter.next()
{
	// Now the loop body is I/O only as it should be!
	println!("||Ax - b||_2 = {:.5}, for x = {:.4}", euc, cgi.solution);
}

What is happening here? assess composes onto ConjugateGradient a StreamingIterator adaptor (just like Iterator adaptors), so below that statement, cg_iter is a new kind of StreamingIterator, whose items pair the underlying items with their cost, as evaluated by the given function. Similarly take_until adapts the stream to return elements and stop after the first satisfying apredicate (take_while is similar but stops before the desired element). Lastly, the loop header now deconstructs the struct, and its body does only I/O. Ok, cute trick, but does this scale for more general scenarios?

A larger scenario, or: more realistic benchmarking

Suppose that:

  • Having seen residual_l2 converge, we decide that to our application the \(l_\infty\) norm of the residual matters more, so we want to track convergence in that metric as well.
  • We also learn that an energy function for conjugate gradient method is: \(||x-x^||_A\) (where \(||a||_A^2 = a^\top A a\)), where ((x^)) is the optimal solution (usually not known in advance in applications, but available in benchmarks). We decide to track that as well.
fn residual_linf(result: &ConjugateGradient) -> f64 {
    (result.a.dot(&result.solution) - &result.b).fold(0.0, |m, e| m.max(e.abs()))
}

fn a_distance(result: &ConjugateGradient, optimum: V) -> f64 {
    let error = &result.solution - &optimum;
    error.dot(&result.a.dot(&error)).sqrt()
}
  • We want to compare performance to a different algorithm solving the same problems. Since the algorithms are different, comparing progress per iteration is apples to oranges, so we want to measure progress by time.
  • We want to stop after the earliest of reaching a sufficiently good solution based on the 2 effecitve measures (useful for a solid method) or iteration count exceeding N (especially useful for a method still under development).

These requirements can all be fulfilled using a readable composition of pre-existing components:

// Set up a problem for which we happen to know the solution
let p = make_3x3_pd_system_2();
let optimum = rcarr1(&[-4.0, 6., -4.]);
let cg_iter = ConjugateGradient::for_problem(&p);

// Cap the number of iterations.
let cg_iter = cg_iter.take(80);

// Time each iteration, only of preceding steps (the method)
// excluding downstream evaluation and I/O (tracking overhead), as
// well as elapsed clocktime (combining both).
let cg_iter = time(cg_iter);

// Record multiple measures of quality
let cg_iter = assess(cg_iter, |TimedResult { result, .. }| {
	(
		residual_l2(result),
		residual_linf(result),
		a_distance(result, optimum.clone()),
	)
});

// Stop if converged by both criteria
fn small_residual((euc, linf, _): &(f64, f64, f64)) -> bool {
	euc < &1e-3 && linf < &1e-3
}
let mut cg_iter = take_until(cg_iter, |ar| small_residual(&ar.annotation));

// Output progress
while let Some(AnnotatedResult {
	annotation: (euc, linf, a_dist),
	result:
		TimedResult {
			result,
			start_time,
			duration,
		},
}) = cg_iter.next()
{
	println!(
		"{:8} : {:6}, \
		 ||Ax - b||_2 = {:.3}, ||Ax - b||_inf = {:.3}, ||x-x*||_A = {:.3}, \
		 for x = {:+.3}",
		start_time.as_nanos(),
		duration.as_nanos(),
		euc,
		linf,
		a_dist,
		result.solution,
	);
}

The result happily shows constant improvement on the energy function from the very first iterations:

     651 :  30304, ||Ax - b||_2 = 1.000, ||Ax - b||_inf = 1.000, ||x-x*||_A = 2.449, for x = [+0.000, +0.000, +0.000]
   75431 :  31863, ||Ax - b||_2 = 0.943, ||Ax - b||_inf = 0.667, ||x-x*||_A = 2.309, for x = [+0.000, +0.667, +0.000]
  148849 :  31173, ||Ax - b||_2 = 0.000, ||Ax - b||_inf = 0.000, ||x-x*||_A = 0.000, for x = [-4.000, +6.000, -4.000]

So far the components have been pretty general purpose like time and take_until even when parametrized like assess. In the next post Daniel Fox describes some more substantial adaptors he has contributed that I find pretty exciting!