Monthly Archives: December 2013

Repa Laplace and SOR

Using the Repa Library for Laplace’s Equation and SOR

This describes the result of some experiments to enhance an existing elegant (haskell parallel array) laplace solver by using successive over-relaxation (SOR) techniques. The starting point was a laplace solver based on stencil convolution and using the Haskell Repa library (Regular Parallel Arrays) as reported in a paper by Lippmeier, Keller and Peyton-Jones. (There is also a 2011 published paper with the first two authors (Lippmeier and Keller 2011))

Background

Laplace’s Equation and Iterative solvers

The Laplace equation

\displaystyle {\partial^2 u\over\partial x^2} + {\partial^2 u\over\partial y^2} = 0

is usually abbreviated to simply \nabla^2 u = 0, where u(x,y) is a scalar potential (such as temperature) over a smooth surface.

For numerical solutions we will restrict attention to a rectangular surface with some fixed boundary values for u. (The boundary is not necessarily the border of the rectangle, as we can allow for fixed values at internal regions as well). The problem is made discrete for finite difference methods by imposing a grid of points over the surface and here we will assume this is spaced evenly (\Delta x in the x direction and \Delta y in the y direction). Approximating solutions numerically amounts to approximating a fixedpoint for the grid values satisfying the boundary values and also, for non-boundary points (i,j) satisfying

\displaystyle u_{i,j} = { u_{i-1,j} + u_{i+1,j} + \beta^2 ( u_{i,j-1} + u_{i,j+1}) \over 2(1+\beta^2)}

where \beta = {\Delta x \over \Delta y }

If we also assume \Delta x = \Delta y then this simplifies to

\displaystyle u_{i,j} = { u_{i-1,j} + u_{i+1,j} + u_{i,j-1} + u_{i,j+1} \over 4}

Iterating to find this fixed poiint from some suitable starting values is called relaxation. The simplest (Jacobi) method involves iterating the calculation of new values u^{n+1} from previous values u^n until the iterations are close enough to converging.

\displaystyle u^{n+1}_{i,j} = { u^n_{i-1,j} + u^n_{i+1,j} + u^n_{i,j-1} + u^n_{i,j+1} \over 4}

This method, whilst very simple, can be extremely slow to converge. One improvement is given by the Gauss-Seidel method which assumes the next set of values in each iteration are calculated row by row in a scan across the columns allowing some new values to be used earlier during the same iteration step (changing n to n+1 for calculated values in the previous column and in the previouus row).

\displaystyle u^{n+1}_{i,j} = { u^{n+1}_{i-1,j} + u^n_{i+1,j} + u^{n+1}_{i,j-1} + u^n_{i,j+1} \over 4}

Unfortunately, this assumption about the order of evaluation of array elements interferes with opportunities to use parallel computation of the array elements (as we wish to do using Repa array primitives).

Successive over-relaxation (SOR)

Successive over-relaxation is a method used to speed up convergence of the iterations by using a weighted sum of the previous iteration with the new one at each relaxation step. Using 0 < \omega < 2 as the weight we calculate first u' (substituting u' for u^{n+1} in the above equation), then calculate

\displaystyle u^{n+1}_{i,j} = \omega u'_{i,j} + (1- \omega ) u^n_{i,j}

Values of \omega less than 1 will slow down the convergence (under-relaxation). A value of \omega between 1 and 2 should speed up convergence. In fact the optimal value for \omega will vary for each iteration, but we will work with constant values only here. It is also important to note that starting with values above 1 will generally cause oscillations (divergence) if just the Jacobi scheme is used, so successive over-relaxation relies on a speed up of the basic iteration such as that provided by the Gauss-Seidel scheme.

The original stencil version using Repa

In the paper a stencil mapping technique is used to get fast performance using the Repa array primitives. The code uses the Repa library version 3

The main relaxation step is expressed as

relaxLaplace arr
 = computeP
 $ szipWith (+) arrBoundValue   -- boundary setting
 $ szipWith (*) arrBoundMask    -- boundary clearing
 $ smap (/ 4)
 $ mapStencil2 (BoundConst 0)
    [stencil2|   0 1 0
                 1 0 1 
                 0 1 0 |] arr

We will be taking this as a starting point, so we review details of this (working from the bottom up).

A stencil is described in a quoted form to express which neighbours are to be added (using zeros and ones) with the element being scanned assumed to be the middle item. Thus this stencil describes adding the north, west, east, and south neighbours for each item. The mapStencil2 is a Repa array primitive which applies the stencil to all elements of the previous iteration array (arr). It is also supplied with a border directive (Boundconst 0) which just says that any stencil item arising from indices outside the array is to be treated as 0. This stencil mapping will result in a non-manifest array, in fact, in a partitioned array split up into border parts and non-border parts to aid later parallel computation. After the stencil map there is a mapping across all the results to divide by 4 using smap (which improves over the basic repa array map for partitioned arrays). These calculations implement the Jacobi scheme.

The subsequent two szipWith operations are used to reset the fixed boundary values. (The primitive szipWith again improves over the basic repa array zipWith by catering explicitly for partitioned arrays.) More specifically, szipWith(*) is used with an array (arrBoundMask) which has zero for boundary positions and 1 for other positions – thus setting the boundary items to 0. After this szipWith(+) is used with an array (arrBoundValues) which has the fixed initial values for the boundary items and zero for all other items. Thus the addition reinstates the initial boundary values.

These array calculations are all delayed so a final computeP is used to force the parallel evaluations of items to produce a manifest array result. This technique allows a single pass with inlining and fusion optimisations of the code to be possible for the calculation of each element. Use of computeP requires the whole calculation to be part of a monad computation. This is a type constraint for the primitive to exclude the possibility of nested calls of computeP overlapping.

A significant advantage of this stencil implementation is that programming directly with array indices is avoided (these are built into the primitives once and for all and are implemented to avoid array bound checks being repeated unnecessarily).

Unfortunately, though, this implementation is using the Jacobi scheme for the iteration steps which is known to have very slow convergence.

We would like to improve on this by using successive over-relaxation, but as pointed out earlier, this will not work with the Jacobi scheme. Furthermore, the Gauss-Seidel improvement will not help because it is based on retrieving values from an array still being updated. This is not compatible with functional array primitives and stencils and prevents simple exploitation of parallel array calculations.

To the rescue – the red-black scheme for calculating iterations.

Red-black Iterations

This scheme is well known and based on the observation that on a red-black checkerboard, for any red square, the north, west, east, and south neighbours are all black and conversely neighbours of black squares are all red. This leads to the simple idea of calculating updates to all the red squares first, then using these updated values to calculate the new black square values.

Implementation using stencils and repa

The set up

We split the original array into a red array and black array with simple traverse operations. We assume the original array (arr) has an even number of columns so the number of red and black cells are the same. As a convention we will take the (0,0) item to be red so we want, for red array (r)

r(i,j) = arr(i, 2*j + (i `mod` 2))

where

arr has i <- 0..n-1, j <- 0..2m-1
r   has i <- 0..n-1, j <- 0..m-1

The traverse operation from the Repa library takes as arguments, the original array, a mapping to express the change in shape, and a lookup function to calculate values in the new array (when given a get operation for the original array and a coordinate (i,j) in the new array)

> projectRed :: Array U DIM2 Double -> Array D DIM2 Double
> projectRed arr =  
>   traverse arr 
>           (\ (e :. i :. j) -> (e :. i :. (j `div` 2)))
>           (\get (e :. i :. j) -> get (e :. i :. 2*j + (i `mod` 2)))

Here (and throughout) we have restricted the type to work with two dimensional arrays of Double, although a more general type is possible. Notice also that the argument array is assumed to be manifest (U) and the result is a delayed array (D).

Similarly for the black array (b) we want

b(i,j) = arr(i, 2*j + ((i+1) `mod` 2))

with the same extent (shape) as for r, hence

> projectBlack :: Array U DIM2 Double -> Array D DIM2 Double
> projectBlack arr =  
>     traverse arr 
>             (\ (e :. i :. j) -> (e :. i :. (j `div` 2)))
>             (\ get (e :. i :. j) -> get(e :. i :. 2*j + ((i+1) `mod` 2)))

We can also use these same functions to set up boundary mask and value arrays separately for red and black from the starting array (arrInit), initial mask (arrBoundMask), and boundary values (arrBoundValue)

do redBoundMask    <- computeP $ projectRed arrBoundMask
   blackBoundMask  <- computeP $ projectBlack arrBoundMask
   redBoundValue   <- computeP $ projectRed arrBoundValue
   blackBoundValue <- computeP $ projectBlack arrBoundValue
   redInit         <- computeP $ projectRed arrInit
   blackInit       <- computeP $ projectBlack arrInit

This is part of a monad computation with each step using a computeP (parallel array computation) to create manifest versions of each of the arrays we need before beginning the iterations. These calculations are independent, so the sequencing is arbitrary.

The finish

At the end of the iterations we will reverse the split into red and black by combining using the traverse2 operation from the Repa library. We need

arr(i,j) = r(i, j `div` 2) when even(i+j)
         = b(i, j `div` 2) otherwise            

where

arr has i <- 0..n-1 , j <- 0..2m-1
r   has i <- 0..n-1 , j <- 0..m-1
b   has i <- 0..n-1 , j <- 0..m-1

The traverse2 operation takes two arrays (here – of the same extent), a mapping to express the new extent (when given the two extents of the argument arrays), and a function to calculate values in the new array (when given the respective get operations for the original arrays (here – get1 and get2) and a coordinate (i,j) in the new array).

> combineRB r b =   
>   traverse2 r b
>            (\ (e :. i :. j) _ -> (e :. i :. 2*j))
>            (\ get1 get2 (e :. i :. j) -> 
>               (if even(i+j) then get1 else get2) (e :. i :. j `div` 2)
>            )
 

Stencils in the iteration step

We use stencils as in the original, but when we consider a stencil on black cells which are to be combined as neighbours of the red cell r(i,j) we need the following shape, where b(i,j) corresponds to the middle item

                    0 1 0         
                    1 1 0
                    0 1 0

BUT this is only when processing EVEN rows from the black array. For odd rows we need a different shape

                    0 1 0 
                    0 1 1 
                    0 1 0

That is, we need to apply one of two different stencils depending on whether the row is even or odd. Similarly in processing the red array to combine red neighbours of b(i,j) we need the same shaped stencils but the first shape above is used on odd rows of red cells and the second shape on even rows. We define and name the stencils as leftSt and rightSt

> leftSt :: Stencil DIM2 Double
> leftSt  =   [stencil2|  0 1 0         
>                         1 1 0 
>                         0 1 0 |]

> rightSt :: Stencil DIM2 Double
> rightSt =   [stencil2|  0 1 0 
>                         0 1 1 
>                         0 1 0 |]

Critically, we will need an efficient way to apply alternate stencils as we map across an array. At first, it may seem that we might have to rebuild a version of the primitive mapStencil2 to accommodate this, but that will get us into some complex array representation handling which is built into that primitive. On reflection, though, the combination of lazy evaluation and smart inlining/fusion optimisation by the compiler should allow us to simply apply both stencils on all elements and then choose the results we actually want. The laziness should ensure that the unwanted stencil applications are not actually calculated, and the administration of choosing should be simplified by compiler optimisations.

To apply our stencils we will use the Repa array primitive mapStencil2 which takes as arguments, a border handling directive, a stencil, and a (manifest) array, to produce a resulting (delayed) array.

 mapStencil2 :: Boundary Double
             -> Stencil DIM2 Double
             -> Array U DIM2 Double
             -> Array D DIM2 Double 

The choice of stencil will depend on the position (actually the evenness of the row). As we cannot refer to the indices when using the stencil mapping operation we are led to using traverse2 after mapping both stencils to select results from the two arrays produced. Our alternate stencil mapping operation will have a similar type to mapStencil2 except that it expects two stencils rather than one.

altMapStencil2 :: Boundary Double
               -> Stencil DIM2 Double
               -> Stencil DIM2 Double
               -> Array U DIM2 Double
               -> Array D DIM2 Double 

altMapStencil2 !bd !s1 !s2 !arr
= traverse2 (mapStencil2 bd s1 arr) (mapStencil2 bd s2 arr)
            (\ e _ -> e)
            (\ get1 get2 (e :. i :. j) -> 
                   (if even i then get1 else get2) (e :. i :. j)
            )

This function needs to be inlined (with a pragma) and the bang annotations on arguments are there to improve optimisation opportunities.

The iteration step

The main relaxLaplace iteration step involves the following (where r and b are the previous red and black arrays)

  • Calculating a new version of r using stencils over b
    r1 = smap (/4)
         $ altMapStencil2 (BoundConst 0) leftSt rightSt b
  • Over-relaxation of this taking weighted sums (using r and r1)
    r2 = szipWith weightedSum r r1
         where weightedSum old new = (1-omega)*old + omega*new
  • Reset the boundaries to get the new version of r (r’)
    r' = szipWith (+) redBoundValue      -- boundary resetting
         $ szipWith (*) redBoundMask r2  -- boundary clearing
  • Do similar steps for calculating b’ but starting with stencils over r’ (not r) and switching the left and right stencils

This is combined in the following monad computation (where r and b are the old arrays). The monad is necessary because we want to use computeP to ensure that the final returned arrays are manifest.

 do r' <- computeP
          $ relaxStep r b redBoundValue redBoundMask leftSt rightSt
    b' <- computeP 
          $ relaxStep b r' blackBoundValue blackBoundMask rightSt leftSt
    ...

which uses

relaxStep !arrOld !arrNbs !boundValue !boundMask !stencil1 !stencil2 
   = szipWith (+) boundValue
     $ szipWith (*) boundMask
     $ szipWith weightedSum arrOld
     $ smap (/4)
     $ altMapStencil2 (BoundConst 0) stencil1 stencil2 arrNbs

weightedSum !old !new = (1-omega)*old + omega*new

The first argument for relaxStep is the old (red or black) array we are calculating an update for, and the second argument is the neighbour array to use stencils on. The old array is needed for the over-relaxation step with weightedSum.

It is also worth pointing out that we have the over-relaxation step being done before the two boundary resetting steps, but this could just as easily be done after the boundary resetting as it does not make a difference to the final array produced.

Laplace with Red-Black and Stencils

Finally the main function (solveLaplace) sets up the initial arrays and passes them to the function with the main loop (iterateLaplace)

> solveLaplace::
>   Monad m
>   => Int		   -- ^ Number of iterations to use.
>   -> Double		   -- ^ weight for over relaxing (>0.0 and <2.0)
>   -> Array U DIM2 Double -- ^ Boundary value mask.
>   -> Array U DIM2 Double -- ^ Boundary values.
>   -> Array U DIM2 Double -- ^ Initial state. Should have even number of columns
>   -> m (Array U DIM2 Double)

> solveLaplace !steps !omega !arrBoundMask !arrBoundValue !arrInit
> do redBoundMask    <- computeP $ projectRed arrBoundMask
>    blackBoundMask  <- computeP $ projectBlack arrBoundMask
>    redBoundValue   <- computeP $ projectRed arrBoundValue
>    blackBoundValue <- computeP $ projectBlack arrBoundValue
>    redInit         <- computeP $ projectRed arrInit
>    blackInit       <- computeP $ projectBlack arrInit
>    iterateLaplace steps omega redInit blackInit
>                   redBoundValue blackBoundValue redBoundMask blackBoundMask

where

iterateLaplace !steps !omega !redInit !blackInit
               !redBoundValue !blackBoundValue !redBoundMask !blackBoundMask 
     = go steps redInit blackInit
       where 
         go 0 !r !b = computeP $ combineRB r b -- return final combined array
         go n !r !b 
            = do r' <- computeP
                       $ relaxStep r b redBoundValue redBoundMask leftSt rightSt
                 b' <- computeP 
                       $ relaxStep b r' blackBoundValue blackBoundMask rightSt leftSt
                 go (n - 1) r' b'

         {-# INLINE relaxStep #-}
         relaxStep !arrOld !arrNbs !boundValue !boundMask !stencil1 !stencil2 
            = szipWith (+) boundValue
              $ szipWith (*) boundMask
              $ szipWith weightedSum arrOld
              $ smap (/4)
              $ altMapStencil2 (BoundConst 0) stencil1 stencil2 arrNbs

         {-# INLINE weightedSum #-}
         weightedSum !old !new = (1-omega)*old + omega*new

{-# INLINE iterateLaplace #-}

Performance and a New Version

The number of calculations in an iteration of red-black is comparable to the original stencil implementation (with just the weighting operations addded in). Although there are two arrays to update, they are half the size of the original. We would expect optimised performance to be only fractionally slower than the original. The speedup in progress towards convergence can be dramatic (one iteration per 8 of the original for our test examples). So, in principle, this would be a big improvement. Unfortunately optimisation did not seem to be achieving the same speedups as the original and the code was roughly 12 times slower.

After much experimentation it looked as though the inner traverse2 operation of altMapStencil2 was inhibiting the optimisations (fusions with stencil mapping code and subsequent maps and zips).

A better performance was achieved by separating the stencil mapping from the traverse2 for alternate row selection and delaying the alternate row selection until after all the other operations. The new version drops altMapStencil2 and instead simply uses altRows

> altRows :: forall r1 r2 a . (Source r1 a, Source r2 a)
>         => Array r1 DIM2 a -> Array r2 DIM2 a -> Array D DIM2 a  
>             
> altRows !arr1 !arr2 =  -- assumes argument arrays with the same shape
>     traverse2 arr1 arr2
>               (\ e _ -> e)
>               (\ get1 get2 e@(_ :. i :. _) -> 
>                   if even i then get1 e else get2 e
>               )
>                  
> {-# INLINE altRows #-}

The function relaxStep in the following revised version of iterateLaplace has altRows done last, with mapStencil2 applied first (using a different stencil in each alternative)

> iterateLaplace ::
>   Monad m
>   => Int
>   -> Double
>   -> Array U DIM2 Double	
> 	-> Array U DIM2 Double
> 	-> Array U DIM2 Double	
> 	-> Array U DIM2 Double
> 	-> Array U DIM2 Double
> 	-> Array U DIM2 Double
> 	-> m (Array U DIM2 Double)
>           
> iterateLaplace !steps !omega !redInit !blackInit
>                !redBoundValue !blackBoundValue !redBoundMask !blackBoundMask 
>      = go steps redInit blackInit
>        where 
>          go 0 !r !b = computeP $ combineRB r b -- return final combined array
>          go n !r !b 
>             = do r' <- computeP 
>                        $ relaxStep r b redBoundValue redBoundMask leftSt rightSt
>                  b' <- computeP
>                        $ relaxStep b r' blackBoundValue blackBoundMask rightSt leftSt
>                  go (n - 1) r' b'
>              
>          {-# INLINE relaxStep #-}
>          relaxStep !arrOld !arrNbs !boundValue !boundMask !stencil1 !stencil2 
>             = altRows (f stencil1) (f stencil2)
>               where 
>                     {-# INLINE f #-}
>                     f s = szipWith (+) boundValue
>                           $ szipWith (*) boundMask
>                           $ szipWith weightedSum arrOld
>                           $ smap (/4)
>                           $ mapStencil2 (BoundConst 0) s arrNbs
>
>          {-# INLINE weightedSum #-}
>          weightedSum !old !new = (1-omega)*old + omega*new
>
> {-# INLINE iterateLaplace #-}

This now seems to be only a little slower to run than the original stencil solution (about 4 times slower so about 3 times faster than the previous version of red-black). Thus, with an approximately 8-fold speed up in convergence, this does give an overall improvement.

In order to compile this version, it was necessary to use not just the ghc -O2 flag, but also -fsimpl-tick-factor=1000. The need for this flag was indicated by a ghc compiler bug message.

All the code above with birdfeet ( >) in this (incomplete) literate haskell document is essentially that in the module RedBlackStencilOpt.hs (which has some extra preliminaries and imports and checking of arguments in functions which were elided here for clarity). This code can be found here along with the module RedBlackStencil.hs which contains the first version. These can both be loaded by the wrapper newWrapper.hs which is just an adaptation of the original wrapper to allow for passing the extra \omega parameter.

Acknowledgements and References

There are many numerical methods books covering relevant background mathematics, but online, there are also lecture notes. In particular, on Computational Numerical Analysis of Partial Differential Equations by J.M.McDonough and on Numerical Solution of Laplace Equation by G.E.Urroz. T. Kadin has some tutorial notes with a diagram for the red-black scheme (and an implementation using Fortran 90). There is a useful Repa tutorial online and more examples on parallel array fusion are reported in (Lippmeier et al. 2012).

Lippmeier, Ben, and Gabriele Keller. 2011. “Efficient Parallel Stencil Convolution in Haskell.” In Proceedings of the 4th ACM Symposium on Haskell, 59–70. Haskell ’11. New York, NY, USA: ACM. doi:10.1145/2034675.2034684. http://doi.acm.org/10.1145/2034675.2034684.

Lippmeier, Ben, Manuel Chakravarty, Gabriele Keller, and Simon Peyton Jones. 2012. “Guiding Parallel Array Fusion with Indexed Types.” SIGPLAN Not. 47 (12) (September): 25–36. doi:10.1145/2430532.2364511. http://doi.acm.org/10.1145/2430532.2364511.