Graceful Recovery From malloc Failure

S. Gilles

2017-08-04

Background

  An argu^Wpolite discussion I've seen play out a few times so far
  goes like this:

      A: I have created a new library/operating system/programming
         language for you to use. Please enjoy.

      B: I have a bug report: the memory allocations are not checked
         for failure. Very bad things will happen on line N + k if
         the memory allocation on line N failed.

      A: Thank you for the report - I've made a fix.

      B: Your fix is to call abort(3) if malloc(3) ever fails. This
         is better, but could you instead clean up the heap and
         report failure to the user?

      A: That would be really hard, and what are you going to do
         besides abort() anyway?

  In my experience, by this point in the conversation negotiations
  have broken down a little, and no progress is made.

  Here's a perfect example of what could be done besides abort():

Background (topology)

  For my research, I'm interested in generalizing (very slightly)
  Thurston's Gluing Equations: for any hyperbolic 3-manifold M, if
  you are interested in listing all possible “nice” maps

     f : { loops in M } → { Some Lie group G }

  it turns out that it suffices to cut M up into a (finite) number
  of tetrahedra, and then associate to each tetrahedron a (finite)
  set of (non-zero) complex numbers: the A-coordinates. Performing
  some mechanical computations, you get a second (finite) set of
  (non-zero) complex numbers: the X-coordinates.

  Not only that, but the {A,X}-coordinates are associated to
  particular geometric locations on those tetrahedra, and that f(γ)
  can be calculated by (roughly) multiplying together the X-coordinates
  that lie along the loop γ.

  For the manifold M = S₃ \ 4₁ (the complement of the figure-eight
  knot, which has a hyperbolic structure), for example, it turns
  out that there are only 2 tetrahedra, and to consider maps into
  SO(5,ℝ) requires with 23 X-coordinates for each tetrahedron
  (according to one convention, if I've counted correctly). So if
  you know 46 (non-zero) complex numbers, you can compute f.

  It turns out 46 numbers are too many. Some are always in pairs
  of (x, 1/x), some may not be used in recovering f, etc. The
  question then arises “What are the exact set of restrictions we
  need to put on thes 46 numbers to guarantee that they correspond
  to some f?”.

  My research is to (using specific “quivers” developed recently,
  independently, by Zickert and Le) try and answer this question
  when G is a Lie group of type Bₙ, Cₙ or (hopefully) Gₙ. My approach
  is not very clever, but it will give context to the situation in
  which I care about memory errors.

  I fix a canonical set of A-coordinates as a basis/super-basis,
  then force the assumption that the A-coordinates belong to a
  triangulation (these equations aren't difficult). Then I represent
  all X-coordinates as rational polynomials in terms of that basis.
  Next, I evaluate those polynomials for some pseudorandom assignments
  of complex numbers to A-coordinates. This allows me to make some
  guesses about which products of X-coordinates will be 1. When I
  symbolically confirm a guess, I obtain a restriction on the
  X-coordinates, which must be satisfied for an X-coordinate
  assignment to correspond to a triangulation

  Once I have enough necessary restrictions, I'll test whether they
  are sufficient through other means, so for now the important thing
  is to multiply as many different polynomials together and list
  all ways to get 1.

The code in question:

       X := { the rational polynomials representing the X-coordinates }
       C := make_guesses_via_numerical_approximation(X)

       for c in C, (c is a list with length n) {
          if (Prod over i=1,2,…n of X[c[i]] is 1) {       (*)
            output({ X[c[i]] : i=1,2,…n })
          }
       }

  Line (*) is performed using the PARI library[0]. Regardless of
  whatever optimizations are applied to obtain C, some products
  must be computed to check if some variables cancel. For reference,
  one of the worse elements of X, in one computation, is

      x'_{1'} = ((a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁³·a₀₃·a₂₃²·a₀² + ((a'₁³·a'₃₂²·a₁³
      + a'₂·a'₁³·a'₃₂·a₂·a₁)·a₀₃·a₂₃² + (a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁⁵ +
      a'₂·a'₁·a'₀₃·a'₂₃Â²·a₂·a₁³)·a₃₂)·a₀ + (a'₁³·a'₃₂²·a₁⁵ +
      2·a'₂·a'₁³·a'₃₂·a₂·a₁³ + a'₂²·a'₁³·a₂²·a₁)·a₃₂)·aₛ² +
      (((a'₂·a'₁³·a'₃₂·a₁·a₀₃·a₃₀ + a'₂·a'₁²·a'₃₀·a'₃₂·a₁²·a₀₃)·a₂₃³
      + (a'₂·a'₁²·a'₃₂·a'₂₃·a₁²·a₀₃·a₃₀ +
      a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁³·a₀₃)·a₂₃²)·a₀² + (((a'₂·a'₁³·a'₃₂·a₁³
      + a'₂²·a'₁³·a₂·a₁)·a₃₀ + (a'₂·a'₁²·a'₃₀·a'₃₂·a₁⁴ +
      a'₂²·a'₁²·a'₃₀·a₂·a₁²))·a₃₂·a₂₃ + ((a'₂·a'₁²·a'₃₂·a'₂₃·a₁⁴ +
      a'₂²·a'₁²·a'₂₃·a₂·a₁²)·a₃₀ + (a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁⁵ +
      a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁³))·a₃₂)·a₀)·aₛ)/((a'₀₃²·a'₂₃³·a₁²·a₀₃·a₂₃³·a₀³
      + (a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁²·a₀₃·a₂₃³ +
      2·a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁·a₀₃·a₂₃² + a'₀₃²·a'₂₃³·a₁⁴·a₃₂·a₂₃)·a₀²
      + ((a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁⁴ + a'₂·a'₁²·a'₀₃·a'₂₃·a₂·a₁²)·a₃₂
      + (a'₂·a'₁²·a'₃₂·a'₂₃·a₂·a₁² + a'₂²·a'₁²·a'₂₃·a₂²)·a₀₃)·a₂₃·a₀)·aₛ²
      + ((2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁·a₀₃·a₃₀ +
      2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁²·a₀₃)·a₂₃³·a₀³ +
      ((2·a'₂²·a'₁²·a'₂₃·a₂·a₀₃·a₃₀ + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁·a₀₃)·a₂₃²
      + (2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁³·a₃₀ +
      2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁⁴)·a₃₂·a₂₃)·a₀²)·aₛ +
      ((a'₂²·a'₁²·a'₂₃·a₀₃·a₃₀² + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁·a₀₃·a₃₀
      + a'₂²·a'₃₀²·a'₂₃·a₁²·a₀₃)·a₂₃³·a₀³ + (a'₂²·a'₁²·a'₂₃·a₁²·a₃₀²
      + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁³·a₃₀ + a'₂²·a'₃₀²·a'₂₃·a₁⁴)·a₃₂·a₂₃·a₀²))

  This takes, I estimate, somewhere on the order of 10KiB for PARI
  to store in a way that can be used efficiently.

  If a few elements of that complexity are multiplied together, and
  the product doesn't cancel to 1, it's entirely possible that
  PARI's internal representation of that result will take more than
  4GiB of memory (which appears to be the maximum size I can make
  PARI's stack).

  Unluckily, PARI's stack-based operations kill the program on
  failure. In my particular situation, overflowing PARI's stack is
  something I could recover from: I'd just output a warning that
  that particular product needs to be checked by hand.

  This is the prime sort of example that I think B should be giving:
  programs performing memory-intensive operations, some of which
  are optional. I am content with a small number of warnings on
  stderr regarding products which are difficult to compute. I am
  not content with a 12 hour computation aborting 7 hours in.

Fixes

  In my particular case, there are fixes. PARI primarily works off
  its own, internal stack, but objects can be moved to main memory
  if needed. This removes the 4GiB limit.

  More importantly, however, PARI exposes the top, bottom, and
  current position of its internal stack. This allows me to monitor
  memory usage in (*), and cleanly skip the computation if too much
  of the stack memory starts being used.

  It also turns out that checking for continual memory growth is
  an excellent heuristic that a computation is probably not going
  to cancel to 1. If

    o all factors have roughly the same initial size,
    o “unlucky” multiplications increase memory usage exponentially,
       and
    o such growth has been observed when multiplying ceil(half)
      factors,

  it's very improbable that the last few factors have enough
  complexity to cancel the intermediate product, so we can abandon
  the computation, skipping the last (and probably the most intensive)
  steps. This also allows me to use low 10s of MiB, not GiB, for
  PARI's stack.

[0] http://pari.math.u-bordeaux.fr