Graceful Recovery From malloc Failure S. Gilles 2017-08-04 Background An argu^Wpolite discussion I've seen play out a few times so far goes like this: A: I have created a new library/operating system/programming language for you to use. Please enjoy. B: I have a bug report: the memory allocations are not checked for failure. Very bad things will happen on line N + k if the memory allocation on line N failed. A: Thank you for the report - I've made a fix. B: Your fix is to call abort(3) if malloc(3) ever fails. This is better, but could you instead clean up the heap and report failure to the user? A: That would be really hard, and what are you going to do besides abort() anyway? In my experience, by this point in the conversation negotiations have broken down a little, and no progress is made. Here's a perfect example of what could be done besides abort(): Background (topology) For my research, I'm interested in generalizing (very slightly) Thurston's Gluing Equations: for any hyperbolic 3-manifold M, if you are interested in listing all possible “nice” maps f : { loops in M } → { Some Lie group G } it turns out that it suffices to cut M up into a (finite) number of tetrahedra, and then associate to each tetrahedron a (finite) set of (non-zero) complex numbers: the A-coordinates. Performing some mechanical computations, you get a second (finite) set of (non-zero) complex numbers: the X-coordinates. Not only that, but the {A,X}-coordinates are associated to particular geometric locations on those tetrahedra, and that f(γ) can be calculated by (roughly) multiplying together the X-coordinates that lie along the loop γ. For the manifold M = S₃ \ 4₁ (the complement of the figure-eight knot, which has a hyperbolic structure), for example, it turns out that there are only 2 tetrahedra, and to consider maps into SO(5,ℝ) requires with 23 X-coordinates for each tetrahedron (according to one convention, if I've counted correctly). So if you know 46 (non-zero) complex numbers, you can compute f. It turns out 46 numbers are too many. Some are always in pairs of (x, 1/x), some may not be used in recovering f, etc. The question then arises “What are the exact set of restrictions we need to put on thes 46 numbers to guarantee that they correspond to some f?”. My research is to (using specific “quivers” developed recently, independently, by Zickert and Le) try and answer this question when G is a Lie group of type Bₙ, Cₙ or (hopefully) Gₙ. My approach is not very clever, but it will give context to the situation in which I care about memory errors. I fix a canonical set of A-coordinates as a basis/super-basis, then force the assumption that the A-coordinates belong to a triangulation (these equations aren't difficult). Then I represent all X-coordinates as rational polynomials in terms of that basis. Next, I evaluate those polynomials for some pseudorandom assignments of complex numbers to A-coordinates. This allows me to make some guesses about which products of X-coordinates will be 1. When I symbolically confirm a guess, I obtain a restriction on the X-coordinates, which must be satisfied for an X-coordinate assignment to correspond to a triangulation Once I have enough necessary restrictions, I'll test whether they are sufficient through other means, so for now the important thing is to multiply as many different polynomials together and list all ways to get 1. The code in question: X := { the rational polynomials representing the X-coordinates } C := make_guesses_via_numerical_approximation(X) for c in C, (c is a list with length n) { if (Prod over i=1,2,…n of X[c[i]] is 1) { (*) output({ X[c[i]] : i=1,2,…n }) } } Line (*) is performed using the PARI library[0]. Regardless of whatever optimizations are applied to obtain C, some products must be computed to check if some variables cancel. For reference, one of the worse elements of X, in one computation, is x'_{1'} = ((a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁³·a₀₃·a₂₃²·a₀² + ((a'₁³·a'₃₂²·a₁³ + a'₂·a'₁³·a'₃₂·a₂·a₁)·a₀₃·a₂₃² + (a'₁·a'₀₃·a'₃₂·a'₂₃²·a₁⁵ + a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁³)·a₃₂)·a₀ + (a'₁³·a'₃₂²·a₁⁵ + 2·a'₂·a'₁³·a'₃₂·a₂·a₁³ + a'₂²·a'₁³·a₂²·a₁)·a₃₂)·aₛ² + (((a'₂·a'₁³·a'₃₂·a₁·a₀₃·a₃₀ + a'₂·a'₁²·a'₃₀·a'₃₂·a₁²·a₀₃)·a₂₃³ + (a'₂·a'₁²·a'₃₂·a'₂₃·a₁²·a₀₃·a₃₀ + a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁³·a₀₃)·a₂₃²)·a₀² + (((a'₂·a'₁³·a'₃₂·a₁³ + a'₂²·a'₁³·a₂·a₁)·a₃₀ + (a'₂·a'₁²·a'₃₀·a'₃₂·a₁⁴ + a'₂²·a'₁²·a'₃₀·a₂·a₁²))·a₃₂·a₂₃ + ((a'₂·a'₁²·a'₃₂·a'₂₃·a₁⁴ + a'₂²·a'₁²·a'₂₃·a₂·a₁²)·a₃₀ + (a'₂·a'₁·a'₃₀·a'₃₂·a'₂₃·a₁⁵ + a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁³))·a₃₂)·a₀)·aₛ)/((a'₀₃²·a'₂₃³·a₁²·a₀₃·a₂₃³·a₀³ + (a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁²·a₀₃·a₂₃³ + 2·a'₂·a'₁·a'₀₃·a'₂₃²·a₂·a₁·a₀₃·a₂₃² + a'₀₃²·a'₂₃³·a₁⁴·a₃₂·a₂₃)·a₀² + ((a'₁²·a'₀₃·a'₃₂·a'₂₃·a₁⁴ + a'₂·a'₁²·a'₀₃·a'₂₃·a₂·a₁²)·a₃₂ + (a'₂·a'₁²·a'₃₂·a'₂₃·a₂·a₁² + a'₂²·a'₁²·a'₂₃·a₂²)·a₀₃)·a₂₃·a₀)·aₛ² + ((2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁·a₀₃·a₃₀ + 2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁²·a₀₃)·a₂₃³·a₀³ + ((2·a'₂²·a'₁²·a'₂₃·a₂·a₀₃·a₃₀ + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₂·a₁·a₀₃)·a₂₃² + (2·a'₂·a'₁·a'₀₃·a'₂₃²·a₁³·a₃₀ + 2·a'₂·a'₀₃·a'₃₀·a'₂₃²·a₁⁴)·a₃₂·a₂₃)·a₀²)·aₛ + ((a'₂²·a'₁²·a'₂₃·a₀₃·a₃₀² + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁·a₀₃·a₃₀ + a'₂²·a'₃₀²·a'₂₃·a₁²·a₀₃)·a₂₃³·a₀³ + (a'₂²·a'₁²·a'₂₃·a₁²·a₃₀² + 2·a'₂²·a'₁·a'₃₀·a'₂₃·a₁³·a₃₀ + a'₂²·a'₃₀²·a'₂₃·a₁⁴)·a₃₂·a₂₃·a₀²)) This takes, I estimate, somewhere on the order of 10KiB for PARI to store in a way that can be used efficiently. If a few elements of that complexity are multiplied together, and the product doesn't cancel to 1, it's entirely possible that PARI's internal representation of that result will take more than 4GiB of memory (which appears to be the maximum size I can make PARI's stack). Unluckily, PARI's stack-based operations kill the program on failure. In my particular situation, overflowing PARI's stack is something I could recover from: I'd just output a warning that that particular product needs to be checked by hand. This is the prime sort of example that I think B should be giving: programs performing memory-intensive operations, some of which are optional. I am content with a small number of warnings on stderr regarding products which are difficult to compute. I am not content with a 12 hour computation aborting 7 hours in. Fixes In my particular case, there are fixes. PARI primarily works off its own, internal stack, but objects can be moved to main memory if needed. This removes the 4GiB limit. More importantly, however, PARI exposes the top, bottom, and current position of its internal stack. This allows me to monitor memory usage in (*), and cleanly skip the computation if too much of the stack memory starts being used. It also turns out that checking for continual memory growth is an excellent heuristic that a computation is probably not going to cancel to 1. If o all factors have roughly the same initial size, o “unlucky” multiplications increase memory usage exponentially, and o such growth has been observed when multiplying ceil(half) factors, it's very improbable that the last few factors have enough complexity to cancel the intermediate product, so we can abandon the computation, skipping the last (and probably the most intensive) steps. This also allows me to use low 10s of MiB, not GiB, for PARI's stack. [0] http://pari.math.u-bordeaux.fr