I fought the AFS, and it was a draw

S. Gilles

2018-08-05

This is a gripping tale of a delayed data corruption bug, and an
unsatisfying solution.

The setting is a huge, separate set of machines running everything
from Solaris to RHEL to Ubuntu, with home directories tied together
via AFS. I represent one side of the conflict, and AFS is my
adversary. I'm not sure what specific implementation of AFS I was
pitted against, but it probably doesn't matter: the principal
battleground was running Linux 2.6.32 at the time of writing, so
whatever bug I was up against has probably been fixed and buried
for years.

The first shot of this war, though I didn't know it, was a few
months ago, when a user of Clav mentioned that some of his data
files had been erased. One day they contained definitions of carefully
constructed graphs, the next day they were size 0. This was concerning,
but after a comb-through, I was pretty sure that Clav itself couldn't
be causing the issue. The only file-related oversight in the code
was a missing call to fclose(3).

Open hostilities broke out when I tried to use a few machines on
this network to do some intensive calculation (about 600 CPU-hours
of computation). The setup was pretty straightforward: I needed to
get 256 output lines from program foo, which would output a line
every two to three hours. This could be mostly paralleled, so on
machine A (with 24 cores), I would run

    ./foo -n 1  | tee output/output-001.txt
    ./foo -n 2  | tee output/output-002.txt
        ...
    ./foo -n 23 | tee output/output-023.txt

So far, so good. I started the jobs and went to sleep. But when I
woke up, the job pipeline was dead; tee had failed and, since the
output had been implicitly buffered, I'd lost all of it. That's
really not supposed to happen, but I shrugged and started again,
this time with some paranoid anti-buffering:

    stdbuf -o0 ./foo -n 1 | stdbuf -i0 tee output/output-001.txt

Things proceeded well until the next night, at which point tee failed
again. “At least,” I thought, “this time I will have partial output,
so this run isn't a total loss.”. Indeed, my screen session had a
few lines visible from each job. But the output files were empty!

At this point, I became concerned. I was a few days behind schedule,
and I wasn't able to collect data except by copying and pasting
from the stdout. So I started up jobs on machines A, B, C, and D
at the same time, burning about 60 cores to 60 output files. And I
watched them like a hawk.

As soon as the first output came through on machine A, I checked
the output file to make sure tee was functioning properly. By chance,
I checked the file on machine C, and everything fell together: the
file was empty on C, but from A I could see the contents.

The problem was that AFS was delaying the file unification as long
as the file was still opened for writing. But at midnight, some
kind of forced synchronization occurred, which invariably destroyed
unsynchronized contents of the file. Since my jobs always ran for
longer than 24 hours, the tee commands were consistently being
clobbered.

This was also killing the Clav output files from months ago. The
workflow people had been following was

 - log in
 - run Clav
 - write out a couple of graph files
 - switch user so that Clav stays running
 - wait until tomorrow
 - log back in, write a few more graph files, &c.

Since I'd forgotten to fclose(3) the data files, they were clobbered
at midnight unless, by coincidence, another user had happened to
shut the machine down (unceremoniously killing all jobs, but at
least flushing all files).

So, by adding the missing fclose(3), I'd fixed Clav months ago. But
I still needed output from my 600 hours of jobs. The problem was
tee, which blindly keeps its output file(s) open. So I modified
suckless' tee[0] into yuu, which always closes and reopens its outputs.

    ...

    while ((n = read(0, buf, sizeof(buf))) > 0) {
        for (int i = 0; i < argc; i++) {
            int fd = open(argv[i], O_WRONLY|O_CREAT|aflag, 0666);
            if (fd < 0) {
                return 1;
            }
            if (writeall(fd, buf, n) < 0) {
                return 1;
            }
            close(fd);
        }
        writeall(1, buf, n);
        aflag = O_APPEND;
    }

    ...

and, a few hundred CPU-hours later, I had my output.

I suppose the moral of the story is that fclose(3) is actually
necessary sometimes.

[0] https://git.suckless.org/sbase/