I fought the AFS, and it was a draw S. Gilles 2018-08-05 This is a gripping tale of a delayed data corruption bug, and an unsatisfying solution. The setting is a huge, separate set of machines running everything from Solaris to RHEL to Ubuntu, with home directories tied together via AFS. I represent one side of the conflict, and AFS is my adversary. I'm not sure what specific implementation of AFS I was pitted against, but it probably doesn't matter: the principal battleground was running Linux 2.6.32 at the time of writing, so whatever bug I was up against has probably been fixed and buried for years. The first shot of this war, though I didn't know it, was a few months ago, when a user of Clav mentioned that some of his data files had been erased. One day they contained definitions of carefully constructed graphs, the next day they were size 0. This was concerning, but after a comb-through, I was pretty sure that Clav itself couldn't be causing the issue. The only file-related oversight in the code was a missing call to fclose(3). Open hostilities broke out when I tried to use a few machines on this network to do some intensive calculation (about 600 CPU-hours of computation). The setup was pretty straightforward: I needed to get 256 output lines from program foo, which would output a line every two to three hours. This could be mostly paralleled, so on machine A (with 24 cores), I would run ./foo -n 1 | tee output/output-001.txt ./foo -n 2 | tee output/output-002.txt ... ./foo -n 23 | tee output/output-023.txt So far, so good. I started the jobs and went to sleep. But when I woke up, the job pipeline was dead; tee had failed and, since the output had been implicitly buffered, I'd lost all of it. That's really not supposed to happen, but I shrugged and started again, this time with some paranoid anti-buffering: stdbuf -o0 ./foo -n 1 | stdbuf -i0 tee output/output-001.txt Things proceeded well until the next night, at which point tee failed again. “At least,” I thought, “this time I will have partial output, so this run isn't a total loss.”. Indeed, my screen session had a few lines visible from each job. But the output files were empty! At this point, I became concerned. I was a few days behind schedule, and I wasn't able to collect data except by copying and pasting from the stdout. So I started up jobs on machines A, B, C, and D at the same time, burning about 60 cores to 60 output files. And I watched them like a hawk. As soon as the first output came through on machine A, I checked the output file to make sure tee was functioning properly. By chance, I checked the file on machine C, and everything fell together: the file was empty on C, but from A I could see the contents. The problem was that AFS was delaying the file unification as long as the file was still opened for writing. But at midnight, some kind of forced synchronization occurred, which invariably destroyed unsynchronized contents of the file. Since my jobs always ran for longer than 24 hours, the tee commands were consistently being clobbered. This was also killing the Clav output files from months ago. The workflow people had been following was - log in - run Clav - write out a couple of graph files - switch user so that Clav stays running - wait until tomorrow - log back in, write a few more graph files, &c. Since I'd forgotten to fclose(3) the data files, they were clobbered at midnight unless, by coincidence, another user had happened to shut the machine down (unceremoniously killing all jobs, but at least flushing all files). So, by adding the missing fclose(3), I'd fixed Clav months ago. But I still needed output from my 600 hours of jobs. The problem was tee, which blindly keeps its output file(s) open. So I modified suckless' tee[0] into yuu, which always closes and reopens its outputs. ... while ((n = read(0, buf, sizeof(buf))) > 0) { for (int i = 0; i < argc; i++) { int fd = open(argv[i], O_WRONLY|O_CREAT|aflag, 0666); if (fd < 0) { return 1; } if (writeall(fd, buf, n) < 0) { return 1; } close(fd); } writeall(1, buf, n); aflag = O_APPEND; } ... and, a few hundred CPU-hours later, I had my output. I suppose the moral of the story is that fclose(3) is actually necessary sometimes. [0] https://git.suckless.org/sbase/