P03A: Why fsync() on OpenZFS can’t fail, and what happens when it does
Rob Norris

P03A: Why fsync() on OpenZFS can’t fail, and what happens when it does
Rob Norris

Abstract

On OpenZFS, `fsync()` cannot fail - it will wait until the application’s changes are on disk before it returns. If there is a problem, such as a hardware failure, that causes the pool to suspend, then it may wait forever. This feels strange, but is acceptable according to the API contract: `fsync()` never returned success, so the application has no reason to believe its data on disk.

However, OpenZFS pools can recover if the fault is repaired, and so `fsync()` can still return. As it turns out though, its possible in rare situations for the pool to return to service but not have actually put the data on disk. `fsync()` returns success, because it cannot fail, and the application has been lied to.

In this paper I describe the path taken from the `fsync()` call, through the ZFS Intent Log, the transaction machinery, the pool failure system and the IO scheduler to understand what happens to IO when disks fail and return and why OpenZFS believed that writes had succeeded when they had not. I then describe the changes I made to make OpenZFS understand that something had gone wrong, and how I threaded that response was threaded back up the stack such that `fsync()` could finally return failure - and what it means when it does.

Speaker

Rob Norris, Klara Systems