Message ID | 20211021001059.438843-1-jane.chu@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | dax poison recovery with RWF_RECOVERY_DATA flag | expand |
Looking over the series I have serious doubts that overloading the slow path clear poison operation over the fast path read/write path is such a great idea.
On 10/21/2021 4:31 AM, Christoph Hellwig wrote: > Looking over the series I have serious doubts that overloading the > slow path clear poison operation over the fast path read/write > path is such a great idea. > Understood, sounds like a concern on principle. But it seems to me that the task of recovery overlaps with the normal write operation on the write part. Without overloading some write operation for 'recovery', I guess we'll need to come up with a new userland command coupled with a new dax API ->clear_poison and propagate the new API support to each dm targets that support dax which, again, is an idea that sounds too bulky if I recall Dan's earlier rejection correctly. It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages with description about the RWF_RECOVERY_DATA flag and specifically not recommending the flag for normal read/write operation - due to potential performance impact from searching badblocks in the range. That said, the badblock searching is turned on only if the pmem device contains badblocks(i.e. bb->count > 0), otherwise, the performance impact is next to none. And once the badblock search starts, it is a binary search over user provided range. The unwanted impact happens to be the case when the pmem device contains badblocks that do not fall in the user specified range, and in that case, the search would end in O(1). Thanks! -jane
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote: > On 10/21/2021 4:31 AM, Christoph Hellwig wrote: > > Looking over the series I have serious doubts that overloading the > > slow path clear poison operation over the fast path read/write > > path is such a great idea. Why would data recovery after a media error ever be considered a fast/hot path? A normal read/write to a fsdax file would not pass the flag, which skips the poison checking with whatever MCE consequences that has, right? pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that carefully stepping around dax_direct_access only has to be faster than restoring from backup, I hope? > Understood, sounds like a concern on principle. But it seems to me > that the task of recovery overlaps with the normal write operation > on the write part. Without overloading some write operation for > 'recovery', I guess we'll need to come up with a new userland > command coupled with a new dax API ->clear_poison and propagate the > new API support to each dm targets that support dax which, again, > is an idea that sounds too bulky if I recall Dan's earlier rejection > correctly. > > It is in my plan though, to provide pwritev2(2) and preadv2(2) man pages > with description about the RWF_RECOVERY_DATA flag and specifically not > recommending the flag for normal read/write operation - due to potential > performance impact from searching badblocks in the range. Yes, this will help much. :) > That said, the badblock searching is turned on only if the pmem device > contains badblocks(i.e. bb->count > 0), otherwise, the performance > impact is next to none. And once the badblock search starts, > it is a binary search over user provided range. The unwanted impact > happens to be the case when the pmem device contains badblocks > that do not fall in the user specified range, and in that case, the > search would end in O(1). (I wonder about improving badblocks to be less sector-oriented and not have that weird 16-records-max limit, but that can be a later optimization.) --D > Thanks! > -jane > >
On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote: > On 10/21/2021 4:31 AM, Christoph Hellwig wrote: > > Looking over the series I have serious doubts that overloading the > > slow path clear poison operation over the fast path read/write > > path is such a great idea. > > > > Understood, sounds like a concern on principle. But it seems to me > that the task of recovery overlaps with the normal write operation > on the write part. Without overloading some write operation for > 'recovery', I guess we'll need to come up with a new userland > command coupled with a new dax API ->clear_poison and propagate the > new API support to each dm targets that support dax which, again, > is an idea that sounds too bulky if I recall Dan's earlier rejection > correctly. When I wrote the above I mostly thought about the in-kernel API, that is use a separate method. But reading your mail and thinking about this a bit more I'm actually less and less sure that overloading pwritev2 and preadv2 with this at the syscall level makes sense either. read/write are our I/O fast path. We really should not overload the core of the VFS with error recovery for a broken hardware interface.
On Thu, Oct 21, 2021 at 06:58:17PM -0700, Darrick J. Wong wrote: > On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote: > > On 10/21/2021 4:31 AM, Christoph Hellwig wrote: > > > Looking over the series I have serious doubts that overloading the > > > slow path clear poison operation over the fast path read/write > > > path is such a great idea. > > Why would data recovery after a media error ever be considered a > fast/hot path? Not sure what you're replying to (the text is from me, the mail you are repling to is fom Jane), but my point is that the read/write got path should not be overloaded with data recovery. > A normal read/write to a fsdax file would not pass the > flag, which skips the poison checking with whatever MCE consequences > that has, right? Exactly! > pwritev2(..., RWF_RECOVER_DATA) should be infrequent enough that > carefully stepping around dax_direct_access only has to be faster than > restoring from backup, I hope? Yes. And thus be kept as separate as possible.
On 10/21/2021 10:36 PM, Christoph Hellwig wrote: > On Fri, Oct 22, 2021 at 01:37:28AM +0000, Jane Chu wrote: >> On 10/21/2021 4:31 AM, Christoph Hellwig wrote: >>> Looking over the series I have serious doubts that overloading the >>> slow path clear poison operation over the fast path read/write >>> path is such a great idea. >>> >> >> Understood, sounds like a concern on principle. But it seems to me >> that the task of recovery overlaps with the normal write operation >> on the write part. Without overloading some write operation for >> 'recovery', I guess we'll need to come up with a new userland >> command coupled with a new dax API ->clear_poison and propagate the >> new API support to each dm targets that support dax which, again, >> is an idea that sounds too bulky if I recall Dan's earlier rejection >> correctly. > > When I wrote the above I mostly thought about the in-kernel API, that > is use a separate method. But reading your mail and thinking about > this a bit more I'm actually less and less sure that overloading > pwritev2 and preadv2 with this at the syscall level makes sense either. > read/write are our I/O fast path. We really should not overload the > core of the VFS with error recovery for a broken hardware interface. > Thanks - I try to be honest. As far as I can tell, the argument about the flag is a philosophical argument between two views. One view assumes design based on perfect hardware, and media error belongs to the category of brokenness. Another view sees media error as a build-in hardware component and make design to include dealing with such errors. Back when I was fresh out of school, a senior engineer explained to me about media error might be caused by cosmic ray hitting on the media at however frequency and at whatever timing. It's an argument that media error within certain range is a fact of the product, and to me, it argues for building normal software component with errors in mind from start. I guess I'm trying to articulate why it is acceptable to include the RWF_DATA_RECOVERY flag to the existing RWF_ flags. - this way, pwritev2 remain fast on fast path, and its slow path (w/ error clearing) is faster than other alternative. Other alternative being 1 system call to clear the poison, and another system call to run the fast pwrite for recovery, what happens if something happened in between? thanks! -jane
On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > Thanks - I try to be honest. As far as I can tell, the argument > about the flag is a philosophical argument between two views. > One view assumes design based on perfect hardware, and media error > belongs to the category of brokenness. Another view sees media > error as a build-in hardware component and make design to include > dealing with such errors. No, I don't think so. Bit errors do happen in all media, which is why devices are built to handle them. It is just the Intel-style pmem interface to handle them which is completely broken. > errors in mind from start. I guess I'm trying to articulate why > it is acceptable to include the RWF_DATA_RECOVERY flag to the > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > and its slow path (w/ error clearing) is faster than other alternative. > Other alternative being 1 system call to clear the poison, and > another system call to run the fast pwrite for recovery, what > happens if something happened in between? Well, my point is doing recovery from bit errors is by definition not the fast path. Which is why I'd rather keep it away from the pmem read/write fast path, which also happens to be the (much more important) non-pmem read/write path.
On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote: > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > > Thanks - I try to be honest. As far as I can tell, the argument > > about the flag is a philosophical argument between two views. > > One view assumes design based on perfect hardware, and media error > > belongs to the category of brokenness. Another view sees media > > error as a build-in hardware component and make design to include > > dealing with such errors. > > No, I don't think so. Bit errors do happen in all media, which is > why devices are built to handle them. It is just the Intel-style > pmem interface to handle them which is completely broken. Yeah, I agree, this takes me back to learning how to use DISKEDIT to work around a hole punched in a file (with a pen!) in the 1980s... ...so would you happen to know if anyone's working on solving this problem for us by putting the memory controller in charge of dealing with media errors? > > errors in mind from start. I guess I'm trying to articulate why > > it is acceptable to include the RWF_DATA_RECOVERY flag to the > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > > and its slow path (w/ error clearing) is faster than other alternative. > > Other alternative being 1 system call to clear the poison, and > > another system call to run the fast pwrite for recovery, what > > happens if something happened in between? > > Well, my point is doing recovery from bit errors is by definition not > the fast path. Which is why I'd rather keep it away from the pmem > read/write fast path, which also happens to be the (much more important) > non-pmem read/write path. The trouble is, we really /do/ want to be able to (re)write the failed area, and we probably want to try to read whatever we can. Those are reads and writes, not {pre,f}allocation activities. This is where Dave and I arrived at a month ago. Unless you'd be ok with a second IO path for recovery where we're allowed to be slow? That would probably have the same user interface flag, just a different path into the pmem driver. Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever speshul options (retry raid mirrors! scrape the film off the disk if you have to!) you want that can take forever, leaving the fast paths alone? (Ok, that wasn't entirely serious...) --D
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote: > > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > > > Thanks - I try to be honest. As far as I can tell, the argument > > > about the flag is a philosophical argument between two views. > > > One view assumes design based on perfect hardware, and media error > > > belongs to the category of brokenness. Another view sees media > > > error as a build-in hardware component and make design to include > > > dealing with such errors. > > > > No, I don't think so. Bit errors do happen in all media, which is > > why devices are built to handle them. It is just the Intel-style > > pmem interface to handle them which is completely broken. > > Yeah, I agree, this takes me back to learning how to use DISKEDIT to > work around a hole punched in a file (with a pen!) in the 1980s... > > ...so would you happen to know if anyone's working on solving this > problem for us by putting the memory controller in charge of dealing > with media errors? > > > > errors in mind from start. I guess I'm trying to articulate why > > > it is acceptable to include the RWF_DATA_RECOVERY flag to the > > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > > > and its slow path (w/ error clearing) is faster than other alternative. > > > Other alternative being 1 system call to clear the poison, and > > > another system call to run the fast pwrite for recovery, what > > > happens if something happened in between? > > > > Well, my point is doing recovery from bit errors is by definition not > > the fast path. Which is why I'd rather keep it away from the pmem > > read/write fast path, which also happens to be the (much more important) > > non-pmem read/write path. > > The trouble is, we really /do/ want to be able to (re)write the failed > area, and we probably want to try to read whatever we can. Those are > reads and writes, not {pre,f}allocation activities. This is where Dave > and I arrived at a month ago. > > Unless you'd be ok with a second IO path for recovery where we're > allowed to be slow? That would probably have the same user interface > flag, just a different path into the pmem driver. I just don't see how 4 single line branches to propage RWF_RECOVERY down to the hardware is in any way an imposition on the fast path. It's no different for passing RWF_HIPRI down to the hardware *in the fast path* so that the IO runs the hardware in polling mode because it's faster for some hardware. IOWs, saying that we shouldn't implement RWF_RECOVERY because it adds a handful of branches to the fast path is like saying that we shouldn't implement RWF_HIPRI because it slows down the fast path for non-polled IO.... Just factor the actual recovery operations out into a separate function like: static void noinline pmem_media_recovery(...) { } pmem_copy_from_iter() { if ((unlikely)(flag & RECOVERY)) pmem_media_recovery(...); return _copy_from_iter_flushcache(addr, bytes, i); } .... And there's basically zero overhead in the fast paths for normal data IO operations, whilst supporting a simple, easy to use data recovery IO operations for regions that have bad media... > Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever > speshul options (retry raid mirrors! scrape the film off the disk if > you have to!) you want that can take forever, leaving the fast paths > alone? Why wouldn't we just pass RWF_RECOVERY down to a REQ_RECOVERY bio flag and have raid devices use that to trigger scraping whatever they can if there are errors? The io path through the VFS and filesystem to get the scraped data out to the user *is exactly the same*, so we're going to have to plumb this functionality into fast paths *somewhere along the line*. Really, I think this whole "flag propagation is too much overhead for the fast path" argument is completely invalid - if 4 conditional branches is too much overhead to add to the fast path, then we can't add *anything ever again* to the IO path because it has too much overhead and impact on the fast path. Cheers, Dave.
On 10/28/21 23:59, Dave Chinner wrote: [...] >>> Well, my point is doing recovery from bit errors is by definition not >>> the fast path. Which is why I'd rather keep it away from the pmem >>> read/write fast path, which also happens to be the (much more important) >>> non-pmem read/write path. >> >> The trouble is, we really /do/ want to be able to (re)write the failed >> area, and we probably want to try to read whatever we can. Those are >> reads and writes, not {pre,f}allocation activities. This is where Dave >> and I arrived at a month ago. >> >> Unless you'd be ok with a second IO path for recovery where we're >> allowed to be slow? That would probably have the same user interface >> flag, just a different path into the pmem driver. > > I just don't see how 4 single line branches to propage RWF_RECOVERY > down to the hardware is in any way an imposition on the fast path. > It's no different for passing RWF_HIPRI down to the hardware *in the > fast path* so that the IO runs the hardware in polling mode because > it's faster for some hardware. Not particularly about this flag, but it is expensive. Surely looks cheap when it's just one feature, but there are dozens of them with limited applicability, default config kernels are already sluggish when it comes to really fast devices and it's not getting better. Also, pretty often every of them will add a bunch of extra checks to fix something of whatever it would be. So let's add a bit of pragmatism to the picture, if there is just one user of a feature but it adds overhead for millions of machines that won't ever use it, it's expensive. This one doesn't spill yet into paths I care about, but in general it'd be great if we start thinking more about such stuff instead of throwing yet another if into the path, e.g. by shifting the overhead from linear to a constant for cases that don't use it, for instance with callbacks or bit masks. > IOWs, saying that we shouldn't implement RWF_RECOVERY because it > adds a handful of branches the fast path is like saying that we > shouldn't implement RWF_HIPRI because it slows down the fast path > for non-polled IO.... > > Just factor the actual recovery operations out into a separate > function like:
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: > On 10/28/21 23:59, Dave Chinner wrote: > [...] > > > > Well, my point is doing recovery from bit errors is by definition not > > > > the fast path. Which is why I'd rather keep it away from the pmem > > > > read/write fast path, which also happens to be the (much more important) > > > > non-pmem read/write path. > > > > > > The trouble is, we really /do/ want to be able to (re)write the failed > > > area, and we probably want to try to read whatever we can. Those are > > > reads and writes, not {pre,f}allocation activities. This is where Dave > > > and I arrived at a month ago. > > > > > > Unless you'd be ok with a second IO path for recovery where we're > > > allowed to be slow? That would probably have the same user interface > > > flag, just a different path into the pmem driver. > > > > I just don't see how 4 single line branches to propage RWF_RECOVERY > > down to the hardware is in any way an imposition on the fast path. > > It's no different for passing RWF_HIPRI down to the hardware *in the > > fast path* so that the IO runs the hardware in polling mode because > > it's faster for some hardware. > > Not particularly about this flag, but it is expensive. Surely looks > cheap when it's just one feature, but there are dozens of them with > limited applicability, default config kernels are already sluggish > when it comes to really fast devices and it's not getting better. > Also, pretty often every of them will add a bunch of extra checks > to fix something of whatever it would be. So we can't have data recovery because moving fast the only goal? That's so meta. --D > So let's add a bit of pragmatism to the picture, if there is just one > user of a feature but it adds overhead for millions of machines that > won't ever use it, it's expensive. > > This one doesn't spill yet into paths I care about, but in general > it'd be great if we start thinking more about such stuff instead of > throwing yet another if into the path, e.g. by shifting the overhead > from linear to a constant for cases that don't use it, for instance > with callbacks or bit masks. > > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it > > adds a handful of branches the fast path is like saying that we > > shouldn't implement RWF_HIPRI because it slows down the fast path > > for non-polled IO.... > > > > Just factor the actual recovery operations out into a separate > > function like: > > -- > Pavel Begunkov
On 10/29/2021 4:46 AM, Pavel Begunkov wrote: > On 10/28/21 23:59, Dave Chinner wrote: > [...] >>>> Well, my point is doing recovery from bit errors is by definition not >>>> the fast path. Which is why I'd rather keep it away from the pmem >>>> read/write fast path, which also happens to be the (much more >>>> important) >>>> non-pmem read/write path. >>> >>> The trouble is, we really /do/ want to be able to (re)write the failed >>> area, and we probably want to try to read whatever we can. Those are >>> reads and writes, not {pre,f}allocation activities. This is where Dave >>> and I arrived at a month ago. >>> >>> Unless you'd be ok with a second IO path for recovery where we're >>> allowed to be slow? That would probably have the same user interface >>> flag, just a different path into the pmem driver. >> >> I just don't see how 4 single line branches to propage RWF_RECOVERY >> down to the hardware is in any way an imposition on the fast path. >> It's no different for passing RWF_HIPRI down to the hardware *in the >> fast path* so that the IO runs the hardware in polling mode because >> it's faster for some hardware. > > Not particularly about this flag, but it is expensive. Surely looks > cheap when it's just one feature, but there are dozens of them with > limited applicability, default config kernels are already sluggish > when it comes to really fast devices and it's not getting better. > Also, pretty often every of them will add a bunch of extra checks > to fix something of whatever it would be. > > So let's add a bit of pragmatism to the picture, if there is just one > user of a feature but it adds overhead for millions of machines that > won't ever use it, it's expensive. > > This one doesn't spill yet into paths I care about, but in general > it'd be great if we start thinking more about such stuff instead of > throwing yet another if into the path, e.g. by shifting the overhead > from linear to a constant for cases that don't use it, for instance > with callbacks or bit masks. May I ask what solution would you propose for pmem recovery that satisfy the requirement of binding poison-clearing and write in one operation? thanks! -jane > >> IOWs, saying that we shouldn't implement RWF_RECOVERY because it >> adds a handful of branches the fast path is like saying that we >> shouldn't implement RWF_HIPRI because it slows down the fast path >> for non-polled IO.... >> >> Just factor the actual recovery operations out into a separate >> function like: >
On 10/29/21 17:57, Darrick J. Wong wrote: > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: >> On 10/28/21 23:59, Dave Chinner wrote: >> [...] >>>>> Well, my point is doing recovery from bit errors is by definition not >>>>> the fast path. Which is why I'd rather keep it away from the pmem >>>>> read/write fast path, which also happens to be the (much more important) >>>>> non-pmem read/write path. >>>> >>>> The trouble is, we really /do/ want to be able to (re)write the failed >>>> area, and we probably want to try to read whatever we can. Those are >>>> reads and writes, not {pre,f}allocation activities. This is where Dave >>>> and I arrived at a month ago. >>>> >>>> Unless you'd be ok with a second IO path for recovery where we're >>>> allowed to be slow? That would probably have the same user interface >>>> flag, just a different path into the pmem driver. >>> >>> I just don't see how 4 single line branches to propage RWF_RECOVERY >>> down to the hardware is in any way an imposition on the fast path. >>> It's no different for passing RWF_HIPRI down to the hardware *in the >>> fast path* so that the IO runs the hardware in polling mode because >>> it's faster for some hardware. >> >> Not particularly about this flag, but it is expensive. Surely looks >> cheap when it's just one feature, but there are dozens of them with >> limited applicability, default config kernels are already sluggish >> when it comes to really fast devices and it's not getting better. >> Also, pretty often every of them will add a bunch of extra checks >> to fix something of whatever it would be. > > So we can't have data recovery because moving fast the only goal? That's not what was said and you missed the point, which was in the rest of the message. > > That's so meta. > > --D > >> So let's add a bit of pragmatism to the picture, if there is just one >> user of a feature but it adds overhead for millions of machines that >> won't ever use it, it's expensive. >> >> This one doesn't spill yet into paths I care about, but in general >> it'd be great if we start thinking more about such stuff instead of >> throwing yet another if into the path, e.g. by shifting the overhead >> from linear to a constant for cases that don't use it, for instance >> with callbacks or bit masks. >> >>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it >>> adds a handful of branches the fast path is like saying that we >>> shouldn't implement RWF_HIPRI because it slows down the fast path >>> for non-polled IO.... >>> >>> Just factor the actual recovery operations out into a separate >>> function like: >> >> -- >> Pavel Begunkov
On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote: > On 10/29/21 17:57, Darrick J. Wong wrote: > > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: > > > On 10/28/21 23:59, Dave Chinner wrote: > > > [...] > > > > > > Well, my point is doing recovery from bit errors is by definition not > > > > > > the fast path. Which is why I'd rather keep it away from the pmem > > > > > > read/write fast path, which also happens to be the (much more important) > > > > > > non-pmem read/write path. > > > > > > > > > > The trouble is, we really /do/ want to be able to (re)write the failed > > > > > area, and we probably want to try to read whatever we can. Those are > > > > > reads and writes, not {pre,f}allocation activities. This is where Dave > > > > > and I arrived at a month ago. > > > > > > > > > > Unless you'd be ok with a second IO path for recovery where we're > > > > > allowed to be slow? That would probably have the same user interface > > > > > flag, just a different path into the pmem driver. > > > > > > > > I just don't see how 4 single line branches to propage RWF_RECOVERY > > > > down to the hardware is in any way an imposition on the fast path. > > > > It's no different for passing RWF_HIPRI down to the hardware *in the > > > > fast path* so that the IO runs the hardware in polling mode because > > > > it's faster for some hardware. > > > > > > Not particularly about this flag, but it is expensive. Surely looks > > > cheap when it's just one feature, but there are dozens of them with > > > limited applicability, default config kernels are already sluggish > > > when it comes to really fast devices and it's not getting better. > > > Also, pretty often every of them will add a bunch of extra checks > > > to fix something of whatever it would be. > > > > So we can't have data recovery because moving fast the only goal? > > That's not what was said and you missed the point, which was in > the rest of the message. ...whatever point you were trying to make was so vague that it was totally uninformative and I completely missed it. What does "callbacks or bit masks" mean, then, specifically? How *exactly* would you solve the problem that Jane is seeking to solve by using callbacks? Actually, you know what? I'm so fed up with every single DAX conversation turning into a ****storm of people saying NO NO NO NO NO NO NO NO to everything proposed that I'm actually going to respond to whatever I think your point is, and you can defend whatever I come up with. > > > > That's so meta. > > > > --D > > > > > So let's add a bit of pragmatism to the picture, if there is just one > > > user of a feature but it adds overhead for millions of machines that > > > won't ever use it, it's expensive. Errors are infrequent, and since everything is cloud-based and disposble now, we can replace error handling with BUG_ON(). This will reduce code complexity, which will reduce code size, and improve icache usage. Win! > > > This one doesn't spill yet into paths I care about, ...so you sail in and say 'no' even though you don't yet care... > > > but in general > > > it'd be great if we start thinking more about such stuff instead of > > > throwing yet another if into the path, e.g. by shifting the overhead > > > from linear to a constant for cases that don't use it, for instance > > > with callbacks Ok so after userspace calls into pread to access a DAX file, hits the poisoned memory line and the machinecheck fires, what then? I guess we just have to figure out how to get from the MCA handler (assuming the machine doesn't just reboot instantly) all the way back into memcpy? Ok, you're in charge of figuring that out because I don't know how to do that. Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a callback that happens after memory controller realizes it's lost something, kicks a notification to the OS kernel through ACPI, and the kernel signal userspace to do something about it. Yeah, that's dumb since spinning rust already does all this for us, but that's pmem. > > > or bit masks. WTF does this even mean? --D > > > > > > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it > > > > adds a handful of branches the fast path is like saying that we > > > > shouldn't implement RWF_HIPRI because it slows down the fast path > > > > for non-polled IO.... > > > > > > > > Just factor the actual recovery operations out into a separate > > > > function like: > > > > > > -- > > > Pavel Begunkov > > -- > Pavel Begunkov
On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: > On 10/28/21 23:59, Dave Chinner wrote: > [...] > > > > Well, my point is doing recovery from bit errors is by definition not > > > > the fast path. Which is why I'd rather keep it away from the pmem > > > > read/write fast path, which also happens to be the (much more important) > > > > non-pmem read/write path. > > > > > > The trouble is, we really /do/ want to be able to (re)write the failed > > > area, and we probably want to try to read whatever we can. Those are > > > reads and writes, not {pre,f}allocation activities. This is where Dave > > > and I arrived at a month ago. > > > > > > Unless you'd be ok with a second IO path for recovery where we're > > > allowed to be slow? That would probably have the same user interface > > > flag, just a different path into the pmem driver. > > > > I just don't see how 4 single line branches to propage RWF_RECOVERY > > down to the hardware is in any way an imposition on the fast path. > > It's no different for passing RWF_HIPRI down to the hardware *in the > > fast path* so that the IO runs the hardware in polling mode because > > it's faster for some hardware. > > Not particularly about this flag, but it is expensive. Surely looks > cheap when it's just one feature, but there are dozens of them with > limited applicability, default config kernels are already sluggish > when it comes to really fast devices and it's not getting better. > Also, pretty often every of them will add a bunch of extra checks > to fix something of whatever it would be. > > So let's add a bit of pragmatism to the picture, if there is just one > user of a feature but it adds overhead for millions of machines that > won't ever use it, it's expensive. Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read past this? I'll quote what I said again, because I've already addressed this argument to point out how silly it is: > > IOWs, saying that we shouldn't implement RWF_RECOVERY because it > > adds a handful of branches to the fast path is like saying that we > > shouldn't implement RWF_HIPRI because it slows down the fast path > > for non-polled IO.... RWF_HIPRI functionality represents a *tiny* niche in the wider Linux ecosystem, so by your reasoning it is too expensive to implement because millions (billions!) of machines don't need or use it. Do you now see how silly your argument is? Seriously, this "optimise the IO fast path at the cost of everything else" craziness has gotten out of hand. Nobody in the filesystem or application world cares if you can do 10M IOPS per core when all the CPU is doing is sitting in a tight loop inside the kernel repeatedly overwriting data in the same memory buffers, essentially tossing the old away the data without ever accessing it or doing anything with it. Such speed racer games are *completely meaningless* as an optimisation goal - it's what we've called "benchmarketing" for a couple of decades now. If all we focus on is bragging rights because "bigger number is always better", then we'll end up with iand IO path that looks like the awful mess that the fs/direct-io.c turned into. That ended up being hyper-optimised for CPU performance right down to single instructions and cacheline load orders that the code became extremely fragile and completely unmaintainable. We ended up *reimplementing the direct IO code from scratch* so that XFS could build and submit direct IO smarter and faster because it simply couldn't be done to the old code. That's how iomap came about, and without *any optimisation at all* iomap was 20-30% faster than the old, hyper-optimised fs/direct-io.c code. IOWs, we always knew we could do direct IO faster than fs/direct-io.c, but we couldn't make the fs/direct-io.c faster because of the hyper-optimisation of the code paths made it impossible to modify and maintain. The current approach of hyper-optimising the IO path for maximum per-core IOPS at the expensive of everything else has been proven in the past to be exactly the wrong approach to be taking for IO path development. Yes, we need to be concerned about performance and work to improve it, but we should not be doing that at the cost of everything else that the IO stack needs to be able to do. Fundamentally, optimisation is something we do *after* we provide the functionality that is required; using "fast path optimisation" as a blunt force implement to prevent new features from being implemented is just ... obnoxious. > This one doesn't spill yet into paths I care about, but in general > it'd be great if we start thinking more about such stuff instead of > throwing yet another if into the path, e.g. by shifting the overhead > from linear to a constant for cases that don't use it, for instance > with callbacks or bit masks. This is orthogonal to providing data recovery functionality. If the claims that flag propagation is too expensive are true, then fixing this problem this will also improve RWF_HIPRI performance regardless of whether RWF_DATA_RECOVERY exists or not... IOWs, *if* there is a fast path performance degradation as a result of flag propagation, then *go measure it* and show us how much impact it has on _real world applications_. *Show us the numbers* and document how much each additional flag propagation actually costs so we can talk about whether it is acceptible, mitigation strategies and/or alternative implementations. Flag propagation overhead is just not a valid reason for preventing us adding new flags to the IO path. Fix the flag propagation overhead if it's a problem for you, don't use it as an excuse for preventing people from adding new functionality that uses flag propagation... Cheers, Dave.
On 10/29/21 23:32, Dave Chinner wrote: > On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: >> On 10/28/21 23:59, Dave Chinner wrote: >> [...] >>>>> Well, my point is doing recovery from bit errors is by definition not >>>>> the fast path. Which is why I'd rather keep it away from the pmem >>>>> read/write fast path, which also happens to be the (much more important) >>>>> non-pmem read/write path. >>>> >>>> The trouble is, we really /do/ want to be able to (re)write the failed >>>> area, and we probably want to try to read whatever we can. Those are >>>> reads and writes, not {pre,f}allocation activities. This is where Dave >>>> and I arrived at a month ago. >>>> >>>> Unless you'd be ok with a second IO path for recovery where we're >>>> allowed to be slow? That would probably have the same user interface >>>> flag, just a different path into the pmem driver. >>> >>> I just don't see how 4 single line branches to propage RWF_RECOVERY >>> down to the hardware is in any way an imposition on the fast path. >>> It's no different for passing RWF_HIPRI down to the hardware *in the >>> fast path* so that the IO runs the hardware in polling mode because >>> it's faster for some hardware. >> >> Not particularly about this flag, but it is expensive. Surely looks >> cheap when it's just one feature, but there are dozens of them with >> limited applicability, default config kernels are already sluggish >> when it comes to really fast devices and it's not getting better. >> Also, pretty often every of them will add a bunch of extra checks >> to fix something of whatever it would be. >> >> So let's add a bit of pragmatism to the picture, if there is just one >> user of a feature but it adds overhead for millions of machines that >> won't ever use it, it's expensive. > > Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read > past this? I'll quote what I said again, because I've already > addressed this argument to point out how silly it is: And you almost got to the initial point in your penult paragraph. A single if for a single flag is not an issue, what is the problem is when there are dozens of them and the overhead for it is not isolated, so the kernel has to jump through dozens of those. And just to be clear I'll outline again, that's a general problem, I have no relation to the layers touched and it's up to relevant people, obviously. Even though I'd expect but haven't found the flag being rejected in other places, but well I may have missed something. >>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it >>> adds a handful of branches to the fast path is like saying that we >>> shouldn't implement RWF_HIPRI because it slows down the fast path >>> for non-polled IO.... > > RWF_HIPRI functionality represents a *tiny* niche in the wider > Linux ecosystem, so by your reasoning it is too expensive to > implement because millions (billions!) of machines don't need or use > it. Do you now see how silly your argument is? > > Seriously, this "optimise the IO fast path at the cost of everything > else" craziness has gotten out of hand. Nobody in the filesystem or > application world cares if you can do 10M IOPS per core when all the > CPU is doing is sitting in a tight loop inside the kernel repeatedly > overwriting data in the same memory buffers, essentially tossing the > old away the data without ever accessing it or doing anything with > it. Such speed racer games are *completely meaningless* as an > optimisation goal - it's what we've called "benchmarketing" for a > couple of decades now. 10M you mentioned is just a way to measure, there is nothing wrong with it. And considering that there are enough of users considering or already switching to spdk because of performance, the approach is not wrong. And it goes not only for IO polling, normal irq IO suffers from the same problems. A related story is that this number is for a pretty reduced config, it'll go down with a more default-ish kernel. > If all we focus on is bragging rights because "bigger number is > always better", then we'll end up with iand IO path that looks like > the awful mess that the fs/direct-io.c turned into. That ended up > being hyper-optimised for CPU performance right down to single > instructions and cacheline load orders that the code became > extremely fragile and completely unmaintainable. > > We ended up *reimplementing the direct IO code from scratch* so that > XFS could build and submit direct IO smarter and faster because it > simply couldn't be done to the old code. That's how iomap came > about, and without *any optimisation at all* iomap was 20-30% faster > than the old, hyper-optimised fs/direct-io.c code. IOWs, we always > knew we could do direct IO faster than fs/direct-io.c, but we > couldn't make the fs/direct-io.c faster because of the > hyper-optimisation of the code paths made it impossible to modify > and maintain.> The current approach of hyper-optimising the IO path for maximum > per-core IOPS at the expensive of everything else has been proven in > the past to be exactly the wrong approach to be taking for IO path > development. Yes, we need to be concerned about performance and work > to improve it, but we should not be doing that at the cost of > everything else that the IO stack needs to be able to do. And iomap is great, what you described is a good typical example of unmaintainable code. I may get wrong what you exactly refer to, but I don't see maintainability not being considered. Even more interesting to notice that more often than not extra features (and flags) almost always hurt maintainability of the kernel, but then other benefits outweigh (hopefully). > Fundamentally, optimisation is something we do *after* we provide > the functionality that is required; using "fast path optimisation" > as a blunt force implement to prevent new features from being > implemented is just ... obnoxious. > >> This one doesn't spill yet into paths I care about, but in general >> it'd be great if we start thinking more about such stuff instead of >> throwing yet another if into the path, e.g. by shifting the overhead >> from linear to a constant for cases that don't use it, for instance >> with callbacks or bit masks. > > This is orthogonal to providing data recovery functionality. > If the claims that flag propagation is too expensive are true, then > fixing this problem this will also improve RWF_HIPRI performance > regardless of whether RWF_DATA_RECOVERY exists or not... > > IOWs, *if* there is a fast path performance degradation as a result > of flag propagation, then *go measure it* and show us how much > impact it has on _real world applications_. *Show us the numbers* > and document how much each additional flag propagation actually > costs so we can talk about whether it is acceptible, mitigation > strategies and/or alternative implementations. Flag propagation > overhead is just not a valid reason for preventing us adding new > flags to the IO path. Fix the flag propagation overhead if it's a > problem for you, don't use it as an excuse for preventing people > from adding new functionality that uses flag propagation...
On 10/29/21 21:08, Darrick J. Wong wrote: > On Fri, Oct 29, 2021 at 08:23:53PM +0100, Pavel Begunkov wrote: >> On 10/29/21 17:57, Darrick J. Wong wrote: >>> On Fri, Oct 29, 2021 at 12:46:14PM +0100, Pavel Begunkov wrote: >>>> On 10/28/21 23:59, Dave Chinner wrote: >>>> [...] >>>>>>> Well, my point is doing recovery from bit errors is by definition not >>>>>>> the fast path. Which is why I'd rather keep it away from the pmem >>>>>>> read/write fast path, which also happens to be the (much more important) >>>>>>> non-pmem read/write path. >>>>>> >>>>>> The trouble is, we really /do/ want to be able to (re)write the failed >>>>>> area, and we probably want to try to read whatever we can. Those are >>>>>> reads and writes, not {pre,f}allocation activities. This is where Dave >>>>>> and I arrived at a month ago. >>>>>> >>>>>> Unless you'd be ok with a second IO path for recovery where we're >>>>>> allowed to be slow? That would probably have the same user interface >>>>>> flag, just a different path into the pmem driver. >>>>> >>>>> I just don't see how 4 single line branches to propage RWF_RECOVERY >>>>> down to the hardware is in any way an imposition on the fast path. >>>>> It's no different for passing RWF_HIPRI down to the hardware *in the >>>>> fast path* so that the IO runs the hardware in polling mode because >>>>> it's faster for some hardware. >>>> >>>> Not particularly about this flag, but it is expensive. Surely looks >>>> cheap when it's just one feature, but there are dozens of them with >>>> limited applicability, default config kernels are already sluggish >>>> when it comes to really fast devices and it's not getting better. >>>> Also, pretty often every of them will add a bunch of extra checks >>>> to fix something of whatever it would be. >>> >>> So we can't have data recovery because moving fast the only goal? >> >> That's not what was said and you missed the point, which was in >> the rest of the message. > > ...whatever point you were trying to make was so vague that it was > totally uninformative and I completely missed it. > > What does "callbacks or bit masks" mean, then, specifically? How > *exactly* would you solve the problem that Jane is seeking to solve by > using callbacks? > > Actually, you know what? I'm so fed up with every single DAX > conversation turning into a ****storm of people saying NO NO NO NO NO NO > NO NO to everything proposed that I'm actually going to respond to > whatever I think your point is, and you can defend whatever I come up > with. Interesting, I don't want to break it to you but nobody is going to defend against yours made up out of thin air interpretations. I think there is one thing we can relate, I wonder as well what the bloody hell that opus of yours was > >>> >>> That's so meta. >>> >>> --D >>> >>>> So let's add a bit of pragmatism to the picture, if there is just one >>>> user of a feature but it adds overhead for millions of machines that >>>> won't ever use it, it's expensive. > > Errors are infrequent, and since everything is cloud-based and disposble > now, we can replace error handling with BUG_ON(). This will reduce code > complexity, which will reduce code size, and improve icache usage. Win! > >>>> This one doesn't spill yet into paths I care about, > > ...so you sail in and say 'no' even though you don't yet care... > >>>> but in general >>>> it'd be great if we start thinking more about such stuff instead of >>>> throwing yet another if into the path, e.g. by shifting the overhead >>>> from linear to a constant for cases that don't use it, for instance >>>> with callbacks > > Ok so after userspace calls into pread to access a DAX file, hits the > poisoned memory line and the machinecheck fires, what then? I guess we > just have to figure out how to get from the MCA handler (assuming the > machine doesn't just reboot instantly) all the way back into memcpy? > Ok, you're in charge of figuring that out because I don't know how to do > that. > > Notably, RWF_DATA_RECOVERY is the flag that we're calling *from* a > callback that happens after memory controller realizes it's lost > something, kicks a notification to the OS kernel through ACPI, and the > kernel signal userspace to do something about it. Yeah, that's dumb > since spinning rust already does all this for us, but that's pmem. > >>>> or bit masks. > > WTF does this even mean? > > --D > >>>> >>>>> IOWs, saying that we shouldn't implement RWF_RECOVERY because it >>>>> adds a handful of branches the fast path is like saying that we >>>>> shouldn't implement RWF_HIPRI because it slows down the fast path >>>>> for non-polled IO.... >>>>> >>>>> Just factor the actual recovery operations out into a separate >>>>> function like:
On Sun, Oct 31, 2021 at 01:19:48PM +0000, Pavel Begunkov wrote: > On 10/29/21 23:32, Dave Chinner wrote: > > Yup, you just described RWF_HIPRI! Seriously, Pavel, did you read > > past this? I'll quote what I said again, because I've already > > addressed this argument to point out how silly it is: > > And you almost got to the initial point in your penult paragraph. A > single if for a single flag is not an issue, what is the problem is > when there are dozens of them and the overhead for it is not isolated, > so the kernel has to jump through dozens of those. This argument can be used to reject *ANY* new feature. For example, by using your argument, we should have rejected the addition of IOCB_WAITQ because it penalises the vast majority of IOs which do not use it. But we didn't. Because we see that while it may not be of use to US today, it's a generally useful feature for Linux to support. You say yourself that this feature doesn't slow down your use case, so why are you spending so much time and energy annoying the people who actually want to use it? Seriously. Stop arguing about something you actually don't care about. You're just making Linux less fun to work on.
On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > ...so would you happen to know if anyone's working on solving this > problem for us by putting the memory controller in charge of dealing > with media errors? The only one who could know is Intel.. > The trouble is, we really /do/ want to be able to (re)write the failed > area, and we probably want to try to read whatever we can. Those are > reads and writes, not {pre,f}allocation activities. This is where Dave > and I arrived at a month ago. > > Unless you'd be ok with a second IO path for recovery where we're > allowed to be slow? That would probably have the same user interface > flag, just a different path into the pmem driver. Which is fine with me. If you look at the API here we do have the RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ API which then gets special casing over three methods. And while Pavel pointed out that he and Jens are now optimizing for single branches like this. I think this actually is silly and it is not my point. The point is that the DAX in-kernel API is a mess, and before we make it even worse we need to sort it first. What is directly relevant here is that the copy_from_iter and copy_to_iter APIs do not make sense. Most of the DAX API is based around getting a memory mapping using ->direct_access, it is just the read/write path which is a slow path that actually uses this. I have a very WIP patch series to try to sort this out here: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize But back to this series. The basic DAX model is that the callers gets a memory mapping an just works on that, maybe calling a sync after a write in a few cases. So any kind of recovery really needs to be able to work with that model as going forward the copy_to/from_iter path will be used less and less. i.e. file systems can and should use direct_access directly instead of using the block layer implementation in the pmem driver. As an example the dm-writecache driver, the pending bcache nvdimm support and the (horribly and out of tree) nova file systems won't even use this path. We need to find a way to support recovery for them. And overloading it over the read/write path which is not the main path for DAX, but the absolutely fast path for 99% of the kernel users is a horrible idea. So how can we work around the horrible nvdimm design for data recovery in a way that: a) actually works with the intended direct memory map use case b) doesn't really affect the normal kernel too much ?
On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > > Thanks - I try to be honest. As far as I can tell, the argument > > about the flag is a philosophical argument between two views. > > One view assumes design based on perfect hardware, and media error > > belongs to the category of brokenness. Another view sees media > > error as a build-in hardware component and make design to include > > dealing with such errors. > > No, I don't think so. Bit errors do happen in all media, which is > why devices are built to handle them. It is just the Intel-style > pmem interface to handle them which is completely broken. No, any media can report checksum / parity errors. NVME also seems to do a poor job with multi-bit ECC errors consumed from DRAM. There is nothing "pmem" or "Intel" specific here. > > errors in mind from start. I guess I'm trying to articulate why > > it is acceptable to include the RWF_DATA_RECOVERY flag to the > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > > and its slow path (w/ error clearing) is faster than other alternative. > > Other alternative being 1 system call to clear the poison, and > > another system call to run the fast pwrite for recovery, what > > happens if something happened in between? > > Well, my point is doing recovery from bit errors is by definition not > the fast path. Which is why I'd rather keep it away from the pmem > read/write fast path, which also happens to be the (much more important) > non-pmem read/write path. I would expect this interface to be useful outside of pmem as a "failfast" or "try harder to recover" flag for reading over media errors.
On Wed, Oct 27, 2021 at 5:25 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Oct 26, 2021 at 11:49:59PM -0700, Christoph Hellwig wrote: > > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > > > Thanks - I try to be honest. As far as I can tell, the argument > > > about the flag is a philosophical argument between two views. > > > One view assumes design based on perfect hardware, and media error > > > belongs to the category of brokenness. Another view sees media > > > error as a build-in hardware component and make design to include > > > dealing with such errors. > > > > No, I don't think so. Bit errors do happen in all media, which is > > why devices are built to handle them. It is just the Intel-style > > pmem interface to handle them which is completely broken. > > Yeah, I agree, this takes me back to learning how to use DISKEDIT to > work around a hole punched in a file (with a pen!) in the 1980s... > > ...so would you happen to know if anyone's working on solving this > problem for us by putting the memory controller in charge of dealing > with media errors? What are you guys going on about? ECC memory corrects single-bit errors in the background, multi-bit errors cause the memory controller to signal that data is gone. This is how ECC memory has worked since forever. Typically the kernel's memory-failure path is just throwing away pages that signal data loss. Throwing away pmem pages is harder because unlike DRAM the physical address of the page matters to upper layers. > > > > errors in mind from start. I guess I'm trying to articulate why > > > it is acceptable to include the RWF_DATA_RECOVERY flag to the > > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > > > and its slow path (w/ error clearing) is faster than other alternative. > > > Other alternative being 1 system call to clear the poison, and > > > another system call to run the fast pwrite for recovery, what > > > happens if something happened in between? > > > > Well, my point is doing recovery from bit errors is by definition not > > the fast path. Which is why I'd rather keep it away from the pmem > > read/write fast path, which also happens to be the (much more important) > > non-pmem read/write path. > > The trouble is, we really /do/ want to be able to (re)write the failed > area, and we probably want to try to read whatever we can. Those are > reads and writes, not {pre,f}allocation activities. This is where Dave > and I arrived at a month ago. > > Unless you'd be ok with a second IO path for recovery where we're > allowed to be slow? That would probably have the same user interface > flag, just a different path into the pmem driver. > > Ha, how about a int fd2 = recoveryfd(fd); call where you'd get whatever > speshul options (retry raid mirrors! scrape the film off the disk if > you have to!) you want that can take forever, leaving the fast paths > alone? I am still failing to see the technical argument for why RWF_RECOVER_DATA significantly impacts the fast path, and why you think this is somehow specific to pmem. In fact the pmem effort is doing the responsible thing and trying to plumb this path while other storage drivers just seem to be pretending that memory errors never happen.
On Mon, Nov 1, 2021 at 11:19 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > > ...so would you happen to know if anyone's working on solving this > > problem for us by putting the memory controller in charge of dealing > > with media errors? > > The only one who could know is Intel.. > > > The trouble is, we really /do/ want to be able to (re)write the failed > > area, and we probably want to try to read whatever we can. Those are > > reads and writes, not {pre,f}allocation activities. This is where Dave > > and I arrived at a month ago. > > > > Unless you'd be ok with a second IO path for recovery where we're > > allowed to be slow? That would probably have the same user interface > > flag, just a different path into the pmem driver. > > Which is fine with me. If you look at the API here we do have the > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > API which then gets special casing over three methods. > > And while Pavel pointed out that he and Jens are now optimizing for > single branches like this. I think this actually is silly and it is > not my point. > > The point is that the DAX in-kernel API is a mess, and before we make > it even worse we need to sort it first. What is directly relevant > here is that the copy_from_iter and copy_to_iter APIs do not make > sense. Most of the DAX API is based around getting a memory mapping > using ->direct_access, it is just the read/write path which is a slow > path that actually uses this. I have a very WIP patch series to try > to sort this out here: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > But back to this series. The basic DAX model is that the callers gets a > memory mapping an just works on that, maybe calling a sync after a write > in a few cases. So any kind of recovery really needs to be able to > work with that model as going forward the copy_to/from_iter path will > be used less and less. i.e. file systems can and should use > direct_access directly instead of using the block layer implementation > in the pmem driver. As an example the dm-writecache driver, the pending > bcache nvdimm support and the (horribly and out of tree) nova file systems > won't even use this path. We need to find a way to support recovery > for them. And overloading it over the read/write path which is not > the main path for DAX, but the absolutely fast path for 99% of the > kernel users is a horrible idea. > > So how can we work around the horrible nvdimm design for data recovery > in a way that: > > a) actually works with the intended direct memory map use case > b) doesn't really affect the normal kernel too much > > ? Ok, now I see where you are going, but I don't see line of sight to something better than RWF_RECOVER_DATA. This goes back to one of the original DAX concerns of wanting a kernel library for coordinating PMEM mmap I/O vs leaving userspace to wrap PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O has this error-handling-API issue whether it is a DAX mapping or not. I.e. a memory failure in page cache is going to signal the process the same way and it will need to fall back to something other than mmap I/O to make forward progress. This is not a PMEM, Intel or even x86 problem, it's a generic CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE problem. CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled and then rely on readv(2)/writev(2) to recover. Do you see a readily available way to improve upon that model without CPU instruction changes? Even with CPU instructions changes, do you think it could improve much upon the model of interrupting the process when a load instruction aborts? I do agree with you that DAX needs to separate itself from block, but I don't think it follows that DAX also needs to separate itself from readv/writev for when a kernel slow-path needs to get involved because mmap I/O (just CPU instructions) does not have the proper semantics. Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement those semantics in new / augmented CPU instructions you will likely not get all of them to move and certainly not in any near term timeframe, so the kernel path will be around indefinitely. Meanwhile, I think RWF_RECOVER_DATA is generically useful for other storage besides PMEM and helps storage-drivers do better than large blast radius "I/O error" completions with no other recourse.
On Tue, Nov 02, 2021 at 09:03:55AM -0700, Dan Williams wrote: > > why devices are built to handle them. It is just the Intel-style > > pmem interface to handle them which is completely broken. > > No, any media can report checksum / parity errors. NVME also seems to > do a poor job with multi-bit ECC errors consumed from DRAM. There is > nothing "pmem" or "Intel" specific here. If you do get data corruption from NVMe (which yes can happen despite the typical very good UBER rate) you just write over it again. You don't need to magically whack the underlying device. Same for hard drives. > > Well, my point is doing recovery from bit errors is by definition not > > the fast path. Which is why I'd rather keep it away from the pmem > > read/write fast path, which also happens to be the (much more important) > > non-pmem read/write path. > > I would expect this interface to be useful outside of pmem as a > "failfast" or "try harder to recover" flag for reading over media > errors. Maybe we need to sit down and define useful semantics then? The problem on the write side isn't really that the behavior with the flag is undefined, it is more that writes without the flag have horrible semantics if they don't just clear the error.
On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote: > This goes back to one of the original DAX concerns of wanting a kernel > library for coordinating PMEM mmap I/O vs leaving userspace to wrap > PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O > has this error-handling-API issue whether it is a DAX mapping or not. Semantics of writes through shared mmaps are a nightmare. Agreed, including agreeing that this is neither new nor pmem specific. But it also has absolutely nothing to do with the new RWF_ flag. > CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will > receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled > and then rely on readv(2)/writev(2) to recover. Do you see a readily > available way to improve upon that model without CPU instruction > changes? Even with CPU instructions changes, do you think it could > improve much upon the model of interrupting the process when a load > instruction aborts? The "only" think we need is something like the exception table we use in the kernel for the uaccess helpers (and the new _nofault kernel access helper). But I suspect refitting that into userspace environments is probably non-trivial. > I do agree with you that DAX needs to separate itself from block, but > I don't think it follows that DAX also needs to separate itself from > readv/writev for when a kernel slow-path needs to get involved because > mmap I/O (just CPU instructions) does not have the proper semantics. > Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement > those semantics in new / augmented CPU instructions you will likely > not get all of them to move and certainly not in any near term > timeframe, so the kernel path will be around indefinitely. I think you misunderstood me. I don't think pmem needs to be decoupled from the read/write path. But I'm very skeptical of adding a new flag to the common read/write path for the special workaround that a plain old write will not actually clear errors unlike every other store interfac. > Meanwhile, I think RWF_RECOVER_DATA is generically useful for other > storage besides PMEM and helps storage-drivers do better than large > blast radius "I/O error" completions with no other recourse. How?
On 11/1/2021 11:18 PM, Christoph Hellwig wrote: > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: >> ...so would you happen to know if anyone's working on solving this >> problem for us by putting the memory controller in charge of dealing >> with media errors? > > The only one who could know is Intel.. > >> The trouble is, we really /do/ want to be able to (re)write the failed >> area, and we probably want to try to read whatever we can. Those are >> reads and writes, not {pre,f}allocation activities. This is where Dave >> and I arrived at a month ago. >> >> Unless you'd be ok with a second IO path for recovery where we're >> allowed to be slow? That would probably have the same user interface >> flag, just a different path into the pmem driver. > > Which is fine with me. If you look at the API here we do have the > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > API which then gets special casing over three methods. > > And while Pavel pointed out that he and Jens are now optimizing for > single branches like this. I think this actually is silly and it is > not my point. > > The point is that the DAX in-kernel API is a mess, and before we make > it even worse we need to sort it first. What is directly relevant > here is that the copy_from_iter and copy_to_iter APIs do not make > sense. Most of the DAX API is based around getting a memory mapping > using ->direct_access, it is just the read/write path which is a slow > path that actually uses this. I have a very WIP patch series to try > to sort this out here: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > But back to this series. The basic DAX model is that the callers gets a > memory mapping an just works on that, maybe calling a sync after a write > in a few cases. So any kind of recovery really needs to be able to > work with that model as going forward the copy_to/from_iter path will > be used less and less. i.e. file systems can and should use > direct_access directly instead of using the block layer implementation > in the pmem driver. As an example the dm-writecache driver, the pending > bcache nvdimm support and the (horribly and out of tree) nova file systems > won't even use this path. We need to find a way to support recovery > for them. And overloading it over the read/write path which is not > the main path for DAX, but the absolutely fast path for 99% of the > kernel users is a horrible idea. > > So how can we work around the horrible nvdimm design for data recovery > in a way that: > > a) actually works with the intended direct memory map use case > b) doesn't really affect the normal kernel too much > > ? > This is clearer, I've looked at your 'dax-devirtualize' patch which removes pmem_copy_to/from_iter, and as you mentioned before, a separate API for poison-clearing is needed. So how about I go ahead rebase my earlier patch https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/ on 'dax-devirtualize', provide dm support for clear-poison? That way, the non-dax 99% of the pwrite use-cases aren't impacted at all and we resolve the urgent pmem poison-clearing issue? Dan, are you okay with this? I am getting pressure from our customers who are basically stuck at the moment. thanks! -jane
On Wed, Nov 3, 2021 at 9:58 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Nov 02, 2021 at 12:57:10PM -0700, Dan Williams wrote: > > This goes back to one of the original DAX concerns of wanting a kernel > > library for coordinating PMEM mmap I/O vs leaving userspace to wrap > > PMEM semantics on top of a DAX mapping. The problem is that mmap-I/O > > has this error-handling-API issue whether it is a DAX mapping or not. > > Semantics of writes through shared mmaps are a nightmare. Agreed, > including agreeing that this is neither new nor pmem specific. But > it also has absolutely nothing to do with the new RWF_ flag. Ok. > > CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE implies that processes will > > receive SIGBUS + BUS_MCEERR_A{R,O} when memory failure is signalled > > and then rely on readv(2)/writev(2) to recover. Do you see a readily > > available way to improve upon that model without CPU instruction > > changes? Even with CPU instructions changes, do you think it could > > improve much upon the model of interrupting the process when a load > > instruction aborts? > > The "only" think we need is something like the exception table we > use in the kernel for the uaccess helpers (and the new _nofault > kernel access helper). But I suspect refitting that into userspace > environments is probably non-trivial. Is the exception table requirement not already fulfilled by: sigaction(SIGBUS, &act, 0); ... if (sigsetjmp(sj_env, 1)) { ... ...but yes, that's awkward when all you want is an error return from a copy operation. For _nofault I'll note that on the kernel side Linus was explicit about not mixing fault handling and memory error exception handling in the same accessor. That's why copy_mc_to_kernel() and copy_{to,from}_kernel_nofault() are distinct. I only say that to probe deeper about what a "copy_mc()" looks like in userspace? Perhaps an interface to suppress SIGBUS generation and register a ring buffer that gets filled with error-event records encountered over a given MMAP I/O code sequence? > > I do agree with you that DAX needs to separate itself from block, but > > I don't think it follows that DAX also needs to separate itself from > > readv/writev for when a kernel slow-path needs to get involved because > > mmap I/O (just CPU instructions) does not have the proper semantics. > > Even if you got one of the ARCH_SUPPORTS_MEMORY_FAILURE to implement > > those semantics in new / augmented CPU instructions you will likely > > not get all of them to move and certainly not in any near term > > timeframe, so the kernel path will be around indefinitely. > > I think you misunderstood me. I don't think pmem needs to be > decoupled from the read/write path. But I'm very skeptical of adding > a new flag to the common read/write path for the special workaround > that a plain old write will not actually clear errors unlike every > other store interfac. Ah, ok, yes, I agree with you there that needing to redirect writes to a platform firmware call to clear errors, and notify the device that its error-list has changed is exceedingly awkward. That said, even if the device-side error-list auto-updated on write (like the promise of MOVDIR64B) there's still the question about when to do management on the software error lists in the driver and/or filesytem. I.e. given that XFS at least wants to be aware of the error lists for block allocation and "list errors" type features. More below... > > Meanwhile, I think RWF_RECOVER_DATA is generically useful for other > > storage besides PMEM and helps storage-drivers do better than large > > blast radius "I/O error" completions with no other recourse. > > How? Hasn't this been a perennial topic at LSF/MM, i.e. how to get an interface for the filesystem to request "try harder" to return data? If the device has a recovery slow-path, or error tracking granularity is smaller than the I/O size, then RWF_RECOVER_DATA gives the device/driver leeway to do better than the typical fast path. For writes though, I can only come up with the use case of this being a signal to the driver to take the opportunity to do error-list management relative to the incoming write data. However, if signaling that "now is the time to update error-lists" is the requirement, I imagine the @kaddr returned from dax_direct_access() could be made to point to an unmapped address representing the poisoned page. Then, arrange for a pmem-driver fault handler to emulate the copy operation and do the slow path updates that would otherwise have been gated by RWF_RECOVER_DATA. Although, I'm not excited about teaching every PMEM arch's fault handler about this new source of kernel faults. Other ideas? RWF_RECOVER_DATA still seems the most viable / cleanest option, but I'm willing to do what it takes to move this error management capability forward.
On Wed, Nov 3, 2021 at 11:10 AM Jane Chu <jane.chu@oracle.com> wrote: > > On 11/1/2021 11:18 PM, Christoph Hellwig wrote: > > On Wed, Oct 27, 2021 at 05:24:51PM -0700, Darrick J. Wong wrote: > >> ...so would you happen to know if anyone's working on solving this > >> problem for us by putting the memory controller in charge of dealing > >> with media errors? > > > > The only one who could know is Intel.. > > > >> The trouble is, we really /do/ want to be able to (re)write the failed > >> area, and we probably want to try to read whatever we can. Those are > >> reads and writes, not {pre,f}allocation activities. This is where Dave > >> and I arrived at a month ago. > >> > >> Unless you'd be ok with a second IO path for recovery where we're > >> allowed to be slow? That would probably have the same user interface > >> flag, just a different path into the pmem driver. > > > > Which is fine with me. If you look at the API here we do have the > > RWF_ API, which them maps to the IOMAP API, which maps to the DAX_ > > API which then gets special casing over three methods. > > > > And while Pavel pointed out that he and Jens are now optimizing for > > single branches like this. I think this actually is silly and it is > > not my point. > > > > The point is that the DAX in-kernel API is a mess, and before we make > > it even worse we need to sort it first. What is directly relevant > > here is that the copy_from_iter and copy_to_iter APIs do not make > > sense. Most of the DAX API is based around getting a memory mapping > > using ->direct_access, it is just the read/write path which is a slow > > path that actually uses this. I have a very WIP patch series to try > > to sort this out here: > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > > > But back to this series. The basic DAX model is that the callers gets a > > memory mapping an just works on that, maybe calling a sync after a write > > in a few cases. So any kind of recovery really needs to be able to > > work with that model as going forward the copy_to/from_iter path will > > be used less and less. i.e. file systems can and should use > > direct_access directly instead of using the block layer implementation > > in the pmem driver. As an example the dm-writecache driver, the pending > > bcache nvdimm support and the (horribly and out of tree) nova file systems > > won't even use this path. We need to find a way to support recovery > > for them. And overloading it over the read/write path which is not > > the main path for DAX, but the absolutely fast path for 99% of the > > kernel users is a horrible idea. > > > > So how can we work around the horrible nvdimm design for data recovery > > in a way that: > > > > a) actually works with the intended direct memory map use case > > b) doesn't really affect the normal kernel too much > > > > ? > > > > This is clearer, I've looked at your 'dax-devirtualize' patch which > removes pmem_copy_to/from_iter, and as you mentioned before, > a separate API for poison-clearing is needed. So how about I go ahead > rebase my earlier patch > > https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/ > on 'dax-devirtualize', provide dm support for clear-poison? > That way, the non-dax 99% of the pwrite use-cases aren't impacted at all > and we resolve the urgent pmem poison-clearing issue? > > Dan, are you okay with this? I am getting pressure from our customers > who are basically stuck at the moment. The concern I have with dax_clear_poison() is that it precludes atomic error clearing. Also, as Boris and I discussed, poisoned pages should be marked NP (not present) rather than UC (uncacheable) [1]. With those 2 properties combined I think that wants a custom pmem fault handler that knows how to carefully write to pmem pages with poison present, rather than an additional explicit dax-operation. That also meets Christoph's requirement of "works with the intended direct memory map use case". [1]: https://lore.kernel.org/r/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com
On Wed, Nov 03, 2021 at 06:09:57PM +0000, Jane Chu wrote: > This is clearer, I've looked at your 'dax-devirtualize' patch which > removes pmem_copy_to/from_iter, and as you mentioned before, > a separate API for poison-clearing is needed. So how about I go ahead > rebase my earlier patch > > https://lore.kernel.org/lkml/20210914233132.3680546-2-jane.chu@oracle.com/ > on 'dax-devirtualize', provide dm support for clear-poison? > That way, the non-dax 99% of the pwrite use-cases aren't impacted at all > and we resolve the urgent pmem poison-clearing issue? FYI, I really do like the in-kernel interface in that series. And now that we discussed all the effects I also think that automatically doing the clear on EIO is a good idea. I'll go back to the series and will reply with a few nitpicks.
On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote: > Is the exception table requirement not already fulfilled by: > > sigaction(SIGBUS, &act, 0); > ... > if (sigsetjmp(sj_env, 1)) { > ... > > ...but yes, that's awkward when all you want is an error return from a > copy operation. Yeah. The nice thing about the kernel uaccess / _nofault helpers is that they allow normal error handling flows. > For _nofault I'll note that on the kernel side Linus was explicit > about not mixing fault handling and memory error exception handling in > the same accessor. That's why copy_mc_to_kernel() and > copy_{to,from}_kernel_nofault() are distinct. I've always wondered why we need all this mess. But if the head penguin wants it.. > I only say that to probe > deeper about what a "copy_mc()" looks like in userspace? Perhaps an > interface to suppress SIGBUS generation and register a ring buffer > that gets filled with error-event records encountered over a given > MMAP I/O code sequence? Well, the equivalent to the kernel uaccess model would be to register a SIGBUS handler that uses an exception table like the kernel, and then if you use the right helpers to load/store they can return errors. The other option would be something more like the Windows Structured Exception Handling: https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160 > > I think you misunderstood me. I don't think pmem needs to be > > decoupled from the read/write path. But I'm very skeptical of adding > > a new flag to the common read/write path for the special workaround > > that a plain old write will not actually clear errors unlike every > > other store interfac. > > Ah, ok, yes, I agree with you there that needing to redirect writes to > a platform firmware call to clear errors, and notify the device that > its error-list has changed is exceedingly awkward. Yes. And that is the big difference to every other modern storage device. SSDs always write out of place and will just not reuse bad blocks, and all drivers built in the last 25-30 years will also do internal bad block remapping. > That said, even if > the device-side error-list auto-updated on write (like the promise of > MOVDIR64B) there's still the question about when to do management on > the software error lists in the driver and/or filesytem. I.e. given > that XFS at least wants to be aware of the error lists for block > allocation and "list errors" type features. More below... Well, the whole problem is that we should not have to manage this at all, and this is where I blame Intel. There is no good reason to not slightly overprovision the nvdimms and just do internal bad page remapping like every other modern storage device. > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an > interface for the filesystem to request "try harder" to return data? Trying harder to _get_ data or to _store_ data? All the discussion here seems to be able about actually writing data. > If the device has a recovery slow-path, or error tracking granularity > is smaller than the I/O size, then RWF_RECOVER_DATA gives the > device/driver leeway to do better than the typical fast path. For > writes though, I can only come up with the use case of this being a > signal to the driver to take the opportunity to do error-list > management relative to the incoming write data. Well, while devices have data recovery slow path they tend to use those automatically. What I'm actually seeing in practice is flags in the storage interfaces to skip this slow path recovery because the higher level software would prefer to instead recover e.g. from remote copies. > However, if signaling that "now is the time to update error-lists" is > the requirement, I imagine the @kaddr returned from > dax_direct_access() could be made to point to an unmapped address > representing the poisoned page. Then, arrange for a pmem-driver fault > handler to emulate the copy operation and do the slow path updates > that would otherwise have been gated by RWF_RECOVER_DATA. That does sound like a much better interface than most of what we've discussed so far. > Although, I'm not excited about teaching every PMEM arch's fault > handler about this new source of kernel faults. Other ideas? > RWF_RECOVER_DATA still seems the most viable / cleanest option, but > I'm willing to do what it takes to move this error management > capability forward. So far out of the low instrusiveness options Janes' previous series to automatically retry after calling a clear_poison operation seems like the best idea so far. We just need to also think about what we want to do for direct users of ->direct_access that do not use the mcsafe iov_iter helpers.
On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote: > The concern I have with dax_clear_poison() is that it precludes atomic > error clearing. atomic as in clear poison and write the actual data? Yes, that would be useful, but it is not how the Intel pmem support actually works, right? > Also, as Boris and I discussed, poisoned pages should > be marked NP (not present) rather than UC (uncacheable) [1]. This would not really have an affect on the series, right? But yes, that seems like the right thing to do. > With > those 2 properties combined I think that wants a custom pmem fault > handler that knows how to carefully write to pmem pages with poison > present, rather than an additional explicit dax-operation. That also > meets Christoph's requirement of "works with the intended direct > memory map use case". So we have 3 kinds of accesses to DAX memory: (1) user space mmap direct access. (2) iov_iter based access (could be from kernel or userspace) (3) open coded kernel access using ->direct_access One thing I noticed: (2) could also work with kernel memory or pages, but that doesn't use MC safe access. Which seems like a major independent of this discussion. I suspect all kernel access could work fine with a copy_mc_to_kernel helper as long as everyone actually uses it, which will cover the missing required bits of (2) and (3) together with something like the ->clear_poison series from Jane. We just need to think hard what we want to do for userspace mmap access.
On Thu, Nov 04, 2021 at 01:30:48AM -0700, Christoph Hellwig wrote: > Well, the whole problem is that we should not have to manage this at > all, and this is where I blame Intel. There is no good reason to not > slightly overprovision the nvdimms and just do internal bad page > remapping like every other modern storage device. What makes you think they don't? The problem is persuading the CPU to do writes without doing reads. That's where magic instructions or out of band IO is needed.
On Thu, Nov 4, 2021 at 1:36 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Nov 03, 2021 at 11:21:39PM -0700, Dan Williams wrote: > > The concern I have with dax_clear_poison() is that it precludes atomic > > error clearing. > > atomic as in clear poison and write the actual data? Yes, that would > be useful, but it is not how the Intel pmem support actually works, right? Yes, atomic clear+write new data. The ability to atomic clear requires either a CPU with the ability to overwrite cachelines without doing a RMW cycle (MOVDIR64B), or it requires a device with a suitable slow-path mailbox command like the one defined for CXL devices (see section 8.2.9.5.4.3 Clear Poison in CXL 2.0). I don't know why you think these devices don't perform wear-leveling with spare blocks? > > Also, as Boris and I discussed, poisoned pages should > > be marked NP (not present) rather than UC (uncacheable) [1]. > > This would not really have an affect on the series, right? But yes, > that seems like the right thing to do. It would because the implementations would need to be careful to clear poison in an entire page before any of it could be accessed. With an enlightened write-path RWF flag or custom fault handler it could do sub-page overwrites of poison. Not that I think the driver should optimize for multiple failed cachelines in a page, but it does mean dax_clear_poison() fails in more theoretical scenarios. > > With > > those 2 properties combined I think that wants a custom pmem fault > > handler that knows how to carefully write to pmem pages with poison > > present, rather than an additional explicit dax-operation. That also > > meets Christoph's requirement of "works with the intended direct > > memory map use case". > > So we have 3 kinds of accesses to DAX memory: > > (1) user space mmap direct access. > (2) iov_iter based access (could be from kernel or userspace) > (3) open coded kernel access using ->direct_access > > One thing I noticed: (2) could also work with kernel memory or pages, > but that doesn't use MC safe access. Yes, but after the fight to even get copy_mc_to_kernel() to exist for pmem_copy_to_iter() I did not have the nerve to push for wider usage. > Which seems like a major independent > of this discussion. > > I suspect all kernel access could work fine with a copy_mc_to_kernel > helper as long as everyone actually uses it, All kernel accesses do use it. They either route to pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you see a kernel path that does not use that helper? > missing required bits of (2) and (3) together with something like the > ->clear_poison series from Jane. We just need to think hard what we > want to do for userspace mmap access. dax_clear_poison() is at least ready to go today and does not preclude adding the atomic and finer grained support later.
On Thu, Nov 4, 2021 at 1:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Nov 03, 2021 at 01:33:58PM -0700, Dan Williams wrote: > > Is the exception table requirement not already fulfilled by: > > > > sigaction(SIGBUS, &act, 0); > > ... > > if (sigsetjmp(sj_env, 1)) { > > ... > > > > ...but yes, that's awkward when all you want is an error return from a > > copy operation. > > Yeah. The nice thing about the kernel uaccess / _nofault helpers is > that they allow normal error handling flows. > > > For _nofault I'll note that on the kernel side Linus was explicit > > about not mixing fault handling and memory error exception handling in > > the same accessor. That's why copy_mc_to_kernel() and > > copy_{to,from}_kernel_nofault() are distinct. > > I've always wondered why we need all this mess. But if the head > penguin wants it.. > > > I only say that to probe > > deeper about what a "copy_mc()" looks like in userspace? Perhaps an > > interface to suppress SIGBUS generation and register a ring buffer > > that gets filled with error-event records encountered over a given > > MMAP I/O code sequence? > > Well, the equivalent to the kernel uaccess model would be to register > a SIGBUS handler that uses an exception table like the kernel, and then > if you use the right helpers to load/store they can return errors. > > The other option would be something more like the Windows Structured > Exception Handling: > > https://docs.microsoft.com/en-us/cpp/cpp/structured-exception-handling-c-cpp?view=msvc-160 Yes, try-catch blocks for PMEM is what is needed if it's not there already from PMDK. Searching for SIGBUS in PMDK finds things like: https://github.com/pmem/pmdk/blob/26449a51a86861db17feabdfefaa5ee4d5afabc9/src/libpmem2/mcsafe_ops_posix.c > > > > > I think you misunderstood me. I don't think pmem needs to be > > > decoupled from the read/write path. But I'm very skeptical of adding > > > a new flag to the common read/write path for the special workaround > > > that a plain old write will not actually clear errors unlike every > > > other store interfac. > > > > Ah, ok, yes, I agree with you there that needing to redirect writes to > > a platform firmware call to clear errors, and notify the device that > > its error-list has changed is exceedingly awkward. > > Yes. And that is the big difference to every other modern storage > device. SSDs always write out of place and will just not reuse bad > blocks, and all drivers built in the last 25-30 years will also do > internal bad block remapping. > No, the big difference with every other modern storage device is access to byte-addressable storage. Storage devices get to "cheat" with guaranteed minimum 512-byte accesses. So you can arrange for writes to always be large enough to scrub the ECC bits along with the data. For PMEM and byte-granularity DAX accesses the "sector size" is a cacheline and it needed a new CPU instruction before software could atomically update data + ECC. Otherwise, with sub-cacheline accesses, a RMW cycle can't always be avoided. Such a cycle pulls poison from the device on the read and pushes it back out to the media on the cacheline writeback. > > That said, even if > > the device-side error-list auto-updated on write (like the promise of > > MOVDIR64B) there's still the question about when to do management on > > the software error lists in the driver and/or filesytem. I.e. given > > that XFS at least wants to be aware of the error lists for block > > allocation and "list errors" type features. More below... > > Well, the whole problem is that we should not have to manage this at > all, and this is where I blame Intel. There is no good reason to not > slightly overprovision the nvdimms and just do internal bad page > remapping like every other modern storage device. I don't understand what overprovisioning has to do with better error management? No other storage device has seen fit to be as transparent with communicating the error list and offering ways to proactively scrub it. Dave and Darrick rightly saw this and said "hey, the FS could do a much better job for the user if it knew about this error list". So I don't get what this argument about spare blocks has to do with what XFS wants? I.e. an rmap facility to communicate files that have been clobbered by cosmic rays and other calamities. > > Hasn't this been a perennial topic at LSF/MM, i.e. how to get an > > interface for the filesystem to request "try harder" to return data? > > Trying harder to _get_ data or to _store_ data? All the discussion > here seems to be able about actually writing data. > > > If the device has a recovery slow-path, or error tracking granularity > > is smaller than the I/O size, then RWF_RECOVER_DATA gives the > > device/driver leeway to do better than the typical fast path. For > > writes though, I can only come up with the use case of this being a > > signal to the driver to take the opportunity to do error-list > > management relative to the incoming write data. > > Well, while devices have data recovery slow path they tend to use those > automatically. What I'm actually seeing in practice is flags in the > storage interfaces to skip this slow path recovery because the higher > level software would prefer to instead recover e.g. from remote copies. Ok. > > However, if signaling that "now is the time to update error-lists" is > > the requirement, I imagine the @kaddr returned from > > dax_direct_access() could be made to point to an unmapped address > > representing the poisoned page. Then, arrange for a pmem-driver fault > > handler to emulate the copy operation and do the slow path updates > > that would otherwise have been gated by RWF_RECOVER_DATA. > > That does sound like a much better interface than most of what we've > discussed so far. > > > Although, I'm not excited about teaching every PMEM arch's fault > > handler about this new source of kernel faults. Other ideas? > > RWF_RECOVER_DATA still seems the most viable / cleanest option, but > > I'm willing to do what it takes to move this error management > > capability forward. > > So far out of the low instrusiveness options Janes' previous series > to automatically retry after calling a clear_poison operation seems > like the best idea so far. We just need to also think about what > we want to do for direct users of ->direct_access that do not use > the mcsafe iov_iter helpers. Those exist? Even dm-writecache uses copy_mc_to_kernel().
On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote: > No, the big difference with every other modern storage device is > access to byte-addressable storage. Storage devices get to "cheat" > with guaranteed minimum 512-byte accesses. So you can arrange for > writes to always be large enough to scrub the ECC bits along with the > data. For PMEM and byte-granularity DAX accesses the "sector size" is > a cacheline and it needed a new CPU instruction before software could > atomically update data + ECC. Otherwise, with sub-cacheline accesses, > a RMW cycle can't always be avoided. Such a cycle pulls poison from > the device on the read and pushes it back out to the media on the > cacheline writeback. Indeed. The fake byte addressability is indeed the problem, and the fix is to not do that, at least on the second attempt. > I don't understand what overprovisioning has to do with better error > management? No other storage device has seen fit to be as transparent > with communicating the error list and offering ways to proactively > scrub it. Dave and Darrick rightly saw this and said "hey, the FS > could do a much better job for the user if it knew about this error > list". So I don't get what this argument about spare blocks has to do > with what XFS wants? I.e. an rmap facility to communicate files that > have been clobbered by cosmic rays and other calamities. Well, the answer for other interfaces (at least at the gold plated cost option) is so strong internal CRCs that user visible bits clobbered by cosmic rays don't realisticly happen. But it is a problem with the cheaper ones, and at least SCSI and NVMe offer the error list through the Get LBA status command (and I bet ATA too, but I haven't looked into that). Oddly enough there has never been much interested from the fs community for those. > > So far out of the low instrusiveness options Janes' previous series > > to automatically retry after calling a clear_poison operation seems > > like the best idea so far. We just need to also think about what > > we want to do for direct users of ->direct_access that do not use > > the mcsafe iov_iter helpers. > > Those exist? Even dm-writecache uses copy_mc_to_kernel(). I'm sorry, I have completely missed that it has been added. And it's been in for a whole year..
On Thu, Nov 04, 2021 at 09:08:41AM -0700, Dan Williams wrote: > Yes, atomic clear+write new data. The ability to atomic clear requires > either a CPU with the ability to overwrite cachelines without doing a > RMW cycle (MOVDIR64B), or it requires a device with a suitable > slow-path mailbox command like the one defined for CXL devices (see > section 8.2.9.5.4.3 Clear Poison in CXL 2.0). > > I don't know why you think these devices don't perform wear-leveling > with spare blocks? Because the interface looks so broken. But yes, apparently it's not the media management that is broken but just the inteface that fakes up byte level access. > All kernel accesses do use it. They either route to > pmem_copy_to_iter(), or like dm-writecache, call it directly. Do you > see a kernel path that does not use that helper? No, sorry. My knowledge is out of date. (nova does, but it is out of tree, and the lack of using copy_mc_to_kernel is the least of its problems)
On Thu, Nov 4, 2021 at 10:43 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Nov 04, 2021 at 09:24:15AM -0700, Dan Williams wrote: > > No, the big difference with every other modern storage device is > > access to byte-addressable storage. Storage devices get to "cheat" > > with guaranteed minimum 512-byte accesses. So you can arrange for > > writes to always be large enough to scrub the ECC bits along with the > > data. For PMEM and byte-granularity DAX accesses the "sector size" is > > a cacheline and it needed a new CPU instruction before software could > > atomically update data + ECC. Otherwise, with sub-cacheline accesses, > > a RMW cycle can't always be avoided. Such a cycle pulls poison from > > the device on the read and pushes it back out to the media on the > > cacheline writeback. > > Indeed. The fake byte addressability is indeed the problem, and the > fix is to not do that, at least on the second attempt. Fair enough. > > I don't understand what overprovisioning has to do with better error > > management? No other storage device has seen fit to be as transparent > > with communicating the error list and offering ways to proactively > > scrub it. Dave and Darrick rightly saw this and said "hey, the FS > > could do a much better job for the user if it knew about this error > > list". So I don't get what this argument about spare blocks has to do > > with what XFS wants? I.e. an rmap facility to communicate files that > > have been clobbered by cosmic rays and other calamities. > > Well, the answer for other interfaces (at least at the gold plated > cost option) is so strong internal CRCs that user visible bits clobbered > by cosmic rays don't realisticly happen. But it is a problem with the > cheaper ones, and at least SCSI and NVMe offer the error list through > the Get LBA status command (and I bet ATA too, but I haven't looked into > that). Oddly enough there has never been much interested from the > fs community for those. It seems the entanglement with 'struct page', error handling, and reflink made people take notice. Hopefully someone could follow the same plumbing we're doing for pmem to offer error-rmap help for NVME badblocks.
On Thu, Nov 04, 2021 at 10:43:23AM -0700, Christoph Hellwig wrote: > Well, the answer for other interfaces (at least at the gold plated > cost option) is so strong internal CRCs that user visible bits clobbered > by cosmic rays don't realisticly happen. But it is a problem with the > cheaper ones, and at least SCSI and NVMe offer the error list through > the Get LBA status command (and I bet ATA too, but I haven't looked into > that). Oddly enough there has never been much interested from the > fs community for those. "don't realistically happen" is different when you're talking about "doesn't happen within the warranty period of my laptop's SSD" and "doesn't happen on my fleet of 10k servers before they're taken out of service". There's also a big difference in speeds between an NVMe drive (7GB/s) and a memory device (20-50GB/s). The UBER being talked about when I was still at Intel was similar to / slightly better than DRAM, but that's still several failures per year across an entire data centre that's using pmem flat-out.
Thanks for the enlightening discussion here, it's so helpful! Please allow me to recap what I've caught up so far - 1. recovery write at page boundary due to NP setting in poisoned page to prevent undesirable prefetching 2. single interface to perform 3 tasks: { clear-poison, update error-list, write } such as an API in pmem driver. For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write' task can be combined (would need something different from the existing _copy_mcsafe though) and 'update error-list' follows closely behind; For CPUs that rely on firmware call to clear posion, the existing pmem_clear_poison() can be used, followed by the 'write' task. 3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery would be automatic for a write if range is page aligned; otherwise, the write fails with EIO as usual. Also, user mustn't have punched out the poisoned page in which case poison repairing will be a lot more complicated. 4. desirable to fetch as much data as possible from a poisoned range. If this understanding is in the right direction, then I'd like to propose below changes to dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() and the dm layer copy_to/from_iter, dax_iomap_iter(). 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there is likely media error: if the API without DAX_F_RECOVERY returns -EIO, then switch to recovery-read/write code. In recovery code, supply DAX_F_RECOVERY to dax_direct_access() in order to obtain 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. 2. the _copy_to/from_iter implementation would be largely the same as in my recent patch, but some changes in Christoph's 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously virtual devices don't have the ability to clear poison, so no need to complicate them. And this also means that not every endpoint dax device has to provide dax_op.copy_to/from_iter, they may use the default. I'm not sure about nova and others, if they use different 'write' other than via iomap, does that mean there will be need for a new set of dax_op for their read/write? the 3-in-1 binding would always be required though. Maybe that'll be an ongoing discussion? Comments? Suggestions? Thanks! -jane
On Thu, Nov 4, 2021 at 11:34 AM Jane Chu <jane.chu@oracle.com> wrote: > > Thanks for the enlightening discussion here, it's so helpful! > > Please allow me to recap what I've caught up so far - > > 1. recovery write at page boundary due to NP setting in poisoned > page to prevent undesirable prefetching > 2. single interface to perform 3 tasks: > { clear-poison, update error-list, write } > such as an API in pmem driver. > For CPUs that support MOVEDIR64B, the 'clear-poison' and 'write' > task can be combined (would need something different from the > existing _copy_mcsafe though) and 'update error-list' follows > closely behind; > For CPUs that rely on firmware call to clear posion, the existing > pmem_clear_poison() can be used, followed by the 'write' task. > 3. if user isn't given RWF_RECOVERY_FLAG flag, then dax recovery > would be automatic for a write if range is page aligned; > otherwise, the write fails with EIO as usual. > Also, user mustn't have punched out the poisoned page in which > case poison repairing will be a lot more complicated. > 4. desirable to fetch as much data as possible from a poisoned range. > > If this understanding is in the right direction, then I'd like to > propose below changes to > dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() > and the dm layer copy_to/from_iter, dax_iomap_iter(). > > 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there > is likely media error: if the API without DAX_F_RECOVERY returns > -EIO, then switch to recovery-read/write code. In recovery code, > supply DAX_F_RECOVERY to dax_direct_access() in order to obtain > 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. I like it. It allows for an atomic write+clear implementation on capable platforms and coordinates with potentially unmapped pages. The best of both worlds from the dax_clear_poison() proposal and my "take a fault and do a slow-path copy". > 2. the _copy_to/from_iter implementation would be largely the same > as in my recent patch, but some changes in Christoph's > 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously > virtual devices don't have the ability to clear poison, so no need > to complicate them. And this also means that not every endpoint > dax device has to provide dax_op.copy_to/from_iter, they may use the > default. Did I miss this series or are you talking about this one? https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/ > I'm not sure about nova and others, if they use different 'write' other > than via iomap, does that mean there will be need for a new set of > dax_op for their read/write? No, they're out-of-tree they'll adjust to the same interface that xfs and ext4 are using when/if they go upstream. > the 3-in-1 binding would always be > required though. Maybe that'll be an ongoing discussion? Yeah, let's cross that bridge when we come to it. > Comments? Suggestions? It sounds great to me!
On 11/4/2021 12:00 PM, Dan Williams wrote: >> >> If this understanding is in the right direction, then I'd like to >> propose below changes to >> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() >> and the dm layer copy_to/from_iter, dax_iomap_iter(). >> >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there >> is likely media error: if the API without DAX_F_RECOVERY returns >> -EIO, then switch to recovery-read/write code. In recovery code, >> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain >> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. > > I like it. It allows for an atomic write+clear implementation on > capable platforms and coordinates with potentially unmapped pages. The > best of both worlds from the dax_clear_poison() proposal and my "take > a fault and do a slow-path copy". > >> 2. the _copy_to/from_iter implementation would be largely the same >> as in my recent patch, but some changes in Christoph's >> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously >> virtual devices don't have the ability to clear poison, so no need >> to complicate them. And this also means that not every endpoint >> dax device has to provide dax_op.copy_to/from_iter, they may use the >> default. > > Did I miss this series or are you talking about this one? > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/ I was referring to http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize that has not come out yet, I said early on that I'll rebase on it, but looks like we still need pmem_copy_to/from_iter(), so. > >> I'm not sure about nova and others, if they use different 'write' other >> than via iomap, does that mean there will be need for a new set of >> dax_op for their read/write? > > No, they're out-of-tree they'll adjust to the same interface that xfs > and ext4 are using when/if they go upstream. > >> the 3-in-1 binding would always be >> required though. Maybe that'll be an ongoing discussion? > > Yeah, let's cross that bridge when we come to it. > >> Comments? Suggestions? > > It sounds great to me! > Thanks! I'll send out an updated patchset when it's ready. -jane
On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <jane.chu@oracle.com> wrote: > > On 11/4/2021 12:00 PM, Dan Williams wrote: > > >> > >> If this understanding is in the right direction, then I'd like to > >> propose below changes to > >> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() > >> and the dm layer copy_to/from_iter, dax_iomap_iter(). > >> > >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there > >> is likely media error: if the API without DAX_F_RECOVERY returns > >> -EIO, then switch to recovery-read/write code. In recovery code, > >> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain > >> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. > > > > I like it. It allows for an atomic write+clear implementation on > > capable platforms and coordinates with potentially unmapped pages. The > > best of both worlds from the dax_clear_poison() proposal and my "take > > a fault and do a slow-path copy". > > > >> 2. the _copy_to/from_iter implementation would be largely the same > >> as in my recent patch, but some changes in Christoph's > >> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously > >> virtual devices don't have the ability to clear poison, so no need > >> to complicate them. And this also means that not every endpoint > >> dax device has to provide dax_op.copy_to/from_iter, they may use the > >> default. > > > > Did I miss this series or are you talking about this one? > > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/ > > I was referring to > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > that has not come out yet, I said early on that I'll rebase on it, > but looks like we still need pmem_copy_to/from_iter(), so. Yeah, since the block-layer divorce gets rid of the old poison clearing path, then we're back to pmem_copy_to_iter() (or something like it) needing to pick up the slack for poison clearing. I do agree it would be nice to clean up all the unnecessary boilerplate, but the error-list coordination requires a driver specific callback. At least the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks.
On Thu, Nov 4, 2021 at 5:46 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Thu, Nov 4, 2021 at 1:27 PM Jane Chu <jane.chu@oracle.com> wrote: > > > > On 11/4/2021 12:00 PM, Dan Williams wrote: > > > > >> > > >> If this understanding is in the right direction, then I'd like to > > >> propose below changes to > > >> dax_direct_access(), dax_copy_to/from_iter(), pmem_copy_to/from_iter() > > >> and the dm layer copy_to/from_iter, dax_iomap_iter(). > > >> > > >> 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there > > >> is likely media error: if the API without DAX_F_RECOVERY returns > > >> -EIO, then switch to recovery-read/write code. In recovery code, > > >> supply DAX_F_RECOVERY to dax_direct_access() in order to obtain > > >> 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. > > > > > > I like it. It allows for an atomic write+clear implementation on > > > capable platforms and coordinates with potentially unmapped pages. The > > > best of both worlds from the dax_clear_poison() proposal and my "take > > > a fault and do a slow-path copy". > > > > > >> 2. the _copy_to/from_iter implementation would be largely the same > > >> as in my recent patch, but some changes in Christoph's > > >> 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously > > >> virtual devices don't have the ability to clear poison, so no need > > >> to complicate them. And this also means that not every endpoint > > >> dax device has to provide dax_op.copy_to/from_iter, they may use the > > >> default. > > > > > > Did I miss this series or are you talking about this one? > > > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/ > > > > I was referring to > > > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dax-devirtualize > > that has not come out yet, I said early on that I'll rebase on it, > > but looks like we still need pmem_copy_to/from_iter(), so. > > Yeah, since the block-layer divorce gets rid of the old poison > clearing path, then we're back to pmem_copy_to_iter() (or something > like it) needing to pick up the slack for poison clearing. I do agree > it would be nice to clean up all the unnecessary boilerplate, but the > error-list coordination requires a driver specific callback. At least > the DAX_F_VIRTUAL flag can eliminate the virtiofs and fuse callbacks. Although, if error management is generically implemented relative to a 'struct dax_device' then there wouldn't be a need to call all the way back into the driver, and the cleanup could still move forward.
On Thu, Nov 04, 2021 at 12:00:12PM -0700, Dan Williams wrote: > > 1. dax_iomap_iter() rely on dax_direct_access() to decide whether there > > is likely media error: if the API without DAX_F_RECOVERY returns > > -EIO, then switch to recovery-read/write code. In recovery code, > > supply DAX_F_RECOVERY to dax_direct_access() in order to obtain > > 'kaddr', and then call dax_copy_to/from_iter() with DAX_F_RECOVERY. > > I like it. It allows for an atomic write+clear implementation on > capable platforms and coordinates with potentially unmapped pages. The > best of both worlds from the dax_clear_poison() proposal and my "take > a fault and do a slow-path copy". Fine with me as well. > > > 2. the _copy_to/from_iter implementation would be largely the same > > as in my recent patch, but some changes in Christoph's > > 'dax-devirtualize' maybe kept, such as DAX_F_VIRTUAL, obviously > > virtual devices don't have the ability to clear poison, so no need > > to complicate them. And this also means that not every endpoint > > dax device has to provide dax_op.copy_to/from_iter, they may use the > > default. > > Did I miss this series or are you talking about this one? > https://lore.kernel.org/all/20211018044054.1779424-1-hch@lst.de/ Yes. This is an early RFC, but I plan to finish this up and submit it after the updated decouple series. > > > I'm not sure about nova and others, if they use different 'write' other > > than via iomap, does that mean there will be need for a new set of > > dax_op for their read/write? > > No, they're out-of-tree they'll adjust to the same interface that xfs > and ext4 are using when/if they go upstream. Yepp.
On Tue, 2 Nov 2021 09:03:55 -0700 Dan Williams <dan.j.williams@intel.com> wrote: > On Tue, Oct 26, 2021 at 11:50 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Fri, Oct 22, 2021 at 08:52:55PM +0000, Jane Chu wrote: > > > Thanks - I try to be honest. As far as I can tell, the argument > > > about the flag is a philosophical argument between two views. > > > One view assumes design based on perfect hardware, and media error > > > belongs to the category of brokenness. Another view sees media > > > error as a build-in hardware component and make design to include > > > dealing with such errors. > > > > No, I don't think so. Bit errors do happen in all media, which is > > why devices are built to handle them. It is just the Intel-style > > pmem interface to handle them which is completely broken. > > No, any media can report checksum / parity errors. NVME also seems to > do a poor job with multi-bit ECC errors consumed from DRAM. There is > nothing "pmem" or "Intel" specific here. > > > > errors in mind from start. I guess I'm trying to articulate why > > > it is acceptable to include the RWF_DATA_RECOVERY flag to the > > > existing RWF_ flags. - this way, pwritev2 remain fast on fast path, > > > and its slow path (w/ error clearing) is faster than other alternative. > > > Other alternative being 1 system call to clear the poison, and > > > another system call to run the fast pwrite for recovery, what > > > happens if something happened in between? > > > > Well, my point is doing recovery from bit errors is by definition not > > the fast path. Which is why I'd rather keep it away from the pmem > > read/write fast path, which also happens to be the (much more important) > > non-pmem read/write path. > > I would expect this interface to be useful outside of pmem as a > "failfast" or "try harder to recover" flag for reading over media > errors. Yeah, I think this flag could also be useful for non-raid btrfs. If you have an extend that is shared between multiple snapshots and it's data is corrupted (without the disk returning an i/o error), btrfs won't be able to fix the corruption without raid and will always return an i/o error when accessing the affected range (due to checksum mismatch). Of course you could just overwrite the range in the file with good data, but that would only fix the file you are operating on, snapshots will still reference the corrupted data. With this flag, a read could just return the corrupted data without i/o error and a write could write directly to the on-disk data to fixup the corruption everywhere. btrfs could also check that the newly written data actually matches the checksum. However, in this btrfs usecase the process still needs to be CAP_SYS_ADMIN or similar, since it's easy to create collisions for crc32 and so an attacker could write to a file that he has no permissions for, if that file shares an extend with one where he has write permissions. Regards, Lukas Straub --