Message ID | 20200218214841.10076-3-vgoyal@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | dax/pmem: Provide a dax operation to zero range of memory | expand |
On Tue, Feb 18, 2020 at 04:48:35PM -0500, Vivek Goyal wrote: > Currently pmem_clear_poison() expects offset and len to be sector aligned. > Atleast that seems to be the assumption with which code has been written. > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > and pmem_make_request() which will only passe sector aligned offset and len. > > Soon we want use this function from dax_zero_page_range() code path which > can try to zero arbitrary range of memory with-in a page. So update this > function to assume that offset and length can be arbitrary and do the > necessary alignments as needed. > > nvdimm_clear_poison() seems to assume offset and len to be aligned to > clear_err_unit boundary. But this is currently internal detail and is > not exported for others to use. So for now, continue to align offset and > length to SECTOR_SIZE boundary. Improving it further and to align it > to clear_err_unit boundary is a TODO item for future. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> This looks sensibel to me, but I'd really like to have Dan take at look.
Vivek Goyal <vgoyal@redhat.com> writes: > Currently pmem_clear_poison() expects offset and len to be sector aligned. > Atleast that seems to be the assumption with which code has been written. > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > and pmem_make_request() which will only passe sector aligned offset and len. > > Soon we want use this function from dax_zero_page_range() code path which > can try to zero arbitrary range of memory with-in a page. So update this > function to assume that offset and length can be arbitrary and do the > necessary alignments as needed. What caller will try to zero a range that is smaller than a sector? > nvdimm_clear_poison() seems to assume offset and len to be aligned to > clear_err_unit boundary. But this is currently internal detail and is > not exported for others to use. So for now, continue to align offset and > length to SECTOR_SIZE boundary. Improving it further and to align it > to clear_err_unit boundary is a TODO item for future. When there is a poisoned range of persistent memory, it is recorded by the badblocks infrastructure, which currently operates on sectors. So, no matter what the error unit is for the hardware, we currently can't record/report to userspace anything smaller than a sector, and so that is what we expect when clearing errors. Continuing on for completeness, we will currently not map a page with badblocks into a process' address space. So, let's say you have 256 bytes of bad pmem, we will tell you we've lost 512 bytes, and even if you access a valid mmap()d address in the same page as the poisoned memory, you will get a segfault. Userspace can fix up the error by calling write(2) and friends to provide new data, or by punching a hole and writing new data to the hole (which may result in getting a new block, or reallocating the old block and zeroing it, which will clear the error). More comments below... > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > drivers/nvdimm/pmem.c | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 075b11682192..e72959203253 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem, > sector_t sector; > long cleared; > blk_status_t rc = BLK_STS_OK; > + phys_addr_t start_aligned, end_aligned; > + unsigned int len_aligned; > > - sector = (offset - pmem->data_offset) / 512; > + /* > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > + * expects memory offset and length to meet certain alignment > + * restrction (clear_err_unit). Currently nvdimm does not export ^^^^^^^^^^^^^^^^^^^^^^ > + * required alignment. So align offset and length to sector boundary What is "nvdimm" in that sentence? Because the nvdimm most certainly does export the required alignment. Perhaps you meant libnvdimm? > + * before passing it to nvdimm_clear_poison(). > + */ > + start_aligned = ALIGN(offset, SECTOR_SIZE); > + end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1; > + len_aligned = end_aligned - start_aligned + 1; > + > + sector = (start_aligned - pmem->data_offset) / 512; > > - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len); > - if (cleared < len) > + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned, > + len_aligned); > + if (cleared < len_aligned) > rc = BLK_STS_IOERR; > if (cleared > 0 && cleared / 512) { > - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared); > + hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared); > cleared /= 512; > dev_dbg(dev, "%#llx clear %ld sector%s\n", > (unsigned long long) sector, cleared, We could potentially support clearing less than a sector, but I'd have to understand the use cases better before offerring implementation suggestions. -Jeff
On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > Atleast that seems to be the assumption with which code has been written. > > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > and pmem_make_request() which will only passe sector aligned offset and len. > > > > Soon we want use this function from dax_zero_page_range() code path which > > can try to zero arbitrary range of memory with-in a page. So update this > > function to assume that offset and length can be arbitrary and do the > > necessary alignments as needed. > > What caller will try to zero a range that is smaller than a sector? Hi Jeff, New dax zeroing interface (dax_zero_page_range()) can technically pass a range which is less than a sector. Or which is bigger than a sector but start and end are not aligned on sector boundaries. At this point of time, all I care about is that case of an arbitrary range is handeled well. So if a caller passes a range in, we figure out subrange which is sector aligned in terms of start and end, and clear poison on those sectors and ignore rest of the range. And this itself will be an improvement over current behavior where nothing is cleared if I/O is not sector aligned. > > > nvdimm_clear_poison() seems to assume offset and len to be aligned to > > clear_err_unit boundary. But this is currently internal detail and is > > not exported for others to use. So for now, continue to align offset and > > length to SECTOR_SIZE boundary. Improving it further and to align it > > to clear_err_unit boundary is a TODO item for future. > > When there is a poisoned range of persistent memory, it is recorded by > the badblocks infrastructure, which currently operates on sectors. So, > no matter what the error unit is for the hardware, we currently can't > record/report to userspace anything smaller than a sector, and so that > is what we expect when clearing errors. > > Continuing on for completeness, we will currently not map a page with > badblocks into a process' address space. So, let's say you have 256 > bytes of bad pmem, we will tell you we've lost 512 bytes, and even if > you access a valid mmap()d address in the same page as the poisoned > memory, you will get a segfault. > > Userspace can fix up the error by calling write(2) and friends to > provide new data, or by punching a hole and writing new data to the hole > (which may result in getting a new block, or reallocating the old block > and zeroing it, which will clear the error). Fair enough. I do not need poison clearing at finer granularity. It might be needed once dev_dax path wants to clear poison. Not sure how exactly that works. > > More comments below... > > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > > --- > > drivers/nvdimm/pmem.c | 22 ++++++++++++++++++---- > > 1 file changed, 18 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index 075b11682192..e72959203253 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pmem.c > > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem, > > sector_t sector; > > long cleared; > > blk_status_t rc = BLK_STS_OK; > > + phys_addr_t start_aligned, end_aligned; > > + unsigned int len_aligned; > > > > - sector = (offset - pmem->data_offset) / 512; > > + /* > > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > > + * expects memory offset and length to meet certain alignment > > + * restrction (clear_err_unit). Currently nvdimm does not export > ^^^^^^^^^^^^^^^^^^^^^^ > > + * required alignment. So align offset and length to sector boundary > > What is "nvdimm" in that sentence? Because the nvdimm most certainly > does export the required alignment. Perhaps you meant libnvdimm? I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever it is called. It first queries alignement required (clear_err_unit) and then makes sure range passed in meets that alignment requirement. > > > + * before passing it to nvdimm_clear_poison(). > > + */ > > + start_aligned = ALIGN(offset, SECTOR_SIZE); > > + end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1; > > + len_aligned = end_aligned - start_aligned + 1; > > + > > + sector = (start_aligned - pmem->data_offset) / 512; > > > > - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len); > > - if (cleared < len) > > + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned, > > + len_aligned); > > + if (cleared < len_aligned) > > rc = BLK_STS_IOERR; > > if (cleared > 0 && cleared / 512) { > > - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared); > > + hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared); > > cleared /= 512; > > dev_dbg(dev, "%#llx clear %ld sector%s\n", > > (unsigned long long) sector, cleared, > > We could potentially support clearing less than a sector, but I'd have > to understand the use cases better before offerring implementation > suggestions. I don't need clearing less than a secotr. Once somebody needs it they can implement it. All I am doing is making sure current logic is not broken when dax_zero_page_range() starts using this logic and passes an arbitrary range. We need to make sure we internally align I/O and carve out an aligned sub-range and pass that subrange to nvdimm_clear_poison(). So if you can make sure I am not breaking things and new interface will continue to clear poison on sector boundary, that will be great. Thanks Vivek
Vivek Goyal <vgoyal@redhat.com> writes: > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: >> Vivek Goyal <vgoyal@redhat.com> writes: >> >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. >> > Atleast that seems to be the assumption with which code has been written. >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() >> > and pmem_make_request() which will only passe sector aligned offset and len. >> > >> > Soon we want use this function from dax_zero_page_range() code path which >> > can try to zero arbitrary range of memory with-in a page. So update this >> > function to assume that offset and length can be arbitrary and do the >> > necessary alignments as needed. >> >> What caller will try to zero a range that is smaller than a sector? > > Hi Jeff, > > New dax zeroing interface (dax_zero_page_range()) can technically pass > a range which is less than a sector. Or which is bigger than a sector > but start and end are not aligned on sector boundaries. Sure, but who will call it with misaligned ranges? > At this point of time, all I care about is that case of an arbitrary > range is handeled well. So if a caller passes a range in, we figure > out subrange which is sector aligned in terms of start and end, and > clear poison on those sectors and ignore rest of the range. And > this itself will be an improvement over current behavior where > nothing is cleared if I/O is not sector aligned. I don't think this makes sense. The caller needs to know about the blast radius of errors. This is why I asked for a concrete example. It might make more sense, for example, to return an error if not all of the errors could be cleared. >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to >> > clear_err_unit boundary. But this is currently internal detail and is >> > not exported for others to use. So for now, continue to align offset and >> > length to SECTOR_SIZE boundary. Improving it further and to align it >> > to clear_err_unit boundary is a TODO item for future. >> >> When there is a poisoned range of persistent memory, it is recorded by >> the badblocks infrastructure, which currently operates on sectors. So, >> no matter what the error unit is for the hardware, we currently can't >> record/report to userspace anything smaller than a sector, and so that >> is what we expect when clearing errors. >> >> Continuing on for completeness, we will currently not map a page with >> badblocks into a process' address space. So, let's say you have 256 >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if >> you access a valid mmap()d address in the same page as the poisoned >> memory, you will get a segfault. >> >> Userspace can fix up the error by calling write(2) and friends to >> provide new data, or by punching a hole and writing new data to the hole >> (which may result in getting a new block, or reallocating the old block >> and zeroing it, which will clear the error). > > Fair enough. I do not need poison clearing at finer granularity. It might > be needed once dev_dax path wants to clear poison. Not sure how exactly > that works. It doesn't. :) >> > + /* >> > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() >> > + * expects memory offset and length to meet certain alignment >> > + * restrction (clear_err_unit). Currently nvdimm does not export >> ^^^^^^^^^^^^^^^^^^^^^^ >> > + * required alignment. So align offset and length to sector boundary >> >> What is "nvdimm" in that sentence? Because the nvdimm most certainly >> does export the required alignment. Perhaps you meant libnvdimm? > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever > it is called. It first queries alignement required (clear_err_unit) and > then makes sure range passed in meets that alignment requirement. My point was your comment is misleading. >> We could potentially support clearing less than a sector, but I'd have >> to understand the use cases better before offerring implementation >> suggestions. > > I don't need clearing less than a secotr. Once somebody needs it they > can implement it. All I am doing is making sure current logic is not > broken when dax_zero_page_range() starts using this logic and passes > an arbitrary range. We need to make sure we internally align I/O An arbitrary range is the same thing as less than a sector. :) Do you know of an instance where the range will not be sector-aligned and sized? > and carve out an aligned sub-range and pass that subrange to > nvdimm_clear_poison(). And what happens to the rest? The caller is left to trip over the errors? That sounds pretty terrible. I really think there needs to be an explicit contract here. > So if you can make sure I am not breaking things and new interface > will continue to clear poison on sector boundary, that will be great. I think allowing arbitrary ranges /could/ break things. How it breaks things depends on what the caller is doing. If ther eare no callers using the interface in this way, then I see no need to relax the restriction. I do think we could document it better. -Jeff
On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > Vivek Goyal <vgoyal@redhat.com> writes: > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > >> Vivek Goyal <vgoyal@redhat.com> writes: > >> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > >> > Atleast that seems to be the assumption with which code has been written. > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > >> > and pmem_make_request() which will only passe sector aligned offset and len. > >> > > >> > Soon we want use this function from dax_zero_page_range() code path which > >> > can try to zero arbitrary range of memory with-in a page. So update this > >> > function to assume that offset and length can be arbitrary and do the > >> > necessary alignments as needed. > >> > >> What caller will try to zero a range that is smaller than a sector? > > > > Hi Jeff, > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > a range which is less than a sector. Or which is bigger than a sector > > but start and end are not aligned on sector boundaries. > > Sure, but who will call it with misaligned ranges? create a file foo.txt of size 4K and then truncate it. "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to 4095. I have also written a test for this. https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102 > > > At this point of time, all I care about is that case of an arbitrary > > range is handeled well. So if a caller passes a range in, we figure > > out subrange which is sector aligned in terms of start and end, and > > clear poison on those sectors and ignore rest of the range. And > > this itself will be an improvement over current behavior where > > nothing is cleared if I/O is not sector aligned. > > I don't think this makes sense. The caller needs to know about the > blast radius of errors. This is why I asked for a concrete example. > It might make more sense, for example, to return an error if not all of > the errors could be cleared. > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to > >> > clear_err_unit boundary. But this is currently internal detail and is > >> > not exported for others to use. So for now, continue to align offset and > >> > length to SECTOR_SIZE boundary. Improving it further and to align it > >> > to clear_err_unit boundary is a TODO item for future. > >> > >> When there is a poisoned range of persistent memory, it is recorded by > >> the badblocks infrastructure, which currently operates on sectors. So, > >> no matter what the error unit is for the hardware, we currently can't > >> record/report to userspace anything smaller than a sector, and so that > >> is what we expect when clearing errors. > >> > >> Continuing on for completeness, we will currently not map a page with > >> badblocks into a process' address space. So, let's say you have 256 > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if > >> you access a valid mmap()d address in the same page as the poisoned > >> memory, you will get a segfault. > >> > >> Userspace can fix up the error by calling write(2) and friends to > >> provide new data, or by punching a hole and writing new data to the hole > >> (which may result in getting a new block, or reallocating the old block > >> and zeroing it, which will clear the error). > > > > Fair enough. I do not need poison clearing at finer granularity. It might > > be needed once dev_dax path wants to clear poison. Not sure how exactly > > that works. > > It doesn't. :) > > >> > + /* > >> > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > >> > + * expects memory offset and length to meet certain alignment > >> > + * restrction (clear_err_unit). Currently nvdimm does not export > >> ^^^^^^^^^^^^^^^^^^^^^^ > >> > + * required alignment. So align offset and length to sector boundary > >> > >> What is "nvdimm" in that sentence? Because the nvdimm most certainly > >> does export the required alignment. Perhaps you meant libnvdimm? > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever > > it is called. It first queries alignement required (clear_err_unit) and > > then makes sure range passed in meets that alignment requirement. > > My point was your comment is misleading. > > >> We could potentially support clearing less than a sector, but I'd have > >> to understand the use cases better before offerring implementation > >> suggestions. > > > > I don't need clearing less than a secotr. Once somebody needs it they > > can implement it. All I am doing is making sure current logic is not > > broken when dax_zero_page_range() starts using this logic and passes > > an arbitrary range. We need to make sure we internally align I/O > > An arbitrary range is the same thing as less than a sector. :) Do you > know of an instance where the range will not be sector-aligned and sized? > > > and carve out an aligned sub-range and pass that subrange to > > nvdimm_clear_poison(). > > And what happens to the rest? The caller is left to trip over the > errors? That sounds pretty terrible. I really think there needs to be > an explicit contract here. Ok, I think is is the contentious bit. Current interface (__dax_zero_page_range()) either clears the poison (if I/O is aligned to sector) or expects page to be free of poison. So in above example, of "truncate -s 23 foo.txt", currently I get an error because range being zeroed is not sector aligned. So __dax_zero_page_range() falls back to calling direct_access(). Which fails because there are poisoned sectors in the page. With my patches, dax_zero_page_range(), clears the poison from sector 1 to 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 and returns success. So question is, is this better behavior or worse behavior. If sector 0 was poisoned, it will continue to remain poisoned and caller will come to know about it on next read and then it should try to truncate file to length 0 or unlink file or restore that file to get rid of poison. IOW, if a partial block is being zeroed and if it is poisoned, caller will not be return an error and poison will not be cleared and memory will be zeroed. What do we expect in such cases. Do we expect an interface where if there are any bad blocks in the range being zeroed, then they all should be cleared (and hence all I/O should be aligned) otherwise error is returned. If yes, I could make that change. Downside of current interface is that it will clear as many blocks as possible in the given range and leave starting and end blocks poisoned (if it is unaligned) and not return error. That means a reader will get error on these blocks again and they will have to try to clear it again. Thanks Vivek
On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > >> > Atleast that seems to be the assumption with which code has been written. > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > >> > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > >> > function to assume that offset and length can be arbitrary and do the > > >> > necessary alignments as needed. > > >> > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > Hi Jeff, > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > a range which is less than a sector. Or which is bigger than a sector > > > but start and end are not aligned on sector boundaries. > > > > Sure, but who will call it with misaligned ranges? > > create a file foo.txt of size 4K and then truncate it. > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > 4095. > > I have also written a test for this. > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102 > > > > > > At this point of time, all I care about is that case of an arbitrary > > > range is handeled well. So if a caller passes a range in, we figure > > > out subrange which is sector aligned in terms of start and end, and > > > clear poison on those sectors and ignore rest of the range. And > > > this itself will be an improvement over current behavior where > > > nothing is cleared if I/O is not sector aligned. > > > > I don't think this makes sense. The caller needs to know about the > > blast radius of errors. This is why I asked for a concrete example. > > It might make more sense, for example, to return an error if not all of > > the errors could be cleared. > > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to > > >> > clear_err_unit boundary. But this is currently internal detail and is > > >> > not exported for others to use. So for now, continue to align offset and > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it > > >> > to clear_err_unit boundary is a TODO item for future. > > >> > > >> When there is a poisoned range of persistent memory, it is recorded by > > >> the badblocks infrastructure, which currently operates on sectors. So, > > >> no matter what the error unit is for the hardware, we currently can't > > >> record/report to userspace anything smaller than a sector, and so that > > >> is what we expect when clearing errors. > > >> > > >> Continuing on for completeness, we will currently not map a page with > > >> badblocks into a process' address space. So, let's say you have 256 > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if > > >> you access a valid mmap()d address in the same page as the poisoned > > >> memory, you will get a segfault. > > >> > > >> Userspace can fix up the error by calling write(2) and friends to > > >> provide new data, or by punching a hole and writing new data to the hole > > >> (which may result in getting a new block, or reallocating the old block > > >> and zeroing it, which will clear the error). > > > > > > Fair enough. I do not need poison clearing at finer granularity. It might > > > be needed once dev_dax path wants to clear poison. Not sure how exactly > > > that works. > > > > It doesn't. :) > > > > >> > + /* > > >> > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > > >> > + * expects memory offset and length to meet certain alignment > > >> > + * restrction (clear_err_unit). Currently nvdimm does not export > > >> ^^^^^^^^^^^^^^^^^^^^^^ > > >> > + * required alignment. So align offset and length to sector boundary > > >> > > >> What is "nvdimm" in that sentence? Because the nvdimm most certainly > > >> does export the required alignment. Perhaps you meant libnvdimm? > > > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever > > > it is called. It first queries alignement required (clear_err_unit) and > > > then makes sure range passed in meets that alignment requirement. > > > > My point was your comment is misleading. > > > > >> We could potentially support clearing less than a sector, but I'd have > > >> to understand the use cases better before offerring implementation > > >> suggestions. > > > > > > I don't need clearing less than a secotr. Once somebody needs it they > > > can implement it. All I am doing is making sure current logic is not > > > broken when dax_zero_page_range() starts using this logic and passes > > > an arbitrary range. We need to make sure we internally align I/O > > > > An arbitrary range is the same thing as less than a sector. :) Do you > > know of an instance where the range will not be sector-aligned and sized? > > > > > and carve out an aligned sub-range and pass that subrange to > > > nvdimm_clear_poison(). > > > > And what happens to the rest? The caller is left to trip over the > > errors? That sounds pretty terrible. I really think there needs to be > > an explicit contract here. > > Ok, I think is is the contentious bit. Current interface > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > sector) or expects page to be free of poison. > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > because range being zeroed is not sector aligned. So > __dax_zero_page_range() falls back to calling direct_access(). Which > fails because there are poisoned sectors in the page. > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > and returns success. > > So question is, is this better behavior or worse behavior. If sector 0 > was poisoned, it will continue to remain poisoned and caller will come > to know about it on next read and then it should try to truncate file > to length 0 or unlink file or restore that file to get rid of poison. > > IOW, if a partial block is being zeroed and if it is poisoned, caller > will not be return an error and poison will not be cleared and memory > will be zeroed. What do we expect in such cases. > > Do we expect an interface where if there are any bad blocks in the range > being zeroed, then they all should be cleared (and hence all I/O should > be aligned) otherwise error is returned. If yes, I could make that > change. This does not strike me as a good idea because it's a false security compared to the latent poison case. If the writes to an unknown poisoned location would otherwise succeed via a different I/O path (dax), it's an unsymmetric surprise to start returning errors just because you wrote zeroes as a side effect of truncate. > Downside of current interface is that it will clear as many blocks as > possible in the given range and leave starting and end blocks poisoned > (if it is unaligned) and not return error. That means a reader will > get error on these blocks again and they will have to try to clear it > again. I think what you have described in your truncate example is an improvement on what we have currently because x86 does not communicate write errors. Specifically, writing zeros via dax from userspace over unknown poison behaves the same as writing unaligned zeros over known poison. In both cases it's a best effort that always succeeds (no cpu exception), and may inadvertently clear poison as a side-effect. Otherwise, an error-block-aligned hole punch is the only way to trigger the kernel to try to clear known poison when the full block is reallocated. On movdir64b capable cpus the error clearing unit becomes 64-bytes rather than 256-bytes because that allows a cacheline to be written without triggering a line fill read. So the error clearing granularity gets better over time, but unfortunately not synchronous detection in the I/O path. I think a better way to improve poison handling is the long standing idea to integrate the badblock tracking into the filesystem directly. That way driver notifications of poison can be ingested into the filesystem and notifications sent on filenames rather than the current TOCTOU mess of trying to do a reverse lookup of badblock numbers to files. If the application can efficiently list and be notified of poison it can mitigate it immediately rather than trying to rely on write side effects.
On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote: > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > >> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > >> > Atleast that seems to be the assumption with which code has been written. > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > >> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > >> > function to assume that offset and length can be arbitrary and do the > > > >> > necessary alignments as needed. > > > >> > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > Hi Jeff, > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > a range which is less than a sector. Or which is bigger than a sector > > > > but start and end are not aligned on sector boundaries. > > > > > > Sure, but who will call it with misaligned ranges? > > > > create a file foo.txt of size 4K and then truncate it. > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > 4095. > > > > I have also written a test for this. > > > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102 > > > > > > > > > At this point of time, all I care about is that case of an arbitrary > > > > range is handeled well. So if a caller passes a range in, we figure > > > > out subrange which is sector aligned in terms of start and end, and > > > > clear poison on those sectors and ignore rest of the range. And > > > > this itself will be an improvement over current behavior where > > > > nothing is cleared if I/O is not sector aligned. > > > > > > I don't think this makes sense. The caller needs to know about the > > > blast radius of errors. This is why I asked for a concrete example. > > > It might make more sense, for example, to return an error if not all of > > > the errors could be cleared. > > > > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to > > > >> > clear_err_unit boundary. But this is currently internal detail and is > > > >> > not exported for others to use. So for now, continue to align offset and > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it > > > >> > to clear_err_unit boundary is a TODO item for future. > > > >> > > > >> When there is a poisoned range of persistent memory, it is recorded by > > > >> the badblocks infrastructure, which currently operates on sectors. So, > > > >> no matter what the error unit is for the hardware, we currently can't > > > >> record/report to userspace anything smaller than a sector, and so that > > > >> is what we expect when clearing errors. > > > >> > > > >> Continuing on for completeness, we will currently not map a page with > > > >> badblocks into a process' address space. So, let's say you have 256 > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if > > > >> you access a valid mmap()d address in the same page as the poisoned > > > >> memory, you will get a segfault. > > > >> > > > >> Userspace can fix up the error by calling write(2) and friends to > > > >> provide new data, or by punching a hole and writing new data to the hole > > > >> (which may result in getting a new block, or reallocating the old block > > > >> and zeroing it, which will clear the error). > > > > > > > > Fair enough. I do not need poison clearing at finer granularity. It might > > > > be needed once dev_dax path wants to clear poison. Not sure how exactly > > > > that works. > > > > > > It doesn't. :) > > > > > > >> > + /* > > > >> > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > > > >> > + * expects memory offset and length to meet certain alignment > > > >> > + * restrction (clear_err_unit). Currently nvdimm does not export > > > >> ^^^^^^^^^^^^^^^^^^^^^^ > > > >> > + * required alignment. So align offset and length to sector boundary > > > >> > > > >> What is "nvdimm" in that sentence? Because the nvdimm most certainly > > > >> does export the required alignment. Perhaps you meant libnvdimm? > > > > > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever > > > > it is called. It first queries alignement required (clear_err_unit) and > > > > then makes sure range passed in meets that alignment requirement. > > > > > > My point was your comment is misleading. > > > > > > >> We could potentially support clearing less than a sector, but I'd have > > > >> to understand the use cases better before offerring implementation > > > >> suggestions. > > > > > > > > I don't need clearing less than a secotr. Once somebody needs it they > > > > can implement it. All I am doing is making sure current logic is not > > > > broken when dax_zero_page_range() starts using this logic and passes > > > > an arbitrary range. We need to make sure we internally align I/O > > > > > > An arbitrary range is the same thing as less than a sector. :) Do you > > > know of an instance where the range will not be sector-aligned and sized? > > > > > > > and carve out an aligned sub-range and pass that subrange to > > > > nvdimm_clear_poison(). > > > > > > And what happens to the rest? The caller is left to trip over the > > > errors? That sounds pretty terrible. I really think there needs to be > > > an explicit contract here. > > > > Ok, I think is is the contentious bit. Current interface > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > sector) or expects page to be free of poison. > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > because range being zeroed is not sector aligned. So > > __dax_zero_page_range() falls back to calling direct_access(). Which > > fails because there are poisoned sectors in the page. > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > and returns success. > > > > So question is, is this better behavior or worse behavior. If sector 0 > > was poisoned, it will continue to remain poisoned and caller will come > > to know about it on next read and then it should try to truncate file > > to length 0 or unlink file or restore that file to get rid of poison. > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > will not be return an error and poison will not be cleared and memory > > will be zeroed. What do we expect in such cases. > > > > Do we expect an interface where if there are any bad blocks in the range > > being zeroed, then they all should be cleared (and hence all I/O should > > be aligned) otherwise error is returned. If yes, I could make that > > change. > > This does not strike me as a good idea because it's a false security > compared to the latent poison case. If the writes to an unknown > poisoned location would otherwise succeed via a different I/O path > (dax), it's an unsymmetric surprise to start returning errors just > because you wrote zeroes as a side effect of truncate. > > > Downside of current interface is that it will clear as many blocks as > > possible in the given range and leave starting and end blocks poisoned > > (if it is unaligned) and not return error. That means a reader will > > get error on these blocks again and they will have to try to clear it > > again. > > I think what you have described in your truncate example is an > improvement on what we have currently because x86 does not communicate > write errors. Specifically, writing zeros via dax from userspace over > unknown poison behaves the same as writing unaligned zeros over known > poison. In both cases it's a best effort that always succeeds (no cpu > exception), and may inadvertently clear poison as a side-effect. > Otherwise, an error-block-aligned hole punch is the only way to > trigger the kernel to try to clear known poison when the full block is > reallocated. Hi Dan, Agreed. This new interface works uniformly for both known poison and latent poison cases. Existing interface is asymmetric and that means if poison is latent, unaligned zero range will succeed but if poison is known, unaligned zero range will fail. > > On movdir64b capable cpus the error clearing unit becomes 64-bytes > rather than 256-bytes because that allows a cacheline to be written > without triggering a line fill read. So the error clearing granularity > gets better over time, but unfortunately not synchronous detection in > the I/O path. > > I think a better way to improve poison handling is the long standing > idea to integrate the badblock tracking into the filesystem directly. > That way driver notifications of poison can be ingested into the > filesystem and notifications sent on filenames rather than the current > TOCTOU mess of trying to do a reverse lookup of badblock numbers to > files. If the application can efficiently list and be notified of > poison it can mitigate it immediately rather than trying to rely on > write side effects. Moving badblocks infrastructure in filesystem sounds like a major rework which should be taken up in a seprate patch series in future. For now, can we please take these patches which are an improvement over existing interface. Thanks Vivek
On Fri, Feb 21, 2020 at 1:25 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote: > > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > > >> > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > > >> > Atleast that seems to be the assumption with which code has been written. > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > > >> > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > > >> > function to assume that offset and length can be arbitrary and do the > > > > >> > necessary alignments as needed. > > > > >> > > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > > > I have also written a test for this. > > > > > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102 > > > > > > > > > > > > At this point of time, all I care about is that case of an arbitrary > > > > > range is handeled well. So if a caller passes a range in, we figure > > > > > out subrange which is sector aligned in terms of start and end, and > > > > > clear poison on those sectors and ignore rest of the range. And > > > > > this itself will be an improvement over current behavior where > > > > > nothing is cleared if I/O is not sector aligned. > > > > > > > > I don't think this makes sense. The caller needs to know about the > > > > blast radius of errors. This is why I asked for a concrete example. > > > > It might make more sense, for example, to return an error if not all of > > > > the errors could be cleared. > > > > > > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to > > > > >> > clear_err_unit boundary. But this is currently internal detail and is > > > > >> > not exported for others to use. So for now, continue to align offset and > > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it > > > > >> > to clear_err_unit boundary is a TODO item for future. > > > > >> > > > > >> When there is a poisoned range of persistent memory, it is recorded by > > > > >> the badblocks infrastructure, which currently operates on sectors. So, > > > > >> no matter what the error unit is for the hardware, we currently can't > > > > >> record/report to userspace anything smaller than a sector, and so that > > > > >> is what we expect when clearing errors. > > > > >> > > > > >> Continuing on for completeness, we will currently not map a page with > > > > >> badblocks into a process' address space. So, let's say you have 256 > > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if > > > > >> you access a valid mmap()d address in the same page as the poisoned > > > > >> memory, you will get a segfault. > > > > >> > > > > >> Userspace can fix up the error by calling write(2) and friends to > > > > >> provide new data, or by punching a hole and writing new data to the hole > > > > >> (which may result in getting a new block, or reallocating the old block > > > > >> and zeroing it, which will clear the error). > > > > > > > > > > Fair enough. I do not need poison clearing at finer granularity. It might > > > > > be needed once dev_dax path wants to clear poison. Not sure how exactly > > > > > that works. > > > > > > > > It doesn't. :) > > > > > > > > >> > + /* > > > > >> > + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() > > > > >> > + * expects memory offset and length to meet certain alignment > > > > >> > + * restrction (clear_err_unit). Currently nvdimm does not export > > > > >> ^^^^^^^^^^^^^^^^^^^^^^ > > > > >> > + * required alignment. So align offset and length to sector boundary > > > > >> > > > > >> What is "nvdimm" in that sentence? Because the nvdimm most certainly > > > > >> does export the required alignment. Perhaps you meant libnvdimm? > > > > > > > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever > > > > > it is called. It first queries alignement required (clear_err_unit) and > > > > > then makes sure range passed in meets that alignment requirement. > > > > > > > > My point was your comment is misleading. > > > > > > > > >> We could potentially support clearing less than a sector, but I'd have > > > > >> to understand the use cases better before offerring implementation > > > > >> suggestions. > > > > > > > > > > I don't need clearing less than a secotr. Once somebody needs it they > > > > > can implement it. All I am doing is making sure current logic is not > > > > > broken when dax_zero_page_range() starts using this logic and passes > > > > > an arbitrary range. We need to make sure we internally align I/O > > > > > > > > An arbitrary range is the same thing as less than a sector. :) Do you > > > > know of an instance where the range will not be sector-aligned and sized? > > > > > > > > > and carve out an aligned sub-range and pass that subrange to > > > > > nvdimm_clear_poison(). > > > > > > > > And what happens to the rest? The caller is left to trip over the > > > > errors? That sounds pretty terrible. I really think there needs to be > > > > an explicit contract here. > > > > > > Ok, I think is is the contentious bit. Current interface > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > > sector) or expects page to be free of poison. > > > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > > because range being zeroed is not sector aligned. So > > > __dax_zero_page_range() falls back to calling direct_access(). Which > > > fails because there are poisoned sectors in the page. > > > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > > and returns success. > > > > > > So question is, is this better behavior or worse behavior. If sector 0 > > > was poisoned, it will continue to remain poisoned and caller will come > > > to know about it on next read and then it should try to truncate file > > > to length 0 or unlink file or restore that file to get rid of poison. > > > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > > will not be return an error and poison will not be cleared and memory > > > will be zeroed. What do we expect in such cases. > > > > > > Do we expect an interface where if there are any bad blocks in the range > > > being zeroed, then they all should be cleared (and hence all I/O should > > > be aligned) otherwise error is returned. If yes, I could make that > > > change. > > > > This does not strike me as a good idea because it's a false security > > compared to the latent poison case. If the writes to an unknown > > poisoned location would otherwise succeed via a different I/O path > > (dax), it's an unsymmetric surprise to start returning errors just > > because you wrote zeroes as a side effect of truncate. > > > > > Downside of current interface is that it will clear as many blocks as > > > possible in the given range and leave starting and end blocks poisoned > > > (if it is unaligned) and not return error. That means a reader will > > > get error on these blocks again and they will have to try to clear it > > > again. > > > > I think what you have described in your truncate example is an > > improvement on what we have currently because x86 does not communicate > > write errors. Specifically, writing zeros via dax from userspace over > > unknown poison behaves the same as writing unaligned zeros over known > > poison. In both cases it's a best effort that always succeeds (no cpu > > exception), and may inadvertently clear poison as a side-effect. > > Otherwise, an error-block-aligned hole punch is the only way to > > trigger the kernel to try to clear known poison when the full block is > > reallocated. > > Hi Dan, > > Agreed. This new interface works uniformly for both known poison and latent > poison cases. Existing interface is asymmetric and that means if poison is > latent, unaligned zero range will succeed but if poison is known, unaligned > zero range will fail. > > > > > On movdir64b capable cpus the error clearing unit becomes 64-bytes > > rather than 256-bytes because that allows a cacheline to be written > > without triggering a line fill read. So the error clearing granularity > > gets better over time, but unfortunately not synchronous detection in > > the I/O path. > > > > I think a better way to improve poison handling is the long standing > > idea to integrate the badblock tracking into the filesystem directly. > > That way driver notifications of poison can be ingested into the > > filesystem and notifications sent on filenames rather than the current > > TOCTOU mess of trying to do a reverse lookup of badblock numbers to > > files. If the application can efficiently list and be notified of > > poison it can mitigate it immediately rather than trying to rely on > > write side effects. > > Moving badblocks infrastructure in filesystem sounds like a major > rework which should be taken up in a seprate patch series in future. > > For now, can we please take these patches which are an improvement > over existing interface. Oh you misunderstood my comment, the "move badblocks to filesystem" proposal is long term / down the road thing to consider. In the near term this unaligned block zeroing facility is an improvement.
Dan Williams <dan.j.williams@intel.com> writes: > Oh you misunderstood my comment, the "move badblocks to filesystem" > proposal is long term / down the road thing to consider. In the near > term this unaligned block zeroing facility is an improvement. I'm not sure I agree. I'm going to think about it and get back to you. -Jeff
On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > >> > Atleast that seems to be the assumption with which code has been written. > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > >> > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > >> > function to assume that offset and length can be arbitrary and do the > > >> > necessary alignments as needed. > > >> > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > Hi Jeff, > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > a range which is less than a sector. Or which is bigger than a sector > > > but start and end are not aligned on sector boundaries. > > > > Sure, but who will call it with misaligned ranges? > > create a file foo.txt of size 4K and then truncate it. > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > 4095. This should fail with EIO. Only full page writes should clear the bad page state, and partial writes should therefore fail because they do not guarantee the data in the filesystem block is all good. If this zeroing was a buffered write to an address with a bad sector, then the writeback will fail and the user will (eventually) get an EIO on the file. DAX should do the same thing, except because the zeroing is synchronous (i.e. done directly by the truncate syscall) we can - and should - return EIO immediately. Indeed, with your code, if we then extend the file by truncating up back to 4k, then the range between 23 and 512 is still bad, even though we've successfully zeroed it and the user knows it. An attempt to read anywhere in this range (e.g. 10 bytes at offset 100) will fail with EIO, but reading 10 bytes at offset 2000 will succeed. That's *awful* behaviour to expose to userspace, especially when they look at the fs config and see that it's using both 4kB block and sector sizes... The only thing that makes sense from a filesystem perspective is clearing bad page state when entire filesystem blocks are overwritten. The data in a filesystem block is either good or bad, and it doesn't matter how many internal (kernel or device) sectors it has. > > And what happens to the rest? The caller is left to trip over the > > errors? That sounds pretty terrible. I really think there needs to be > > an explicit contract here. > > Ok, I think is is the contentious bit. Current interface > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > sector) or expects page to be free of poison. > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > because range being zeroed is not sector aligned. So > __dax_zero_page_range() falls back to calling direct_access(). Which > fails because there are poisoned sectors in the page. > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > and returns success. Ok, kernel sectors are not the unit of granularity bad page state should be managed at. They don't match page state granularity, and they don't match filesystem block granularity, and the whacky "partial writes silently succeed, reads fail unpredictably" assymetry it leads to will just cause problems for users. > So question is, is this better behavior or worse behavior. If sector 0 > was poisoned, it will continue to remain poisoned and caller will come > to know about it on next read and then it should try to truncate file > to length 0 or unlink file or restore that file to get rid of poison. Worse, because the filesystem can't track what sub-parts of the block are bad and that leads to inconsistent data integrity status being exposed to userspace. > IOW, if a partial block is being zeroed and if it is poisoned, caller > will not be return an error and poison will not be cleared and memory > will be zeroed. What do we expect in such cases. > > Do we expect an interface where if there are any bad blocks in the range > being zeroed, then they all should be cleared (and hence all I/O should > be aligned) otherwise error is returned. If yes, I could make that > change. > > Downside of current interface is that it will clear as many blocks as > possible in the given range and leave starting and end blocks poisoned > (if it is unaligned) and not return error. That means a reader will > get error on these blocks again and they will have to try to clear it > again. Which is solved by having partial page writes always EIO on poisoned memory. Cheers, Dave.
On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > >> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > >> > Atleast that seems to be the assumption with which code has been written. > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > >> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > >> > function to assume that offset and length can be arbitrary and do the > > > >> > necessary alignments as needed. > > > >> > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > Hi Jeff, > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > a range which is less than a sector. Or which is bigger than a sector > > > > but start and end are not aligned on sector boundaries. > > > > > > Sure, but who will call it with misaligned ranges? > > > > create a file foo.txt of size 4K and then truncate it. > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > 4095. > > This should fail with EIO. Only full page writes should clear the > bad page state, and partial writes should therefore fail because > they do not guarantee the data in the filesystem block is all good. > > If this zeroing was a buffered write to an address with a bad > sector, then the writeback will fail and the user will (eventually) > get an EIO on the file. > > DAX should do the same thing, except because the zeroing is > synchronous (i.e. done directly by the truncate syscall) we can - > and should - return EIO immediately. > > Indeed, with your code, if we then extend the file by truncating up > back to 4k, then the range between 23 and 512 is still bad, even > though we've successfully zeroed it and the user knows it. An > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > will fail with EIO, but reading 10 bytes at offset 2000 will > succeed. > > That's *awful* behaviour to expose to userspace, especially when > they look at the fs config and see that it's using both 4kB block > and sector sizes... > > The only thing that makes sense from a filesystem perspective is > clearing bad page state when entire filesystem blocks are > overwritten. The data in a filesystem block is either good or bad, > and it doesn't matter how many internal (kernel or device) sectors > it has. > > > > And what happens to the rest? The caller is left to trip over the > > > errors? That sounds pretty terrible. I really think there needs to be > > > an explicit contract here. > > > > Ok, I think is is the contentious bit. Current interface > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > sector) or expects page to be free of poison. > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > because range being zeroed is not sector aligned. So > > __dax_zero_page_range() falls back to calling direct_access(). Which > > fails because there are poisoned sectors in the page. > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > and returns success. > > Ok, kernel sectors are not the unit of granularity bad page state > should be managed at. They don't match page state granularity, and > they don't match filesystem block granularity, and the whacky > "partial writes silently succeed, reads fail unpredictably" > assymetry it leads to will just cause problems for users. > > > So question is, is this better behavior or worse behavior. If sector 0 > > was poisoned, it will continue to remain poisoned and caller will come > > to know about it on next read and then it should try to truncate file > > to length 0 or unlink file or restore that file to get rid of poison. > > Worse, because the filesystem can't track what sub-parts of the > block are bad and that leads to inconsistent data integrity status > being exposed to userspace. The driver can't track it either. Latent poison isn't know until it is consumed, and writes to latent poison are not guaranteed to clear it. > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > will not be return an error and poison will not be cleared and memory > > will be zeroed. What do we expect in such cases. > > > > Do we expect an interface where if there are any bad blocks in the range > > being zeroed, then they all should be cleared (and hence all I/O should > > be aligned) otherwise error is returned. If yes, I could make that > > change. > > > > Downside of current interface is that it will clear as many blocks as > > possible in the given range and leave starting and end blocks poisoned > > (if it is unaligned) and not return error. That means a reader will > > get error on these blocks again and they will have to try to clear it > > again. > > Which is solved by having partial page writes always EIO on poisoned > memory. The problem with the above is that partial page writes can not be guaranteed to return EIO. Poison is only detected on consumed reads, or a periodic scrub, not writes. IFF poison detection was always synchronous with poison creation then the above makes sense. However, with asynchronous signaling, it's fundamentally a false security blanket to assume even full block writes will clear poison unless a callback to firmware is made for every block.
Dan Williams <dan.j.williams@intel.com> writes: > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: >> >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: >> > > Vivek Goyal <vgoyal@redhat.com> writes: >> > > >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: >> > > >> Vivek Goyal <vgoyal@redhat.com> writes: >> > > >> >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. >> > > >> > Atleast that seems to be the assumption with which code has been written. >> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() >> > > >> > and pmem_make_request() which will only passe sector aligned offset and len. >> > > >> > >> > > >> > Soon we want use this function from dax_zero_page_range() code path which >> > > >> > can try to zero arbitrary range of memory with-in a page. So update this >> > > >> > function to assume that offset and length can be arbitrary and do the >> > > >> > necessary alignments as needed. >> > > >> >> > > >> What caller will try to zero a range that is smaller than a sector? >> > > > >> > > > Hi Jeff, >> > > > >> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass >> > > > a range which is less than a sector. Or which is bigger than a sector >> > > > but start and end are not aligned on sector boundaries. >> > > >> > > Sure, but who will call it with misaligned ranges? >> > >> > create a file foo.txt of size 4K and then truncate it. >> > >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to >> > 4095. >> >> This should fail with EIO. Only full page writes should clear the >> bad page state, and partial writes should therefore fail because >> they do not guarantee the data in the filesystem block is all good. >> >> If this zeroing was a buffered write to an address with a bad >> sector, then the writeback will fail and the user will (eventually) >> get an EIO on the file. >> >> DAX should do the same thing, except because the zeroing is >> synchronous (i.e. done directly by the truncate syscall) we can - >> and should - return EIO immediately. >> >> Indeed, with your code, if we then extend the file by truncating up >> back to 4k, then the range between 23 and 512 is still bad, even >> though we've successfully zeroed it and the user knows it. An >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100) >> will fail with EIO, but reading 10 bytes at offset 2000 will >> succeed. >> >> That's *awful* behaviour to expose to userspace, especially when >> they look at the fs config and see that it's using both 4kB block >> and sector sizes... >> >> The only thing that makes sense from a filesystem perspective is >> clearing bad page state when entire filesystem blocks are >> overwritten. The data in a filesystem block is either good or bad, >> and it doesn't matter how many internal (kernel or device) sectors >> it has. >> >> > > And what happens to the rest? The caller is left to trip over the >> > > errors? That sounds pretty terrible. I really think there needs to be >> > > an explicit contract here. >> > >> > Ok, I think is is the contentious bit. Current interface >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to >> > sector) or expects page to be free of poison. >> > >> > So in above example, of "truncate -s 23 foo.txt", currently I get an error >> > because range being zeroed is not sector aligned. So >> > __dax_zero_page_range() falls back to calling direct_access(). Which >> > fails because there are poisoned sectors in the page. >> > >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 >> > and returns success. >> >> Ok, kernel sectors are not the unit of granularity bad page state >> should be managed at. They don't match page state granularity, and >> they don't match filesystem block granularity, and the whacky >> "partial writes silently succeed, reads fail unpredictably" >> assymetry it leads to will just cause problems for users. >> >> > So question is, is this better behavior or worse behavior. If sector 0 >> > was poisoned, it will continue to remain poisoned and caller will come >> > to know about it on next read and then it should try to truncate file >> > to length 0 or unlink file or restore that file to get rid of poison. >> >> Worse, because the filesystem can't track what sub-parts of the >> block are bad and that leads to inconsistent data integrity status >> being exposed to userspace. > > The driver can't track it either. Latent poison isn't know until it is > consumed, and writes to latent poison are not guaranteed to clear it. I believe we're discussing the case where we know there is a bad block. Obviously we can't know about latent errors. >> > IOW, if a partial block is being zeroed and if it is poisoned, caller >> > will not be return an error and poison will not be cleared and memory >> > will be zeroed. What do we expect in such cases. >> > >> > Do we expect an interface where if there are any bad blocks in the range >> > being zeroed, then they all should be cleared (and hence all I/O should >> > be aligned) otherwise error is returned. If yes, I could make that >> > change. >> > >> > Downside of current interface is that it will clear as many blocks as >> > possible in the given range and leave starting and end blocks poisoned >> > (if it is unaligned) and not return error. That means a reader will >> > get error on these blocks again and they will have to try to clear it >> > again. >> >> Which is solved by having partial page writes always EIO on poisoned >> memory. > > The problem with the above is that partial page writes can not be > guaranteed to return EIO. Poison is only detected on consumed reads, > or a periodic scrub, not writes. IFF poison detection was always > synchronous with poison creation then the above makes sense. However, > with asynchronous signaling, it's fundamentally a false security > blanket to assume even full block writes will clear poison unless a > callback to firmware is made for every block. Let's just focus on reporting errors when we know we have them. -Jeff
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: [..] > > > > Hi Jeff, > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > a range which is less than a sector. Or which is bigger than a sector > > > > but start and end are not aligned on sector boundaries. > > > > > > Sure, but who will call it with misaligned ranges? > > > > create a file foo.txt of size 4K and then truncate it. > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > 4095. > > This should fail with EIO. Only full page writes should clear the > bad page state, and partial writes should therefore fail because > they do not guarantee the data in the filesystem block is all good. > > If this zeroing was a buffered write to an address with a bad > sector, then the writeback will fail and the user will (eventually) > get an EIO on the file. > > DAX should do the same thing, except because the zeroing is > synchronous (i.e. done directly by the truncate syscall) we can - > and should - return EIO immediately. > > Indeed, with your code, if we then extend the file by truncating up > back to 4k, then the range between 23 and 512 is still bad, even > though we've successfully zeroed it and the user knows it. An > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > will fail with EIO, but reading 10 bytes at offset 2000 will > succeed. Hi Dave, What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to 511) is poisoned and rest don't have poison. Should this fail with -EIO. In current implementation it does not. Because all sector aligned I/O we redirect through blkdev_issue_zeroout() and that will happly zero out sector 2-8 without worrying about the state of sector 1. Hence user which tries to read 10 bytes at offset 100, will still fail. This probably should be fixed if we want to retain existing behavior. Anyway, partial page truncate can't ensure that data in rest of the page can be read back successfully. Memory can get poison after the write and hence read after truncate will still fail. Hence, all we are trying to ensure is that if a poison is known at the time of writing partial page, then we should return error to user space. Thanks Vivek
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > >> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > >> > Atleast that seems to be the assumption with which code has been written. > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > >> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > >> > function to assume that offset and length can be arbitrary and do the > > > >> > necessary alignments as needed. > > > >> > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > Hi Jeff, > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > a range which is less than a sector. Or which is bigger than a sector > > > > but start and end are not aligned on sector boundaries. > > > > > > Sure, but who will call it with misaligned ranges? > > > > create a file foo.txt of size 4K and then truncate it. > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > 4095. > > This should fail with EIO. Only full page writes should clear the > bad page state, and partial writes should therefore fail because > they do not guarantee the data in the filesystem block is all good. > > If this zeroing was a buffered write to an address with a bad > sector, then the writeback will fail and the user will (eventually) > get an EIO on the file. > > DAX should do the same thing, except because the zeroing is > synchronous (i.e. done directly by the truncate syscall) we can - > and should - return EIO immediately. > > Indeed, with your code, if we then extend the file by truncating up > back to 4k, then the range between 23 and 512 is still bad, even > though we've successfully zeroed it and the user knows it. An > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > will fail with EIO, but reading 10 bytes at offset 2000 will > succeed. > > That's *awful* behaviour to expose to userspace, especially when > they look at the fs config and see that it's using both 4kB block > and sector sizes... > > The only thing that makes sense from a filesystem perspective is > clearing bad page state when entire filesystem blocks are > overwritten. The data in a filesystem block is either good or bad, > and it doesn't matter how many internal (kernel or device) sectors > it has. > > > > And what happens to the rest? The caller is left to trip over the > > > errors? That sounds pretty terrible. I really think there needs to be > > > an explicit contract here. > > > > Ok, I think is is the contentious bit. Current interface > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > sector) or expects page to be free of poison. > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > because range being zeroed is not sector aligned. So > > __dax_zero_page_range() falls back to calling direct_access(). Which > > fails because there are poisoned sectors in the page. > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > and returns success. > > Ok, kernel sectors are not the unit of granularity bad page state > should be managed at. They don't match page state granularity, and > they don't match filesystem block granularity, and the whacky > "partial writes silently succeed, reads fail unpredictably" > assymetry it leads to will just cause problems for users. > > > So question is, is this better behavior or worse behavior. If sector 0 > > was poisoned, it will continue to remain poisoned and caller will come > > to know about it on next read and then it should try to truncate file > > to length 0 or unlink file or restore that file to get rid of poison. > > Worse, because the filesystem can't track what sub-parts of the > block are bad and that leads to inconsistent data integrity status > being exposed to userspace. > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > will not be return an error and poison will not be cleared and memory > > will be zeroed. What do we expect in such cases. > > > > Do we expect an interface where if there are any bad blocks in the range > > being zeroed, then they all should be cleared (and hence all I/O should > > be aligned) otherwise error is returned. If yes, I could make that > > change. > > > > Downside of current interface is that it will clear as many blocks as > > possible in the given range and leave starting and end blocks poisoned > > (if it is unaligned) and not return error. That means a reader will > > get error on these blocks again and they will have to try to clear it > > again. > > Which is solved by having partial page writes always EIO on poisoned > memory. Ok, how about if I add one more patch to the series which will check if unwritten portion of the page has known poison. If it has, then -EIO is returned. Subject: pmem: zero page range return error if poisoned memory in unwritten area Filesystems call into pmem_dax_zero_page_range() to zero partial page upon truncate. If partial page is being zeroed, then at the end of operation file systems expect that there is no poison in the whole page (atleast known poison). So make sure part of the partial page which is not being written, does not have poison. If it does, return error. If there is poison in area of page being written, it will be cleared. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- drivers/nvdimm/pmem.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Index: redhat-linux/drivers/nvdimm/pmem.c =================================================================== --- redhat-linux.orig/drivers/nvdimm/pmem.c 2020-02-24 10:57:48.520451478 -0500 +++ redhat-linux/drivers/nvdimm/pmem.c 2020-02-24 15:01:58.119854835 -0500 @@ -309,6 +309,19 @@ static int pmem_dax_zero_page_range(stru { struct pmem_device *pmem = dax_get_private(dax_dev); + if (offset_in_page(offset)) { + sector_t unwritten_start = (offset & PAGE_MASK) >> SECTOR_SHIFT; + unsigned int unwritten_len = ALIGN(offset_in_page(offset), SECTOR_SIZE); + /* If partial page is being zeroed, make sure there is no + * known poison in the area of page which is not being zeroed. + * Filesystems expect full page to be free of *known* poison + * at the end of partial page zeroing. + */ + if (unlikely(is_bad_pmem(&pmem->bb, unwritten_start, + unwritten_len))) + return -EIO; + } + return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset, len)); }
On Mon, Feb 24, 2020 at 5:50 AM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote: > >> > >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > >> > > Vivek Goyal <vgoyal@redhat.com> writes: > >> > > > >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > >> > > >> Vivek Goyal <vgoyal@redhat.com> writes: > >> > > >> > >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > >> > > >> > Atleast that seems to be the assumption with which code has been written. > >> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > >> > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > >> > > >> > > >> > > >> > Soon we want use this function from dax_zero_page_range() code path which > >> > > >> > can try to zero arbitrary range of memory with-in a page. So update this > >> > > >> > function to assume that offset and length can be arbitrary and do the > >> > > >> > necessary alignments as needed. > >> > > >> > >> > > >> What caller will try to zero a range that is smaller than a sector? > >> > > > > >> > > > Hi Jeff, > >> > > > > >> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > >> > > > a range which is less than a sector. Or which is bigger than a sector > >> > > > but start and end are not aligned on sector boundaries. > >> > > > >> > > Sure, but who will call it with misaligned ranges? > >> > > >> > create a file foo.txt of size 4K and then truncate it. > >> > > >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > >> > 4095. > >> > >> This should fail with EIO. Only full page writes should clear the > >> bad page state, and partial writes should therefore fail because > >> they do not guarantee the data in the filesystem block is all good. > >> > >> If this zeroing was a buffered write to an address with a bad > >> sector, then the writeback will fail and the user will (eventually) > >> get an EIO on the file. > >> > >> DAX should do the same thing, except because the zeroing is > >> synchronous (i.e. done directly by the truncate syscall) we can - > >> and should - return EIO immediately. > >> > >> Indeed, with your code, if we then extend the file by truncating up > >> back to 4k, then the range between 23 and 512 is still bad, even > >> though we've successfully zeroed it and the user knows it. An > >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > >> will fail with EIO, but reading 10 bytes at offset 2000 will > >> succeed. > >> > >> That's *awful* behaviour to expose to userspace, especially when > >> they look at the fs config and see that it's using both 4kB block > >> and sector sizes... > >> > >> The only thing that makes sense from a filesystem perspective is > >> clearing bad page state when entire filesystem blocks are > >> overwritten. The data in a filesystem block is either good or bad, > >> and it doesn't matter how many internal (kernel or device) sectors > >> it has. > >> > >> > > And what happens to the rest? The caller is left to trip over the > >> > > errors? That sounds pretty terrible. I really think there needs to be > >> > > an explicit contract here. > >> > > >> > Ok, I think is is the contentious bit. Current interface > >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > >> > sector) or expects page to be free of poison. > >> > > >> > So in above example, of "truncate -s 23 foo.txt", currently I get an error > >> > because range being zeroed is not sector aligned. So > >> > __dax_zero_page_range() falls back to calling direct_access(). Which > >> > fails because there are poisoned sectors in the page. > >> > > >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > >> > and returns success. > >> > >> Ok, kernel sectors are not the unit of granularity bad page state > >> should be managed at. They don't match page state granularity, and > >> they don't match filesystem block granularity, and the whacky > >> "partial writes silently succeed, reads fail unpredictably" > >> assymetry it leads to will just cause problems for users. > >> > >> > So question is, is this better behavior or worse behavior. If sector 0 > >> > was poisoned, it will continue to remain poisoned and caller will come > >> > to know about it on next read and then it should try to truncate file > >> > to length 0 or unlink file or restore that file to get rid of poison. > >> > >> Worse, because the filesystem can't track what sub-parts of the > >> block are bad and that leads to inconsistent data integrity status > >> being exposed to userspace. > > > > The driver can't track it either. Latent poison isn't know until it is > > consumed, and writes to latent poison are not guaranteed to clear it. > > I believe we're discussing the case where we know there is a bad block. > Obviously we can't know about latent errors. > > >> > IOW, if a partial block is being zeroed and if it is poisoned, caller > >> > will not be return an error and poison will not be cleared and memory > >> > will be zeroed. What do we expect in such cases. > >> > > >> > Do we expect an interface where if there are any bad blocks in the range > >> > being zeroed, then they all should be cleared (and hence all I/O should > >> > be aligned) otherwise error is returned. If yes, I could make that > >> > change. > >> > > >> > Downside of current interface is that it will clear as many blocks as > >> > possible in the given range and leave starting and end blocks poisoned > >> > (if it is unaligned) and not return error. That means a reader will > >> > get error on these blocks again and they will have to try to clear it > >> > again. > >> > >> Which is solved by having partial page writes always EIO on poisoned > >> memory. > > > > The problem with the above is that partial page writes can not be > > guaranteed to return EIO. Poison is only detected on consumed reads, > > or a periodic scrub, not writes. IFF poison detection was always > > synchronous with poison creation then the above makes sense. However, > > with asynchronous signaling, it's fundamentally a false security > > blanket to assume even full block writes will clear poison unless a > > callback to firmware is made for every block. > > Let's just focus on reporting errors when we know we have them. That's the problem in my eyes. If software needs to contend with latent error reporting then it should always contend otherwise software has multiple error models to wrangle. Setting that aside we can start with just treating zeroing the same as the copy_from_iter() case and fail the I/O at the dax_direct_access() step. I'd rather have a separate op that filesystems can use to clear errors at block allocation time that can be enforced to have the correct alignment.
On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > > >> > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > > >> > Atleast that seems to be the assumption with which code has been written. > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > > >> > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > > >> > function to assume that offset and length can be arbitrary and do the > > > > >> > necessary alignments as needed. > > > > >> > > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > This should fail with EIO. Only full page writes should clear the > > bad page state, and partial writes should therefore fail because > > they do not guarantee the data in the filesystem block is all good. > > > > If this zeroing was a buffered write to an address with a bad > > sector, then the writeback will fail and the user will (eventually) > > get an EIO on the file. > > > > DAX should do the same thing, except because the zeroing is > > synchronous (i.e. done directly by the truncate syscall) we can - > > and should - return EIO immediately. > > > > Indeed, with your code, if we then extend the file by truncating up > > back to 4k, then the range between 23 and 512 is still bad, even > > though we've successfully zeroed it and the user knows it. An > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > will fail with EIO, but reading 10 bytes at offset 2000 will > > succeed. > > > > That's *awful* behaviour to expose to userspace, especially when > > they look at the fs config and see that it's using both 4kB block > > and sector sizes... > > > > The only thing that makes sense from a filesystem perspective is > > clearing bad page state when entire filesystem blocks are > > overwritten. The data in a filesystem block is either good or bad, > > and it doesn't matter how many internal (kernel or device) sectors > > it has. > > > > > > And what happens to the rest? The caller is left to trip over the > > > > errors? That sounds pretty terrible. I really think there needs to be > > > > an explicit contract here. > > > > > > Ok, I think is is the contentious bit. Current interface > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > > sector) or expects page to be free of poison. > > > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > > because range being zeroed is not sector aligned. So > > > __dax_zero_page_range() falls back to calling direct_access(). Which > > > fails because there are poisoned sectors in the page. > > > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > > and returns success. > > > > Ok, kernel sectors are not the unit of granularity bad page state > > should be managed at. They don't match page state granularity, and > > they don't match filesystem block granularity, and the whacky > > "partial writes silently succeed, reads fail unpredictably" > > assymetry it leads to will just cause problems for users. > > > > > So question is, is this better behavior or worse behavior. If sector 0 > > > was poisoned, it will continue to remain poisoned and caller will come > > > to know about it on next read and then it should try to truncate file > > > to length 0 or unlink file or restore that file to get rid of poison. > > > > Worse, because the filesystem can't track what sub-parts of the > > block are bad and that leads to inconsistent data integrity status > > being exposed to userspace. > > > > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > > will not be return an error and poison will not be cleared and memory > > > will be zeroed. What do we expect in such cases. > > > > > > Do we expect an interface where if there are any bad blocks in the range > > > being zeroed, then they all should be cleared (and hence all I/O should > > > be aligned) otherwise error is returned. If yes, I could make that > > > change. > > > > > > Downside of current interface is that it will clear as many blocks as > > > possible in the given range and leave starting and end blocks poisoned > > > (if it is unaligned) and not return error. That means a reader will > > > get error on these blocks again and they will have to try to clear it > > > again. > > > > Which is solved by having partial page writes always EIO on poisoned > > memory. > > Ok, how about if I add one more patch to the series which will check > if unwritten portion of the page has known poison. If it has, then > -EIO is returned. > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > truncate. If partial page is being zeroed, then at the end of operation > file systems expect that there is no poison in the whole page (atleast > known poison). > > So make sure part of the partial page which is not being written, does not > have poison. If it does, return error. If there is poison in area of page > being written, it will be cleared. No, I don't like that the zero operation is special cased compared to the write case. I'd say let's make them identical for now. I.e. fail the I/O at dax_direct_access() time. I think the error clearing interface should be an explicit / separate op rather than a side-effect. What about an explicit interface for initializing newly allocated blocks, and the only reliable way to destroy poison through the filesystem is to free the block?
On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote: > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > > > >> > > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > > > >> > Atleast that seems to be the assumption with which code has been written. > > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > > > >> > > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > > > >> > function to assume that offset and length can be arbitrary and do the > > > > > >> > necessary alignments as needed. > > > > > >> > > > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > > > > > Hi Jeff, > > > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > > 4095. > > > > > > This should fail with EIO. Only full page writes should clear the > > > bad page state, and partial writes should therefore fail because > > > they do not guarantee the data in the filesystem block is all good. > > > > > > If this zeroing was a buffered write to an address with a bad > > > sector, then the writeback will fail and the user will (eventually) > > > get an EIO on the file. > > > > > > DAX should do the same thing, except because the zeroing is > > > synchronous (i.e. done directly by the truncate syscall) we can - > > > and should - return EIO immediately. > > > > > > Indeed, with your code, if we then extend the file by truncating up > > > back to 4k, then the range between 23 and 512 is still bad, even > > > though we've successfully zeroed it and the user knows it. An > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > > will fail with EIO, but reading 10 bytes at offset 2000 will > > > succeed. > > > > > > That's *awful* behaviour to expose to userspace, especially when > > > they look at the fs config and see that it's using both 4kB block > > > and sector sizes... > > > > > > The only thing that makes sense from a filesystem perspective is > > > clearing bad page state when entire filesystem blocks are > > > overwritten. The data in a filesystem block is either good or bad, > > > and it doesn't matter how many internal (kernel or device) sectors > > > it has. > > > > > > > > And what happens to the rest? The caller is left to trip over the > > > > > errors? That sounds pretty terrible. I really think there needs to be > > > > > an explicit contract here. > > > > > > > > Ok, I think is is the contentious bit. Current interface > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > > > sector) or expects page to be free of poison. > > > > > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > > > because range being zeroed is not sector aligned. So > > > > __dax_zero_page_range() falls back to calling direct_access(). Which > > > > fails because there are poisoned sectors in the page. > > > > > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > > > and returns success. > > > > > > Ok, kernel sectors are not the unit of granularity bad page state > > > should be managed at. They don't match page state granularity, and > > > they don't match filesystem block granularity, and the whacky > > > "partial writes silently succeed, reads fail unpredictably" > > > assymetry it leads to will just cause problems for users. > > > > > > > So question is, is this better behavior or worse behavior. If sector 0 > > > > was poisoned, it will continue to remain poisoned and caller will come > > > > to know about it on next read and then it should try to truncate file > > > > to length 0 or unlink file or restore that file to get rid of poison. > > > > > > Worse, because the filesystem can't track what sub-parts of the > > > block are bad and that leads to inconsistent data integrity status > > > being exposed to userspace. > > > > > > > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > > > will not be return an error and poison will not be cleared and memory > > > > will be zeroed. What do we expect in such cases. > > > > > > > > Do we expect an interface where if there are any bad blocks in the range > > > > being zeroed, then they all should be cleared (and hence all I/O should > > > > be aligned) otherwise error is returned. If yes, I could make that > > > > change. > > > > > > > > Downside of current interface is that it will clear as many blocks as > > > > possible in the given range and leave starting and end blocks poisoned > > > > (if it is unaligned) and not return error. That means a reader will > > > > get error on these blocks again and they will have to try to clear it > > > > again. > > > > > > Which is solved by having partial page writes always EIO on poisoned > > > memory. > > > > Ok, how about if I add one more patch to the series which will check > > if unwritten portion of the page has known poison. If it has, then > > -EIO is returned. > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > truncate. If partial page is being zeroed, then at the end of operation > > file systems expect that there is no poison in the whole page (atleast > > known poison). > > > > So make sure part of the partial page which is not being written, does not > > have poison. If it does, return error. If there is poison in area of page > > being written, it will be cleared. > > No, I don't like that the zero operation is special cased compared to > the write case. I'd say let's make them identical for now. I.e. fail > the I/O at dax_direct_access() time. So basically __dax_zero_page_range() will only write zeros (and not try to clear any poison). Right? > I think the error clearing > interface should be an explicit / separate op rather than a > side-effect. What about an explicit interface for initializing newly > allocated blocks, and the only reliable way to destroy poison through > the filesystem is to free the block? Effectively pmem_make_request() is already that interface filesystems use to initialize blocks and clear poison. So we don't really have to introduce a new interface? Or you are suggesting separate dax_zero_page_range() interface which will always call into firmware to clear poison. And that will make sure latent poison is cleared as well and filesystem should use that for block initialization instead? I do like the idea of not having to differentiate between known poison and latent poison. Once a block has been initialized all poison should be cleared (known/latent). I am worried though that on large devices this might slowdown filesystem initialization a lot if they are zeroing large range of blocks. If yes, this sounds like two different patch series. First patch series takes care of removing blkdev_issue_zeroout() from __dax_zero_page_range() and couple of iomap related cleans christoph wanted. And second patch series for adding new dax operation to zero a range and always call info firmware to clear poison and modify filesystems accordingly. Thanks Vivek
On Mon, Feb 24, 2020 at 1:17 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote: > > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote: > > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote: > > > > > > Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > > > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote: > > > > > > >> Vivek Goyal <vgoyal@redhat.com> writes: > > > > > > >> > > > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned. > > > > > > >> > Atleast that seems to be the assumption with which code has been written. > > > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page() > > > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len. > > > > > > >> > > > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which > > > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this > > > > > > >> > function to assume that offset and length can be arbitrary and do the > > > > > > >> > necessary alignments as needed. > > > > > > >> > > > > > > >> What caller will try to zero a range that is smaller than a sector? > > > > > > > > > > > > > > Hi Jeff, > > > > > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > > > 4095. > > > > > > > > This should fail with EIO. Only full page writes should clear the > > > > bad page state, and partial writes should therefore fail because > > > > they do not guarantee the data in the filesystem block is all good. > > > > > > > > If this zeroing was a buffered write to an address with a bad > > > > sector, then the writeback will fail and the user will (eventually) > > > > get an EIO on the file. > > > > > > > > DAX should do the same thing, except because the zeroing is > > > > synchronous (i.e. done directly by the truncate syscall) we can - > > > > and should - return EIO immediately. > > > > > > > > Indeed, with your code, if we then extend the file by truncating up > > > > back to 4k, then the range between 23 and 512 is still bad, even > > > > though we've successfully zeroed it and the user knows it. An > > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > > > will fail with EIO, but reading 10 bytes at offset 2000 will > > > > succeed. > > > > > > > > That's *awful* behaviour to expose to userspace, especially when > > > > they look at the fs config and see that it's using both 4kB block > > > > and sector sizes... > > > > > > > > The only thing that makes sense from a filesystem perspective is > > > > clearing bad page state when entire filesystem blocks are > > > > overwritten. The data in a filesystem block is either good or bad, > > > > and it doesn't matter how many internal (kernel or device) sectors > > > > it has. > > > > > > > > > > And what happens to the rest? The caller is left to trip over the > > > > > > errors? That sounds pretty terrible. I really think there needs to be > > > > > > an explicit contract here. > > > > > > > > > > Ok, I think is is the contentious bit. Current interface > > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to > > > > > sector) or expects page to be free of poison. > > > > > > > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error > > > > > because range being zeroed is not sector aligned. So > > > > > __dax_zero_page_range() falls back to calling direct_access(). Which > > > > > fails because there are poisoned sectors in the page. > > > > > > > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to > > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511 > > > > > and returns success. > > > > > > > > Ok, kernel sectors are not the unit of granularity bad page state > > > > should be managed at. They don't match page state granularity, and > > > > they don't match filesystem block granularity, and the whacky > > > > "partial writes silently succeed, reads fail unpredictably" > > > > assymetry it leads to will just cause problems for users. > > > > > > > > > So question is, is this better behavior or worse behavior. If sector 0 > > > > > was poisoned, it will continue to remain poisoned and caller will come > > > > > to know about it on next read and then it should try to truncate file > > > > > to length 0 or unlink file or restore that file to get rid of poison. > > > > > > > > Worse, because the filesystem can't track what sub-parts of the > > > > block are bad and that leads to inconsistent data integrity status > > > > being exposed to userspace. > > > > > > > > > > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller > > > > > will not be return an error and poison will not be cleared and memory > > > > > will be zeroed. What do we expect in such cases. > > > > > > > > > > Do we expect an interface where if there are any bad blocks in the range > > > > > being zeroed, then they all should be cleared (and hence all I/O should > > > > > be aligned) otherwise error is returned. If yes, I could make that > > > > > change. > > > > > > > > > > Downside of current interface is that it will clear as many blocks as > > > > > possible in the given range and leave starting and end blocks poisoned > > > > > (if it is unaligned) and not return error. That means a reader will > > > > > get error on these blocks again and they will have to try to clear it > > > > > again. > > > > > > > > Which is solved by having partial page writes always EIO on poisoned > > > > memory. > > > > > > Ok, how about if I add one more patch to the series which will check > > > if unwritten portion of the page has known poison. If it has, then > > > -EIO is returned. > > > > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > > truncate. If partial page is being zeroed, then at the end of operation > > > file systems expect that there is no poison in the whole page (atleast > > > known poison). > > > > > > So make sure part of the partial page which is not being written, does not > > > have poison. If it does, return error. If there is poison in area of page > > > being written, it will be cleared. > > > > No, I don't like that the zero operation is special cased compared to > > the write case. I'd say let's make them identical for now. I.e. fail > > the I/O at dax_direct_access() time. > > So basically __dax_zero_page_range() will only write zeros (and not > try to clear any poison). Right? Yes, the zero operation would have already failed at the dax_direct_access() step if there was present poison. > > I think the error clearing > > interface should be an explicit / separate op rather than a > > side-effect. What about an explicit interface for initializing newly > > allocated blocks, and the only reliable way to destroy poison through > > the filesystem is to free the block? > > Effectively pmem_make_request() is already that interface filesystems > use to initialize blocks and clear poison. So we don't really have to > introduce a new interface? pmem_make_request() is shared with the I/O path and is too low in the stack to understand intent. DAX intercepts the I/O path closer to the filesystem and can understand zeroing vs writing today. I'm proposing we go a step further and make DAX understand free-to-allocated-block initialization instead of just zeroing. Inject the error clearing into that initialization interface. > Or you are suggesting separate dax_zero_page_range() interface which will > always call into firmware to clear poison. And that will make sure latent > poison is cleared as well and filesystem should use that for block > initialization instead? Yes, except latent poison would not be cleared until the zeroing is implemented with movdir64b instead of callouts to firmware. It's otherwise too slow to call out to firmware unconditionally. > I do like the idea of not having to differentiate > between known poison and latent poison. Once a block has been initialized > all poison should be cleared (known/latent). I am worried though that > on large devices this might slowdown filesystem initialization a lot > if they are zeroing large range of blocks. > > If yes, this sounds like two different patch series. First patch series > takes care of removing blkdev_issue_zeroout() from > __dax_zero_page_range() and couple of iomap related cleans christoph > wanted. > > And second patch series for adding new dax operation to zero a range > and always call info firmware to clear poison and modify filesystems > accordingly. Yes, but they may need to be merged together. I don't want to regress the ability of a block-aligned hole-punch to clear errors.
Dan Williams <dan.j.williams@intel.com> writes: >> Let's just focus on reporting errors when we know we have them. > > That's the problem in my eyes. If software needs to contend with > latent error reporting then it should always contend otherwise > software has multiple error models to wrangle. The only way for an application to know that the data has been written successfully would be to issue a read after every write. That's not a performance hit most applications are willing to take. And, of course, the media can still go bad at a later time, so it only guarantees the data is accessible immediately after having been written. What I'm suggesting is that we should not complete a write successfully if we know that the data will not be retrievable. I wouldn't call this adding an extra error model to contend with. Applications should already be checking for errors on write. Does that make sense? Are we talking past each other? > Setting that aside we can start with just treating zeroing the same as > the copy_from_iter() case and fail the I/O at the dax_direct_access() > step. OK. > I'd rather have a separate op that filesystems can use to clear errors > at block allocation time that can be enforced to have the correct > alignment. So would file systems always call that routine instead of zeroing, or would they first check to see if there are badblocks? -Jeff
On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > >> Let's just focus on reporting errors when we know we have them. > > > > That's the problem in my eyes. If software needs to contend with > > latent error reporting then it should always contend otherwise > > software has multiple error models to wrangle. > > The only way for an application to know that the data has been written > successfully would be to issue a read after every write. That's not a > performance hit most applications are willing to take. And, of course, > the media can still go bad at a later time, so it only guarantees the > data is accessible immediately after having been written. > > What I'm suggesting is that we should not complete a write successfully > if we know that the data will not be retrievable. I wouldn't call this > adding an extra error model to contend with. Applications should > already be checking for errors on write. > > Does that make sense? Are we talking past each other? The badblock list is late to update in both directions, late to add entries that the scrub needs to find and late to delete entries that were inadvertently cleared by cache-line writes that did not first ingest the poison for a read-modify-write. So I see the above as being wishful in using the error list as the hard source of truth and unfortunate to up-level all sub-sector error entries into full PAGE_SIZE data offline events. I'm hoping we can find a way to make the error handling more fine grained over time, but for the current patch, managing the blast radius as PAGE_SIZE granularity at least matches the zero path with the write path. > > Setting that aside we can start with just treating zeroing the same as > > the copy_from_iter() case and fail the I/O at the dax_direct_access() > > step. > > OK. > > > I'd rather have a separate op that filesystems can use to clear errors > > at block allocation time that can be enforced to have the correct > > alignment. > > So would file systems always call that routine instead of zeroing, or > would they first check to see if there are badblocks? The proposal is that filesystems distinguish zeroing from free-block allocation/initialization such that the fsdax implementation directs initialization to a driver callback. This "initialization op" would take care to check for poison and clear it. All other dax paths would not consult the badblocks list.
On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote: [..] > > > > Ok, how about if I add one more patch to the series which will check > > > > if unwritten portion of the page has known poison. If it has, then > > > > -EIO is returned. > > > > > > > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > > > truncate. If partial page is being zeroed, then at the end of operation > > > > file systems expect that there is no poison in the whole page (atleast > > > > known poison). > > > > > > > > So make sure part of the partial page which is not being written, does not > > > > have poison. If it does, return error. If there is poison in area of page > > > > being written, it will be cleared. > > > > > > No, I don't like that the zero operation is special cased compared to > > > the write case. I'd say let's make them identical for now. I.e. fail > > > the I/O at dax_direct_access() time. > > > > So basically __dax_zero_page_range() will only write zeros (and not > > try to clear any poison). Right? > > Yes, the zero operation would have already failed at the > dax_direct_access() step if there was present poison. > > > > I think the error clearing > > > interface should be an explicit / separate op rather than a > > > side-effect. What about an explicit interface for initializing newly > > > allocated blocks, and the only reliable way to destroy poison through > > > the filesystem is to free the block? > > > > Effectively pmem_make_request() is already that interface filesystems > > use to initialize blocks and clear poison. So we don't really have to > > introduce a new interface? > > pmem_make_request() is shared with the I/O path and is too low in the > stack to understand intent. DAX intercepts the I/O path closer to the > filesystem and can understand zeroing vs writing today. I'm proposing > we go a step further and make DAX understand free-to-allocated-block > initialization instead of just zeroing. Inject the error clearing into > that initialization interface. > > > Or you are suggesting separate dax_zero_page_range() interface which will > > always call into firmware to clear poison. And that will make sure latent > > poison is cleared as well and filesystem should use that for block > > initialization instead? > > Yes, except latent poison would not be cleared until the zeroing is > implemented with movdir64b instead of callouts to firmware. It's > otherwise too slow to call out to firmware unconditionally. > > > I do like the idea of not having to differentiate > > between known poison and latent poison. Once a block has been initialized > > all poison should be cleared (known/latent). I am worried though that > > on large devices this might slowdown filesystem initialization a lot > > if they are zeroing large range of blocks. > > > > If yes, this sounds like two different patch series. First patch series > > takes care of removing blkdev_issue_zeroout() from > > __dax_zero_page_range() and couple of iomap related cleans christoph > > wanted. > > > > And second patch series for adding new dax operation to zero a range > > and always call info firmware to clear poison and modify filesystems > > accordingly. > > Yes, but they may need to be merged together. I don't want to regress > the ability of a block-aligned hole-punch to clear errors. Hi Dan, IIUC, block aligned hole punch don't go through __dax_zero_page_range() path. Instead they call blkdev_issue_zeroout() at later point of time. Only partial block zeroing path is taking __dax_zero_page_range(). So even if we remove poison clearing code from __dax_zero_page_range(), there should not be a regression w.r.t full block zeroing. Only possible regression will be if somebody was doing partial block zeroing on sector boundary, then poison will not be cleared. We now seem to be discussing too many issues w.r.t poison clearing and dax. Atleast 3 issues are mentioned in this thread. A. Get rid of dependency on block device in dax zeroing path. (__dax_zero_page_range) B. Provide a way to clear latent poison. And possibly use movdir64b to do that and make filesystems use that interface for initialization of blocks. C. Dax zero operation is clearing known poison while copy_from_iter() is not. I guess this ship has already sailed. If we change it now, somebody will complain of some regression. For issue A, there are two possible ways to deal with it. 1. Implement a dax method to zero page. And this method will also clear known poison. This is what my patch series is doing. 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range() so that no poison will be cleared in __dax_zero_page_range() path. This path is currently used in partial page zeroing path and full filesystem block zeroing happens with blkdev_issue_zeroout(). There is a small chance of regression here in case of sector aligned partial block zeroing. My patch series takes care of issue A without any regressions. In fact it improves current interface. For example, currently "truncate -s 512 foo.txt" will succeed even if first sector in the block is poisoned. My patch series fixes it. Current implementation will return error on if any non sector aligned truncate is done and any of the sector is poisoned. My implementation will not return error if poisoned can be cleared as part of zeroing. It will return only if poison is present in non-zeoring part. Why don't we solve one issue A now and deal with issue B and C later in a sepaprate patch series. This patch series gets rid of dependency on block device in dax path and also makes current zeroing interface better. Thanks Vivek
On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote: > > [..] > > > > > Ok, how about if I add one more patch to the series which will check > > > > > if unwritten portion of the page has known poison. If it has, then > > > > > -EIO is returned. > > > > > > > > > > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > > > > truncate. If partial page is being zeroed, then at the end of operation > > > > > file systems expect that there is no poison in the whole page (atleast > > > > > known poison). > > > > > > > > > > So make sure part of the partial page which is not being written, does not > > > > > have poison. If it does, return error. If there is poison in area of page > > > > > being written, it will be cleared. > > > > > > > > No, I don't like that the zero operation is special cased compared to > > > > the write case. I'd say let's make them identical for now. I.e. fail > > > > the I/O at dax_direct_access() time. > > > > > > So basically __dax_zero_page_range() will only write zeros (and not > > > try to clear any poison). Right? > > > > Yes, the zero operation would have already failed at the > > dax_direct_access() step if there was present poison. > > > > > > I think the error clearing > > > > interface should be an explicit / separate op rather than a > > > > side-effect. What about an explicit interface for initializing newly > > > > allocated blocks, and the only reliable way to destroy poison through > > > > the filesystem is to free the block? > > > > > > Effectively pmem_make_request() is already that interface filesystems > > > use to initialize blocks and clear poison. So we don't really have to > > > introduce a new interface? > > > > pmem_make_request() is shared with the I/O path and is too low in the > > stack to understand intent. DAX intercepts the I/O path closer to the > > filesystem and can understand zeroing vs writing today. I'm proposing > > we go a step further and make DAX understand free-to-allocated-block > > initialization instead of just zeroing. Inject the error clearing into > > that initialization interface. > > > > > Or you are suggesting separate dax_zero_page_range() interface which will > > > always call into firmware to clear poison. And that will make sure latent > > > poison is cleared as well and filesystem should use that for block > > > initialization instead? > > > > Yes, except latent poison would not be cleared until the zeroing is > > implemented with movdir64b instead of callouts to firmware. It's > > otherwise too slow to call out to firmware unconditionally. > > > > > I do like the idea of not having to differentiate > > > between known poison and latent poison. Once a block has been initialized > > > all poison should be cleared (known/latent). I am worried though that > > > on large devices this might slowdown filesystem initialization a lot > > > if they are zeroing large range of blocks. > > > > > > If yes, this sounds like two different patch series. First patch series > > > takes care of removing blkdev_issue_zeroout() from > > > __dax_zero_page_range() and couple of iomap related cleans christoph > > > wanted. > > > > > > And second patch series for adding new dax operation to zero a range > > > and always call info firmware to clear poison and modify filesystems > > > accordingly. > > > > Yes, but they may need to be merged together. I don't want to regress > > the ability of a block-aligned hole-punch to clear errors. > > Hi Dan, > > IIUC, block aligned hole punch don't go through __dax_zero_page_range() > path. Instead they call blkdev_issue_zeroout() at later point of time. > > Only partial block zeroing path is taking __dax_zero_page_range(). So > even if we remove poison clearing code from __dax_zero_page_range(), > there should not be a regression w.r.t full block zeroing. Only possible > regression will be if somebody was doing partial block zeroing on sector > boundary, then poison will not be cleared. > > We now seem to be discussing too many issues w.r.t poison clearing > and dax. Atleast 3 issues are mentioned in this thread. > > A. Get rid of dependency on block device in dax zeroing path. > (__dax_zero_page_range) > > B. Provide a way to clear latent poison. And possibly use movdir64b to > do that and make filesystems use that interface for initialization > of blocks. > > C. Dax zero operation is clearing known poison while copy_from_iter() is > not. I guess this ship has already sailed. If we change it now, > somebody will complain of some regression. > > For issue A, there are two possible ways to deal with it. > > 1. Implement a dax method to zero page. And this method will also clear > known poison. This is what my patch series is doing. > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range() > so that no poison will be cleared in __dax_zero_page_range() path. This > path is currently used in partial page zeroing path and full filesystem > block zeroing happens with blkdev_issue_zeroout(). There is a small > chance of regression here in case of sector aligned partial block > zeroing. > > My patch series takes care of issue A without any regressions. In fact it > improves current interface. For example, currently "truncate -s 512 > foo.txt" will succeed even if first sector in the block is poisoned. My > patch series fixes it. Current implementation will return error on if any > non sector aligned truncate is done and any of the sector is poisoned. My > implementation will not return error if poisoned can be cleared as part > of zeroing. It will return only if poison is present in non-zeoring part. That asymmetry makes the implementation too much of a special case. If the dax mapping path forces error boundaries on PAGE_SIZE blocks then so should zeroing. > > Why don't we solve one issue A now and deal with issue B and C later in > a sepaprate patch series. This patch series gets rid of dependency on > block device in dax path and also makes current zeroing interface better. I'm ok with replacing blkdev_issue_zeroout() with a dax operation callback that deals with page aligned entries. That change at least makes the error boundary symmetric across copy_from_iter() and the zeroing path.
On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote: > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote: > > > > [..] > > > > > > Ok, how about if I add one more patch to the series which will check > > > > > > if unwritten portion of the page has known poison. If it has, then > > > > > > -EIO is returned. > > > > > > > > > > > > > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > > > > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > > > > > truncate. If partial page is being zeroed, then at the end of operation > > > > > > file systems expect that there is no poison in the whole page (atleast > > > > > > known poison). > > > > > > > > > > > > So make sure part of the partial page which is not being written, does not > > > > > > have poison. If it does, return error. If there is poison in area of page > > > > > > being written, it will be cleared. > > > > > > > > > > No, I don't like that the zero operation is special cased compared to > > > > > the write case. I'd say let's make them identical for now. I.e. fail > > > > > the I/O at dax_direct_access() time. > > > > > > > > So basically __dax_zero_page_range() will only write zeros (and not > > > > try to clear any poison). Right? > > > > > > Yes, the zero operation would have already failed at the > > > dax_direct_access() step if there was present poison. > > > > > > > > I think the error clearing > > > > > interface should be an explicit / separate op rather than a > > > > > side-effect. What about an explicit interface for initializing newly > > > > > allocated blocks, and the only reliable way to destroy poison through > > > > > the filesystem is to free the block? > > > > > > > > Effectively pmem_make_request() is already that interface filesystems > > > > use to initialize blocks and clear poison. So we don't really have to > > > > introduce a new interface? > > > > > > pmem_make_request() is shared with the I/O path and is too low in the > > > stack to understand intent. DAX intercepts the I/O path closer to the > > > filesystem and can understand zeroing vs writing today. I'm proposing > > > we go a step further and make DAX understand free-to-allocated-block > > > initialization instead of just zeroing. Inject the error clearing into > > > that initialization interface. > > > > > > > Or you are suggesting separate dax_zero_page_range() interface which will > > > > always call into firmware to clear poison. And that will make sure latent > > > > poison is cleared as well and filesystem should use that for block > > > > initialization instead? > > > > > > Yes, except latent poison would not be cleared until the zeroing is > > > implemented with movdir64b instead of callouts to firmware. It's > > > otherwise too slow to call out to firmware unconditionally. > > > > > > > I do like the idea of not having to differentiate > > > > between known poison and latent poison. Once a block has been initialized > > > > all poison should be cleared (known/latent). I am worried though that > > > > on large devices this might slowdown filesystem initialization a lot > > > > if they are zeroing large range of blocks. > > > > > > > > If yes, this sounds like two different patch series. First patch series > > > > takes care of removing blkdev_issue_zeroout() from > > > > __dax_zero_page_range() and couple of iomap related cleans christoph > > > > wanted. > > > > > > > > And second patch series for adding new dax operation to zero a range > > > > and always call info firmware to clear poison and modify filesystems > > > > accordingly. > > > > > > Yes, but they may need to be merged together. I don't want to regress > > > the ability of a block-aligned hole-punch to clear errors. > > > > Hi Dan, > > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range() > > path. Instead they call blkdev_issue_zeroout() at later point of time. > > > > Only partial block zeroing path is taking __dax_zero_page_range(). So > > even if we remove poison clearing code from __dax_zero_page_range(), > > there should not be a regression w.r.t full block zeroing. Only possible > > regression will be if somebody was doing partial block zeroing on sector > > boundary, then poison will not be cleared. > > > > We now seem to be discussing too many issues w.r.t poison clearing > > and dax. Atleast 3 issues are mentioned in this thread. > > > > A. Get rid of dependency on block device in dax zeroing path. > > (__dax_zero_page_range) > > > > B. Provide a way to clear latent poison. And possibly use movdir64b to > > do that and make filesystems use that interface for initialization > > of blocks. > > > > C. Dax zero operation is clearing known poison while copy_from_iter() is > > not. I guess this ship has already sailed. If we change it now, > > somebody will complain of some regression. > > > > For issue A, there are two possible ways to deal with it. > > > > 1. Implement a dax method to zero page. And this method will also clear > > known poison. This is what my patch series is doing. > > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range() > > so that no poison will be cleared in __dax_zero_page_range() path. This > > path is currently used in partial page zeroing path and full filesystem > > block zeroing happens with blkdev_issue_zeroout(). There is a small > > chance of regression here in case of sector aligned partial block > > zeroing. > > > > My patch series takes care of issue A without any regressions. In fact it > > improves current interface. For example, currently "truncate -s 512 > > foo.txt" will succeed even if first sector in the block is poisoned. My > > patch series fixes it. Current implementation will return error on if any > > non sector aligned truncate is done and any of the sector is poisoned. My > > implementation will not return error if poisoned can be cleared as part > > of zeroing. It will return only if poison is present in non-zeoring part. > > That asymmetry makes the implementation too much of a special case. If > the dax mapping path forces error boundaries on PAGE_SIZE blocks then > so should zeroing. > > > > > Why don't we solve one issue A now and deal with issue B and C later in > > a sepaprate patch series. This patch series gets rid of dependency on > > block device in dax path and also makes current zeroing interface better. > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > callback that deals with page aligned entries. That change at least > makes the error boundary symmetric across copy_from_iter() and the > zeroing path. IIUC, you are suggesting that modify dax_zero_page_range() to take page aligned start and size and call this interface from __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that path? Something like. __dax_zero_page_range() { if(page_aligned_io) call_dax_page_zero_range() else use_direct_access_and_memcpy; } And other callers of blkdev_issue_zeroout() in filesystems can migrate to calling dax_zero_page_range() instead. If yes, I am not seeing what advantage do we get by this change. - __dax_zero_page_range() seems to be called by only partial block zeroing code. So dax_zero_page_range() call will remain unused. - dax_zero_page_range() will be exact replacement of blkdev_issue_zeroout() so filesystems will not gain anything. Just that it will create a dax specific hook. In that case it might be simpler to just get rid of blkdev_issue_zeroout() call from __dax_zero_page_range() and make sure there are no callers of full block zeroing from this path. Thanks Vivek
Dan Williams <dan.j.williams@intel.com> writes: > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: >> >> Dan Williams <dan.j.williams@intel.com> writes: >> >> >> Let's just focus on reporting errors when we know we have them. >> > >> > That's the problem in my eyes. If software needs to contend with >> > latent error reporting then it should always contend otherwise >> > software has multiple error models to wrangle. >> >> The only way for an application to know that the data has been written >> successfully would be to issue a read after every write. That's not a >> performance hit most applications are willing to take. And, of course, >> the media can still go bad at a later time, so it only guarantees the >> data is accessible immediately after having been written. >> >> What I'm suggesting is that we should not complete a write successfully >> if we know that the data will not be retrievable. I wouldn't call this >> adding an extra error model to contend with. Applications should >> already be checking for errors on write. >> >> Does that make sense? Are we talking past each other? > > The badblock list is late to update in both directions, late to add > entries that the scrub needs to find and late to delete entries that > were inadvertently cleared by cache-line writes that did not first > ingest the poison for a read-modify-write. We aren't looking for perfection. If the badblocks list is populated, then report the error instead of letting the user write data that we know they won't be able to access later. You have confused me, though, since I thought that stores to bad media would not clear errors. Perhaps you are talking about some future hardware implementation that doesn't yet exist? > So I see the above as being wishful in using the error list as the > hard source of truth and unfortunate to up-level all sub-sector error > entries into full PAGE_SIZE data offline events. The page size granularity is only an issue for mmap(). If you are using read/write, then 512 byte granularity can be achieved. Even with mmap, if you encounter an error on a 4k page, you can query the status of each sector in that page to isolate the error. So I'm not quite sure I understand what you're getting at. > I'm hoping we can find a way to make the error handling more fine > grained over time, but for the current patch, managing the blast > radius as PAGE_SIZE granularity at least matches the zero path with > the write path. I think the write path can do 512 byte granularity, not page size. Anyway, I think we've gone far enough off into the weeds that more patches will have to be posted for debate. :) >> > Setting that aside we can start with just treating zeroing the same as >> > the copy_from_iter() case and fail the I/O at the dax_direct_access() >> > step. >> >> OK. >> >> > I'd rather have a separate op that filesystems can use to clear errors >> > at block allocation time that can be enforced to have the correct >> > alignment. >> >> So would file systems always call that routine instead of zeroing, or >> would they first check to see if there are badblocks? > > The proposal is that filesystems distinguish zeroing from free-block > allocation/initialization such that the fsdax implementation directs > initialization to a driver callback. This "initialization op" would > take care to check for poison and clear it. All other dax paths would > not consult the badblocks list. What do you mean by "all other dax paths?" Would that include mmap/direct_access? Because that definitely should still consult the badblocks list. I'm not against having a separate operation for clearing errors, but I guess I'm not convinced it's cleaner, either. -Jeff
On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote: > > Dan Williams <dan.j.williams@intel.com> writes: > > > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: > >> > >> Dan Williams <dan.j.williams@intel.com> writes: > >> > >> >> Let's just focus on reporting errors when we know we have them. > >> > > >> > That's the problem in my eyes. If software needs to contend with > >> > latent error reporting then it should always contend otherwise > >> > software has multiple error models to wrangle. > >> > >> The only way for an application to know that the data has been written > >> successfully would be to issue a read after every write. That's not a > >> performance hit most applications are willing to take. And, of course, > >> the media can still go bad at a later time, so it only guarantees the > >> data is accessible immediately after having been written. > >> > >> What I'm suggesting is that we should not complete a write successfully > >> if we know that the data will not be retrievable. I wouldn't call this > >> adding an extra error model to contend with. Applications should > >> already be checking for errors on write. > >> > >> Does that make sense? Are we talking past each other? > > > > The badblock list is late to update in both directions, late to add > > entries that the scrub needs to find and late to delete entries that > > were inadvertently cleared by cache-line writes that did not first > > ingest the poison for a read-modify-write. > > We aren't looking for perfection. If the badblocks list is populated, > then report the error instead of letting the user write data that we > know they won't be able to access later. > > You have confused me, though, since I thought that stores to bad media > would not clear errors. Perhaps you are talking about some future > hardware implementation that doesn't yet exist? No, today cacheline aligned DMA, and cpu cachelines that are fully dirtied without a demand fill can end up clearing poison. The movdir64b instruction provides a way to force this behavior from the cpu where it was previously luck. > > So I see the above as being wishful in using the error list as the > > hard source of truth and unfortunate to up-level all sub-sector error > > entries into full PAGE_SIZE data offline events. > > The page size granularity is only an issue for mmap(). If you are using > read/write, then 512 byte granularity can be achieved. Even with mmap, > if you encounter an error on a 4k page, you can query the status of each > sector in that page to isolate the error. So I'm not quite sure I > understand what you're getting at. I'm getting at the fact that we don't allow mmap on PAGE_SIZE granularity in the presence of errors, and don't allow dax I/O to the page when an error is present. Only non-dax direct-I/O can read / write at sub-page granularity in the presence of errors. The interface to query the status is not coordinated with the filesystem, but that's a whole other discussion. Yes, we're getting a bit off in the weeds. > > I'm hoping we can find a way to make the error handling more fine > > grained over time, but for the current patch, managing the blast > > radius as PAGE_SIZE granularity at least matches the zero path with > > the write path. > > I think the write path can do 512 byte granularity, not page size. > Anyway, I think we've gone far enough off into the weeds that more > patches will have to be posted for debate. :) > It can't, see dax_iomap_actor(). That call to dax_direct_access() will fail if it hits a known badblock. > >> > Setting that aside we can start with just treating zeroing the same as > >> > the copy_from_iter() case and fail the I/O at the dax_direct_access() > >> > step. > >> > >> OK. > >> > >> > I'd rather have a separate op that filesystems can use to clear errors > >> > at block allocation time that can be enforced to have the correct > >> > alignment. > >> > >> So would file systems always call that routine instead of zeroing, or > >> would they first check to see if there are badblocks? > > > > The proposal is that filesystems distinguish zeroing from free-block > > allocation/initialization such that the fsdax implementation directs > > initialization to a driver callback. This "initialization op" would > > take care to check for poison and clear it. All other dax paths would > > not consult the badblocks list. > > What do you mean by "all other dax paths?" Would that include > mmap/direct_access? Because that definitely should still consult the > badblocks list. dax_direct_access() consults the badblock list, dax_copy_{to,from}_iter() do not, and badblocks discovered after a page is mapped do not invalidate the page unless the poison is consumed. > I'm not against having a separate operation for clearing errors, but I > guess I'm not convinced it's cleaner, either. The idea with a separate op is to have an explicit ABI to clear errors like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit side effect of direct-I/O writes. I also feel like we're heading in a direction that tries to avoid relying on the cpu machine-check recovery path at all costs. That's the kernel's prerogative, of course, but it seems like at a minimum we're not quite on the same page about what role machine-check recovery plays in the error handling model.
On Tue, Feb 25, 2020 at 12:08 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote: > > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote: > > > > > > [..] > > > > > > > Ok, how about if I add one more patch to the series which will check > > > > > > > if unwritten portion of the page has known poison. If it has, then > > > > > > > -EIO is returned. > > > > > > > > > > > > > > > > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area > > > > > > > > > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon > > > > > > > truncate. If partial page is being zeroed, then at the end of operation > > > > > > > file systems expect that there is no poison in the whole page (atleast > > > > > > > known poison). > > > > > > > > > > > > > > So make sure part of the partial page which is not being written, does not > > > > > > > have poison. If it does, return error. If there is poison in area of page > > > > > > > being written, it will be cleared. > > > > > > > > > > > > No, I don't like that the zero operation is special cased compared to > > > > > > the write case. I'd say let's make them identical for now. I.e. fail > > > > > > the I/O at dax_direct_access() time. > > > > > > > > > > So basically __dax_zero_page_range() will only write zeros (and not > > > > > try to clear any poison). Right? > > > > > > > > Yes, the zero operation would have already failed at the > > > > dax_direct_access() step if there was present poison. > > > > > > > > > > I think the error clearing > > > > > > interface should be an explicit / separate op rather than a > > > > > > side-effect. What about an explicit interface for initializing newly > > > > > > allocated blocks, and the only reliable way to destroy poison through > > > > > > the filesystem is to free the block? > > > > > > > > > > Effectively pmem_make_request() is already that interface filesystems > > > > > use to initialize blocks and clear poison. So we don't really have to > > > > > introduce a new interface? > > > > > > > > pmem_make_request() is shared with the I/O path and is too low in the > > > > stack to understand intent. DAX intercepts the I/O path closer to the > > > > filesystem and can understand zeroing vs writing today. I'm proposing > > > > we go a step further and make DAX understand free-to-allocated-block > > > > initialization instead of just zeroing. Inject the error clearing into > > > > that initialization interface. > > > > > > > > > Or you are suggesting separate dax_zero_page_range() interface which will > > > > > always call into firmware to clear poison. And that will make sure latent > > > > > poison is cleared as well and filesystem should use that for block > > > > > initialization instead? > > > > > > > > Yes, except latent poison would not be cleared until the zeroing is > > > > implemented with movdir64b instead of callouts to firmware. It's > > > > otherwise too slow to call out to firmware unconditionally. > > > > > > > > > I do like the idea of not having to differentiate > > > > > between known poison and latent poison. Once a block has been initialized > > > > > all poison should be cleared (known/latent). I am worried though that > > > > > on large devices this might slowdown filesystem initialization a lot > > > > > if they are zeroing large range of blocks. > > > > > > > > > > If yes, this sounds like two different patch series. First patch series > > > > > takes care of removing blkdev_issue_zeroout() from > > > > > __dax_zero_page_range() and couple of iomap related cleans christoph > > > > > wanted. > > > > > > > > > > And second patch series for adding new dax operation to zero a range > > > > > and always call info firmware to clear poison and modify filesystems > > > > > accordingly. > > > > > > > > Yes, but they may need to be merged together. I don't want to regress > > > > the ability of a block-aligned hole-punch to clear errors. > > > > > > Hi Dan, > > > > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range() > > > path. Instead they call blkdev_issue_zeroout() at later point of time. > > > > > > Only partial block zeroing path is taking __dax_zero_page_range(). So > > > even if we remove poison clearing code from __dax_zero_page_range(), > > > there should not be a regression w.r.t full block zeroing. Only possible > > > regression will be if somebody was doing partial block zeroing on sector > > > boundary, then poison will not be cleared. > > > > > > We now seem to be discussing too many issues w.r.t poison clearing > > > and dax. Atleast 3 issues are mentioned in this thread. > > > > > > A. Get rid of dependency on block device in dax zeroing path. > > > (__dax_zero_page_range) > > > > > > B. Provide a way to clear latent poison. And possibly use movdir64b to > > > do that and make filesystems use that interface for initialization > > > of blocks. > > > > > > C. Dax zero operation is clearing known poison while copy_from_iter() is > > > not. I guess this ship has already sailed. If we change it now, > > > somebody will complain of some regression. > > > > > > For issue A, there are two possible ways to deal with it. > > > > > > 1. Implement a dax method to zero page. And this method will also clear > > > known poison. This is what my patch series is doing. > > > > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range() > > > so that no poison will be cleared in __dax_zero_page_range() path. This > > > path is currently used in partial page zeroing path and full filesystem > > > block zeroing happens with blkdev_issue_zeroout(). There is a small > > > chance of regression here in case of sector aligned partial block > > > zeroing. > > > > > > My patch series takes care of issue A without any regressions. In fact it > > > improves current interface. For example, currently "truncate -s 512 > > > foo.txt" will succeed even if first sector in the block is poisoned. My > > > patch series fixes it. Current implementation will return error on if any > > > non sector aligned truncate is done and any of the sector is poisoned. My > > > implementation will not return error if poisoned can be cleared as part > > > of zeroing. It will return only if poison is present in non-zeoring part. > > > > That asymmetry makes the implementation too much of a special case. If > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then > > so should zeroing. > > > > > > > > Why don't we solve one issue A now and deal with issue B and C later in > > > a sepaprate patch series. This patch series gets rid of dependency on > > > block device in dax path and also makes current zeroing interface better. > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > callback that deals with page aligned entries. That change at least > > makes the error boundary symmetric across copy_from_iter() and the > > zeroing path. > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > aligned start and size and call this interface from > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > path? > > Something like. > > __dax_zero_page_range() { > if(page_aligned_io) > call_dax_page_zero_range() > else > use_direct_access_and_memcpy; > } > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > to calling dax_zero_page_range() instead. > > If yes, I am not seeing what advantage do we get by this change. > > - __dax_zero_page_range() seems to be called by only partial block > zeroing code. So dax_zero_page_range() call will remain unused. > > > - dax_zero_page_range() will be exact replacement of > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > it will create a dax specific hook. > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > call from __dax_zero_page_range() and make sure there are no callers of > full block zeroing from this path. I think you're right. The path I'm concerned about not regressing is the error clearing on new block allocation and we get that already via xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a dax-device equivalent for that path, but that does mean that __dax_zero_page_range() stays out of the error clearing game.
On 2/24/2020 4:26 PM, Dan Williams wrote: > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote: >> >> Dan Williams <dan.j.williams@intel.com> writes: >> >>>> Let's just focus on reporting errors when we know we have them. >>> >>> That's the problem in my eyes. If software needs to contend with >>> latent error reporting then it should always contend otherwise >>> software has multiple error models to wrangle. >> >> The only way for an application to know that the data has been written >> successfully would be to issue a read after every write. That's not a >> performance hit most applications are willing to take. And, of course, >> the media can still go bad at a later time, so it only guarantees the >> data is accessible immediately after having been written. >> >> What I'm suggesting is that we should not complete a write successfully >> if we know that the data will not be retrievable. I wouldn't call this >> adding an extra error model to contend with. Applications should >> already be checking for errors on write. >> >> Does that make sense? Are we talking past each other? > > The badblock list is late to update in both directions, late to add > entries that the scrub needs to find and late to delete entries that > were inadvertently cleared by cache-line writes that did not first > ingest the poison for a read-modify-write. So I see the above as being > wishful in using the error list as the hard source of truth and > unfortunate to up-level all sub-sector error entries into full > PAGE_SIZE data offline events. Sorry, don't mean to distract the discussion, but I'm wondering if anyone has noticed SIGBUS with si_code = MCEERR_AO in a single process poison test over a dax-xfs file? There is only 1 poison in the file which has been consumed, it's the recovery code path (hole punch/ munmap/mmap/pwrite/read) that encounters the _AO. I'm confident that latent error isn't the scenario per ARS scrub. Also, the _AO appears rarely. This is un-explainable given the kernel MCE pmem handling implementation. > > I'm hoping we can find a way to make the error handling more fine > grained over time, but for the current patch, managing the blast > radius as PAGE_SIZE granularity at least matches the zero path with > the write path. Maybe the new filesystem op for clearing pmem poison should insist on 4K alignment? because in hwpoison_clear() the starting pfn is given by PHYS_PFN which rounds down to the nearest page, so we might inadvertently clear the poison bit and 'noce' bit from a page when we only cleared a poison e.g. in the second half of the page. BTW, set_mce_nospec() doesn't seem to work in 5.5 release, [ 2321.209382] Could not invalidate pfn=0x1850600 from 1:1 map I will see if I can find more information. > >>> Setting that aside we can start with just treating zeroing the same as >>> the copy_from_iter() case and fail the I/O at the dax_direct_access() >>> step. >> >> OK. >> >>> I'd rather have a separate op that filesystems can use to clear errors >>> at block allocation time that can be enforced to have the correct >>> alignment. >> >> So would file systems always call that routine instead of zeroing, or >> would they first check to see if there are badblocks? > > The proposal is that filesystems distinguish zeroing from free-block > allocation/initialization such that the fsdax implementation directs > initialization to a driver callback. This "initialization op" would > take care to check for poison and clear it. All other dax paths would > not consult the badblocks list. thanks! -jane > _______________________________________________ > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > To unsubscribe send an email to linux-nvdimm-leave@lists.01.org >
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: [..] > > > > Hi Dan, > > > > > > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range() > > > > path. Instead they call blkdev_issue_zeroout() at later point of time. > > > > > > > > Only partial block zeroing path is taking __dax_zero_page_range(). So > > > > even if we remove poison clearing code from __dax_zero_page_range(), > > > > there should not be a regression w.r.t full block zeroing. Only possible > > > > regression will be if somebody was doing partial block zeroing on sector > > > > boundary, then poison will not be cleared. > > > > > > > > We now seem to be discussing too many issues w.r.t poison clearing > > > > and dax. Atleast 3 issues are mentioned in this thread. > > > > > > > > A. Get rid of dependency on block device in dax zeroing path. > > > > (__dax_zero_page_range) > > > > > > > > B. Provide a way to clear latent poison. And possibly use movdir64b to > > > > do that and make filesystems use that interface for initialization > > > > of blocks. > > > > > > > > C. Dax zero operation is clearing known poison while copy_from_iter() is > > > > not. I guess this ship has already sailed. If we change it now, > > > > somebody will complain of some regression. > > > > > > > > For issue A, there are two possible ways to deal with it. > > > > > > > > 1. Implement a dax method to zero page. And this method will also clear > > > > known poison. This is what my patch series is doing. > > > > > > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range() > > > > so that no poison will be cleared in __dax_zero_page_range() path. This > > > > path is currently used in partial page zeroing path and full filesystem > > > > block zeroing happens with blkdev_issue_zeroout(). There is a small > > > > chance of regression here in case of sector aligned partial block > > > > zeroing. > > > > > > > > My patch series takes care of issue A without any regressions. In fact it > > > > improves current interface. For example, currently "truncate -s 512 > > > > foo.txt" will succeed even if first sector in the block is poisoned. My > > > > patch series fixes it. Current implementation will return error on if any > > > > non sector aligned truncate is done and any of the sector is poisoned. My > > > > implementation will not return error if poisoned can be cleared as part > > > > of zeroing. It will return only if poison is present in non-zeoring part. > > > > > > That asymmetry makes the implementation too much of a special case. If > > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then > > > so should zeroing. > > > > > > > > > > > Why don't we solve one issue A now and deal with issue B and C later in > > > > a sepaprate patch series. This patch series gets rid of dependency on > > > > block device in dax path and also makes current zeroing interface better. > > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > callback that deals with page aligned entries. That change at least > > > makes the error boundary symmetric across copy_from_iter() and the > > > zeroing path. > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > aligned start and size and call this interface from > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > path? > > > > Something like. > > > > __dax_zero_page_range() { > > if(page_aligned_io) > > call_dax_page_zero_range() > > else > > use_direct_access_and_memcpy; > > } > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > to calling dax_zero_page_range() instead. > > > > If yes, I am not seeing what advantage do we get by this change. > > > > - __dax_zero_page_range() seems to be called by only partial block > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > - dax_zero_page_range() will be exact replacement of > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > it will create a dax specific hook. > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > call from __dax_zero_page_range() and make sure there are no callers of > > full block zeroing from this path. > > I think you're right. The path I'm concerned about not regressing is > the error clearing on new block allocation and we get that already via > xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a > dax-device equivalent for that path, but that does mean that > __dax_zero_page_range() stays out of the error clearing game. In virtiofs we do not manage our own blocks. We let host filesystem do that and we are just passthrough filesystem passing fuse messages around. So I have not seen need of block zeroing interface yet. I just happened to carry a patch in my patch series in this area because we wanted to get rid of this assumption that dax always has a block device associated. Apart from that, I don't need __dax_zero_page_range() for virtiofs. I am doing this cleanup so that we dont even try to use block device in this dax zeroing path. Anyway, I will cleanup this patch series and get rid of blkdev_issue_zeroout() call from __dax_zero_page_range() and post again for review and where does it go from there. Thanks Vivek
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: [..] > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > callback that deals with page aligned entries. That change at least > > > makes the error boundary symmetric across copy_from_iter() and the > > > zeroing path. > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > aligned start and size and call this interface from > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > path? > > > > Something like. > > > > __dax_zero_page_range() { > > if(page_aligned_io) > > call_dax_page_zero_range() > > else > > use_direct_access_and_memcpy; > > } > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > to calling dax_zero_page_range() instead. > > > > If yes, I am not seeing what advantage do we get by this change. > > > > - __dax_zero_page_range() seems to be called by only partial block > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > - dax_zero_page_range() will be exact replacement of > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > it will create a dax specific hook. > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > call from __dax_zero_page_range() and make sure there are no callers of > > full block zeroing from this path. > > I think you're right. The path I'm concerned about not regressing is > the error clearing on new block allocation and we get that already via > xfs_zero_extent() and sb_issue_zeroout(). Well I was wrong. I found atleast one user which uses __dax_zero_page_range() to zero full PAGE_SIZE blocks. xfs_io -c "allocsp 32K 0" foo.txt In that case, I will add a new dax method say dax_zero_page_range() which will only take PAGE_SIZE aligned range will clear known poison and call that from __dax_zero_page_range() if I/O is PAGE_SIZE aligned. For now I will limit this interface to take only single page at a time because otherwise implementation becomes complex in dm/md stack where range has to be broken across multiple devices and there are no users because current iomap API passes one page at a time. So once we have grown users, then one can also tackle the complexity of modifying dm/md to break a page range across multiple devices. Will post a patch series for this. Thanks Vivek
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > [..] > > > > > Hi Jeff, > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > 4095. > > > > This should fail with EIO. Only full page writes should clear the > > bad page state, and partial writes should therefore fail because > > they do not guarantee the data in the filesystem block is all good. > > > > If this zeroing was a buffered write to an address with a bad > > sector, then the writeback will fail and the user will (eventually) > > get an EIO on the file. > > > > DAX should do the same thing, except because the zeroing is > > synchronous (i.e. done directly by the truncate syscall) we can - > > and should - return EIO immediately. > > > > Indeed, with your code, if we then extend the file by truncating up > > back to 4k, then the range between 23 and 512 is still bad, even > > though we've successfully zeroed it and the user knows it. An > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > will fail with EIO, but reading 10 bytes at offset 2000 will > > succeed. > > Hi Dave, > > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to > 511) is poisoned and rest don't have poison. Should this fail with -EIO. Yes - the filesystem block still contains bad data. > In current implementation it does not. I'm not surprised - the whole hardware error handling architecture for FS-DAX is fundamentally broken. It was designed for device-dax, and it just doesn't work for FS-DAX. For example, to get the hardware error handling to be able to kill userspace applications, a 1:1 physical-to-logical association constraint was added to fs/dax.c: /* * TODO: for reflink+dax we need a way to associate a single page with * multiple address_space instances at different linear_page_index() * offsets. */ static void dax_associate_entry(void *entry, struct address_space *mapping, struct vm_area_struct *vma, unsigned long address) because device-dax only has *linear mappings* and so has a fixed reverse mapping architecture. i.e. the whole hardware error handling infrastructure was designed around the constraints of device-dax. device-dax does not having any structure to serialise access to the physical storage, so locking was added to the mapping tree. THe mapping tree locking is accessed on hardware error via the reverse mappingi association in the struct page and that's how device-dax serialises direct physical storage access against hardware error processing. And while the page index is locked in the mapping tree, it can walk the process vmas that have the page mapped to kill them so that they don't try to access the bad page. That bad physical storage state is carried in a volatile struct page flag, hence requiring some mechanism to make it persistent (the device bad blocks register) and some other mechanism to clear the poison state (direct IO, IIRC). It's also a big, nasty, solid roadblock to implementing shared data extents in FS-DAX. We basically have to completely re-architect the hardware error handling for FS-DAX so that such errors are reported to the filesystem first and then the filesystem does what is needed to handle the error. None of this works for filesystems because they need to perform different operations depending on what the page that went bad contains. FS-DAX should never trip over an unexpected poisoned page; we do so now because such physical storage errors are completely hidden form the fielsystem. What you are trying to do is slap a band-aid over what to do when we hit an unexpected page containing bad data. Filesystems expect to find out about bad data in storage when they marshall the data into or out of memory. They make the assumption that once it is in memory it remains valid on the physical storage. Hence if an in-memory error occurs, we can just toss it away and re-read it from storage, and all is good. FS-DAX changes that - we are no longer marshalling data into and out of memory so we don't have a mechanism to get EIO when reading the page into the page cache or writing it back to disk. We also don't have an in-memory copy of the data - the physical storage is the in-memory copy, and so we can't just toss it away when an error occurs. What we actually require is instantaneous notification of physical storage errors so we can handle the error immediately. And that, in turn, means we should never poison or see poisoned pages during direct access operations because the filesystem doesn't need to poison pages to prevent user access - it controls how the storage is accessed directly. e.g. if this error is in filesystem metadata, we might be able to recover from it as that metadata might have a full copy in memory (metadata is buffered in both XFS and ext4) or we might be able to reconstruct it from other sources. Worst case, we have shut the filesystem down completely so the admin can repair the damage the lost page has caused. e.g. The physical page may be located in free space, in which case we don't care and can just zero it so all the bad hardware state is cleared. The page never goes bad or gets poisoned in that case. e.g. The physical page may be user data, in which case it may be buffered in the page cache (non-dax) and so can easily be recovered. It may not be recoverable, in which case we need to issue log messages indicating that data has been lost (path, offset, length), and do the VMA walk and kill processes that map that page. Then we can zero the page to clear the bad state. If, at any point we can't clear the bad state (e.g. the zeroing or the read-back verification fails), then we need to make sure that filesystem block is marked as allocated in the free space map, then tell the reverse map that it's owner is now "bad storage" so it never gets used again. i.e. this is the persistent bad block storage, but doing it this way results in the rest of the filesystem code never, ever seeing a poisoned page. And users never see it either, because it will never be returned to the free space pool. Of course, this relies of the filesystem having reverse mapping capability. XFS already has this funcitonality available as a mkfs option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can get rid of the association of a physical page with a mapping and file offset that device-dax requires for hardware page error handling. This means we don't need the physical storage to try to hold filesystem layer reverse mapping information for us, and this also removes the roadblock that the hardware error handling has placed on implementing reflink w/ FS-DAX. IOWs, the problem you are trying to solve is a direct result of filesysetms not being informed when a physical pmem page goes bad and the current error handling being implemented at entirely the wrong layer for FS-DAX. It may work for device-dax, but it's most definitely insufficient for correct error handling for filesystems. > Anyway, partial page truncate can't ensure that data in rest of the page can > be read back successfully. Memory can get poison after the write and > hence read after truncate will still fail. Which is where the notification requirement comes in. Yes, we may still get errors on read or write, but if memory at rest goes bad, we want to handle that and correct it ASAP, not wait days or months for something to trip over the poisoned page before we find out about it. > Hence, all we are trying to ensure is that if a poison is known at the > time of writing partial page, then we should return error to user space. I think within FS-DAX infrastructure, any access to the data (read or write) within a poisoned page or a page marked with PageError() should return EIO to the caller, unless it's the specific command to clear the error/poison state on the page. What happens with that error state is then up to the caller. Cheers, Dave.
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > [..] > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > callback that deals with page aligned entries. That change at least > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > zeroing path. > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > > aligned start and size and call this interface from > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > path? > > > > > > Something like. > > > > > > __dax_zero_page_range() { > > > if(page_aligned_io) > > > call_dax_page_zero_range() > > > else > > > use_direct_access_and_memcpy; > > > } > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > to calling dax_zero_page_range() instead. > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > > it will create a dax specific hook. > > > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > > call from __dax_zero_page_range() and make sure there are no callers of > > > full block zeroing from this path. > > > > I think you're right. The path I'm concerned about not regressing is > > the error clearing on new block allocation and we get that already via > > xfs_zero_extent() and sb_issue_zeroout(). > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range() > to zero full PAGE_SIZE blocks. > > xfs_io -c "allocsp 32K 0" foo.txt That ioctl interface is deprecated and likely unused by any new application written since 1999. It predates unwritten extents (1998) and I don't think any native linux applications have ever used it. A newly written DAX aware application would almost certainly not use this interface. IOWs, I wouldn't use it as justification for some special case behaviour; I'm more likely to say "rip that ancient ioctl out" than to jump through hoops because it's implementation behaviour. Cheers, Dave.
On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote: > > > > [..] > > > > > > Hi Jeff, > > > > > > > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass > > > > > > a range which is less than a sector. Or which is bigger than a sector > > > > > > but start and end are not aligned on sector boundaries. > > > > > > > > > > Sure, but who will call it with misaligned ranges? > > > > > > > > create a file foo.txt of size 4K and then truncate it. > > > > > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to > > > > 4095. > > > > > > This should fail with EIO. Only full page writes should clear the > > > bad page state, and partial writes should therefore fail because > > > they do not guarantee the data in the filesystem block is all good. > > > > > > If this zeroing was a buffered write to an address with a bad > > > sector, then the writeback will fail and the user will (eventually) > > > get an EIO on the file. > > > > > > DAX should do the same thing, except because the zeroing is > > > synchronous (i.e. done directly by the truncate syscall) we can - > > > and should - return EIO immediately. > > > > > > Indeed, with your code, if we then extend the file by truncating up > > > back to 4k, then the range between 23 and 512 is still bad, even > > > though we've successfully zeroed it and the user knows it. An > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100) > > > will fail with EIO, but reading 10 bytes at offset 2000 will > > > succeed. > > > > Hi Dave, > > > > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to > > 511) is poisoned and rest don't have poison. Should this fail with -EIO. > > Yes - the filesystem block still contains bad data. > > > In current implementation it does not. > > I'm not surprised - the whole hardware error handling architecture > for FS-DAX is fundamentally broken. It was designed for device-dax, > and it just doesn't work for FS-DAX. > > For example, to get the hardware error handling to be able to kill > userspace applications, a 1:1 physical-to-logical association > constraint was added to fs/dax.c: > > /* > * TODO: for reflink+dax we need a way to associate a single page with > * multiple address_space instances at different linear_page_index() > * offsets. > */ > static void dax_associate_entry(void *entry, struct address_space *mapping, > struct vm_area_struct *vma, unsigned long address) > > because device-dax only has *linear mappings* and so has a fixed > reverse mapping architecture. > > i.e. the whole hardware error handling infrastructure was designed > around the constraints of device-dax. device-dax does not having any > structure to serialise access to the physical storage, so locking > was added to the mapping tree. THe mapping tree locking is accessed > on hardware error via the reverse mappingi association in the struct > page and that's how device-dax serialises direct physical storage > access against hardware error processing. And while the page index > is locked in the mapping tree, it can walk the process vmas that > have the page mapped to kill them so that they don't try to access > the bad page. > > That bad physical storage state is carried in a volatile struct page > flag, hence requiring some mechanism to make it persistent (the > device bad blocks register) and some other mechanism to clear the > poison state (direct IO, IIRC). > > It's also a big, nasty, solid roadblock to implementing shared > data extents in FS-DAX. We basically have to completely re-architect > the hardware error handling for FS-DAX so that such errors are > reported to the filesystem first and then the filesystem does what > is needed to handle the error. > > None of this works for filesystems because they need to perform > different operations depending on what the page that went bad > contains. FS-DAX should never trip over an unexpected poisoned page; > we do so now because such physical storage errors are completely > hidden form the fielsystem. > > What you are trying to do is slap a band-aid over what to do when we > hit an unexpected page containing bad data. Filesystems expect to > find out about bad data in storage when they marshall the data into > or out of memory. They make the assumption that once it is in memory > it remains valid on the physical storage. Hence if an in-memory > error occurs, we can just toss it away and re-read it from storage, > and all is good. > > FS-DAX changes that - we are no longer marshalling data into and out > of memory so we don't have a mechanism to get EIO when reading the > page into the page cache or writing it back to disk. We also don't > have an in-memory copy of the data - the physical storage is the > in-memory copy, and so we can't just toss it away when an error > occurs. > > What we actually require is instantaneous notification of physical > storage errors so we can handle the error immediately. And that, in > turn, means we should never poison or see poisoned pages during > direct access operations because the filesystem doesn't need to > poison pages to prevent user access - it controls how the storage is > accessed directly. > > e.g. if this error is in filesystem metadata, we might be able to > recover from it as that metadata might have a full copy in memory > (metadata is buffered in both XFS and ext4) or we might be able to > reconstruct it from other sources. Worst case, we have shut the > filesystem down completely so the admin can repair the damage the > lost page has caused. > > e.g. The physical page may be located in free space, in which case > we don't care and can just zero it so all the bad hardware state is > cleared. The page never goes bad or gets poisoned in that case. > > e.g. The physical page may be user data, in which case it may be > buffered in the page cache (non-dax) and so can easily be recovered. > It may not be recoverable, in which case we need to issue log > messages indicating that data has been lost (path, offset, length), > and do the VMA walk and kill processes that map that page. Then we > can zero the page to clear the bad state. > > If, at any point we can't clear the bad state (e.g. the zeroing or > the read-back verification fails), then we need to make sure that > filesystem block is marked as allocated in the free space map, then > tell the reverse map that it's owner is now "bad storage" so it > never gets used again. i.e. this is the persistent bad block > storage, but doing it this way results in the rest of the filesystem > code never, ever seeing a poisoned page. And users never see it > either, because it will never be returned to the free space pool. > > Of course, this relies of the filesystem having reverse mapping > capability. XFS already has this funcitonality available as a mkfs > option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can > get rid of the association of a physical page with a mapping and > file offset that device-dax requires for hardware page error > handling. This means we don't need the physical storage to try to > hold filesystem layer reverse mapping information for us, and this > also removes the roadblock that the hardware error handling has > placed on implementing reflink w/ FS-DAX. > > IOWs, the problem you are trying to solve is a direct result of > filesysetms not being informed when a physical pmem page goes bad > and the current error handling being implemented at entirely the > wrong layer for FS-DAX. It may work for device-dax, but it's most > definitely insufficient for correct error handling for filesystems. > > > Anyway, partial page truncate can't ensure that data in rest of the page can > > be read back successfully. Memory can get poison after the write and > > hence read after truncate will still fail. > > Which is where the notification requirement comes in. Yes, we may > still get errors on read or write, but if memory at rest goes bad, > we want to handle that and correct it ASAP, not wait days or months > for something to trip over the poisoned page before we find out > about it. > > > Hence, all we are trying to ensure is that if a poison is known at the > > time of writing partial page, then we should return error to user space. > > I think within FS-DAX infrastructure, any access to the data (read > or write) within a poisoned page or a page marked with PageError() > should return EIO to the caller, unless it's the specific command to > clear the error/poison state on the page. What happens with that > error state is then up to the caller. > I agree with most of the above if you replace "device-dax error handling" with "System RAM error handling". It's typical memory error handling that injects the page->index and page->mappping dependency. In fact it was difficult to map this to device-dax without the hack that device-dax does not need to take a page lock. I do think that the FS needs this error information, at the same time it's also true that historically no FS pushed for this information for page-cache and in-memory metadata prior to FS-DAX. So, the design was not "device-dax first" it was "existing memory error handling first" and all the warts that implied. So you want the FS to have error handling for just pmem errors or all memory errors? And you want this to be done without the mm core using page->index to identify what to unmap when the error happens? Memory error scanning is not universally available on all pmem implementations, so FS will need to subscribe for memory-error handling events. The implementation is clearer for pmem, if the FS is only interested in memory error handling events for memory it ostensibly owns (mounted pmem) vs other pages that are only ephemerally related to the FS (typical RAM).
On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote: > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > > [..] > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > > callback that deals with page aligned entries. That change at least > > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > > zeroing path. > > > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > > > aligned start and size and call this interface from > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > > path? > > > > > > > > Something like. > > > > > > > > __dax_zero_page_range() { > > > > if(page_aligned_io) > > > > call_dax_page_zero_range() > > > > else > > > > use_direct_access_and_memcpy; > > > > } > > > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > > to calling dax_zero_page_range() instead. > > > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > > > it will create a dax specific hook. > > > > > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > > > call from __dax_zero_page_range() and make sure there are no callers of > > > > full block zeroing from this path. > > > > > > I think you're right. The path I'm concerned about not regressing is > > > the error clearing on new block allocation and we get that already via > > > xfs_zero_extent() and sb_issue_zeroout(). > > > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range() > > to zero full PAGE_SIZE blocks. > > > > xfs_io -c "allocsp 32K 0" foo.txt > > That ioctl interface is deprecated and likely unused by any new > application written since 1999. It predates unwritten extents (1998) > and I don't think any native linux applications have ever used it. A > newly written DAX aware application would almost certainly not use > this interface. > > IOWs, I wouldn't use it as justification for some special case > behaviour; I'm more likely to say "rip that ancient ioctl out" than > to jump through hoops because it's implementation behaviour. Hi Dave, Do you see any other path in xfs using iomap_zero_range() and zeroing full block. iomap_zero_range() already skips IOMAP_HOLE and IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type IOMAP_HOLE and IOMAP_UNWRITTEN. My understanding is that ext4 and xfs both are initializing full blocks using blkdev_issue_zeroout(). Only partial blocks are being zeroed using this dax zeroing path. If there are no callers of full block zeroing through __dax_zero_page_range(), then I can simply get rid of blkdev_issue_zerout() call from __dax_zero_page_range(). Thanks Vivek
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote: > On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote: > > > Anyway, partial page truncate can't ensure that data in rest of the page can > > > be read back successfully. Memory can get poison after the write and > > > hence read after truncate will still fail. > > > > Which is where the notification requirement comes in. Yes, we may > > still get errors on read or write, but if memory at rest goes bad, > > we want to handle that and correct it ASAP, not wait days or months > > for something to trip over the poisoned page before we find out > > about it. > > > > > Hence, all we are trying to ensure is that if a poison is known at the > > > time of writing partial page, then we should return error to user space. > > > > I think within FS-DAX infrastructure, any access to the data (read > > or write) within a poisoned page or a page marked with PageError() > > should return EIO to the caller, unless it's the specific command to > > clear the error/poison state on the page. What happens with that > > error state is then up to the caller. > > > > I agree with most of the above if you replace "device-dax error > handling" with "System RAM error handling". It's typical memory error > handling that injects the page->index and page->mappping dependency. I disagree, but that's beside the point and not worth arguing. > So you want the FS to have error handling for just pmem errors or all > memory errors? Just pmem errors in the specific range the filesystem manages - we really only care storage errors because those are the only errors the filesystem is responsible for handling correctly. Somebody else can worry about errors that hit page cache pages - page cache pages require mapping/index pointers on each page anyway, so a generic mechanism for handling those errors can be built into the page cache. And if the error is in general kernel memory, then it's game over for the entire kernel at that point, not just the filesystem. > And you want this to be done without the mm core using > page->index to identify what to unmap when the error happens? Isn't that exactly what I just said? We get the page address that failed, the daxdev can turn that into a sector address and call into the filesystem with a {sector, len, errno} tuple. We then do a reverse mapping lookup on {sector, len} to find all the owners of that range in the filesystem. If it's file data, that owner record gives us the inode and the offset into the file, which then gives us a {mapping, index} tuple. Further, the filesytem reverse map is aware that it's blocks can be shared by multiple owners, so it will have a record for every inode and file offset that maps to that page. Hence we can simply iterate the reverse map and do that whacky collect_procs/kill_procs dance for every {mapping, index} pair that references the the bad range. Nothing ever needs to be stored on the struct page... > Memory > error scanning is not universally available on all pmem > implementations, so FS will need to subscribe for memory-error > handling events. No. Filesystems interact with the underlying device abstraction, not the physical storage that lies behind that device abstraction. The filesystem should not interface with pmem directly in any way (all direct accesses are hidden inside fs/dax.c!), nor should it care about the quirks of the pmem implementation it is sitting on. That's what the daxdev abstraction is supposed to hide from the users of the pmem. IOWs, the daxdev infrastructure subscribes to memory-error event subsystem, calls out to the filesystem when an error in a page in the daxdev is reported. The filesystem only needs to know the {sector, len, errno} tuple related to the error; it is the device's responsibility to handle the physical mechanics of listening, filtering and translating MCEs to a format the filesystem understands.... Another reason it should be provided by the daxdev as a {sector, len, errno} tuple is that it also allows non-dax block devices to implement the similar error notifications and provide filesystems with exactly the same information so the filesystem can start auto-recovery processes.... Cheers, Dave.
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote: > On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote: > > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote: > > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote: > > > [..] > > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation > > > > > > callback that deals with page aligned entries. That change at least > > > > > > makes the error boundary symmetric across copy_from_iter() and the > > > > > > zeroing path. > > > > > > > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page > > > > > aligned start and size and call this interface from > > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that > > > > > path? > > > > > > > > > > Something like. > > > > > > > > > > __dax_zero_page_range() { > > > > > if(page_aligned_io) > > > > > call_dax_page_zero_range() > > > > > else > > > > > use_direct_access_and_memcpy; > > > > > } > > > > > > > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate > > > > > to calling dax_zero_page_range() instead. > > > > > > > > > > If yes, I am not seeing what advantage do we get by this change. > > > > > > > > > > - __dax_zero_page_range() seems to be called by only partial block > > > > > zeroing code. So dax_zero_page_range() call will remain unused. > > > > > > > > > > > > > > > - dax_zero_page_range() will be exact replacement of > > > > > blkdev_issue_zeroout() so filesystems will not gain anything. Just that > > > > > it will create a dax specific hook. > > > > > > > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout() > > > > > call from __dax_zero_page_range() and make sure there are no callers of > > > > > full block zeroing from this path. > > > > > > > > I think you're right. The path I'm concerned about not regressing is > > > > the error clearing on new block allocation and we get that already via > > > > xfs_zero_extent() and sb_issue_zeroout(). > > > > > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range() > > > to zero full PAGE_SIZE blocks. > > > > > > xfs_io -c "allocsp 32K 0" foo.txt > > > > That ioctl interface is deprecated and likely unused by any new > > application written since 1999. It predates unwritten extents (1998) > > and I don't think any native linux applications have ever used it. A > > newly written DAX aware application would almost certainly not use > > this interface. > > > > IOWs, I wouldn't use it as justification for some special case > > behaviour; I'm more likely to say "rip that ancient ioctl out" than > > to jump through hoops because it's implementation behaviour. > > Hi Dave, > > Do you see any other path in xfs using iomap_zero_range() and zeroing > full block. Yes: - xfs_file_aio_write_checks() for zeroing blocks between the existing EOF and the start of the incoming write beyond EOF - xfs_setattr_size() on truncate up for zeroing blocks between the existing EOF and the new EOF. - xfs_reflink_zero_posteof() for zeroing blocks between the old EOF and where the new reflinked extents are going to land beyond EOF And don't think that blocks beyond EOF can't exist when DAX is enabled. We can turn DAX on and off, we can crash between allocation and file size extension, etc. Hence this code must be able to handle zeroing large ranges of blocks beyond EOF... > iomap_zero_range() already skips IOMAP_HOLE and > IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type > IOMAP_HOLE and IOMAP_UNWRITTEN. > > My understanding is that ext4 and xfs both are initializing full blocks > using blkdev_issue_zeroout(). Only partial blocks are being zeroed using > this dax zeroing path. Look at the API, not the callers: iomap_zero_range takes a 64 bit length parameter. It can be asked to zero blocks across petabytes of a file. If there's a single block somewhere in that range, it will only zero that block. If the entire range is allocated, it will zero that entire range (yes, it will take forever!) as that it what it is intended to do. It should be pretty clear that needs to be able to zero entire pages, regardless of how it is currently used/called by filesystems. Cheers, Dave.
On Thu, Feb 27, 2020 at 5:31 PM Dave Chinner <david@fromorbit.com> wrote: > On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote: [..] > > So you want the FS to have error handling for just pmem errors or all > > memory errors? > > Just pmem errors in the specific range the filesystem manages - we > really only care storage errors because those are the only errors > the filesystem is responsible for handling correctly. > > Somebody else can worry about errors that hit page cache pages - > page cache pages require mapping/index pointers on each page anyway, > so a generic mechanism for handling those errors can be built into > the page cache. And if the error is in general kernel memory, then > it's game over for the entire kernel at that point, not just the > filesystem. > > > And you want this to be done without the mm core using > > page->index to identify what to unmap when the error happens? > > Isn't that exactly what I just said? We get the page address that > failed, the daxdev can turn that into a sector address and call into > the filesystem with a {sector, len, errno} tuple. We then do a > reverse mapping lookup on {sector, len} to find all the owners of > that range in the filesystem. If it's file data, that owner record > gives us the inode and the offset into the file, which then gives us > a {mapping, index} tuple. > > Further, the filesytem reverse map is aware that it's blocks can be > shared by multiple owners, so it will have a record for every inode > and file offset that maps to that page. Hence we can simply iterate > the reverse map and do that whacky collect_procs/kill_procs dance > for every {mapping, index} pair that references the the bad range. > > Nothing ever needs to be stored on the struct page... Ok, so fs/dax.c needs to coordinate with mm/memory-failure.c to say "don't perform generic memory-error-handling, there's an fs that owns this daxdev and wants to disposition errors". The fs/dax.c infrastructure that sets up ->index and ->mapping would still need to be there for ext4 until its ready to take on the same responsibility. Last I checked the ext4 reverse mapping implementation was not as capable as XFS. This goes back to why the initial design avoided relying on not universally available / stable reverse-mapping facilities and opted for extending the generic mm/memory-failure.c implementation. If XFS optionally supplants mm/memory-failure.c I would expect we'd want to do better than the "whacky collect_procs/kill_procs" implementation and let applications register for a notification format better than BUS_MCEERR_AO signals. > > Memory > > error scanning is not universally available on all pmem > > implementations, so FS will need to subscribe for memory-error > > handling events. > > No. Filesystems interact with the underlying device abstraction, not > the physical storage that lies behind that device abstraction. The > filesystem should not interface with pmem directly in any way (all > direct accesses are hidden inside fs/dax.c!), nor should it care > about the quirks of the pmem implementation it is sitting on. That's > what the daxdev abstraction is supposed to hide from the users of > the pmem. I wasn't proposing that XFS have a machine-check handler, I was trying to get to the next level detail of the async notification interface to the fs. > > IOWs, the daxdev infrastructure subscribes to memory-error event > subsystem, calls out to the filesystem when an error in a page in > the daxdev is reported. The filesystem only needs to know the > {sector, len, errno} tuple related to the error; it is the device's > responsibility to handle the physical mechanics of listening, > filtering and translating MCEs to a format the filesystem > understands.... > > Another reason it should be provided by the daxdev as a {sector, > len, errno} tuple is that it also allows non-dax block devices to > implement the similar error notifications and provide filesystems > with exactly the same information so the filesystem can start > auto-recovery processes.... The driver layer does already have 'struct badblocks' that both pmem and md use, just no current way for the FS to get a reference to it. However, my hope was to get away from the interface being sector based because the error granularity is already smaller than a sector in the daxdev case as compared to a bdev. A daxdev native error record should be a daxdev relative byte offset, not a sector. If the fs wants to align the blast radius of the error to sectors or fs-blocks that's its choice, but I don't think the driver interface should preclude sub-sector error handling. Essentially I don't want to add more bdev semantics to fs/dax.c while Vivek is busy removing them.
On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote: > "don't perform generic memory-error-handling, there's an fs that owns > this daxdev and wants to disposition errors". The fs/dax.c > infrastructure that sets up ->index and ->mapping would still need to > be there for ext4 until its ready to take on the same responsibility. > Last I checked the ext4 reverse mapping implementation was not as > capable as XFS. This goes back to why the initial design avoided > relying on not universally available / stable reverse-mapping > facilities and opted for extending the generic mm/memory-failure.c > implementation. The important but is that we stop using ->index and ->mapping when XFS is used as that enables using DAX+reflinks, which actually is the most requested DAX feature on XFS (way more than silly runtime flag switches for example).
On Fri, Feb 28, 2020 at 6:05 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote: > > "don't perform generic memory-error-handling, there's an fs that owns > > this daxdev and wants to disposition errors". The fs/dax.c > > infrastructure that sets up ->index and ->mapping would still need to > > be there for ext4 until its ready to take on the same responsibility. > > Last I checked the ext4 reverse mapping implementation was not as > > capable as XFS. This goes back to why the initial design avoided > > relying on not universally available / stable reverse-mapping > > facilities and opted for extending the generic mm/memory-failure.c > > implementation. > > The important but is that we stop using ->index and ->mapping when XFS > is used as that enables using DAX+reflinks, which actually is the most > requested DAX feature on XFS (way more than silly runtime flag switches > for example). Understood. To be clear the plan we are marching to is knock down all the known objections to the removal of the "experimental" designation. reflink is on that list and so is per-file dax. The thought was that pef-file dax was a nearer term goal than reflink.
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 075b11682192..e72959203253 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem, sector_t sector; long cleared; blk_status_t rc = BLK_STS_OK; + phys_addr_t start_aligned, end_aligned; + unsigned int len_aligned; - sector = (offset - pmem->data_offset) / 512; + /* + * Callers can pass arbitrary offset and len. But nvdimm_clear_poison() + * expects memory offset and length to meet certain alignment + * restrction (clear_err_unit). Currently nvdimm does not export + * required alignment. So align offset and length to sector boundary + * before passing it to nvdimm_clear_poison(). + */ + start_aligned = ALIGN(offset, SECTOR_SIZE); + end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1; + len_aligned = end_aligned - start_aligned + 1; + + sector = (start_aligned - pmem->data_offset) / 512; - cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len); - if (cleared < len) + cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned, + len_aligned); + if (cleared < len_aligned) rc = BLK_STS_IOERR; if (cleared > 0 && cleared / 512) { - hwpoison_clear(pmem, pmem->phys_addr + offset, cleared); + hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared); cleared /= 512; dev_dbg(dev, "%#llx clear %ld sector%s\n", (unsigned long long) sector, cleared,
Currently pmem_clear_poison() expects offset and len to be sector aligned. Atleast that seems to be the assumption with which code has been written. It is called only from pmem_do_bvec() which is called only from pmem_rw_page() and pmem_make_request() which will only passe sector aligned offset and len. Soon we want use this function from dax_zero_page_range() code path which can try to zero arbitrary range of memory with-in a page. So update this function to assume that offset and length can be arbitrary and do the necessary alignments as needed. nvdimm_clear_poison() seems to assume offset and len to be aligned to clear_err_unit boundary. But this is currently internal detail and is not exported for others to use. So for now, continue to align offset and length to SECTOR_SIZE boundary. Improving it further and to align it to clear_err_unit boundary is a TODO item for future. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- drivers/nvdimm/pmem.c | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-)