Message ID | 20210730100158.3117319-2-ruansy.fnst@fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fsdax: introduce fs query to support reflink | expand |
The filesystem part of the pmem failure handling is at minimum built on PAGE_SIZE granularity - an inheritance from general memory_failure handling. However, with Intel's DCPMEM technology, the error blast radius is no more than 256bytes, and might get smaller with future hardware generation, also advanced atomic 64B write to clear the poison. But I don't see any of that could be incorporated in, given that the filesystem is notified a corruption with pfn, rather than an exact address. So I guess this question is also for Dan: how to avoid unnecessarily repairing a PMD range for a 256B corrupt range going forward? thanks, -jane On 7/30/2021 3:01 AM, Shiyang Ruan wrote: > When memory-failure occurs, we call this function which is implemented > by each kind of devices. For the fsdax case, pmem device driver > implements it. Pmem device driver will find out the filesystem in which > the corrupted page located in. And finally call filesystem handler to > deal with this error. > > The filesystem will try to recover the corrupted data if necessary.
Hi, ShiYang, So I applied the v6 patch series to my 5.14-rc3 as it's what you indicated is what v6 was based at, and injected a hardware poison. I'm seeing the same problem that was reported a while ago after the poison was consumed - in the SIGBUS payload, the si_addr is missing: ** SIGBUS(7): canjmp=1, whichstep=0, ** ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** The si_addr ought to be 0x7f6568000000 - the vaddr of the first page in this case. Something is not right... thanks, -jane On 8/5/2021 6:17 PM, Jane Chu wrote: > The filesystem part of the pmem failure handling is at minimum built > on PAGE_SIZE granularity - an inheritance from general memory_failure > handling. However, with Intel's DCPMEM technology, the error blast > radius is no more than 256bytes, and might get smaller with future > hardware generation, also advanced atomic 64B write to clear the poison. > But I don't see any of that could be incorporated in, given that the > filesystem is notified a corruption with pfn, rather than an exact > address. > > So I guess this question is also for Dan: how to avoid unnecessarily > repairing a PMD range for a 256B corrupt range going forward? > > thanks, > -jane > > > On 7/30/2021 3:01 AM, Shiyang Ruan wrote: >> When memory-failure occurs, we call this function which is implemented >> by each kind of devices. For the fsdax case, pmem device driver >> implements it. Pmem device driver will find out the filesystem in which >> the corrupted page located in. And finally call filesystem handler to >> deal with this error. >> >> The filesystem will try to recover the corrupted data if necessary. >
> -----Original Message----- > From: Jane Chu <jane.chu@oracle.com> > Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure() > > Hi, ShiYang, > > So I applied the v6 patch series to my 5.14-rc3 as it's what you indicated is what > v6 was based at, and injected a hardware poison. > > I'm seeing the same problem that was reported a while ago after the poison > was consumed - in the SIGBUS payload, the si_addr is missing: > > ** SIGBUS(7): canjmp=1, whichstep=0, ** > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** > > The si_addr ought to be 0x7f6568000000 - the vaddr of the first page in this > case. > > Something is not right... Hi Jane, Sorry for late reply. Thanks for testing. This address should have been reported in my code. I'll check why it's finally nil. -- Thanks. Ruan. > > thanks, > -jane > > > On 8/5/2021 6:17 PM, Jane Chu wrote: > > The filesystem part of the pmem failure handling is at minimum built > > on PAGE_SIZE granularity - an inheritance from general memory_failure > > handling. However, with Intel's DCPMEM technology, the error blast > > radius is no more than 256bytes, and might get smaller with future > > hardware generation, also advanced atomic 64B write to clear the poison. > > But I don't see any of that could be incorporated in, given that the > > filesystem is notified a corruption with pfn, rather than an exact > > address. > > > > So I guess this question is also for Dan: how to avoid unnecessarily > > repairing a PMD range for a 256B corrupt range going forward? > > > > thanks, > > -jane > > > > > > On 7/30/2021 3:01 AM, Shiyang Ruan wrote: > >> When memory-failure occurs, we call this function which is > >> implemented by each kind of devices. For the fsdax case, pmem device > >> driver implements it. Pmem device driver will find out the > >> filesystem in which the corrupted page located in. And finally call > >> filesystem handler to deal with this error. > >> > >> The filesystem will try to recover the corrupted data if necessary. > >
More information - On 8/16/2021 10:20 AM, Jane Chu wrote: > Hi, ShiYang, > > So I applied the v6 patch series to my 5.14-rc3 as it's what you > indicated is what v6 was based at, and injected a hardware poison. > > I'm seeing the same problem that was reported a while ago after the > poison was consumed - in the SIGBUS payload, the si_addr is missing: > > ** SIGBUS(7): canjmp=1, whichstep=0, ** > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** > > The si_addr ought to be 0x7f6568000000 - the vaddr of the first page > in this case. The failure came from here : [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS +static int +xfs_dax_notify_failure( ... + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) { + xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); + return -EOPNOTSUPP; + } I am not familiar with XFS, but I have a few questions I hope to get answers - 1) What does it take and cost to make xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? 2) For a running environment that fails the above check, is it okay to leave the poison handle in limbo and why? 3) If the above regression is not acceptable, any potential remedy? thanks! -jane > > Something is not right... > > thanks, > -jane > > > On 8/5/2021 6:17 PM, Jane Chu wrote: >> The filesystem part of the pmem failure handling is at minimum built >> on PAGE_SIZE granularity - an inheritance from general memory_failure >> handling. However, with Intel's DCPMEM technology, the error blast >> radius is no more than 256bytes, and might get smaller with future >> hardware generation, also advanced atomic 64B write to clear the poison. >> But I don't see any of that could be incorporated in, given that the >> filesystem is notified a corruption with pfn, rather than an exact >> address. >> >> So I guess this question is also for Dan: how to avoid unnecessarily >> repairing a PMD range for a 256B corrupt range going forward? >> >> thanks, >> -jane >> >> >> On 7/30/2021 3:01 AM, Shiyang Ruan wrote: >>> When memory-failure occurs, we call this function which is implemented >>> by each kind of devices. For the fsdax case, pmem device driver >>> implements it. Pmem device driver will find out the filesystem in which >>> the corrupted page located in. And finally call filesystem handler to >>> deal with this error. >>> >>> The filesystem will try to recover the corrupted data if necessary. >> >
On 8/17/2021 10:43 PM, Jane Chu wrote: > More information - > > On 8/16/2021 10:20 AM, Jane Chu wrote: >> Hi, ShiYang, >> >> So I applied the v6 patch series to my 5.14-rc3 as it's what you >> indicated is what v6 was based at, and injected a hardware poison. >> >> I'm seeing the same problem that was reported a while ago after the >> poison was consumed - in the SIGBUS payload, the si_addr is missing: >> >> ** SIGBUS(7): canjmp=1, whichstep=0, ** >> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** >> >> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page >> in this case. > > The failure came from here : > > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS > > +static int > +xfs_dax_notify_failure( > ... > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) { > + xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); > + return -EOPNOTSUPP; > + } > > I am not familiar with XFS, but I have a few questions I hope to get > answers - > > 1) What does it take and cost to make > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? > > 2) For a running environment that fails the above check, is it > okay to leave the poison handle in limbo and why? > > 3) If the above regression is not acceptable, any potential remedy? How about moving the check to prior to the notifier registration? And register only if the check is passed? This seems better than an alternative which is to fall back to the legacy memory_failure handling in case the filesystem returns -EOPNOTSUPP. thanks, -jane > > thanks! > -jane > > >> >> Something is not right... >> >> thanks, >> -jane >> >> >> On 8/5/2021 6:17 PM, Jane Chu wrote: >>> The filesystem part of the pmem failure handling is at minimum built >>> on PAGE_SIZE granularity - an inheritance from general memory_failure >>> handling. However, with Intel's DCPMEM technology, the error blast >>> radius is no more than 256bytes, and might get smaller with future >>> hardware generation, also advanced atomic 64B write to clear the poison. >>> But I don't see any of that could be incorporated in, given that the >>> filesystem is notified a corruption with pfn, rather than an exact >>> address. >>> >>> So I guess this question is also for Dan: how to avoid unnecessarily >>> repairing a PMD range for a 256B corrupt range going forward? >>> >>> thanks, >>> -jane >>> >>> >>> On 7/30/2021 3:01 AM, Shiyang Ruan wrote: >>>> When memory-failure occurs, we call this function which is implemented >>>> by each kind of devices. For the fsdax case, pmem device driver >>>> implements it. Pmem device driver will find out the filesystem in >>>> which >>>> the corrupted page located in. And finally call filesystem handler to >>>> deal with this error. >>>> >>>> The filesystem will try to recover the corrupted data if necessary. >>> >> >
> -----Original Message----- > From: Jane Chu <jane.chu@oracle.com> > Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure() > > > On 8/17/2021 10:43 PM, Jane Chu wrote: > > More information - > > > > On 8/16/2021 10:20 AM, Jane Chu wrote: > >> Hi, ShiYang, > >> > >> So I applied the v6 patch series to my 5.14-rc3 as it's what you > >> indicated is what v6 was based at, and injected a hardware poison. > >> > >> I'm seeing the same problem that was reported a while ago after the > >> poison was consumed - in the SIGBUS payload, the si_addr is missing: > >> > >> ** SIGBUS(7): canjmp=1, whichstep=0, ** > >> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** > >> > >> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page > >> in this case. > > > > The failure came from here : > > > > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS > > > > +static int > > +xfs_dax_notify_failure( > > ... > > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) { > > + xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); > > + return -EOPNOTSUPP; > > + } > > > > I am not familiar with XFS, but I have a few questions I hope to get > > answers - > > > > 1) What does it take and cost to make > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? Enable rmpabt feature when making xfs filesystem `mkfs.xfs -m rmapbt=1 /path/to/device` BTW, reflink is enabled by default. > > > > 2) For a running environment that fails the above check, is it > > okay to leave the poison handle in limbo and why? It will fall back to the old handler. I think you have already known it. > > > > 3) If the above regression is not acceptable, any potential remedy? > > How about moving the check to prior to the notifier registration? > And register only if the check is passed? This seems better than an > alternative which is to fall back to the legacy memory_failure handling in case > the filesystem returns -EOPNOTSUPP. Sounds like a nice solution. I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder. -- Thanks, Ruan. > > thanks, > -jane > > > > > thanks! > > -jane > > > > > >> > >> Something is not right... > >> > >> thanks, > >> -jane > >> > >> > >> On 8/5/2021 6:17 PM, Jane Chu wrote: > >>> The filesystem part of the pmem failure handling is at minimum built > >>> on PAGE_SIZE granularity - an inheritance from general > >>> memory_failure handling. However, with Intel's DCPMEM technology, > >>> the error blast radius is no more than 256bytes, and might get > >>> smaller with future hardware generation, also advanced atomic 64B write > to clear the poison. > >>> But I don't see any of that could be incorporated in, given that the > >>> filesystem is notified a corruption with pfn, rather than an exact > >>> address. > >>> > >>> So I guess this question is also for Dan: how to avoid unnecessarily > >>> repairing a PMD range for a 256B corrupt range going forward? > >>> > >>> thanks, > >>> -jane > >>> > >>> > >>> On 7/30/2021 3:01 AM, Shiyang Ruan wrote: > >>>> When memory-failure occurs, we call this function which is > >>>> implemented by each kind of devices. For the fsdax case, pmem > >>>> device driver implements it. Pmem device driver will find out the > >>>> filesystem in which the corrupted page located in. And finally > >>>> call filesystem handler to deal with this error. > >>>> > >>>> The filesystem will try to recover the corrupted data if necessary. > >>> > >> > >
On Tue, Aug 17, 2021 at 11:08:40PM -0700, Jane Chu wrote: > > On 8/17/2021 10:43 PM, Jane Chu wrote: > > More information - > > > > On 8/16/2021 10:20 AM, Jane Chu wrote: > > > Hi, ShiYang, > > > > > > So I applied the v6 patch series to my 5.14-rc3 as it's what you > > > indicated is what v6 was based at, and injected a hardware poison. > > > > > > I'm seeing the same problem that was reported a while ago after the > > > poison was consumed - in the SIGBUS payload, the si_addr is missing: > > > > > > ** SIGBUS(7): canjmp=1, whichstep=0, ** > > > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** > > > > > > The si_addr ought to be 0x7f6568000000 - the vaddr of the first page > > > in this case. > > > > The failure came from here : > > > > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS > > > > +static int > > +xfs_dax_notify_failure( > > ... > > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) { > > + xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); > > + return -EOPNOTSUPP; > > + } > > > > I am not familiar with XFS, but I have a few questions I hope to get > > answers - > > > > 1) What does it take and cost to make > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? mkfs.xfs -m rmapbt=1 > > 2) For a running environment that fails the above check, is it > > okay to leave the poison handle in limbo and why? > > > > 3) If the above regression is not acceptable, any potential remedy? > > How about moving the check to prior to the notifier registration? > And register only if the check is passed? This seems better > than an alternative which is to fall back to the legacy memory_failure > handling in case the filesystem returns -EOPNOTSUPP. "return -EOPNOTSUPP;" is the branching point where a future patch could probe the (DRAM) buffer cache to bwrite the contents to restore the pmem contents. Right now the focus should be on landing the core code changes without drawing any more NAKs from Dan. --D > thanks, > -jane > > > > > thanks! > > -jane > > > > > > > > > > Something is not right... > > > > > > thanks, > > > -jane > > > > > > > > > On 8/5/2021 6:17 PM, Jane Chu wrote: > > > > The filesystem part of the pmem failure handling is at minimum built > > > > on PAGE_SIZE granularity - an inheritance from general > > > > memory_failure handling. However, with Intel's DCPMEM > > > > technology, the error blast > > > > radius is no more than 256bytes, and might get smaller with future > > > > hardware generation, also advanced atomic 64B write to clear the poison. > > > > But I don't see any of that could be incorporated in, given that the > > > > filesystem is notified a corruption with pfn, rather than an exact > > > > address. > > > > > > > > So I guess this question is also for Dan: how to avoid unnecessarily > > > > repairing a PMD range for a 256B corrupt range going forward? > > > > > > > > thanks, > > > > -jane > > > > > > > > > > > > On 7/30/2021 3:01 AM, Shiyang Ruan wrote: > > > > > When memory-failure occurs, we call this function which is implemented > > > > > by each kind of devices. For the fsdax case, pmem device driver > > > > > implements it. Pmem device driver will find out the > > > > > filesystem in which > > > > > the corrupted page located in. And finally call filesystem handler to > > > > > deal with this error. > > > > > > > > > > The filesystem will try to recover the corrupted data if necessary. > > > > > > > > >
On Wed, Aug 18, 2021 at 12:52 AM ruansy.fnst@fujitsu.com <ruansy.fnst@fujitsu.com> wrote: > > > > > -----Original Message----- > > From: Jane Chu <jane.chu@oracle.com> > > Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure() > > > > > > On 8/17/2021 10:43 PM, Jane Chu wrote: > > > More information - > > > > > > On 8/16/2021 10:20 AM, Jane Chu wrote: > > >> Hi, ShiYang, > > >> > > >> So I applied the v6 patch series to my 5.14-rc3 as it's what you > > >> indicated is what v6 was based at, and injected a hardware poison. > > >> > > >> I'm seeing the same problem that was reported a while ago after the > > >> poison was consumed - in the SIGBUS payload, the si_addr is missing: > > >> > > >> ** SIGBUS(7): canjmp=1, whichstep=0, ** > > >> ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) ** > > >> > > >> The si_addr ought to be 0x7f6568000000 - the vaddr of the first page > > >> in this case. > > > > > > The failure came from here : > > > > > > [PATCH RESEND v6 6/9] xfs: Implement ->notify_failure() for XFS > > > > > > +static int > > > +xfs_dax_notify_failure( > > > ... > > > + if (!xfs_sb_version_hasrmapbt(&mp->m_sb)) { > > > + xfs_warn(mp, "notify_failure() needs rmapbt enabled!"); > > > + return -EOPNOTSUPP; > > > + } > > > > > > I am not familiar with XFS, but I have a few questions I hope to get > > > answers - > > > > > > 1) What does it take and cost to make > > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? > > Enable rmpabt feature when making xfs filesystem > `mkfs.xfs -m rmapbt=1 /path/to/device` > BTW, reflink is enabled by default. > > > > > > > 2) For a running environment that fails the above check, is it > > > okay to leave the poison handle in limbo and why? > It will fall back to the old handler. I think you have already known it. > > > > > > > 3) If the above regression is not acceptable, any potential remedy? > > > > How about moving the check to prior to the notifier registration? > > And register only if the check is passed? This seems better than an > > alternative which is to fall back to the legacy memory_failure handling in case > > the filesystem returns -EOPNOTSUPP. > > Sounds like a nice solution. I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder. Shouldn't the fs avoid registering a memory failure handler if it is not prepared to take over? For example, shouldn't this case behave identically to ext4 that will not even register a callback?
Hi, Shiyang, > > > 1) What does it take and cost to make > > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? > > Enable rmpabt feature when making xfs filesystem > `mkfs.xfs -m rmapbt=1 /path/to/device` > BTW, reflink is enabled by default. Thanks! I tried mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m rmapbt=1 -f /dev/pmem0 Again, injected a HW poison to the first page in a dax-file, had the poison consumed and received a SIGBUS. The result is better - ** SIGBUS(7): canjmp=1, whichstep=0, ** ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) ** The SIGBUS payload looks correct. However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes - [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956] Memory failure: 0x1850800: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681] Memory failure: 0x1850a00: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190] Memory failure: 0x1850c00: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831] Memory failure: 0x1850e00: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796] Memory failure: 0x1851000: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption .... [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885] Memory failure: 0x1950000: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495] Memory failure: 0x1950200: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106] Memory failure: 0x1950400: Sending SIGBUS to fsdax_poison_v1:4109 due to hardware memory corruption That's too much for a single process dealing with a single poison in a PMD page. If nothing else, given an .si_addr_lsb being 0x15, it doesn't make sense to send a SIGBUS per 512B block. Could you determine the user process' mapping size from the filesystem, and take that as a hint to determine how many iterations to call mf_dax_kill_procs() ? thanks! -jane
Sorry, correction in line. On 8/19/2021 12:18 AM, Jane Chu wrote: > Hi, Shiyang, > > > > > 1) What does it take and cost to make > > > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? > > > > Enable rmpabt feature when making xfs filesystem > > `mkfs.xfs -m rmapbt=1 /path/to/device` > > BTW, reflink is enabled by default. > > Thanks! I tried > mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m > rmapbt=1 -f /dev/pmem0 > > Again, injected a HW poison to the first page in a dax-file, had > the poison consumed and received a SIGBUS. The result is better - > > ** SIGBUS(7): canjmp=1, whichstep=0, ** > ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) ** > > The SIGBUS payload looks correct. > > However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes - Actually that's one per 2MB, even though the poison is located in pfn 0x1850600 only. > > [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7003.507956] Memory failure: 0x1850800: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7003.531681] Memory failure: 0x1850a00: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7003.554190] Memory failure: 0x1850c00: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7003.575831] Memory failure: 0x1850e00: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7003.596796] Memory failure: 0x1851000: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > .... > [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7045.758885] Memory failure: 0x1950000: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7045.779495] Memory failure: 0x1950200: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > [ 7045.800106] Memory failure: 0x1950400: Sending SIGBUS to > fsdax_poison_v1:4109 due to hardware memory corruption > > That's too much for a single process dealing with a single > poison in a PMD page. If nothing else, given an .si_addr_lsb being 0x15, > it doesn't make sense to send a SIGBUS per 512B block. > > Could you determine the user process' mapping size from the filesystem, > and take that as a hint to determine how many iterations to call > mf_dax_kill_procs() ? Sorry, scratch the 512byte stuff... the filesystem has been notified the length of the poison blast radius, could it take clue from that? thanks, -jane > > thanks! > -jane > > >
> From: Jane Chu <jane.chu@oracle.com> > Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure() > > Sorry, correction in line. > > On 8/19/2021 12:18 AM, Jane Chu wrote: > > Hi, Shiyang, > > > > > > > 1) What does it take and cost to make > > > > > xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? > > > > > > Enable rmpabt feature when making xfs filesystem > `mkfs.xfs > > -m rmapbt=1 /path/to/device` > BTW, reflink is enabled by default. > > > > Thanks! I tried > > mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m > > rmapbt=1 -f /dev/pmem0 > > > > Again, injected a HW poison to the first page in a dax-file, had the > > poison consumed and received a SIGBUS. The result is better - > > > > ** SIGBUS(7): canjmp=1, whichstep=0, ** > > ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, > > BUS_MCEERR_AR) ** > > > > The SIGBUS payload looks correct. > > > > However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes - > > Actually that's one per 2MB, even though the poison is located in pfn 0x1850600 > only. > > > > > [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956] > > Memory failure: 0x1850800: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681] > > Memory failure: 0x1850a00: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190] > > Memory failure: 0x1850c00: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831] > > Memory failure: 0x1850e00: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796] > > Memory failure: 0x1851000: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption .... > > [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885] > > Memory failure: 0x1950000: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495] > > Memory failure: 0x1950200: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106] > > Memory failure: 0x1950400: Sending SIGBUS to > > fsdax_poison_v1:4109 due to hardware memory corruption > > > > That's too much for a single process dealing with a single poison in a > > PMD page. If nothing else, given an .si_addr_lsb being 0x15, it > > doesn't make sense to send a SIGBUS per 512B block. > > > > Could you determine the user process' mapping size from the > > filesystem, and take that as a hint to determine how many iterations > > to call > > mf_dax_kill_procs() ? > > Sorry, scratch the 512byte stuff... the filesystem has been notified the length of > the poison blast radius, could it take clue from that? I think this is caused by a mistake I made in the 6th patch: xfs handler iterates the file range in block size(4k here) even though it is a PMD page. That's why so many message shows when poison on a PMD page. I'll fix it in next version. -- Thanks, Ruan. > > thanks, > -jane > > > > > thanks! > > -jane > > > > > >
On 8/19/2021 2:10 AM, ruansy.fnst@fujitsu.com wrote: >> From: Jane Chu <jane.chu@oracle.com> >> Subject: Re: [PATCH RESEND v6 1/9] pagemap: Introduce ->memory_failure() >> >> Sorry, correction in line. >> >> On 8/19/2021 12:18 AM, Jane Chu wrote: >>> Hi, Shiyang, >>> >>> > > > 1) What does it take and cost to make > > > >>> xfs_sb_version_hasrmapbt(&mp->m_sb) to return true? >>> > >>> > Enable rmpabt feature when making xfs filesystem > `mkfs.xfs >>> -m rmapbt=1 /path/to/device` > BTW, reflink is enabled by default. >>> >>> Thanks! I tried >>> mkfs.xfs -d agcount=2,extszinherit=512,su=2m,sw=1 -m reflink=0 -m >>> rmapbt=1 -f /dev/pmem0 >>> >>> Again, injected a HW poison to the first page in a dax-file, had the >>> poison consumed and received a SIGBUS. The result is better - >>> >>> ** SIGBUS(7): canjmp=1, whichstep=0, ** >>> ** si_addr(0x0x7ff2d8800000), si_lsb(0x15), si_code(0x4, >>> BUS_MCEERR_AR) ** >>> >>> The SIGBUS payload looks correct. >>> >>> However, "dmesg" has 2048 lines on sending SIGBUS, one per 512bytes - >> >> Actually that's one per 2MB, even though the poison is located in pfn 0x1850600 >> only. >> >>> >>> [ 7003.482326] Memory failure: 0x1850600: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.507956] >>> Memory failure: 0x1850800: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.531681] >>> Memory failure: 0x1850a00: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.554190] >>> Memory failure: 0x1850c00: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.575831] >>> Memory failure: 0x1850e00: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7003.596796] >>> Memory failure: 0x1851000: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption .... >>> [ 7045.738270] Memory failure: 0x194fe00: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.758885] >>> Memory failure: 0x1950000: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.779495] >>> Memory failure: 0x1950200: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption [ 7045.800106] >>> Memory failure: 0x1950400: Sending SIGBUS to >>> fsdax_poison_v1:4109 due to hardware memory corruption >>> >>> That's too much for a single process dealing with a single poison in a >>> PMD page. If nothing else, given an .si_addr_lsb being 0x15, it >>> doesn't make sense to send a SIGBUS per 512B block. >>> >>> Could you determine the user process' mapping size from the >>> filesystem, and take that as a hint to determine how many iterations >>> to call >>> mf_dax_kill_procs() ? >> >> Sorry, scratch the 512byte stuff... the filesystem has been notified the length of >> the poison blast radius, could it take clue from that? > > I think this is caused by a mistake I made in the 6th patch: xfs handler iterates the file range in block size(4k here) even though it is a PMD page. That's why so many message shows when poison on a PMD page. I'll fix it in next version. > Sorry, just to clarify, it looks like XFS has iterated through out the entire file in 2MiB stride. The test file size is 4GiB, that explains 'dmesg' showing 2048 line about sending SIGBUS. thanks, -jane > > -- > Thanks, > Ruan. > >> >> thanks, >> -jane >> >>> >>> thanks! >>> -jane >>> >>> >>>
On Fri, Jul 30, 2021 at 3:02 AM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote: > > When memory-failure occurs, we call this function which is implemented > by each kind of devices. For the fsdax case, pmem device driver > implements it. Pmem device driver will find out the filesystem in which > the corrupted page located in. And finally call filesystem handler to > deal with this error. > > The filesystem will try to recover the corrupted data if necessary. This patch looks good to me, but I would fold it into the patch that first populates ->memory_failure().
On Wed, Aug 18, 2021 at 10:10:51AM -0700, Dan Williams wrote: > > Sounds like a nice solution. I think I can add an is_notify_supported() interface in dax_holder_ops and check it when register dax_holder. > > Shouldn't the fs avoid registering a memory failure handler if it is > not prepared to take over? For example, shouldn't this case behave > identically to ext4 that will not even register a callback? Yes.
diff --git a/include/linux/memremap.h b/include/linux/memremap.h index c0e9d35889e8..ed6980087db5 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -87,6 +87,15 @@ struct dev_pagemap_ops { * the page back to a CPU accessible page. */ vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); + + /* + * Handle the memory failure happens on a range of pfns. Notify the + * processes who are using these pfns, and try to recover the data on + * them if necessary. The flag is finally passed to the recover + * function through the whole notify routine. + */ + int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn, + unsigned long nr_pfns, int flags); }; #define PGMAP_ALTMAP_VALID (1 << 0)
When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. And finally call filesystem handler to deal with this error. The filesystem will try to recover the corrupted data if necessary. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- include/linux/memremap.h | 9 +++++++++ 1 file changed, 9 insertions(+)