Message ID | 20191024120938.11237-5-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: Don't mark hotplugged pages PG_reserved (including ZONE_DEVICE) | expand |
On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: > > Right now, ZONE_DEVICE memory is always set PG_reserved. We want to > change that. > > KVM has this weird use case that you can map anything from /dev/mem > into the guest. pfn_valid() is not a reliable check whether the memmap > was initialized and can be touched. pfn_to_online_page() makes sure > that we have an initialized memmap (and don't have ZONE_DEVICE memory). > > Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make > sure the function produces the same result once we stop setting ZONE_DEVICE > pages PG_reserved. > > Cc: Alex Williamson <alex.williamson@redhat.com> > Cc: Cornelia Huck <cohuck@redhat.com> > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > index 2ada8e6cdb88..f8ce8c408ba8 100644 > --- a/drivers/vfio/vfio_iommu_type1.c > +++ b/drivers/vfio/vfio_iommu_type1.c > @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) > */ > static bool is_invalid_reserved_pfn(unsigned long pfn) > { > - if (pfn_valid(pfn)) > - return PageReserved(pfn_to_page(pfn)); > + struct page *page = pfn_to_online_page(pfn); Ugh, I just realized this is not a safe conversion until pfn_to_online_page() is moved over to subsection granularity. As it stands it will return true for any ZONE_DEVICE pages that share a section with boot memory.
> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: > > On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: >> >> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >> change that. >> >> KVM has this weird use case that you can map anything from /dev/mem >> into the guest. pfn_valid() is not a reliable check whether the memmap >> was initialized and can be touched. pfn_to_online_page() makes sure >> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >> >> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >> sure the function produces the same result once we stop setting ZONE_DEVICE >> pages PG_reserved. >> >> Cc: Alex Williamson <alex.williamson@redhat.com> >> Cc: Cornelia Huck <cohuck@redhat.com> >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >> index 2ada8e6cdb88..f8ce8c408ba8 100644 >> --- a/drivers/vfio/vfio_iommu_type1.c >> +++ b/drivers/vfio/vfio_iommu_type1.c >> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >> */ >> static bool is_invalid_reserved_pfn(unsigned long pfn) >> { >> - if (pfn_valid(pfn)) >> - return PageReserved(pfn_to_page(pfn)); >> + struct page *page = pfn_to_online_page(pfn); > > Ugh, I just realized this is not a safe conversion until > pfn_to_online_page() is moved over to subsection granularity. As it > stands it will return true for any ZONE_DEVICE pages that share a > section with boot memory. That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought.
On 07.11.19 19:22, David Hildenbrand wrote: > > >> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: >> >> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>> change that. >>> >>> KVM has this weird use case that you can map anything from /dev/mem >>> into the guest. pfn_valid() is not a reliable check whether the memmap >>> was initialized and can be touched. pfn_to_online_page() makes sure >>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>> >>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >>> sure the function produces the same result once we stop setting ZONE_DEVICE >>> pages PG_reserved. >>> >>> Cc: Alex Williamson <alex.williamson@redhat.com> >>> Cc: Cornelia Huck <cohuck@redhat.com> >>> Signed-off-by: David Hildenbrand <david@redhat.com> >>> --- >>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- >>> 1 file changed, 8 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>> index 2ada8e6cdb88..f8ce8c408ba8 100644 >>> --- a/drivers/vfio/vfio_iommu_type1.c >>> +++ b/drivers/vfio/vfio_iommu_type1.c >>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >>> */ >>> static bool is_invalid_reserved_pfn(unsigned long pfn) >>> { >>> - if (pfn_valid(pfn)) >>> - return PageReserved(pfn_to_page(pfn)); >>> + struct page *page = pfn_to_online_page(pfn); >> >> Ugh, I just realized this is not a safe conversion until >> pfn_to_online_page() is moved over to subsection granularity. As it >> stands it will return true for any ZONE_DEVICE pages that share a >> section with boot memory. > > That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. > I just realized the "boot memory" part. Is that a real thing? IOW, can we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat have doubts that this would work ...
On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote: > > On 07.11.19 19:22, David Hildenbrand wrote: > > > > > >> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: > >> > >> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to > >>> change that. > >>> > >>> KVM has this weird use case that you can map anything from /dev/mem > >>> into the guest. pfn_valid() is not a reliable check whether the memmap > >>> was initialized and can be touched. pfn_to_online_page() makes sure > >>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). > >>> > >>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make > >>> sure the function produces the same result once we stop setting ZONE_DEVICE > >>> pages PG_reserved. > >>> > >>> Cc: Alex Williamson <alex.williamson@redhat.com> > >>> Cc: Cornelia Huck <cohuck@redhat.com> > >>> Signed-off-by: David Hildenbrand <david@redhat.com> > >>> --- > >>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- > >>> 1 file changed, 8 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>> index 2ada8e6cdb88..f8ce8c408ba8 100644 > >>> --- a/drivers/vfio/vfio_iommu_type1.c > >>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) > >>> */ > >>> static bool is_invalid_reserved_pfn(unsigned long pfn) > >>> { > >>> - if (pfn_valid(pfn)) > >>> - return PageReserved(pfn_to_page(pfn)); > >>> + struct page *page = pfn_to_online_page(pfn); > >> > >> Ugh, I just realized this is not a safe conversion until > >> pfn_to_online_page() is moved over to subsection granularity. As it > >> stands it will return true for any ZONE_DEVICE pages that share a > >> section with boot memory. > > > > That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. > > > > I just realized the "boot memory" part. Is that a real thing? IOW, can > we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat > have doubts that this would work ... One of the real world failure cases that started the subsection effect is that Persistent Memory collides with System RAM on a 64MB boundary on shipping platforms. System RAM ends on a 64MB boundary and due to a lack of memory controller resources PMEM is mapped contiguously at the end of that boundary. Some more details in the subsection cover letter / changelogs [1] [2]. It's not sufficient to just lose some memory, that's the broken implementation that lead to the subsection work because the lost memory may change from one boot to the next and software can't reliably inject a padding that conforms to the x86 128MB section constraint. Suffice to say I think we need your pfn_active() to get subsection granularity pfn_to_online_page() before PageReserved() can be removed. [1]: https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com/ [2]: https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com/
On 08.11.19 06:09, Dan Williams wrote: > On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 07.11.19 19:22, David Hildenbrand wrote: >>> >>> >>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: >>>> >>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>>>> change that. >>>>> >>>>> KVM has this weird use case that you can map anything from /dev/mem >>>>> into the guest. pfn_valid() is not a reliable check whether the memmap >>>>> was initialized and can be touched. pfn_to_online_page() makes sure >>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>>>> >>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >>>>> sure the function produces the same result once we stop setting ZONE_DEVICE >>>>> pages PG_reserved. >>>>> >>>>> Cc: Alex Williamson <alex.williamson@redhat.com> >>>>> Cc: Cornelia Huck <cohuck@redhat.com> >>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>>> --- >>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- >>>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644 >>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >>>>> */ >>>>> static bool is_invalid_reserved_pfn(unsigned long pfn) >>>>> { >>>>> - if (pfn_valid(pfn)) >>>>> - return PageReserved(pfn_to_page(pfn)); >>>>> + struct page *page = pfn_to_online_page(pfn); >>>> >>>> Ugh, I just realized this is not a safe conversion until >>>> pfn_to_online_page() is moved over to subsection granularity. As it >>>> stands it will return true for any ZONE_DEVICE pages that share a >>>> section with boot memory. >>> >>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. >>> >> >> I just realized the "boot memory" part. Is that a real thing? IOW, can >> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat >> have doubts that this would work ... > > One of the real world failure cases that started the subsection effect > is that Persistent Memory collides with System RAM on a 64MB boundary > on shipping platforms. System RAM ends on a 64MB boundary and due to a > lack of memory controller resources PMEM is mapped contiguously at the > end of that boundary. Some more details in the subsection cover letter > / changelogs [1] [2]. It's not sufficient to just lose some memory, > that's the broken implementation that lead to the subsection work > because the lost memory may change from one boot to the next and > software can't reliably inject a padding that conforms to the x86 > 128MB section constraint. Thanks, I thought it was mostly for weird alignment where other parts of the section are basically "holes" and not memory. Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are marked SECTION_IS_ONLINE. > > Suffice to say I think we need your pfn_active() to get subsection > granularity pfn_to_online_page() before PageReserved() can be removed. I agree that we have to fix this. I don't like ZONE_DEVICE pages falling into memory device blocks (e.g., cannot get offlined), but I guess that train is gone :) As long as it's not for memory hotplug, I can most probably live with this. Also, I'd like to get Michals opinion on this and the pfn_active() approach, but I can understand he's busy. This patch set can wait, I won't be working next week besides reading/writing mails either way. Is anybody looking into the pfn_active() thingy? > > [1]: https://lore.kernel.org/linux-mm/156092349300.979959.17603710711957735135.stgit@dwillia2-desk3.amr.corp.intel.com/ > [2]: https://lore.kernel.org/linux-mm/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com/ >
On 08.11.19 08:14, David Hildenbrand wrote: > On 08.11.19 06:09, Dan Williams wrote: >> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 07.11.19 19:22, David Hildenbrand wrote: >>>> >>>> >>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: >>>>> >>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>>>>> change that. >>>>>> >>>>>> KVM has this weird use case that you can map anything from /dev/mem >>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap >>>>>> was initialized and can be touched. pfn_to_online_page() makes sure >>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>>>>> >>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE >>>>>> pages PG_reserved. >>>>>> >>>>>> Cc: Alex Williamson <alex.williamson@redhat.com> >>>>>> Cc: Cornelia Huck <cohuck@redhat.com> >>>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>>>> --- >>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- >>>>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644 >>>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >>>>>> */ >>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn) >>>>>> { >>>>>> - if (pfn_valid(pfn)) >>>>>> - return PageReserved(pfn_to_page(pfn)); >>>>>> + struct page *page = pfn_to_online_page(pfn); >>>>> >>>>> Ugh, I just realized this is not a safe conversion until >>>>> pfn_to_online_page() is moved over to subsection granularity. As it >>>>> stands it will return true for any ZONE_DEVICE pages that share a >>>>> section with boot memory. >>>> >>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. >>>> >>> >>> I just realized the "boot memory" part. Is that a real thing? IOW, can >>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat >>> have doubts that this would work ... >> >> One of the real world failure cases that started the subsection effect >> is that Persistent Memory collides with System RAM on a 64MB boundary >> on shipping platforms. System RAM ends on a 64MB boundary and due to a >> lack of memory controller resources PMEM is mapped contiguously at the >> end of that boundary. Some more details in the subsection cover letter >> / changelogs [1] [2]. It's not sufficient to just lose some memory, >> that's the broken implementation that lead to the subsection work >> because the lost memory may change from one boot to the next and >> software can't reliably inject a padding that conforms to the x86 >> 128MB section constraint. > > Thanks, I thought it was mostly for weird alignment where other parts of > the section are basically "holes" and not memory. > > Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are > marked SECTION_IS_ONLINE. > >> >> Suffice to say I think we need your pfn_active() to get subsection >> granularity pfn_to_online_page() before PageReserved() can be removed. > > I agree that we have to fix this. I don't like ZONE_DEVICE pages falling > into memory device blocks (e.g., cannot get offlined), but I guess that > train is gone :) As long as it's not for memory hotplug, I can most > probably live with this. > > Also, I'd like to get Michals opinion on this and the pfn_active() > approach, but I can understand he's busy. > > This patch set can wait, I won't be working next week besides > reading/writing mails either way. > > Is anybody looking into the pfn_active() thingy? > I wonder if we should do something like this right now to fix this (exclude the false positive ZONE_DEVICE pages we could have within an online section, which was not possible before subsection hotplug): diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 384ffb3d69ab..490a9e9358b3 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -30,6 +30,8 @@ struct vmem_altmap; if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ pfn_valid_within(___pfn)) \ ___page = pfn_to_page(___pfn); \ + if (unlikely(___page && is_zone_device_page(___page))) \ + ___page = NULL; \ ___page; \ }) Yeah, it's another is_zone_device_page(), but it should not be racy here, as we want to exclude, not include ZONE_DEVICE. I don't have time to look into this right now, unfortunately.
On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand <david@redhat.com> wrote: > > On 08.11.19 08:14, David Hildenbrand wrote: > > On 08.11.19 06:09, Dan Williams wrote: > >> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 07.11.19 19:22, David Hildenbrand wrote: > >>>> > >>>> > >>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: > >>>>> > >>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: > >>>>>> > >>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to > >>>>>> change that. > >>>>>> > >>>>>> KVM has this weird use case that you can map anything from /dev/mem > >>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap > >>>>>> was initialized and can be touched. pfn_to_online_page() makes sure > >>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). > >>>>>> > >>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make > >>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE > >>>>>> pages PG_reserved. > >>>>>> > >>>>>> Cc: Alex Williamson <alex.williamson@redhat.com> > >>>>>> Cc: Cornelia Huck <cohuck@redhat.com> > >>>>>> Signed-off-by: David Hildenbrand <david@redhat.com> > >>>>>> --- > >>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- > >>>>>> 1 file changed, 8 insertions(+), 2 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c > >>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644 > >>>>>> --- a/drivers/vfio/vfio_iommu_type1.c > >>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c > >>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) > >>>>>> */ > >>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn) > >>>>>> { > >>>>>> - if (pfn_valid(pfn)) > >>>>>> - return PageReserved(pfn_to_page(pfn)); > >>>>>> + struct page *page = pfn_to_online_page(pfn); > >>>>> > >>>>> Ugh, I just realized this is not a safe conversion until > >>>>> pfn_to_online_page() is moved over to subsection granularity. As it > >>>>> stands it will return true for any ZONE_DEVICE pages that share a > >>>>> section with boot memory. > >>>> > >>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. > >>>> > >>> > >>> I just realized the "boot memory" part. Is that a real thing? IOW, can > >>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat > >>> have doubts that this would work ... > >> > >> One of the real world failure cases that started the subsection effect > >> is that Persistent Memory collides with System RAM on a 64MB boundary > >> on shipping platforms. System RAM ends on a 64MB boundary and due to a > >> lack of memory controller resources PMEM is mapped contiguously at the > >> end of that boundary. Some more details in the subsection cover letter > >> / changelogs [1] [2]. It's not sufficient to just lose some memory, > >> that's the broken implementation that lead to the subsection work > >> because the lost memory may change from one boot to the next and > >> software can't reliably inject a padding that conforms to the x86 > >> 128MB section constraint. > > > > Thanks, I thought it was mostly for weird alignment where other parts of > > the section are basically "holes" and not memory. > > > > Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are > > marked SECTION_IS_ONLINE. > > > >> > >> Suffice to say I think we need your pfn_active() to get subsection > >> granularity pfn_to_online_page() before PageReserved() can be removed. > > > > I agree that we have to fix this. I don't like ZONE_DEVICE pages falling > > into memory device blocks (e.g., cannot get offlined), but I guess that > > train is gone :) As long as it's not for memory hotplug, I can most > > probably live with this. > > > > Also, I'd like to get Michals opinion on this and the pfn_active() > > approach, but I can understand he's busy. > > > > This patch set can wait, I won't be working next week besides > > reading/writing mails either way. > > > > Is anybody looking into the pfn_active() thingy? > > > > I wonder if we should do something like this right now to fix this > (exclude the false positive ZONE_DEVICE pages we could have within an > online section, which was not possible before subsection hotplug): > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 384ffb3d69ab..490a9e9358b3 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -30,6 +30,8 @@ struct vmem_altmap; > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ > pfn_valid_within(___pfn)) \ > ___page = pfn_to_page(___pfn); \ > + if (unlikely(___page && is_zone_device_page(___page))) \ > + ___page = NULL; \ > ___page; \ > }) > > > Yeah, it's another is_zone_device_page(), but it should not be racy > here, as we want to exclude, not include ZONE_DEVICE. > > I don't have time to look into this right now, unfortunately. I don't want to band-aid without an actual bug report. I'll take a look at a subsection-map for the online state.
On 08.11.19 19:29, Dan Williams wrote: > On Fri, Nov 8, 2019 at 2:22 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 08.11.19 08:14, David Hildenbrand wrote: >>> On 08.11.19 06:09, Dan Williams wrote: >>>> On Thu, Nov 7, 2019 at 2:07 PM David Hildenbrand <david@redhat.com> wrote: >>>>> >>>>> On 07.11.19 19:22, David Hildenbrand wrote: >>>>>> >>>>>> >>>>>>> Am 07.11.2019 um 16:40 schrieb Dan Williams <dan.j.williams@intel.com>: >>>>>>> >>>>>>> On Thu, Oct 24, 2019 at 5:12 AM David Hildenbrand <david@redhat.com> wrote: >>>>>>>> >>>>>>>> Right now, ZONE_DEVICE memory is always set PG_reserved. We want to >>>>>>>> change that. >>>>>>>> >>>>>>>> KVM has this weird use case that you can map anything from /dev/mem >>>>>>>> into the guest. pfn_valid() is not a reliable check whether the memmap >>>>>>>> was initialized and can be touched. pfn_to_online_page() makes sure >>>>>>>> that we have an initialized memmap (and don't have ZONE_DEVICE memory). >>>>>>>> >>>>>>>> Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make >>>>>>>> sure the function produces the same result once we stop setting ZONE_DEVICE >>>>>>>> pages PG_reserved. >>>>>>>> >>>>>>>> Cc: Alex Williamson <alex.williamson@redhat.com> >>>>>>>> Cc: Cornelia Huck <cohuck@redhat.com> >>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com> >>>>>>>> --- >>>>>>>> drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- >>>>>>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>>>>>> >>>>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c >>>>>>>> index 2ada8e6cdb88..f8ce8c408ba8 100644 >>>>>>>> --- a/drivers/vfio/vfio_iommu_type1.c >>>>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c >>>>>>>> @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) >>>>>>>> */ >>>>>>>> static bool is_invalid_reserved_pfn(unsigned long pfn) >>>>>>>> { >>>>>>>> - if (pfn_valid(pfn)) >>>>>>>> - return PageReserved(pfn_to_page(pfn)); >>>>>>>> + struct page *page = pfn_to_online_page(pfn); >>>>>>> >>>>>>> Ugh, I just realized this is not a safe conversion until >>>>>>> pfn_to_online_page() is moved over to subsection granularity. As it >>>>>>> stands it will return true for any ZONE_DEVICE pages that share a >>>>>>> section with boot memory. >>>>>> >>>>>> That should not happen right now and I commented back when you introduced subsection support that I don’t want to have ZONE_DEVICE mixed with online pages in a section. Having memory block devices that partially span ZONE_DEVICE would be ... really weird. With something like pfn_active() - as discussed - we could at least make this check work - but I am not sure if we really want to go down that path. In the worst case, some MB of RAM are lost ... I guess this needs more thought. >>>>>> >>>>> >>>>> I just realized the "boot memory" part. Is that a real thing? IOW, can >>>>> we have ZONE_DEVICE falling into a memory block (with holes)? I somewhat >>>>> have doubts that this would work ... >>>> >>>> One of the real world failure cases that started the subsection effect >>>> is that Persistent Memory collides with System RAM on a 64MB boundary >>>> on shipping platforms. System RAM ends on a 64MB boundary and due to a >>>> lack of memory controller resources PMEM is mapped contiguously at the >>>> end of that boundary. Some more details in the subsection cover letter >>>> / changelogs [1] [2]. It's not sufficient to just lose some memory, >>>> that's the broken implementation that lead to the subsection work >>>> because the lost memory may change from one boot to the next and >>>> software can't reliably inject a padding that conforms to the x86 >>>> 128MB section constraint. >>> >>> Thanks, I thought it was mostly for weird alignment where other parts of >>> the section are basically "holes" and not memory. >>> >>> Yes, it is a real bug that ZONE_DEVICE pages fall into sections that are >>> marked SECTION_IS_ONLINE. >>> >>>> >>>> Suffice to say I think we need your pfn_active() to get subsection >>>> granularity pfn_to_online_page() before PageReserved() can be removed. >>> >>> I agree that we have to fix this. I don't like ZONE_DEVICE pages falling >>> into memory device blocks (e.g., cannot get offlined), but I guess that >>> train is gone :) As long as it's not for memory hotplug, I can most >>> probably live with this. >>> >>> Also, I'd like to get Michals opinion on this and the pfn_active() >>> approach, but I can understand he's busy. >>> >>> This patch set can wait, I won't be working next week besides >>> reading/writing mails either way. >>> >>> Is anybody looking into the pfn_active() thingy? >>> >> >> I wonder if we should do something like this right now to fix this >> (exclude the false positive ZONE_DEVICE pages we could have within an >> online section, which was not possible before subsection hotplug): >> >> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h >> index 384ffb3d69ab..490a9e9358b3 100644 >> --- a/include/linux/memory_hotplug.h >> +++ b/include/linux/memory_hotplug.h >> @@ -30,6 +30,8 @@ struct vmem_altmap; >> if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \ >> pfn_valid_within(___pfn)) \ >> ___page = pfn_to_page(___pfn); \ >> + if (unlikely(___page && is_zone_device_page(___page))) \ >> + ___page = NULL; \ >> ___page; \ >> }) >> >> >> Yeah, it's another is_zone_device_page(), but it should not be racy >> here, as we want to exclude, not include ZONE_DEVICE. >> >> I don't have time to look into this right now, unfortunately. > > I don't want to band-aid without an actual bug report. I'll take a > look at a subsection-map for the online state. > Fair enough, but at least in what I proposed for pfn_active(), this check would exist in pfn_to_online_page() in a similar way - and it is certainly easier to backport. But yeah, triggering this might not be easy.
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 2ada8e6cdb88..f8ce8c408ba8 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -299,9 +299,15 @@ static int vfio_lock_acct(struct vfio_dma *dma, long npage, bool async) */ static bool is_invalid_reserved_pfn(unsigned long pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page = pfn_to_online_page(pfn); + /* + * We treat any pages that are not online (not managed by the buddy) + * as reserved - this includes ZONE_DEVICE pages and pages without + * a memmap (e.g., mapped via /dev/mem). + */ + if (page) + return PageReserved(page); return true; }
Right now, ZONE_DEVICE memory is always set PG_reserved. We want to change that. KVM has this weird use case that you can map anything from /dev/mem into the guest. pfn_valid() is not a reliable check whether the memmap was initialized and can be touched. pfn_to_online_page() makes sure that we have an initialized memmap (and don't have ZONE_DEVICE memory). Rewrite is_invalid_reserved_pfn() similar to kvm_is_reserved_pfn() to make sure the function produces the same result once we stop setting ZONE_DEVICE pages PG_reserved. Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> --- drivers/vfio/vfio_iommu_type1.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)