Message ID | 4e8c2e0facd46cfaf4ab79e19c9115958ab6f218.1536342881.git.yi.z.zhang@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix kvm misconceives NVDIMM pages as reserved mmio | expand |
On Fri, Sep 7, 2018 at 2:25 AM Zhang Yi <yi.z.zhang@linux.intel.com> wrote: > > For device specific memory space, when we move these area of pfn to > memory zone, we will set the page reserved flag at that time, some of > these reserved for device mmio, and some of these are not, such as > NVDIMM pmem. > > Now, we map these dev_dax or fs_dax pages to kvm for DIMM/NVDIMM > backend, since these pages are reserved, the check of > kvm_is_reserved_pfn() misconceives those pages as MMIO. Therefor, we > introduce 2 page map types, MEMORY_DEVICE_FS_DAX/MEMORY_DEVICE_DEV_DAX, > to identify these pages are from NVDIMM pmem and let kvm treat these > as normal pages. > > Without this patch, many operations will be missed due to this > mistreatment to pmem pages, for example, a page may not have chance to > be unpinned for KVM guest(in kvm_release_pfn_clean), not able to be > marked as dirty/accessed(in kvm_set_pfn_dirty/accessed) etc. > > Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com> > Acked-by: Pankaj Gupta <pagupta@redhat.com> > --- > virt/kvm/kvm_main.c | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index c44c406..9c49634 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -147,8 +147,20 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, > > bool kvm_is_reserved_pfn(kvm_pfn_t pfn) > { > - if (pfn_valid(pfn)) > - return PageReserved(pfn_to_page(pfn)); > + struct page *page; > + > + if (pfn_valid(pfn)) { > + page = pfn_to_page(pfn); > + > + /* > + * For device specific memory space, there is a case > + * which we need pass MEMORY_DEVICE_FS[DEV]_DAX pages > + * to kvm, these pages marked reserved flag as it is a > + * zone device memory, we need to identify these pages > + * and let kvm treat these as normal pages > + */ > + return PageReserved(page) && !is_dax_page(page); Should we consider just not setting PageReserved for devm_memremap_pages()? Perhaps kvm is not be the only component making these assumptions about this flag? Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? This has less to do with "dax" pages and more to do with devm_memremap_pages() established ranges. P2PDMA is another producer of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be used in these kvm paths then I think this points to consider clearing the Reserved flag. That said I haven't audited all the locations that test PageReserved(). Sorry for not responding sooner I was on extended leave.
Am 19.09.18 um 04:53 schrieb Dan Williams: > On Fri, Sep 7, 2018 at 2:25 AM Zhang Yi <yi.z.zhang@linux.intel.com> wrote: >> >> For device specific memory space, when we move these area of pfn to >> memory zone, we will set the page reserved flag at that time, some of >> these reserved for device mmio, and some of these are not, such as >> NVDIMM pmem. >> >> Now, we map these dev_dax or fs_dax pages to kvm for DIMM/NVDIMM >> backend, since these pages are reserved, the check of >> kvm_is_reserved_pfn() misconceives those pages as MMIO. Therefor, we >> introduce 2 page map types, MEMORY_DEVICE_FS_DAX/MEMORY_DEVICE_DEV_DAX, >> to identify these pages are from NVDIMM pmem and let kvm treat these >> as normal pages. >> >> Without this patch, many operations will be missed due to this >> mistreatment to pmem pages, for example, a page may not have chance to >> be unpinned for KVM guest(in kvm_release_pfn_clean), not able to be >> marked as dirty/accessed(in kvm_set_pfn_dirty/accessed) etc. >> >> Signed-off-by: Zhang Yi <yi.z.zhang@linux.intel.com> >> Acked-by: Pankaj Gupta <pagupta@redhat.com> >> --- >> virt/kvm/kvm_main.c | 16 ++++++++++++++-- >> 1 file changed, 14 insertions(+), 2 deletions(-) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index c44c406..9c49634 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -147,8 +147,20 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, >> >> bool kvm_is_reserved_pfn(kvm_pfn_t pfn) >> { >> - if (pfn_valid(pfn)) >> - return PageReserved(pfn_to_page(pfn)); >> + struct page *page; >> + >> + if (pfn_valid(pfn)) { >> + page = pfn_to_page(pfn); >> + >> + /* >> + * For device specific memory space, there is a case >> + * which we need pass MEMORY_DEVICE_FS[DEV]_DAX pages >> + * to kvm, these pages marked reserved flag as it is a >> + * zone device memory, we need to identify these pages >> + * and let kvm treat these as normal pages >> + */ >> + return PageReserved(page) && !is_dax_page(page); > > Should we consider just not setting PageReserved for > devm_memremap_pages()? Perhaps kvm is not be the only component making > these assumptions about this flag? I was asking the exact same question in v3 or so. I was recently going through all PageReserved users, trying to clean up and document how it is used. PG_reserved used to be a marker "not available for the page allocator". This is only partially true and not really helpful I think. My current understanding: " PG_reserved is set for special pages, struct pages of such pages should in general not be touched except by their owner. Pages marked as reserved include: - Kernel image (including vDSO) and similar (e.g. BIOS, initrd) - Pages allocated early during boot (bootmem, memblock) - Zero pages - Pages that have been associated with a zone but were not onlined (e.g. NVDIMM/pmem, online_page_callback used by XEN) - Pages to exclude from the hibernation image (e.g. loaded kexec images) - MCA (memory error) pages on ia64 - Offline pages Some architectures don't allow to ioremap RAM pages that are not marked as reserved. Allocated pages might have to be set reserved to allow for that - if there is a good reason to enforce this. Consequently, PG_reserved part of a user space table might be the indicator for the zero page, pmem or MMIO pages. " Swapping code does not care about PageReserved at all as far as I remember. This seems to be fine as it only looks at the way pages have been mapped into user space. I don't really see a good reason to set pmem pages as reserved. One question would be, how/if to exclude them from the hibernation image. But that could also be solved differently (we would have to double check how they are handled in hibernation code). A similar user of PageReserved to look at is: drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn() It will not mark pages dirty if they are reserved. Similar to KVM code. > > Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? > > This has less to do with "dax" pages and more to do with > devm_memremap_pages() established ranges. P2PDMA is another producer > of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be > used in these kvm paths then I think this points to consider clearing > the Reserved flag. > > That said I haven't audited all the locations that test PageReserved(). > > Sorry for not responding sooner I was on extended leave. >
On Thu, Sep 20, 2018 at 7:11 AM Yi Zhang <yi.z.zhang@linux.intel.com> wrote: > > On 2018-09-19 at 09:20:25 +0200, David Hildenbrand wrote: > > Am 19.09.18 um 04:53 schrieb Dan Williams: > > > > > > Should we consider just not setting PageReserved for > > > devm_memremap_pages()? Perhaps kvm is not be the only component making > > > these assumptions about this flag? > > > > I was asking the exact same question in v3 or so. > > > > I was recently going through all PageReserved users, trying to clean up > > and document how it is used. > > > > PG_reserved used to be a marker "not available for the page allocator". > > This is only partially true and not really helpful I think. My current > > understanding: > > > > " > > PG_reserved is set for special pages, struct pages of such pages should > > in general not be touched except by their owner. Pages marked as > > reserved include: > > - Kernel image (including vDSO) and similar (e.g. BIOS, initrd) > > - Pages allocated early during boot (bootmem, memblock) > > - Zero pages > > - Pages that have been associated with a zone but were not onlined > > (e.g. NVDIMM/pmem, online_page_callback used by XEN) > > - Pages to exclude from the hibernation image (e.g. loaded kexec images) > > - MCA (memory error) pages on ia64 > > - Offline pages > > Some architectures don't allow to ioremap RAM pages that are not marked > > as reserved. Allocated pages might have to be set reserved to allow for > > that - if there is a good reason to enforce this. Consequently, > > PG_reserved part of a user space table might be the indicator for the > > zero page, pmem or MMIO pages. > > " > > > > Swapping code does not care about PageReserved at all as far as I > > remember. This seems to be fine as it only looks at the way pages have > > been mapped into user space. > > > > I don't really see a good reason to set pmem pages as reserved. One > > question would be, how/if to exclude them from the hibernation image. > > But that could also be solved differently (we would have to double check > > how they are handled in hibernation code). > > > > > > A similar user of PageReserved to look at is: > > > > drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn() > > > > It will not mark pages dirty if they are reserved. Similar to KVM code. > Yes, kvm is not the only one user of the dax reserved page. > > > > > > > > Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? > > > > > > This has less to do with "dax" pages and more to do with > > > devm_memremap_pages() established ranges. P2PDMA is another producer > > > of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be > > > used in these kvm paths then I think this points to consider clearing > > > the Reserved flag. > > Thanks Dan/David's comments. > for MEMORY_DEVICE_PUBLIC memory, since host driver could manager the > memory resource to share to guest, Jerome says we could ignore it at > this time. > > And p2pmem, it seems mapped in a PCI bar space which should most likely > a mmio. I think kvm should treated as a reserved page. Ok, but the question you left unanswered is whether it would be better for devm_memremap_pages() to clear the PageReserved flag for MEMORY_DEVICE_{FS,DEV}_DAX rather than introduce a local kvm-only hack for what looks like a global problem.
On 2018-09-19 at 09:20:25 +0200, David Hildenbrand wrote: > Am 19.09.18 um 04:53 schrieb Dan Williams: > > > > Should we consider just not setting PageReserved for > > devm_memremap_pages()? Perhaps kvm is not be the only component making > > these assumptions about this flag? > > I was asking the exact same question in v3 or so. > > I was recently going through all PageReserved users, trying to clean up > and document how it is used. > > PG_reserved used to be a marker "not available for the page allocator". > This is only partially true and not really helpful I think. My current > understanding: > > " > PG_reserved is set for special pages, struct pages of such pages should > in general not be touched except by their owner. Pages marked as > reserved include: > - Kernel image (including vDSO) and similar (e.g. BIOS, initrd) > - Pages allocated early during boot (bootmem, memblock) > - Zero pages > - Pages that have been associated with a zone but were not onlined > (e.g. NVDIMM/pmem, online_page_callback used by XEN) > - Pages to exclude from the hibernation image (e.g. loaded kexec images) > - MCA (memory error) pages on ia64 > - Offline pages > Some architectures don't allow to ioremap RAM pages that are not marked > as reserved. Allocated pages might have to be set reserved to allow for > that - if there is a good reason to enforce this. Consequently, > PG_reserved part of a user space table might be the indicator for the > zero page, pmem or MMIO pages. > " > > Swapping code does not care about PageReserved at all as far as I > remember. This seems to be fine as it only looks at the way pages have > been mapped into user space. > > I don't really see a good reason to set pmem pages as reserved. One > question would be, how/if to exclude them from the hibernation image. > But that could also be solved differently (we would have to double check > how they are handled in hibernation code). > > > A similar user of PageReserved to look at is: > > drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn() > > It will not mark pages dirty if they are reserved. Similar to KVM code. Yes, kvm is not the only one user of the dax reserved page. > > > > > Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? > > > > This has less to do with "dax" pages and more to do with > > devm_memremap_pages() established ranges. P2PDMA is another producer > > of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be > > used in these kvm paths then I think this points to consider clearing > > the Reserved flag. Thanks Dan/David's comments. for MEMORY_DEVICE_PUBLIC memory, since host driver could manager the memory resource to share to guest, Jerome says we could ignore it at this time. And p2pmem, it seems mapped in a PCI bar space which should most likely a mmio. I think kvm should treated as a reserved page. > > > > That said I haven't audited all the locations that test PageReserved(). > > > > Sorry for not responding sooner I was on extended leave. > > > > > -- > > Thanks, > > David / dhildenb
On 22/09/2018 00:47, Yi Zhang wrote: > On 2018-09-20 at 14:19:17 -0700, Dan Williams wrote: >> On Thu, Sep 20, 2018 at 7:11 AM Yi Zhang <yi.z.zhang@linux.intel.com> wrote: >>> >>> On 2018-09-19 at 09:20:25 +0200, David Hildenbrand wrote: >>>> Am 19.09.18 um 04:53 schrieb Dan Williams: >>>>> >>>>> Should we consider just not setting PageReserved for >>>>> devm_memremap_pages()? Perhaps kvm is not be the only component making >>>>> these assumptions about this flag? >>>> >>>> I was asking the exact same question in v3 or so. >>>> >>>> I was recently going through all PageReserved users, trying to clean up >>>> and document how it is used. >>>> >>>> PG_reserved used to be a marker "not available for the page allocator". >>>> This is only partially true and not really helpful I think. My current >>>> understanding: >>>> >>>> " >>>> PG_reserved is set for special pages, struct pages of such pages should >>>> in general not be touched except by their owner. Pages marked as >>>> reserved include: >>>> - Kernel image (including vDSO) and similar (e.g. BIOS, initrd) >>>> - Pages allocated early during boot (bootmem, memblock) >>>> - Zero pages >>>> - Pages that have been associated with a zone but were not onlined >>>> (e.g. NVDIMM/pmem, online_page_callback used by XEN) >>>> - Pages to exclude from the hibernation image (e.g. loaded kexec images) >>>> - MCA (memory error) pages on ia64 >>>> - Offline pages >>>> Some architectures don't allow to ioremap RAM pages that are not marked >>>> as reserved. Allocated pages might have to be set reserved to allow for >>>> that - if there is a good reason to enforce this. Consequently, >>>> PG_reserved part of a user space table might be the indicator for the >>>> zero page, pmem or MMIO pages. >>>> " >>>> >>>> Swapping code does not care about PageReserved at all as far as I >>>> remember. This seems to be fine as it only looks at the way pages have >>>> been mapped into user space. >>>> >>>> I don't really see a good reason to set pmem pages as reserved. One >>>> question would be, how/if to exclude them from the hibernation image. >>>> But that could also be solved differently (we would have to double check >>>> how they are handled in hibernation code). >>>> >>>> >>>> A similar user of PageReserved to look at is: >>>> >>>> drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn() >>>> >>>> It will not mark pages dirty if they are reserved. Similar to KVM code. >>> Yes, kvm is not the only one user of the dax reserved page. >>>> >>>>> >>>>> Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? >>>>> >>>>> This has less to do with "dax" pages and more to do with >>>>> devm_memremap_pages() established ranges. P2PDMA is another producer >>>>> of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be >>>>> used in these kvm paths then I think this points to consider clearing >>>>> the Reserved flag. >>> >>> Thanks Dan/David's comments. >>> for MEMORY_DEVICE_PUBLIC memory, since host driver could manager the >>> memory resource to share to guest, Jerome says we could ignore it at >>> this time. >>> >>> And p2pmem, it seems mapped in a PCI bar space which should most likely >>> a mmio. I think kvm should treated as a reserved page. >> >> Ok, but the question you left unanswered is whether it would be better >> for devm_memremap_pages() to clear the PageReserved flag for >> MEMORY_DEVICE_{FS,DEV}_DAX rather than introduce a local kvm-only hack >> for what looks like a global problem. > > Remove the PageReserved flag sounds more reasonable. > And Could we still have a flag to identify it is a device private memory, or > where these pages coming from? We could use a page type for that or what you proposed. (as I said, we might have to change hibernation code to skip the pages once we drop the reserved flag).
On Fri, Sep 21, 2018 at 7:24 AM David Hildenbrand <david@redhat.com> wrote: [..] > > Remove the PageReserved flag sounds more reasonable. > > And Could we still have a flag to identify it is a device private memory, or > > where these pages coming from? > > We could use a page type for that or what you proposed. (as I said, we > might have to change hibernation code to skip the pages once we drop the > reserved flag). I think it would be reasonable to reject all ZONE_DEVICE pages in saveable_page().
On 21/09/2018 20:17, Dan Williams wrote: > On Fri, Sep 21, 2018 at 7:24 AM David Hildenbrand <david@redhat.com> wrote: > [..] >>> Remove the PageReserved flag sounds more reasonable. >>> And Could we still have a flag to identify it is a device private memory, or >>> where these pages coming from? >> >> We could use a page type for that or what you proposed. (as I said, we >> might have to change hibernation code to skip the pages once we drop the >> reserved flag). > > I think it would be reasonable to reject all ZONE_DEVICE pages in > saveable_page(). > Indeed, that sounds like the easiest solution - guess that answer was too easy for me to figure out :) .
On 2018-09-20 at 14:19:17 -0700, Dan Williams wrote: > On Thu, Sep 20, 2018 at 7:11 AM Yi Zhang <yi.z.zhang@linux.intel.com> wrote: > > > > On 2018-09-19 at 09:20:25 +0200, David Hildenbrand wrote: > > > Am 19.09.18 um 04:53 schrieb Dan Williams: > > > > > > > > Should we consider just not setting PageReserved for > > > > devm_memremap_pages()? Perhaps kvm is not be the only component making > > > > these assumptions about this flag? > > > > > > I was asking the exact same question in v3 or so. > > > > > > I was recently going through all PageReserved users, trying to clean up > > > and document how it is used. > > > > > > PG_reserved used to be a marker "not available for the page allocator". > > > This is only partially true and not really helpful I think. My current > > > understanding: > > > > > > " > > > PG_reserved is set for special pages, struct pages of such pages should > > > in general not be touched except by their owner. Pages marked as > > > reserved include: > > > - Kernel image (including vDSO) and similar (e.g. BIOS, initrd) > > > - Pages allocated early during boot (bootmem, memblock) > > > - Zero pages > > > - Pages that have been associated with a zone but were not onlined > > > (e.g. NVDIMM/pmem, online_page_callback used by XEN) > > > - Pages to exclude from the hibernation image (e.g. loaded kexec images) > > > - MCA (memory error) pages on ia64 > > > - Offline pages > > > Some architectures don't allow to ioremap RAM pages that are not marked > > > as reserved. Allocated pages might have to be set reserved to allow for > > > that - if there is a good reason to enforce this. Consequently, > > > PG_reserved part of a user space table might be the indicator for the > > > zero page, pmem or MMIO pages. > > > " > > > > > > Swapping code does not care about PageReserved at all as far as I > > > remember. This seems to be fine as it only looks at the way pages have > > > been mapped into user space. > > > > > > I don't really see a good reason to set pmem pages as reserved. One > > > question would be, how/if to exclude them from the hibernation image. > > > But that could also be solved differently (we would have to double check > > > how they are handled in hibernation code). > > > > > > > > > A similar user of PageReserved to look at is: > > > > > > drivers/vfio/vfio_iommu_type1.c:is_invalid_reserved_pfn() > > > > > > It will not mark pages dirty if they are reserved. Similar to KVM code. > > Yes, kvm is not the only one user of the dax reserved page. > > > > > > > > > > > Why is MEMORY_DEVICE_PUBLIC memory specifically excluded? > > > > > > > > This has less to do with "dax" pages and more to do with > > > > devm_memremap_pages() established ranges. P2PDMA is another producer > > > > of these pages. If either MEMORY_DEVICE_PUBLIC or P2PDMA pages can be > > > > used in these kvm paths then I think this points to consider clearing > > > > the Reserved flag. > > > > Thanks Dan/David's comments. > > for MEMORY_DEVICE_PUBLIC memory, since host driver could manager the > > memory resource to share to guest, Jerome says we could ignore it at > > this time. > > > > And p2pmem, it seems mapped in a PCI bar space which should most likely > > a mmio. I think kvm should treated as a reserved page. > > Ok, but the question you left unanswered is whether it would be better > for devm_memremap_pages() to clear the PageReserved flag for > MEMORY_DEVICE_{FS,DEV}_DAX rather than introduce a local kvm-only hack > for what looks like a global problem. Remove the PageReserved flag sounds more reasonable. And Could we still have a flag to identify it is a device private memory, or where these pages coming from? > _______________________________________________ > Linux-nvdimm mailing list > Linux-nvdimm@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
On 2018-09-21 at 21:29 David Hildenbrand <david@redhat.com> wrote: > On 21/09/2018 20:17, Dan Williams wrote: > > On Fri, Sep 21, 2018 at 7:24 AM David Hildenbrand <david@redhat.com> wrote: > > [..] > >>> Remove the PageReserved flag sounds more reasonable. > >>> And Could we still have a flag to identify it is a device private memory, or > >>> where these pages coming from? > >> > >> We could use a page type for that or what you proposed. (as I said, we > >> might have to change hibernation code to skip the pages once we drop the > >> reserved flag). > > > > I think it would be reasonable to reject all ZONE_DEVICE pages in > > saveable_page(). > > > > Indeed, that sounds like the easiest solution - guess that answer was > too easy for me to figure out :) . > Just to follow-up, is the plan to clear PageReserved for nvdimm pages instead of the approach taken in this patch set? Or should we special case nvdimm/dax pages in kvm_is_reserved_pfn()? Thanks, Barret
On 2018-10-19 at 12:33:48 -0400, Barret Rhoden wrote: > On 2018-09-21 at 21:29 David Hildenbrand <david@redhat.com> wrote: > > On 21/09/2018 20:17, Dan Williams wrote: > > > On Fri, Sep 21, 2018 at 7:24 AM David Hildenbrand <david@redhat.com> wrote: > > > [..] > > >>> Remove the PageReserved flag sounds more reasonable. > > >>> And Could we still have a flag to identify it is a device private memory, or > > >>> where these pages coming from? > > >> > > >> We could use a page type for that or what you proposed. (as I said, we > > >> might have to change hibernation code to skip the pages once we drop the > > >> reserved flag). > > > > > > I think it would be reasonable to reject all ZONE_DEVICE pages in > > > saveable_page(). > > > > > > > Indeed, that sounds like the easiest solution - guess that answer was > > too easy for me to figure out :) . > > > > Just to follow-up, is the plan to clear PageReserved for nvdimm pages > instead of the approach taken in this patch set? Or should we special > case nvdimm/dax pages in kvm_is_reserved_pfn()? Yes, we are going to remove the PageReserved flag for nvdimm pages. Added Alex, attached the patch-set. > > Thanks, > > Barret > > >
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c44c406..9c49634 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -147,8 +147,20 @@ __weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm, bool kvm_is_reserved_pfn(kvm_pfn_t pfn) { - if (pfn_valid(pfn)) - return PageReserved(pfn_to_page(pfn)); + struct page *page; + + if (pfn_valid(pfn)) { + page = pfn_to_page(pfn); + + /* + * For device specific memory space, there is a case + * which we need pass MEMORY_DEVICE_FS[DEV]_DAX pages + * to kvm, these pages marked reserved flag as it is a + * zone device memory, we need to identify these pages + * and let kvm treat these as normal pages + */ + return PageReserved(page) && !is_dax_page(page); + } return true; }