diff mbox

[RFC,1/7] x86, mm: ZONE_DEVICE for "device memory"

Message ID 20150817214554.GA5976@gmail.com (mailing list archive)
State Superseded
Headers show

Commit Message

Jerome Glisse Aug. 17, 2015, 9:45 p.m. UTC
On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> On Fri, Aug 14, 2015 at 3:33 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Fri, Aug 14, 2015 at 3:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >> On Fri, Aug 14, 2015 at 02:52:15PM -0700, Dan Williams wrote:
> >>> On Fri, Aug 14, 2015 at 2:37 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >>> > On Wed, Aug 12, 2015 at 11:50:05PM -0400, Dan Williams wrote:
> > [..]
> >>> > What is the rational for not updating max_pfn, max_low_pfn, ... ?
> >>> >
> >>>
> >>> The idea is that this memory is not meant to be available to the page
> >>> allocator and should not count as new memory capacity.  We're only
> >>> hotplugging it to get struct page coverage.
> >>
> >> But this sounds bogus to me to rely on max_pfn to stay smaller than
> >> first_dev_pfn.  For instance you might plug a device that register
> >> dev memory and then some regular memory might be hotplug, effectively
> >> updating max_pfn to a value bigger than first_dev_pfn.
> >>
> >
> > True.
> >
> >> Also i do not think that the buddy allocator use max_pfn or max_low_pfn
> >> to consider page/zone for allocation or not.
> >
> > Yes, I took it out with no effects.  I'll investigate further whether
> > we should be touching those variables or not for this new usage.
> 
> Although it does not offer perfect protection if device memory is at a
> physically lower address than RAM, skipping the update of these
> variables does seem to be what we want.  For example /dev/mem would
> fail to allow write access to persistent memory if it fails a
> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> write to PMEM in a reliably persistent way, it should not treat a
> PMEM-pfn like RAM.

So i attach is a patch that should keep ZONE_DEVICE out of consideration
for the buddy allocator. You might also want to keep page reserved and not
free inside the zone, you could replace the generic_online_page() using
set_online_page_callback() while hotpluging device memory.

Regarding /dev/mem i would not worry about highmem, as /dev/mem is already
broken in respect to memory hole that might exist (at least that is my
understanding). Alternatively if you really care about /dev/mem you could
add an arch valid_phys_addr_range() that could check valid zone.

Cheers,
Jérôme
From 45976e1186eee45ecb277fe5293a7cfa7466d740 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Mon, 17 Aug 2015 17:31:27 -0400
Subject: [PATCH] mm/ZONE_DEVICE: Keep ZONE_DEVICE out of allocation zonelist.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Memory inside a ZONE_DEVICE should never be consider by the buddy
allocator and thus any such zone should never be added to any of
the zonelist. This patch just do that.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/page_alloc.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Dan Williams Aug. 18, 2015, 12:46 a.m. UTC | #1
On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
>> Although it does not offer perfect protection if device memory is at a
>> physically lower address than RAM, skipping the update of these
>> variables does seem to be what we want.  For example /dev/mem would
>> fail to allow write access to persistent memory if it fails a
>> valid_phys_addr_range() check.  Since /dev/mem does not know how to
>> write to PMEM in a reliably persistent way, it should not treat a
>> PMEM-pfn like RAM.
>
> So i attach is a patch that should keep ZONE_DEVICE out of consideration
> for the buddy allocator. You might also want to keep page reserved and not
> free inside the zone, you could replace the generic_online_page() using
> set_online_page_callback() while hotpluging device memory.
>

Hmm, are we already protected by the fact that ZONE_DEVICE is not
represented in the GFP_ZONEMASK?
Jerome Glisse Aug. 18, 2015, 4:55 p.m. UTC | #2
On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> >> Although it does not offer perfect protection if device memory is at a
> >> physically lower address than RAM, skipping the update of these
> >> variables does seem to be what we want.  For example /dev/mem would
> >> fail to allow write access to persistent memory if it fails a
> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> >> write to PMEM in a reliably persistent way, it should not treat a
> >> PMEM-pfn like RAM.
> >
> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
> > for the buddy allocator. You might also want to keep page reserved and not
> > free inside the zone, you could replace the generic_online_page() using
> > set_online_page_callback() while hotpluging device memory.
> >
> 
> Hmm, are we already protected by the fact that ZONE_DEVICE is not
> represented in the GFP_ZONEMASK?

Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
always limit which zones are considered. I thought that under memory presure
it would go over all of the zonelist entry and eventualy consider the device
zone. But it doesn't seems to be that way.

Keeping the device zone out of the zonelist might still be a good idea, if
only to avoid pointless iteration for the page allocator. Unless someone can
think of a reason why this would be bad.

Cheers,
Jérôme
Dan Williams Aug. 18, 2015, 5:23 p.m. UTC | #3
On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
>> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
>> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
>> >> Although it does not offer perfect protection if device memory is at a
>> >> physically lower address than RAM, skipping the update of these
>> >> variables does seem to be what we want.  For example /dev/mem would
>> >> fail to allow write access to persistent memory if it fails a
>> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
>> >> write to PMEM in a reliably persistent way, it should not treat a
>> >> PMEM-pfn like RAM.
>> >
>> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
>> > for the buddy allocator. You might also want to keep page reserved and not
>> > free inside the zone, you could replace the generic_online_page() using
>> > set_online_page_callback() while hotpluging device memory.
>> >
>>
>> Hmm, are we already protected by the fact that ZONE_DEVICE is not
>> represented in the GFP_ZONEMASK?
>
> Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
> always limit which zones are considered. I thought that under memory presure
> it would go over all of the zonelist entry and eventualy consider the device
> zone. But it doesn't seems to be that way.
>
> Keeping the device zone out of the zonelist might still be a good idea, if
> only to avoid pointless iteration for the page allocator. Unless someone can
> think of a reason why this would be bad.
>

The other question I have is whether disabling ZONE_DMA is a realistic
tradeoff for enabling ZONE_DEVICE?  I.e. can ZONE_DMA default to off
going forward, lose some ISA device support, or do we need to figure
out how to enable > 4 zones.
Jerome Glisse Aug. 18, 2015, 7:06 p.m. UTC | #4
On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote:
> On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote:
> > On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote:
> >> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> >> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote:
> >> >> Although it does not offer perfect protection if device memory is at a
> >> >> physically lower address than RAM, skipping the update of these
> >> >> variables does seem to be what we want.  For example /dev/mem would
> >> >> fail to allow write access to persistent memory if it fails a
> >> >> valid_phys_addr_range() check.  Since /dev/mem does not know how to
> >> >> write to PMEM in a reliably persistent way, it should not treat a
> >> >> PMEM-pfn like RAM.
> >> >
> >> > So i attach is a patch that should keep ZONE_DEVICE out of consideration
> >> > for the buddy allocator. You might also want to keep page reserved and not
> >> > free inside the zone, you could replace the generic_online_page() using
> >> > set_online_page_callback() while hotpluging device memory.
> >> >
> >>
> >> Hmm, are we already protected by the fact that ZONE_DEVICE is not
> >> represented in the GFP_ZONEMASK?
> >
> > Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will
> > always limit which zones are considered. I thought that under memory presure
> > it would go over all of the zonelist entry and eventualy consider the device
> > zone. But it doesn't seems to be that way.
> >
> > Keeping the device zone out of the zonelist might still be a good idea, if
> > only to avoid pointless iteration for the page allocator. Unless someone can
> > think of a reason why this would be bad.
> >
> 
> The other question I have is whether disabling ZONE_DMA is a realistic
> tradeoff for enabling ZONE_DEVICE?  I.e. can ZONE_DMA default to off
> going forward, lose some ISA device support, or do we need to figure
> out how to enable > 4 zones.

That require some auditing a quick look and it seems to matter for s390
arch and there is still few driver that use it. I think we can forget
about ISA bus, i would be surprise if you could still run a recent kernel
on a computer that has ISA bus.

Thought maybe you don't need a new ZONE_DEV and all you need is valid
struct page for this device memory, and you don't want this page to be
useable by the general memory allocator. There is surely other ways to
achieve that like marking all as reserved when you hotplug them.

Cheers,
Jérôme
Dan Williams Aug. 20, 2015, 12:49 a.m. UTC | #5
On Tue, Aug 18, 2015 at 12:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote:
> On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote:
> Thought maybe you don't need a new ZONE_DEV and all you need is valid
> struct page for this device memory, and you don't want this page to be
> useable by the general memory allocator. There is surely other ways to
> achieve that like marking all as reserved when you hotplug them.
>

Yes, there are other ways that can achieve the same thing, but I do
like the ability to do reverse page to zone lookups for debug if
anything.
diff mbox

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ef19f22..f3e26de 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3834,6 +3834,13 @@  static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
 	do {
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
+		/*
+		 * Device zone is special memory and should never be consider
+		 * for regular allocation. It is expected that page in device
+		 * zone will be allocated by other means.
+		 */
+		if (is_dev_zone(zone))
+			continue;
 		if (populated_zone(zone)) {
 			zoneref_set_zone(zone,
 				&zonelist->_zonerefs[nr_zones++]);