Message ID | 20150817214554.GA5976@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote: > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote: >> Although it does not offer perfect protection if device memory is at a >> physically lower address than RAM, skipping the update of these >> variables does seem to be what we want. For example /dev/mem would >> fail to allow write access to persistent memory if it fails a >> valid_phys_addr_range() check. Since /dev/mem does not know how to >> write to PMEM in a reliably persistent way, it should not treat a >> PMEM-pfn like RAM. > > So i attach is a patch that should keep ZONE_DEVICE out of consideration > for the buddy allocator. You might also want to keep page reserved and not > free inside the zone, you could replace the generic_online_page() using > set_online_page_callback() while hotpluging device memory. > Hmm, are we already protected by the fact that ZONE_DEVICE is not represented in the GFP_ZONEMASK?
On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote: > On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote: > > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote: > >> Although it does not offer perfect protection if device memory is at a > >> physically lower address than RAM, skipping the update of these > >> variables does seem to be what we want. For example /dev/mem would > >> fail to allow write access to persistent memory if it fails a > >> valid_phys_addr_range() check. Since /dev/mem does not know how to > >> write to PMEM in a reliably persistent way, it should not treat a > >> PMEM-pfn like RAM. > > > > So i attach is a patch that should keep ZONE_DEVICE out of consideration > > for the buddy allocator. You might also want to keep page reserved and not > > free inside the zone, you could replace the generic_online_page() using > > set_online_page_callback() while hotpluging device memory. > > > > Hmm, are we already protected by the fact that ZONE_DEVICE is not > represented in the GFP_ZONEMASK? Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will always limit which zones are considered. I thought that under memory presure it would go over all of the zonelist entry and eventualy consider the device zone. But it doesn't seems to be that way. Keeping the device zone out of the zonelist might still be a good idea, if only to avoid pointless iteration for the page allocator. Unless someone can think of a reason why this would be bad. Cheers, Jérôme
On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote: > On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote: >> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote: >> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote: >> >> Although it does not offer perfect protection if device memory is at a >> >> physically lower address than RAM, skipping the update of these >> >> variables does seem to be what we want. For example /dev/mem would >> >> fail to allow write access to persistent memory if it fails a >> >> valid_phys_addr_range() check. Since /dev/mem does not know how to >> >> write to PMEM in a reliably persistent way, it should not treat a >> >> PMEM-pfn like RAM. >> > >> > So i attach is a patch that should keep ZONE_DEVICE out of consideration >> > for the buddy allocator. You might also want to keep page reserved and not >> > free inside the zone, you could replace the generic_online_page() using >> > set_online_page_callback() while hotpluging device memory. >> > >> >> Hmm, are we already protected by the fact that ZONE_DEVICE is not >> represented in the GFP_ZONEMASK? > > Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will > always limit which zones are considered. I thought that under memory presure > it would go over all of the zonelist entry and eventualy consider the device > zone. But it doesn't seems to be that way. > > Keeping the device zone out of the zonelist might still be a good idea, if > only to avoid pointless iteration for the page allocator. Unless someone can > think of a reason why this would be bad. > The other question I have is whether disabling ZONE_DMA is a realistic tradeoff for enabling ZONE_DEVICE? I.e. can ZONE_DMA default to off going forward, lose some ISA device support, or do we need to figure out how to enable > 4 zones.
On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote: > On Tue, Aug 18, 2015 at 9:55 AM, Jerome Glisse <j.glisse@gmail.com> wrote: > > On Mon, Aug 17, 2015 at 05:46:43PM -0700, Dan Williams wrote: > >> On Mon, Aug 17, 2015 at 2:45 PM, Jerome Glisse <j.glisse@gmail.com> wrote: > >> > On Fri, Aug 14, 2015 at 07:11:27PM -0700, Dan Williams wrote: > >> >> Although it does not offer perfect protection if device memory is at a > >> >> physically lower address than RAM, skipping the update of these > >> >> variables does seem to be what we want. For example /dev/mem would > >> >> fail to allow write access to persistent memory if it fails a > >> >> valid_phys_addr_range() check. Since /dev/mem does not know how to > >> >> write to PMEM in a reliably persistent way, it should not treat a > >> >> PMEM-pfn like RAM. > >> > > >> > So i attach is a patch that should keep ZONE_DEVICE out of consideration > >> > for the buddy allocator. You might also want to keep page reserved and not > >> > free inside the zone, you could replace the generic_online_page() using > >> > set_online_page_callback() while hotpluging device memory. > >> > > >> > >> Hmm, are we already protected by the fact that ZONE_DEVICE is not > >> represented in the GFP_ZONEMASK? > > > > Yeah seems you right, high_zoneidx (which is derive using gfp_zone()) will > > always limit which zones are considered. I thought that under memory presure > > it would go over all of the zonelist entry and eventualy consider the device > > zone. But it doesn't seems to be that way. > > > > Keeping the device zone out of the zonelist might still be a good idea, if > > only to avoid pointless iteration for the page allocator. Unless someone can > > think of a reason why this would be bad. > > > > The other question I have is whether disabling ZONE_DMA is a realistic > tradeoff for enabling ZONE_DEVICE? I.e. can ZONE_DMA default to off > going forward, lose some ISA device support, or do we need to figure > out how to enable > 4 zones. That require some auditing a quick look and it seems to matter for s390 arch and there is still few driver that use it. I think we can forget about ISA bus, i would be surprise if you could still run a recent kernel on a computer that has ISA bus. Thought maybe you don't need a new ZONE_DEV and all you need is valid struct page for this device memory, and you don't want this page to be useable by the general memory allocator. There is surely other ways to achieve that like marking all as reserved when you hotplug them. Cheers, Jérôme
On Tue, Aug 18, 2015 at 12:06 PM, Jerome Glisse <j.glisse@gmail.com> wrote: > On Tue, Aug 18, 2015 at 10:23:38AM -0700, Dan Williams wrote: > Thought maybe you don't need a new ZONE_DEV and all you need is valid > struct page for this device memory, and you don't want this page to be > useable by the general memory allocator. There is surely other ways to > achieve that like marking all as reserved when you hotplug them. > Yes, there are other ways that can achieve the same thing, but I do like the ability to do reverse page to zone lookups for debug if anything.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ef19f22..f3e26de 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3834,6 +3834,13 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist, do { zone_type--; zone = pgdat->node_zones + zone_type; + /* + * Device zone is special memory and should never be consider + * for regular allocation. It is expected that page in device + * zone will be allocated by other means. + */ + if (is_dev_zone(zone)) + continue; if (populated_zone(zone)) { zoneref_set_zone(zone, &zonelist->_zonerefs[nr_zones++]);