Message ID | 1567503958-25831-4-git-send-email-anshuman.khandual@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64/mm: Enable memory hot remove | expand |
On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: > @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, > void vmemmap_free(unsigned long start, unsigned long end, > struct vmem_altmap *altmap) > { > +#ifdef CONFIG_MEMORY_HOTPLUG > + /* > + * FIXME: We should have called remove_pagetable(start, end, true). > + * vmemmap and vmalloc virtual range might share intermediate kernel > + * page table entries. Removing vmemmap range page table pages here > + * can potentially conflict with a concurrent vmalloc() allocation. > + * > + * This is primarily because vmalloc() does not take init_mm ptl for > + * the entire page table walk and it's modification. Instead it just > + * takes the lock while allocating and installing page table pages > + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table > + * entry via memory hot remove can cause vmalloc() kernel page table > + * walk pointers to be invalid on the fly which can cause corruption > + * or worst, a crash. > + * > + * So free_empty_tables() gets called where vmalloc and vmemmap range > + * do not overlap at any intermediate level kernel page table entry. > + */ > + unmap_hotplug_range(start, end, true); > + if (!vmalloc_vmemmap_overlap) > + free_empty_tables(start, end); > +#endif > } > #endif /* CONFIG_SPARSEMEM_VMEMMAP */ I wonder whether we could simply ignore the vmemmap freeing altogether, just leave it around and not unmap it. This way, we could call unmap_kernel_range() for removing the linear map and we save some code. For the linear map, I think we use just above 2MB of tables for 1GB of memory mapped (worst case with 4KB pages we need 512 pte pages). For vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we expect such memory to be re-plugged again in the same range? If we do, then I shouldn't even bother with removing the vmmemmap. I don't fully understand the use-case for memory hotremove, so any additional info would be useful to make a decision here.
On 10.09.19 18:17, Catalin Marinas wrote: > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, >> void vmemmap_free(unsigned long start, unsigned long end, >> struct vmem_altmap *altmap) >> { >> +#ifdef CONFIG_MEMORY_HOTPLUG >> + /* >> + * FIXME: We should have called remove_pagetable(start, end, true). >> + * vmemmap and vmalloc virtual range might share intermediate kernel >> + * page table entries. Removing vmemmap range page table pages here >> + * can potentially conflict with a concurrent vmalloc() allocation. >> + * >> + * This is primarily because vmalloc() does not take init_mm ptl for >> + * the entire page table walk and it's modification. Instead it just >> + * takes the lock while allocating and installing page table pages >> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table >> + * entry via memory hot remove can cause vmalloc() kernel page table >> + * walk pointers to be invalid on the fly which can cause corruption >> + * or worst, a crash. >> + * >> + * So free_empty_tables() gets called where vmalloc and vmemmap range >> + * do not overlap at any intermediate level kernel page table entry. >> + */ >> + unmap_hotplug_range(start, end, true); >> + if (!vmalloc_vmemmap_overlap) >> + free_empty_tables(start, end); >> +#endif >> } >> #endif /* CONFIG_SPARSEMEM_VMEMMAP */ > > I wonder whether we could simply ignore the vmemmap freeing altogether, > just leave it around and not unmap it. This way, we could call > unmap_kernel_range() for removing the linear map and we save some code. > > For the linear map, I think we use just above 2MB of tables for 1GB of > memory mapped (worst case with 4KB pages we need 512 pte pages). For > vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we > expect such memory to be re-plugged again in the same range? If we do, > then I shouldn't even bother with removing the vmmemmap. > FWIW, I think we should do it cleanly. > I don't fully understand the use-case for memory hotremove, so any > additional info would be useful to make a decision here. > Especially in virtual environment, hotremove will be relevant. For physical environments - I have no idea how important that is for ARM.
On 09/10/2019 09:47 PM, Catalin Marinas wrote: > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, >> void vmemmap_free(unsigned long start, unsigned long end, >> struct vmem_altmap *altmap) >> { >> +#ifdef CONFIG_MEMORY_HOTPLUG >> + /* >> + * FIXME: We should have called remove_pagetable(start, end, true). >> + * vmemmap and vmalloc virtual range might share intermediate kernel >> + * page table entries. Removing vmemmap range page table pages here >> + * can potentially conflict with a concurrent vmalloc() allocation. >> + * >> + * This is primarily because vmalloc() does not take init_mm ptl for >> + * the entire page table walk and it's modification. Instead it just >> + * takes the lock while allocating and installing page table pages >> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table >> + * entry via memory hot remove can cause vmalloc() kernel page table >> + * walk pointers to be invalid on the fly which can cause corruption >> + * or worst, a crash. >> + * >> + * So free_empty_tables() gets called where vmalloc and vmemmap range >> + * do not overlap at any intermediate level kernel page table entry. >> + */ >> + unmap_hotplug_range(start, end, true); >> + if (!vmalloc_vmemmap_overlap) >> + free_empty_tables(start, end); >> +#endif >> } >> #endif /* CONFIG_SPARSEMEM_VMEMMAP */ Hello Catalin, > > I wonder whether we could simply ignore the vmemmap freeing altogether, > just leave it around and not unmap it. This way, we could call This would have been an option (even if we just ignore for a moment that it might not be the cleanest possible method) if present memory hot remove scenarios involved just system RAM of comparable sizes. But with persistent memory which will be plugged in as ZONE_DEVICE might ask for a vmem_atlamp based vmemmap mapping where the backing memory comes from the persistent memory range itself not from existing system RAM. IIRC altmap support was originally added because the amount persistent memory on a system might be order of magnitude higher than that of regular system RAM. During normal memory hot add (without altmap) would have caused great deal of consumption from system RAM just for persistent memory range's vmemmap mapping. In order to avoid such a scenario altmap was created to allocate vmemmap mapping backing memory from the device memory range itself. In such cases vmemmap must be unmapped and it's backing memory freed up for the complete removal of persistent memory which originally requested for altmap based vmemmap backing. Just as a reference, the upcoming series which enables altmap support on arm64 tries to allocate vmemmap mapping backing memory from the device range itself during memory hot add and free them up during memory hot remove. Those methods will not be possible if memory hot-remove does not really free up vmemmap backing storage. https://patchwork.kernel.org/project/linux-mm/list/?series=139299 > unmap_kernel_range() for removing the linear map and we save some code. > > For the linear map, I think we use just above 2MB of tables for 1GB of > memory mapped (worst case with 4KB pages we need 512 pte pages). For > vmemmap we'd use slightly above 2MB for a 64GB hotplugged memory. Do we You are right, the amount of memory required for kernel page table pages are dependent on mapping page size and size of the range to be mapped. But as explained below there might be hot remove situations where these ranges will remain unused for ever after hot remove. There are chances that some these pages (even empty) might remain unused for good. > expect such memory to be re-plugged again in the same range? If we do, > then I shouldn't even bother with removing the vmmemmap. > > I don't fully understand the use-case for memory hotremove, so any > additional info would be useful to make a decision here. Sure, these are some of the scenarios I could recollect. Physical Environment: A. Physical DIMM replacement Platform detects memory errors and initiates a DIMM replacement. - Hot remove selected DIMM with errors - Hot add a new DIMM in it's place on the same slot In normal circumstances, the new DIMM will require the same linear and vmemmap mapping. In such cases hot-remove could just unmap linear mapping, leave everything else and be done with it. Though I am not sure whether its a good idea to leave aside accessible struct pages which correspond to non-present pfns. B. Physical DIMM movement Platform can detect errors on a DIMM slot itself and initiates a DIMM movement into a different empty slot - Hot remove selected memory DIMM from defective slot - Hot add same memory DIMM into a different available empty slot Physical address range for the DIMM has now changed, it will require different linear and vmemmap mapping than what it had originally. Hence during hot remove we should not only unmap linear and vmemmap mapping but also free up all associated resources as this physical memory range is never going to be available again because the slot has gone bad permanently. C. Physical DIMM hot-remove Platform just initiates hot-remove of a DIMM and reduces available memory as instructed by the administrator. - Hot remove a selected DIMM This memory might never come back again or comes back on a different slot. Without that certainty, its is always better to unmap both linear and vmemmap mappings, free up all associated resources. D. Changing NUMA affinity After performance analysis, administrator through the platform initiates a DIMM hot-remove from a given node and a DIMM hot-add to another node to achieve better NUMA affinity. - Hot remove a selected DIMM from node N0 - Hot add selected DIMM to another node N1 Here both linear and vmemmap ranges will change after the movement and there is uncertainty regarding whether the now empty physical range on node N0 will ever get populated again. Without that certainty, its is always better to unmap both linear and vmemmap mapping, free up all associated resources. Virtual Environment: 1. Memory hot-remove can just be initiated by the admin from the host in order to reduce total physical memory entitlement of a guest which will reflect any changing hosting contracts etc. The memory might never come back again and in such cases hot-remove should be as clean freeing all associated resources. 2. Memory hot-remove on the guest can be initiated from the host after detecting memory errors on the backing physical DIMM. Memory hot-remove on the guest will be followed by memory hot-remove on the host itself. Replacement DIMM can be on the same slot taking over the same physical address range from host as before but guest might get back it's memory either on the same range previously or on some other guest physical range. 3. Changing NUMA binding for a guest on the host might require guest PFN realignment with respect to guest nodes as well. Persistent Memory: As mentioned previously, persistent memory has special vmemmap mapping requirements through vmem_altmap which would need freeing up backing memory from it's own range, for it to be completely removed. Device memory (FPGA cards, GPU cards, Network cards etc): In future, some of these coherent device memory might be plugged into ZONE_DEVICE and managed through drivers. They might be attached to the system via upcoming interfaces like CCIX. The managing drivers might need to offline the device memory range in order to service some high priority error, re-init and plug it back on a different physical range due to existing CCIX link errors or some other constraints. The point I am trying to make here is that there are many such possible combinations of events with respect to memory hot-remove in both physical and virtual environment for system RAM, persistent memory and other coherent device memory. Leaving aside kernel page table pages or even struct pages for unavailable (possibly forever) physical range might problematic. IMHO it is better to do this as much cleanly as possible.
On 09/12/2019 09:58 AM, Anshuman Khandual wrote: > > On 09/10/2019 09:47 PM, Catalin Marinas wrote: >> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: >>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, >>> void vmemmap_free(unsigned long start, unsigned long end, >>> struct vmem_altmap *altmap) >>> { >>> +#ifdef CONFIG_MEMORY_HOTPLUG >>> + /* >>> + * FIXME: We should have called remove_pagetable(start, end, true). >>> + * vmemmap and vmalloc virtual range might share intermediate kernel >>> + * page table entries. Removing vmemmap range page table pages here >>> + * can potentially conflict with a concurrent vmalloc() allocation. >>> + * >>> + * This is primarily because vmalloc() does not take init_mm ptl for >>> + * the entire page table walk and it's modification. Instead it just >>> + * takes the lock while allocating and installing page table pages >>> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table >>> + * entry via memory hot remove can cause vmalloc() kernel page table >>> + * walk pointers to be invalid on the fly which can cause corruption >>> + * or worst, a crash. >>> + * >>> + * So free_empty_tables() gets called where vmalloc and vmemmap range >>> + * do not overlap at any intermediate level kernel page table entry. >>> + */ >>> + unmap_hotplug_range(start, end, true); >>> + if (!vmalloc_vmemmap_overlap) >>> + free_empty_tables(start, end); >>> +#endif >>> } >>> #endif /* CONFIG_SPARSEMEM_VMEMMAP */ > Hello Catalin, > >> I wonder whether we could simply ignore the vmemmap freeing altogether, >> just leave it around and not unmap it. This way, we could call > This would have been an option (even if we just ignore for a moment that > it might not be the cleanest possible method) if present memory hot remove > scenarios involved just system RAM of comparable sizes. > > But with persistent memory which will be plugged in as ZONE_DEVICE might > ask for a vmem_atlamp based vmemmap mapping where the backing memory comes > from the persistent memory range itself not from existing system RAM. IIRC > altmap support was originally added because the amount persistent memory on > a system might be order of magnitude higher than that of regular system RAM. > During normal memory hot add (without altmap) would have caused great deal > of consumption from system RAM just for persistent memory range's vmemmap > mapping. In order to avoid such a scenario altmap was created to allocate > vmemmap mapping backing memory from the device memory range itself. > > In such cases vmemmap must be unmapped and it's backing memory freed up for > the complete removal of persistent memory which originally requested for > altmap based vmemmap backing. > > Just as a reference, the upcoming series which enables altmap support on > arm64 tries to allocate vmemmap mapping backing memory from the device range > itself during memory hot add and free them up during memory hot remove. Those > methods will not be possible if memory hot-remove does not really free up > vmemmap backing storage. > > https://patchwork.kernel.org/project/linux-mm/list/?series=139299 > Just to add in here. There is an ongoing work which will enable allocating memory from the hot-add range itself even for normal system RAM. So this might not be specific to ZONE_DEVICE based device/persistent memory alone for a long time. https://lore.kernel.org/lkml/20190725160207.19579-1-osalvador@suse.de/
Hi Anshuman, Thanks for the details on the need for removing the page tables and vmemmap backing. Some comments on the code below. On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused; > > static DEFINE_SPINLOCK(swapper_pgdir_lock); > > +/* > + * This represents if vmalloc and vmemmap address range overlap with > + * each other on an intermediate level kernel page table entry which > + * in turn helps in deciding whether empty kernel page table pages > + * if any can be freed during memory hotplug operation. > + */ > +static bool vmalloc_vmemmap_overlap; I'd say just move the static find_vmalloc_vmemmap_overlap() function here, the compiler should be sufficiently smart enough to figure out that it's just a build-time constant. > @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, > void vmemmap_free(unsigned long start, unsigned long end, > struct vmem_altmap *altmap) > { > +#ifdef CONFIG_MEMORY_HOTPLUG > + /* > + * FIXME: We should have called remove_pagetable(start, end, true). > + * vmemmap and vmalloc virtual range might share intermediate kernel > + * page table entries. Removing vmemmap range page table pages here > + * can potentially conflict with a concurrent vmalloc() allocation. > + * > + * This is primarily because vmalloc() does not take init_mm ptl for > + * the entire page table walk and it's modification. Instead it just > + * takes the lock while allocating and installing page table pages > + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table > + * entry via memory hot remove can cause vmalloc() kernel page table > + * walk pointers to be invalid on the fly which can cause corruption > + * or worst, a crash. > + * > + * So free_empty_tables() gets called where vmalloc and vmemmap range > + * do not overlap at any intermediate level kernel page table entry. > + */ > + unmap_hotplug_range(start, end, true); > + if (!vmalloc_vmemmap_overlap) > + free_empty_tables(start, end); > +#endif > } So, I see the risk with overlapping and I guess for some kernel configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we can, that's great, otherwise could we rewrite the above functions to handle floor and ceiling similar to free_pgd_range()? (I wonder how this function works if you called it on init_mm and kernel address range). By having the vmemmap start/end information it avoids freeing partially filled page table pages. Another question: could we do the page table and the actual vmemmap pages freeing in a single pass (sorry if this has been discussed before)? > @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) > } > > #ifdef CONFIG_MEMORY_HOTPLUG > +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) > +{ > + unsigned long end = start + size; > + > + WARN_ON(pgdir != init_mm.pgd); > + remove_pagetable(start, end, false); > +} I think the point I've made previously still stands: you only call remove_pagetable() with sparse_vmap == false in this patch. Just remove the extra argument and call unmap_hotplug_range() with sparse_vmap == false directly in remove_pagetable().
On 09/13/2019 01:45 AM, Catalin Marinas wrote: > Hi Anshuman, > > Thanks for the details on the need for removing the page tables and > vmemmap backing. Some comments on the code below. > > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: >> --- a/arch/arm64/mm/mmu.c >> +++ b/arch/arm64/mm/mmu.c >> @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused; >> >> static DEFINE_SPINLOCK(swapper_pgdir_lock); >> >> +/* >> + * This represents if vmalloc and vmemmap address range overlap with >> + * each other on an intermediate level kernel page table entry which >> + * in turn helps in deciding whether empty kernel page table pages >> + * if any can be freed during memory hotplug operation. >> + */ >> +static bool vmalloc_vmemmap_overlap; > > I'd say just move the static find_vmalloc_vmemmap_overlap() function > here, the compiler should be sufficiently smart enough to figure out > that it's just a build-time constant. Sure, will do. > >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, >> void vmemmap_free(unsigned long start, unsigned long end, >> struct vmem_altmap *altmap) >> { >> +#ifdef CONFIG_MEMORY_HOTPLUG >> + /* >> + * FIXME: We should have called remove_pagetable(start, end, true). >> + * vmemmap and vmalloc virtual range might share intermediate kernel >> + * page table entries. Removing vmemmap range page table pages here >> + * can potentially conflict with a concurrent vmalloc() allocation. >> + * >> + * This is primarily because vmalloc() does not take init_mm ptl for >> + * the entire page table walk and it's modification. Instead it just >> + * takes the lock while allocating and installing page table pages >> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table >> + * entry via memory hot remove can cause vmalloc() kernel page table >> + * walk pointers to be invalid on the fly which can cause corruption >> + * or worst, a crash. >> + * >> + * So free_empty_tables() gets called where vmalloc and vmemmap range >> + * do not overlap at any intermediate level kernel page table entry. >> + */ >> + unmap_hotplug_range(start, end, true); >> + if (!vmalloc_vmemmap_overlap) >> + free_empty_tables(start, end); >> +#endif >> } > > So, I see the risk with overlapping and I guess for some kernel > configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we Did not see 64K config options to have overlap, do you suspect they might ? After the 52 bit KVA series has been merged, following configurations have the vmalloc-vmemmap range overlap problem. - 4K page size with 48 bit VA space - 16K page size with 48 bit VA space > can, that's great, otherwise could we rewrite the above functions to > handle floor and ceiling similar to free_pgd_range()? (I wonder how this > function works if you called it on init_mm and kernel address range). By Hmm, never tried that. Are you wondering if this can be used directly ? There are two distinct elements which make it very specific to user page tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting with mm_dec_nr_pxx(). > having the vmemmap start/end information it avoids freeing partially > filled page table pages. Did you mean page table pages which can partially overlap with vmalloc ? The problem (race) is not because of the inability to deal with partially filled table. We can handle that correctly as explained below [1]. The problem is with inadequate kernel page table locking during vmalloc() which might be accessing intermediate kernel page table pointers which is being freed with free_empty_tables() concurrently. Hence we cannot free any page table page which can ever have entries from vmalloc() range. Though not completely sure, whether I really understood the suggestion above with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you suggesting that we should only attempt to free up those vmemmap range page table pages which *definitely* could never overlap with vmalloc by working on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range at each level) vmemmap range instead ? This can be one restrictive version of the function free_empty_tables() called in case there is an overlap. So we will maintain two versions for free_empty_tables(). Please correct me if any the above assumptions or understanding is wrong. But yes, with this we should be able to free up some possible empty page table pages which were being left out in the current proposal when overlap happens. [1] Skipping partially filled page tables All free_pXX_table() functions take care in avoiding freeing partially filled page table pages whether they represent or ever represented linear or vmemmap or vmalloc mapping in init_mm. They go over each individual entry in a given page table making sure that each of them checks as pXX_none() before freeing the entire page table page. Though walking is restricted by the address range in question. free_empty_tables(start, end) free_empty_pud_table(pgdp, addr, next); free_empty_pmd_table(pudp, addr, next); free_empty_pte_table(pmdp, addr, next); Page table pages being examined here on the way while freeing might contain entries which once represented address beyond vmemmap range in removal. But thats a good thing IMHO. It can accommodate vmemmap tear down from a previous hot remove for an adjacent range which might not have been freed last time. pudp = pud_offset(pgdp, 0UL); pmdp = pmd_offset(pudp, 0UL); ptep = pte_offset_kernel(pmdp, 0UL); pxx_none() makes sure that in such cases freeing of the page table page is skipped. But yes, even though it is more thorough, it might attempt to free page table pages which might contains entries not belonging to the range being removed. > > Another question: could we do the page table and the actual vmemmap > pages freeing in a single pass (sorry if this has been discussed > before)? We could and some initial versions (till V5) of the series had that in fact. Initially Mark Rutland had suggested to do this in two passes. Some extracts from the previous discussion. https://lkml.org/lkml/2019/5/30/1159 ----------------------- Looking at this some more, I don't think this is quite right, and tI think that structure of the free_*() and remove_*() functions makes this unnecessarily hard to follow. We should aim for this to be obviously correct. The x86 code is the best template to follow here. As mentioned previously, I'm fairly certain it's not entirely correct (e.g. due to missing TLB maintenance), and we've already diverged a fair amount in fixing up obvious issues, so we shouldn't aim to mirror it. I think that the structure of unmap_region() is closer to what we want here -- do one pass to unmap leaf entries (and freeing the associated memory if unmapping the vmemmap), then do a second pass cleaning up any empty tables. ---------------------- Apart from the fact that two passes over the page table is cleaner and gives us more granular and modular infrastructure to use for later purposes, it is also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables() which is the second pass, can be skipped cleanly when overlap is detected. > >> @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) >> } >> >> #ifdef CONFIG_MEMORY_HOTPLUG >> +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) >> +{ >> + unsigned long end = start + size; >> + >> + WARN_ON(pgdir != init_mm.pgd); >> + remove_pagetable(start, end, false); >> +} > > I think the point I've made previously still stands: you only call > remove_pagetable() with sparse_vmap == false in this patch. Just remove > the extra argument and call unmap_hotplug_range() with sparse_vmap == > false directly in remove_pagetable(). Sure, will do. The original function signature was left unchanged in the hope that at a later point in time it can be called with "sparse_vmap == true" as mentioned by the comment in vmemmap_free(). Will change the comment as well.
On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote: > On 09/13/2019 01:45 AM, Catalin Marinas wrote: > > On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: > >> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, > >> void vmemmap_free(unsigned long start, unsigned long end, > >> struct vmem_altmap *altmap) > >> { > >> +#ifdef CONFIG_MEMORY_HOTPLUG > >> + /* > >> + * FIXME: We should have called remove_pagetable(start, end, true). > >> + * vmemmap and vmalloc virtual range might share intermediate kernel > >> + * page table entries. Removing vmemmap range page table pages here > >> + * can potentially conflict with a concurrent vmalloc() allocation. > >> + * > >> + * This is primarily because vmalloc() does not take init_mm ptl for > >> + * the entire page table walk and it's modification. Instead it just > >> + * takes the lock while allocating and installing page table pages > >> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table > >> + * entry via memory hot remove can cause vmalloc() kernel page table > >> + * walk pointers to be invalid on the fly which can cause corruption > >> + * or worst, a crash. > >> + * > >> + * So free_empty_tables() gets called where vmalloc and vmemmap range > >> + * do not overlap at any intermediate level kernel page table entry. > >> + */ > >> + unmap_hotplug_range(start, end, true); > >> + if (!vmalloc_vmemmap_overlap) > >> + free_empty_tables(start, end); > >> +#endif > >> } > > > > So, I see the risk with overlapping and I guess for some kernel > > configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we > > Did not see 64K config options to have overlap, do you suspect they might ? > After the 52 bit KVA series has been merged, following configurations have > the vmalloc-vmemmap range overlap problem. > > - 4K page size with 48 bit VA space > - 16K page size with 48 bit VA space OK. I haven't checked, so it was just a guess that 64K has this problem since the pgd entry coverage is fairly large. > > can, that's great, otherwise could we rewrite the above functions to > > handle floor and ceiling similar to free_pgd_range()? (I wonder how this > > function works if you called it on init_mm and kernel address range). By > > Hmm, never tried that. Are you wondering if this can be used directly ? > There are two distinct elements which make it very specific to user page > tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting > with mm_dec_nr_pxx(). Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly. We could, however, use the same approach for kernel page tables. > > having the vmemmap start/end information it avoids freeing partially > > filled page table pages. > > Did you mean page table pages which can partially overlap with vmalloc ? Overlapping with the vmalloc range, not necessarily with a mapped vmalloc area. > The problem (race) is not because of the inability to deal with partially > filled table. We can handle that correctly as explained below [1]. The > problem is with inadequate kernel page table locking during vmalloc() > which might be accessing intermediate kernel page table pointers which is > being freed with free_empty_tables() concurrently. Hence we cannot free > any page table page which can ever have entries from vmalloc() range. The way you deal with the partially filled table in this patch is to avoid freeing if there is a non-empty entry (!p*d_none()). This is what causes the race with vmalloc. If you simply avoid freeing a pmd page, for example, if the range floor/ceiling is not aligned to PUD_SIZE, irrespective of whether the other entries are empty or not, you shouldn't have this problem. You do free the pte page if the range is aligned to PMD_SIZE but in this case it wouldn't overlap with the vmalloc space. That's how free_pgd_range() works. We may have some pgtable pages not freed at both ends of the range (maximum 6 in total) but I don't really see this an issue. They could be reused if something else gets mapped in that range. > Though not completely sure, whether I really understood the suggestion above > with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you > suggesting that we should only attempt to free up those vmemmap range page > table pages which *definitely* could never overlap with vmalloc by working > on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range > at each level) vmemmap range instead ? You can ignore the overlap check altogether, only free the page tables with floor/ceiling set to the start/size passed to arch_remove_memory() and vmemmap_free(). > This can be one restrictive version of the function > free_empty_tables() called in case there is an overlap. So we will > maintain two versions for free_empty_tables(). Please correct me if > any the above assumptions or understanding is wrong. I'd rather have a single version of free_empty_tables(). As I said above, the only downside is that a partially filled pgtable page would not be freed even though the other entries are empty. > But yes, with this we should be able to free up some possible empty page > table pages which were being left out in the current proposal when overlap > happens. > > [1] Skipping partially filled page tables > > All free_pXX_table() functions take care in avoiding freeing partially filled > page table pages whether they represent or ever represented linear or vmemmap > or vmalloc mapping in init_mm. They go over each individual entry in a given > page table making sure that each of them checks as pXX_none() before freeing > the entire page table page. Yes but that's what's causing the race with a vmalloc trying to create such entries. > > Another question: could we do the page table and the actual vmemmap > > pages freeing in a single pass (sorry if this has been discussed > > before)? > > We could and some initial versions (till V5) of the series had that in fact. > Initially Mark Rutland had suggested to do this in two passes. Some extracts > from the previous discussion. > > https://lkml.org/lkml/2019/5/30/1159 > > ----------------------- > Looking at this some more, I don't think this is quite right, and tI > think that structure of the free_*() and remove_*() functions makes this > unnecessarily hard to follow. We should aim for this to be obviously > correct. > > The x86 code is the best template to follow here. As mentioned > previously, I'm fairly certain it's not entirely correct (e.g. due to > missing TLB maintenance), and we've already diverged a fair amount in > fixing up obvious issues, so we shouldn't aim to mirror it. > > I think that the structure of unmap_region() is closer to what we want > here -- do one pass to unmap leaf entries (and freeing the associated > memory if unmapping the vmemmap), then do a second pass cleaning up any > empty tables. > ---------------------- > > Apart from the fact that two passes over the page table is cleaner and gives > us more granular and modular infrastructure to use for later purposes, it is > also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables() > which is the second pass, can be skipped cleanly when overlap is detected. I'm fine with two passes for unmap and pgtable free for the time being and if they look fairly similar in a feature version, we can think of merging them. But for now, stick with two passes. The unmapping one in this patchset I think seems fine (though I haven't looked in detail). There is also a race with ptdump that I haven't looked into.
On 09/13/2019 03:39 PM, Catalin Marinas wrote: > On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote: >> On 09/13/2019 01:45 AM, Catalin Marinas wrote: >>> On Tue, Sep 03, 2019 at 03:15:58PM +0530, Anshuman Khandual wrote: >>>> @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, >>>> void vmemmap_free(unsigned long start, unsigned long end, >>>> struct vmem_altmap *altmap) >>>> { >>>> +#ifdef CONFIG_MEMORY_HOTPLUG >>>> + /* >>>> + * FIXME: We should have called remove_pagetable(start, end, true). >>>> + * vmemmap and vmalloc virtual range might share intermediate kernel >>>> + * page table entries. Removing vmemmap range page table pages here >>>> + * can potentially conflict with a concurrent vmalloc() allocation. >>>> + * >>>> + * This is primarily because vmalloc() does not take init_mm ptl for >>>> + * the entire page table walk and it's modification. Instead it just >>>> + * takes the lock while allocating and installing page table pages >>>> + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table >>>> + * entry via memory hot remove can cause vmalloc() kernel page table >>>> + * walk pointers to be invalid on the fly which can cause corruption >>>> + * or worst, a crash. >>>> + * >>>> + * So free_empty_tables() gets called where vmalloc and vmemmap range >>>> + * do not overlap at any intermediate level kernel page table entry. >>>> + */ >>>> + unmap_hotplug_range(start, end, true); >>>> + if (!vmalloc_vmemmap_overlap) >>>> + free_empty_tables(start, end); >>>> +#endif >>>> } >>> >>> So, I see the risk with overlapping and I guess for some kernel >>> configurations (PAGE_SIZE == 64K) we may not be able to avoid it. If we >> >> Did not see 64K config options to have overlap, do you suspect they might ? >> After the 52 bit KVA series has been merged, following configurations have >> the vmalloc-vmemmap range overlap problem. >> >> - 4K page size with 48 bit VA space >> - 16K page size with 48 bit VA space > > OK. I haven't checked, so it was just a guess that 64K has this problem > since the pgd entry coverage is fairly large. > >>> can, that's great, otherwise could we rewrite the above functions to >>> handle floor and ceiling similar to free_pgd_range()? (I wonder how this >>> function works if you called it on init_mm and kernel address range). By >> >> Hmm, never tried that. Are you wondering if this can be used directly ? >> There are two distinct elements which make it very specific to user page >> tables, mmu_gather based TLB tracking and mm->pgtable_bytes accounting >> with mm_dec_nr_pxx(). > > Ah, I missed the mm_dec_nr_*(). So I don't think it would work directly. > We could, however, use the same approach for kernel page tables. Right. > >>> having the vmemmap start/end information it avoids freeing partially >>> filled page table pages. >> >> Did you mean page table pages which can partially overlap with vmalloc ? > > Overlapping with the vmalloc range, not necessarily with a mapped > vmalloc area. > >> The problem (race) is not because of the inability to deal with partially >> filled table. We can handle that correctly as explained below [1]. The >> problem is with inadequate kernel page table locking during vmalloc() >> which might be accessing intermediate kernel page table pointers which is >> being freed with free_empty_tables() concurrently. Hence we cannot free >> any page table page which can ever have entries from vmalloc() range. > > The way you deal with the partially filled table in this patch is to > avoid freeing if there is a non-empty entry (!p*d_none()). This is what > causes the race with vmalloc. If you simply avoid freeing a pmd page, > for example, if the range floor/ceiling is not aligned to PUD_SIZE, > irrespective of whether the other entries are empty or not, you > shouldn't have this problem. You do free the pte page if the range is Right, the floor/ceiling alignment check should abort the process before scanning for non-empty entries. > aligned to PMD_SIZE but in this case it wouldn't overlap with the > vmalloc space. That's how free_pgd_range() works. Like free_pgd_range(), page table pages can be freed at lower levels when they have the right alignment wrt floor/ceiling. I will change all existing free_pxx_table() functions to accommodate floor/ceiling alignment checks to achieve this. > > We may have some pgtable pages not freed at both ends of the range > (maximum 6 in total) but I don't really see this an issue. They could be > reused if something else gets mapped in that range. I assume that the number 6 for maximum page possibility came from (floor edge + ceiling edge) * (PTE table + PMD table + PUD table) > >> Though not completely sure, whether I really understood the suggestion above >> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you >> suggesting that we should only attempt to free up those vmemmap range page >> table pages which *definitely* could never overlap with vmalloc by working >> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range >> at each level) vmemmap range instead ? > > You can ignore the overlap check altogether, only free the page tables > with floor/ceiling set to the start/size passed to arch_remove_memory() > and vmemmap_free(). Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and [PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free() and arch_remove_memory(). Not only it is safe to free all page table pages which span over these maximum possible mapping range but also it reduces the risk for alignment related wastage. > >> This can be one restrictive version of the function >> free_empty_tables() called in case there is an overlap. So we will >> maintain two versions for free_empty_tables(). Please correct me if >> any the above assumptions or understanding is wrong. > > I'd rather have a single version of free_empty_tables(). As I said > above, the only downside is that a partially filled pgtable page would > not be freed even though the other entries are empty. Sure. Also practically the limitation will be applicable only for vmemmap mapping but not for linear mappings where the chances of overlap might be negligible as it covers half kernel virtual address space. > >> But yes, with this we should be able to free up some possible empty page >> table pages which were being left out in the current proposal when overlap >> happens. >> >> [1] Skipping partially filled page tables >> >> All free_pXX_table() functions take care in avoiding freeing partially filled >> page table pages whether they represent or ever represented linear or vmemmap >> or vmalloc mapping in init_mm. They go over each individual entry in a given >> page table making sure that each of them checks as pXX_none() before freeing >> the entire page table page. > > Yes but that's what's causing the race with a vmalloc trying to create > such entries. free_pxx_table() needs to check for both floor-ceiling alignment before making sure that the page table page is completely empty before freeing. > >>> Another question: could we do the page table and the actual vmemmap >>> pages freeing in a single pass (sorry if this has been discussed >>> before)? >> >> We could and some initial versions (till V5) of the series had that in fact. >> Initially Mark Rutland had suggested to do this in two passes. Some extracts >> from the previous discussion. >> >> https://lkml.org/lkml/2019/5/30/1159 >> >> ----------------------- >> Looking at this some more, I don't think this is quite right, and tI >> think that structure of the free_*() and remove_*() functions makes this >> unnecessarily hard to follow. We should aim for this to be obviously >> correct. >> >> The x86 code is the best template to follow here. As mentioned >> previously, I'm fairly certain it's not entirely correct (e.g. due to >> missing TLB maintenance), and we've already diverged a fair amount in >> fixing up obvious issues, so we shouldn't aim to mirror it. >> >> I think that the structure of unmap_region() is closer to what we want >> here -- do one pass to unmap leaf entries (and freeing the associated >> memory if unmapping the vmemmap), then do a second pass cleaning up any >> empty tables. >> ---------------------- >> >> Apart from the fact that two passes over the page table is cleaner and gives >> us more granular and modular infrastructure to use for later purposes, it is >> also a necessity in dealing with vmalloc-vmemmap overlap. free_empty_tables() >> which is the second pass, can be skipped cleanly when overlap is detected. > > I'm fine with two passes for unmap and pgtable free for the time being > and if they look fairly similar in a feature version, we can think of > merging them. But for now, stick with two passes. The unmapping one in > this patchset I think seems fine (though I haven't looked in detail). Sure. > > There is also a race with ptdump that I haven't looked into. The second patch in the series deals with that.
On Tue, Sep 17, 2019 at 10:06:11AM +0530, Anshuman Khandual wrote: > On 09/13/2019 03:39 PM, Catalin Marinas wrote: > > On Fri, Sep 13, 2019 at 11:28:01AM +0530, Anshuman Khandual wrote: > >> The problem (race) is not because of the inability to deal with partially > >> filled table. We can handle that correctly as explained below [1]. The > >> problem is with inadequate kernel page table locking during vmalloc() > >> which might be accessing intermediate kernel page table pointers which is > >> being freed with free_empty_tables() concurrently. Hence we cannot free > >> any page table page which can ever have entries from vmalloc() range. > > > > The way you deal with the partially filled table in this patch is to > > avoid freeing if there is a non-empty entry (!p*d_none()). This is what > > causes the race with vmalloc. If you simply avoid freeing a pmd page, > > for example, if the range floor/ceiling is not aligned to PUD_SIZE, > > irrespective of whether the other entries are empty or not, you > > shouldn't have this problem. You do free the pte page if the range is [...] > > We may have some pgtable pages not freed at both ends of the range > > (maximum 6 in total) but I don't really see this an issue. They could be > > reused if something else gets mapped in that range. > > I assume that the number 6 for maximum page possibility came from > > (floor edge + ceiling edge) * (PTE table + PMD table + PUD table) Yes. > >> Though not completely sure, whether I really understood the suggestion above > >> with respect to the floor-ceiling mechanism as in free_pgd_range(). Are you > >> suggesting that we should only attempt to free up those vmemmap range page > >> table pages which *definitely* could never overlap with vmalloc by working > >> on a modified (i.e cut down with floor-ceiling while avoiding vmalloc range > >> at each level) vmemmap range instead ? > > > > You can ignore the overlap check altogether, only free the page tables > > with floor/ceiling set to the start/size passed to arch_remove_memory() > > and vmemmap_free(). > > Wondering if it will be better to use [VMEMMAP_START - VMEMMAP_END] and > [PAGE_OFFSET - PAGE_END] as floor/ceiling respectively with vmemmap_free() > and arch_remove_memory(). Not only it is safe to free all page table pages > which span over these maximum possible mapping range but also it reduces > the risk for alignment related wastage. That's indeed better. You pass the floor/ceiling as the enclosing range and start/end as the actual range to unmap is. We avoid the potential "leak" around the edges when falling within the floor/ceiling range (I think that's close to what free_pgd_range() does). > >> This can be one restrictive version of the function > >> free_empty_tables() called in case there is an overlap. So we will > >> maintain two versions for free_empty_tables(). Please correct me if > >> any the above assumptions or understanding is wrong. > > > > I'd rather have a single version of free_empty_tables(). As I said > > above, the only downside is that a partially filled pgtable page would > > not be freed even though the other entries are empty. > > Sure. Also practically the limitation will be applicable only for vmemmap > mapping but not for linear mappings where the chances of overlap might be > negligible as it covers half kernel virtual address space. If you have a common set of functions, it doesn't heart to pass the correct floor/ceiling in both cases.
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6481964b6425..0079820af308 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -274,6 +274,9 @@ config ZONE_DMA32 config ARCH_ENABLE_MEMORY_HOTPLUG def_bool y +config ARCH_ENABLE_MEMORY_HOTREMOVE + def_bool y + config SMP def_bool y diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h index b61b50bf68b1..615dcd08acfa 100644 --- a/arch/arm64/include/asm/memory.h +++ b/arch/arm64/include/asm/memory.h @@ -54,6 +54,7 @@ #define MODULES_VADDR (BPF_JIT_REGION_END) #define MODULES_VSIZE (SZ_128M) #define VMEMMAP_START (-VMEMMAP_SIZE - SZ_2M) +#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE) #define PCI_IO_END (VMEMMAP_START - SZ_2M) #define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE) #define FIXADDR_TOP (PCI_IO_START - SZ_2M) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index ee1bf416368d..980d761a7327 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -60,6 +60,14 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused; static DEFINE_SPINLOCK(swapper_pgdir_lock); +/* + * This represents if vmalloc and vmemmap address range overlap with + * each other on an intermediate level kernel page table entry which + * in turn helps in deciding whether empty kernel page table pages + * if any can be freed during memory hotplug operation. + */ +static bool vmalloc_vmemmap_overlap; + void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd) { pgd_t *fixmap_pgdp; @@ -723,6 +731,250 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(pte)); } + +#ifdef CONFIG_MEMORY_HOTPLUG +static void free_hotplug_page_range(struct page *page, size_t size) +{ + WARN_ON(!page || PageReserved(page)); + free_pages((unsigned long)page_address(page), get_order(size)); +} + +static void free_hotplug_pgtable_page(struct page *page) +{ + free_hotplug_page_range(page, PAGE_SIZE); +} + +static void free_pte_table(pmd_t *pmdp, unsigned long addr) +{ + struct page *page; + pte_t *ptep; + int i; + + ptep = pte_offset_kernel(pmdp, 0UL); + for (i = 0; i < PTRS_PER_PTE; i++) { + if (!pte_none(READ_ONCE(ptep[i]))) + return; + } + + page = pmd_page(READ_ONCE(*pmdp)); + pmd_clear(pmdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pmd_table(pud_t *pudp, unsigned long addr) +{ + struct page *page; + pmd_t *pmdp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 2) + return; + + pmdp = pmd_offset(pudp, 0UL); + for (i = 0; i < PTRS_PER_PMD; i++) { + if (!pmd_none(READ_ONCE(pmdp[i]))) + return; + } + + page = pud_page(READ_ONCE(*pudp)); + pud_clear(pudp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void free_pud_table(pgd_t *pgdp, unsigned long addr) +{ + struct page *page; + pud_t *pudp; + int i; + + if (CONFIG_PGTABLE_LEVELS <= 3) + return; + + pudp = pud_offset(pgdp, 0UL); + for (i = 0; i < PTRS_PER_PUD; i++) { + if (!pud_none(READ_ONCE(pudp[i]))) + return; + } + + page = pgd_page(READ_ONCE(*pgdp)); + pgd_clear(pgdp); + __flush_tlb_kernel_pgtable(addr); + free_hotplug_pgtable_page(page); +} + +static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + struct page *page; + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + if (pte_none(pte)) + continue; + + WARN_ON(!pte_present(pte)); + page = sparse_vmap ? pte_page(pte) : NULL; + pte_clear(&init_mm, addr, ptep); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + if (sparse_vmap) + free_hotplug_page_range(page, PAGE_SIZE); + } while (addr += PAGE_SIZE, addr < end); +} + +static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd)); + if (pmd_sect(pmd)) { + page = sparse_vmap ? pmd_page(pmd) : NULL; + pmd_clear(pmdp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PMD_SIZE); + continue; + } + WARN_ON(!pmd_table(pmd)); + unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr, + unsigned long end, bool sparse_vmap) +{ + unsigned long next; + struct page *page; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud)); + if (pud_sect(pud)) { + page = sparse_vmap ? pud_page(pud) : NULL; + pud_clear(pudp); + flush_tlb_kernel_range(addr, next); + if (sparse_vmap) + free_hotplug_page_range(page, PUD_SIZE); + continue; + } + WARN_ON(!pud_table(pud)); + unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void unmap_hotplug_range(unsigned long addr, unsigned long end, + bool sparse_vmap) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap); + } while (addr = next, addr < end); +} + +static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr, + unsigned long end) +{ + pte_t *ptep, pte; + + do { + ptep = pte_offset_kernel(pmdp, addr); + pte = READ_ONCE(*ptep); + WARN_ON(!pte_none(pte)); + } while (addr += PAGE_SIZE, addr < end); +} + +static void free_empty_pmd_table(pud_t *pudp, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pmd_t *pmdp, pmd; + + do { + next = pmd_addr_end(addr, end); + pmdp = pmd_offset(pudp, addr); + pmd = READ_ONCE(*pmdp); + if (pmd_none(pmd)) + continue; + + WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd)); + free_empty_pte_table(pmdp, addr, next); + free_pte_table(pmdp, addr); + } while (addr = next, addr < end); +} + +static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr, + unsigned long end) +{ + unsigned long next; + pud_t *pudp, pud; + + do { + next = pud_addr_end(addr, end); + pudp = pud_offset(pgdp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + continue; + + WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud)); + free_empty_pmd_table(pudp, addr, next); + free_pmd_table(pudp, addr); + } while (addr = next, addr < end); +} + +static void free_empty_tables(unsigned long addr, unsigned long end) +{ + unsigned long next; + pgd_t *pgdp, pgd; + + do { + next = pgd_addr_end(addr, end); + pgdp = pgd_offset_k(addr); + pgd = READ_ONCE(*pgdp); + if (pgd_none(pgd)) + continue; + + WARN_ON(!pgd_present(pgd)); + free_empty_pud_table(pgdp, addr, next); + free_pud_table(pgdp, addr); + } while (addr = next, addr < end); +} + +static void remove_pagetable(unsigned long start, unsigned long end, + bool sparse_vmap) +{ + unmap_hotplug_range(start, end, sparse_vmap); + free_empty_tables(start, end); +} +#endif + #ifdef CONFIG_SPARSEMEM_VMEMMAP #if !ARM64_SWAPPER_USES_SECTION_MAPS int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, @@ -770,6 +1022,28 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node, void vmemmap_free(unsigned long start, unsigned long end, struct vmem_altmap *altmap) { +#ifdef CONFIG_MEMORY_HOTPLUG + /* + * FIXME: We should have called remove_pagetable(start, end, true). + * vmemmap and vmalloc virtual range might share intermediate kernel + * page table entries. Removing vmemmap range page table pages here + * can potentially conflict with a concurrent vmalloc() allocation. + * + * This is primarily because vmalloc() does not take init_mm ptl for + * the entire page table walk and it's modification. Instead it just + * takes the lock while allocating and installing page table pages + * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table + * entry via memory hot remove can cause vmalloc() kernel page table + * walk pointers to be invalid on the fly which can cause corruption + * or worst, a crash. + * + * So free_empty_tables() gets called where vmalloc and vmemmap range + * do not overlap at any intermediate level kernel page table entry. + */ + unmap_hotplug_range(start, end, true); + if (!vmalloc_vmemmap_overlap) + free_empty_tables(start, end); +#endif } #endif /* CONFIG_SPARSEMEM_VMEMMAP */ @@ -1048,10 +1322,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) } #ifdef CONFIG_MEMORY_HOTPLUG +static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size) +{ + unsigned long end = start + size; + + WARN_ON(pgdir != init_mm.pgd); + remove_pagetable(start, end, false); +} + int arch_add_memory(int nid, u64 start, u64 size, struct mhp_restrictions *restrictions) { - int flags = 0; + int ret, flags = 0; if (rodata_full || debug_pagealloc_enabled()) flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS; @@ -1059,9 +1341,14 @@ int arch_add_memory(int nid, u64 start, u64 size, __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start), size, PAGE_KERNEL, __pgd_pgtable_alloc, flags); - return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, + ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, restrictions); + if (ret) + __remove_pgd_mapping(swapper_pg_dir, + __phys_to_virt(start), size); + return ret; } + void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { @@ -1069,14 +1356,47 @@ void arch_remove_memory(int nid, u64 start, u64 size, unsigned long nr_pages = size >> PAGE_SHIFT; struct zone *zone; - /* - * FIXME: Cleanup page tables (also in arch_add_memory() in case - * adding fails). Until then, this function should only be used - * during memory hotplug (adding memory), not for memory - * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be - * unlocked yet. - */ zone = page_zone(pfn_to_page(start_pfn)); __remove_pages(zone, start_pfn, nr_pages, altmap); + __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size); +} + +static int __init find_vmalloc_vmemmap_overlap(void) +{ + unsigned long gap_start, gap_end; + + /* + * vmalloc and vmemmap address ranges should be linearly + * increasing and non-overlapping before the gap between + * them can be measured. + */ + BUILD_BUG_ON(VMALLOC_END <= VMALLOC_START); + BUILD_BUG_ON(VMEMMAP_END <= VMEMMAP_START); + BUILD_BUG_ON((VMEMMAP_START >= VMALLOC_START) + && (VMEMMAP_START <= VMALLOC_END)); + BUILD_BUG_ON((VMEMMAP_END >= VMALLOC_START) + && (VMEMMAP_END <= VMALLOC_END)); + + /* + * vmalloc and vmemmap can only be non-overlapping disjoint + * ranges. So [gap_start..gap_end] is determined which will + * represent the gap between the above mentioned ranges. + */ + if (VMEMMAP_START > VMALLOC_END) { + gap_start = VMALLOC_END; + gap_end = VMEMMAP_START; + } else { + gap_start = VMEMMAP_END; + gap_end = VMALLOC_START; + } + + /* + * Race condition during memory hot-remove exists when the + * gap edges share same PGD entry in kernel page table. + */ + if ((gap_start & PGDIR_MASK) == (gap_end & PGDIR_MASK)) + vmalloc_vmemmap_overlap = true; + return 0; } +early_initcall(find_vmalloc_vmemmap_overlap); #endif