Message ID | 20211207030750.30824-1-bhe@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Avoid requesting page from DMA zone when no managed pages | expand |
Sorry, forgot adding x86 and x86/mm maintainers On 12/07/21 at 11:07am, Baoquan He wrote: > ***Problem observed: > On x86_64, when crash is triggered and entering into kdump kernel, page > allocation failure can always be seen. > > --------------------------------- > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations > swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 > CPU: 0 PID: 1 Comm: swapper/0 > Call Trace: > dump_stack+0x7f/0xa1 > warn_alloc.cold+0x72/0xd6 > ...... > __alloc_pages+0x24d/0x2c0 > ...... > dma_atomic_pool_init+0xdb/0x176 > do_one_initcall+0x67/0x320 > ? rcu_read_lock_sched_held+0x3f/0x80 > kernel_init_freeable+0x290/0x2dc > ? rest_init+0x24f/0x24f > kernel_init+0xa/0x111 > ret_from_fork+0x22/0x30 > Mem-Info: > ------------------------------------ > > ***Root cause: > In the current kernel, it assumes that DMA zone must have managed pages > and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not > always true. E.g in kdump kernel of x86_64, only low 1M is presented and > locked down at very early stage of boot, so that this low 1M won't be > added into buddy allocator to become managed pages of DMA zone. This > exception will always cause page allocation failure if page is requested > from DMA zone. > > ***Investigation: > This failure happens since below commit merged into linus's tree. > 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options > 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M() > f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM > 7c321eb2b843 x86/kdump: Remove the backup region handling > 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified > > Before them, on x86_64, the low 640K area will be reused by kdump kernel. > So in kdump kernel, the content of low 640K area is copied into a backup > region for dumping before jumping into kdump. Then except of those firmware > reserved region in [0, 640K], the left area will be added into buddy > allocator to become available managed pages of DMA zone. > > However, after above commits applied, in kdump kernel of x86_64, the low > 1M is reserved by memblock, but not released to buddy allocator. So any > later page allocation requested from DMA zone will fail. > > This low 1M lock down is needed because AMD SME encrypts memory making > the old backup region mechanims impossible when switching into kdump > kernel. And Intel engineer mentioned their TDX (Trusted domain extensions) > which is under development in kernel also needs lock down the low 1M. > So we can't simply revert above commits to fix the page allocation > failure from DMA zone as someone suggested. > > ***Solution: > Currently, only DMA atomic pool and dma-kmalloc will initialize and > request page allocation with GFP_DMA during bootup. So only initialize > them when DMA zone has available managed pages, otherwise just skip the > initialization. From testing and code, this doesn't matter. In kdump > kernel of x86_64, the page allocation failure disappear. > > ***Further thinking > On x86_64, it consistently takes [0, 16M] into ZONE_DMA, and (16M, 4G] > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to > take care of antique ISA devices. In fact, on 64bit system, it rarely > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. > However, some components treat DMA as a generic concept, e.g > kmalloc-dma, slab allocator initializes it for later any DMA related > buffer allocation, but not limited to ISA DMA. > > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of > the 32-bit addressable memory. > > I am wondering if we can also change the size of DMA and DMA32 ZONE as > dynamically adjusted, just as arm64 is doing? On x86_64, we can make > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable > memory when enabled?) > > Change history: > > v2 post: > https://lore.kernel.org/all/20210810094835.13402-1-bhe@redhat.com/T/#u > > v1 post: > https://lore.kernel.org/all/20210624052010.5676-1-bhe@redhat.com/T/#u > > v2->v2 RESEND: > John pinged to push the repost of this patchset. So fix one typo of > suject of patch 3/5; Fix a building error caused by mix declaration in > patch 5/5. Both of them are found by John from his testing. > > v1->v2: > Change to check if managed DMA zone exists. If DMA zone has managed > pages, go further to request page from DMA zone to initialize. Otherwise, > just skip to initialize stuffs which need pages from DMA zone. > > Baoquan He (5): > docs: kernel-parameters: Update to reflect the current default size of > atomic pool > dma-pool: allow user to disable atomic pool > mm_zone: add function to check if managed dma zone exists > dma/pool: create dma atomic pool only if dma zone has managed pages > mm/slub: do not create dma-kmalloc if no managed pages in DMA zone > > .../admin-guide/kernel-parameters.txt | 5 ++++- > include/linux/mmzone.h | 21 +++++++++++++++++++ > kernel/dma/pool.c | 11 ++++++---- > mm/page_alloc.c | 11 ++++++++++ > mm/slab_common.c | 9 ++++++++ > 5 files changed, 52 insertions(+), 5 deletions(-) > > -- > 2.17.2 >
On 12/6/21 9:16 PM, Baoquan He wrote: > Sorry, forgot adding x86 and x86/mm maintainers Hi, These commits need applied to Linux-5.15.0 (LTS) too since it has the original regression : 1d659236fb43 ("dma-pool: scale the default DMA coherent pool size with memory capacity") Maybe add "Fixes" to the other commits ? > > On 12/07/21 at 11:07am, Baoquan He wrote: >> ***Problem observed: >> On x86_64, when crash is triggered and entering into kdump kernel, page >> allocation failure can always be seen. >> >> --------------------------------- >> DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations >> swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 >> CPU: 0 PID: 1 Comm: swapper/0 >> Call Trace: >> dump_stack+0x7f/0xa1 >> warn_alloc.cold+0x72/0xd6 >> ...... >> __alloc_pages+0x24d/0x2c0 >> ...... >> dma_atomic_pool_init+0xdb/0x176 >> do_one_initcall+0x67/0x320 >> ? rcu_read_lock_sched_held+0x3f/0x80 >> kernel_init_freeable+0x290/0x2dc >> ? rest_init+0x24f/0x24f >> kernel_init+0xa/0x111 >> ret_from_fork+0x22/0x30 >> Mem-Info: >> ------------------------------------ >> >> ***Root cause: >> In the current kernel, it assumes that DMA zone must have managed pages >> and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not >> always true. E.g in kdump kernel of x86_64, only low 1M is presented and >> locked down at very early stage of boot, so that this low 1M won't be >> added into buddy allocator to become managed pages of DMA zone. This >> exception will always cause page allocation failure if page is requested >> from DMA zone. >> >> ***Investigation: >> This failure happens since below commit merged into linus's tree. >> 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options >> 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M() >> f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM >> 7c321eb2b843 x86/kdump: Remove the backup region handling >> 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified >> >> Before them, on x86_64, the low 640K area will be reused by kdump kernel. >> So in kdump kernel, the content of low 640K area is copied into a backup >> region for dumping before jumping into kdump. Then except of those firmware >> reserved region in [0, 640K], the left area will be added into buddy >> allocator to become available managed pages of DMA zone. >> >> However, after above commits applied, in kdump kernel of x86_64, the low >> 1M is reserved by memblock, but not released to buddy allocator. So any >> later page allocation requested from DMA zone will fail. >> >> This low 1M lock down is needed because AMD SME encrypts memory making >> the old backup region mechanims impossible when switching into kdump >> kernel. And Intel engineer mentioned their TDX (Trusted domain extensions) >> which is under development in kernel also needs lock down the low 1M. >> So we can't simply revert above commits to fix the page allocation >> failure from DMA zone as someone suggested. >> >> ***Solution: >> Currently, only DMA atomic pool and dma-kmalloc will initialize and >> request page allocation with GFP_DMA during bootup. So only initialize >> them when DMA zone has available managed pages, otherwise just skip the >> initialization. From testing and code, this doesn't matter. In kdump >> kernel of x86_64, the page allocation failure disappear. >> >> ***Further thinking >> On x86_64, it consistently takes [0, 16M] into ZONE_DMA, and (16M, 4G] >> into ZONE_DMA32 by default. The zone DMA covering low 16M is used to >> take care of antique ISA devices. In fact, on 64bit system, it rarely >> need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. >> However, some components treat DMA as a generic concept, e.g >> kmalloc-dma, slab allocator initializes it for later any DMA related >> buffer allocation, but not limited to ISA DMA. >> >> On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 >> are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 >> empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), >> then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of >> the 32-bit addressable memory. >> >> I am wondering if we can also change the size of DMA and DMA32 ZONE as >> dynamically adjusted, just as arm64 is doing? On x86_64, we can make >> zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by >> default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers >> low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. >> (I am not familiar with ISA_DMA_API, will it require 24-bit addressable >> memory when enabled?) >> >> Change history: >> >> v2 post: >> https://urldefense.com/v3/__https://lore.kernel.org/all/20210810094835.13402-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EjaERCi0$ >> >> v1 post: >> https://urldefense.com/v3/__https://lore.kernel.org/all/20210624052010.5676-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EgRgBiPP$ >> >> v2->v2 RESEND: >> John pinged to push the repost of this patchset. So fix one typo of >> suject of patch 3/5; Fix a building error caused by mix declaration in >> patch 5/5. Both of them are found by John from his testing. >> >> v1->v2: >> Change to check if managed DMA zone exists. If DMA zone has managed >> pages, go further to request page from DMA zone to initialize. Otherwise, >> just skip to initialize stuffs which need pages from DMA zone. >> >> Baoquan He (5): >> docs: kernel-parameters: Update to reflect the current default size of >> atomic pool >> dma-pool: allow user to disable atomic pool >> mm_zone: add function to check if managed dma zone exists >> dma/pool: create dma atomic pool only if dma zone has managed pages >> mm/slub: do not create dma-kmalloc if no managed pages in DMA zone >> >> .../admin-guide/kernel-parameters.txt | 5 ++++- >> include/linux/mmzone.h | 21 +++++++++++++++++++ >> kernel/dma/pool.c | 11 ++++++---- >> mm/page_alloc.c | 11 ++++++++++ >> mm/slab_common.c | 9 ++++++++ >> 5 files changed, 52 insertions(+), 5 deletions(-) >> >> -- >> 2.17.2 >> >
On Tue, 7 Dec 2021, Baoquan He wrote: > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to > take care of antique ISA devices. In fact, on 64bit system, it rarely > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. > However, some components treat DMA as a generic concept, e.g > kmalloc-dma, slab allocator initializes it for later any DMA related > buffer allocation, but not limited to ISA DMA. The idea of the slab allocator DMA support is to have memory available for devices that can only support a limited range of physical addresses. These are only to be enabled for platforms that have such requirements. The slab allocators guarantee that all kmalloc allocations are DMA able indepent of specifying ZONE_DMA/ZONE_DMA32 > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of > the 32-bit addressable memory. ZONE_NORMAL should cover all memory. ARM does not need ZONE_DMA32. > I am wondering if we can also change the size of DMA and DMA32 ZONE as > dynamically adjusted, just as arm64 is doing? On x86_64, we can make > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable > memory when enabled?) The size of ZONE_DMA is traditionally depending on the platform. On some it is 16MB, on some 1G and on some 4GB. ZONE32 is always 4GB and should only be used if ZONE_DMA has already been used. ZONE_DMA is dynamic in the sense of being different on different platforms. Generally I guess it would be possible to use ZONE_DMA for generic tagging of special memory that can be configured to have a dynamic size but that is not what it was designed to do.
On Mon, 6 Dec 2021 22:03:59 -0600 John Donnelly <John.p.donnelly@oracle.com> wrote: > On 12/6/21 9:16 PM, Baoquan He wrote: > > Sorry, forgot adding x86 and x86/mm maintainers > > Hi, > > These commits need applied to Linux-5.15.0 (LTS) too since it has the > original regression : > > 1d659236fb43 ("dma-pool: scale the default DMA coherent pool > size with memory capacity") > > Maybe add "Fixes" to the other commits ? And cc:stable, please. "Fixes:" doesn't always mean "should be backported".
On 12/7/21 22:33, Andrew Morton wrote: > On Mon, 6 Dec 2021 22:03:59 -0600 John Donnelly <John.p.donnelly@oracle.com> wrote: > >> On 12/6/21 9:16 PM, Baoquan He wrote: >>> Sorry, forgot adding x86 and x86/mm maintainers >> >> Hi, >> >> These commits need applied to Linux-5.15.0 (LTS) too since it has the >> original regression : >> >> 1d659236fb43 ("dma-pool: scale the default DMA coherent pool >> size with memory capacity") >> >> Maybe add "Fixes" to the other commits ? > > And cc:stable, please. "Fixes:" doesn't always mean "should be > backported > Hi. Does this mean we need a v3 version ?
On 12/07/21 at 09:05am, Christoph Lameter wrote: > On Tue, 7 Dec 2021, Baoquan He wrote: > > > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to > > take care of antique ISA devices. In fact, on 64bit system, it rarely > > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. > > However, some components treat DMA as a generic concept, e.g > > kmalloc-dma, slab allocator initializes it for later any DMA related > > buffer allocation, but not limited to ISA DMA. Thanks a lot for your reviewing and sharing. > > The idea of the slab allocator DMA support is to have memory available > for devices that can only support a limited range of physical addresses. > These are only to be enabled for platforms that have such requirements. > > The slab allocators guarantee that all kmalloc allocations are DMA able > indepent of specifying ZONE_DMA/ZONE_DMA32 Here you mean we guarantee dma-kmalloc will be DMA able independent of specifying ZONE_DMA/DMA32, or the whole sla/ub allocator? Sorry for late reply because I suddenly realized one test case is missed. In my earlier test on this patchset, I only set crashkernel=256M in cmdline, then it will reserve 256M memory under 4G. Then in kdump kernel, all memory belongs to zone DMA32. So requiring dma buffer with GFP_DMA will finally get memory from zone DMA32 since zone NORMAL doesn't exist. I tried crashkernel=256M,high yesterday, it will reserve 256M above 4G, and another 256M under 4G. Then, the zone NORMAL will have memory above 4G. With this patchset applied, dma-kmalloc will take page from Normal zone, get pages above 4G. What disappointed me is this patchset works too. So the confusion to me is in ata_scsi device driver, it require dma buffer with GFP_DMA, we feed it with memory above 4G, it can succeed too. I added amd_iommu=off to cmdline to disable IOMMU. Furthermore, if on risc and ia64, they only have zone DMA32, no zone DMA, and ata_scsi device is deployed, it require dma buffer with GFP_DMA, but get memory above 4G, isn't this wrong? With my understanding, isn't the reasonable sequence zone DMA firstly if GFP_DMA, then zone DMA32, finaly zone NORMAL. At least, on x86_64, I believe device driver developer prefer to see this because most of time, zone DMA and zone DMA32 are both used for dma buffer allocation, if IOMMU is not enabled. However, memory got from zone NORMAL when required with GFP_DMA, and it succeeds, does it mean that the developer doesn't take the GFP_DMA flag seriously, just try to get buffer for allocation? --> sr_probe() -->get_capabilities() --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA); --> scsi_mode_sense() --> scsi_execute_req() --> blk_rq_map_kern() --> bio_copy_kern() or --> bio_map_kern() > > > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 > > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 > > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), > > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of > > the 32-bit addressable memory. > > ZONE_NORMAL should cover all memory. ARM does not need ZONE_DMA32. I grep-ed all ARCHes which provide ZONE_DMA or| and ZONE_DMA32, and summarize them at below. From these, for ARCH-es which has DMA32, only x86_64 and mips (which is not on platform SGI_IP22 or SGI_IP28) have ZONE_DMA of 16M. Obviously the ZONE_DMA is created because they carry the legacy burden of the old ISA support. Arm64 will have ZONE_DMA to cover the low 4G by default if ACPI/DT doesn't report a shorter limit of dma capability. While both riscv and ia64 bypass ZONE_DMA, only use ZONE_DMA32 to cover low 4G. As for s390 and ppc64, they both takes low 2G into ZONE_DMA, and no ZONE_DMA32 provided. ============================= ARCH which has DMA32 ZONE_DMA ZONE_DMA32 arm64 0~X X~4G (X is got from ACPI or DT. Otherwise it's 4G by default, DMA32 is empty) ia64 None 0~4G mips 0 or 0~16M X~4G (zone DMA is empty on SGI_IP22 or SGI_IP28, otherwise 16M by default like i386) riscv None 0~4G x86_64 16M 16M~4G ============================= ARCH which has no DMA32 ZONE_DMA alpha 0~16M or empty if IOMMU enabled arm 0~X (X is reported by fdt, 4G by default) m68k 0~total memory microblaze 0~total low memory powerpc 0~2G s390 0~2G sparc 0~ total low memory i386 0~16M > > > I am wondering if we can also change the size of DMA and DMA32 ZONE as > > dynamically adjusted, just as arm64 is doing? On x86_64, we can make > > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by > > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers > > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. > > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable > > memory when enabled?) > > The size of ZONE_DMA is traditionally depending on the platform. On some > it is 16MB, on some 1G and on some 4GB. ZONE32 is always 4GB and should > only be used if ZONE_DMA has already been used. As said at above, ia64 and riscv don't have ZONE_DMA at all, they just cover low 4G with ZONE_DMA32 alone. > > ZONE_DMA is dynamic in the sense of being different on different > platforms. > > Generally I guess it would be possible to use ZONE_DMA for generic tagging > of special memory that can be configured to have a dynamic size but that is > not what it was designed to do. > Thanks again for these precious sharing. I am still a little confused with the current ZONE_DMA and it's usage, e.g in slab. May need to continue explore.
On Thu, 9 Dec 2021, Baoquan He wrote: > > The slab allocators guarantee that all kmalloc allocations are DMA able > > indepent of specifying ZONE_DMA/ZONE_DMA32 > > Here you mean we guarantee dma-kmalloc will be DMA able independent of > specifying ZONE_DMA/DMA32, or the whole sla/ub allocator? All memory obtained via kmalloc --independent of "dma-alloc", ZONE_DMA etc-- must be dmaable. > With my understanding, isn't the reasonable sequence zone DMA firstly if > GFP_DMA, then zone DMA32, finaly zone NORMAL. At least, on x86_64, I > believe device driver developer prefer to see this because most of time, > zone DMA and zone DMA32 are both used for dma buffer allocation, if > IOMMU is not enabled. However, memory got from zone NORMAL when required > with GFP_DMA, and it succeeds, does it mean that the developer doesn't > take the GFP_DMA flag seriously, just try to get buffer for allocation? ZONE_NORMAL is also used for DMA allocations. ZONE_DMA and ZONE_DMA32 are only used if the physical range of memory supported by a device does not include all of normal memory. > > The size of ZONE_DMA is traditionally depending on the platform. On some > > it is 16MB, on some 1G and on some 4GB. ZONE32 is always 4GB and should > > only be used if ZONE_DMA has already been used. > > As said at above, ia64 and riscv don't have ZONE_DMA at all, they just > cover low 4G with ZONE_DMA32 alone. If you do not have devices that are crap and cannot address the full memory then you dont need these special zones. Sorry this subject has caused confusion multiple times over the years and there are still arches that are not implementing this in a consistent way.
On 12/06/21 at 10:03pm, John Donnelly wrote: > On 12/6/21 9:16 PM, Baoquan He wrote: > > Sorry, forgot adding x86 and x86/mm maintainers > > Hi, > > These commits need applied to Linux-5.15.0 (LTS) too since it has the > original regression : > > 1d659236fb43 ("dma-pool: scale the default DMA coherent pool > size with memory capacity") Yeah, Fixes and stable need be added. Thanks for pointing out. As I have said in cover letter, this issue didn't occur until below commits applied. So I will add 'Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")' to patch 4, 5. The patch 1, 2 are cleanup|improvement, not related to this issue. 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M() f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM 7c321eb2b843 x86/kdump: Remove the backup region handling 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified > > Maybe add "Fixes" to the other commits ? > > > > > > On 12/07/21 at 11:07am, Baoquan He wrote: > > > ***Problem observed: > > > On x86_64, when crash is triggered and entering into kdump kernel, page > > > allocation failure can always be seen. > > > > > > --------------------------------- > > > DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations > > > swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0 > > > CPU: 0 PID: 1 Comm: swapper/0 > > > Call Trace: > > > dump_stack+0x7f/0xa1 > > > warn_alloc.cold+0x72/0xd6 > > > ...... > > > __alloc_pages+0x24d/0x2c0 > > > ...... > > > dma_atomic_pool_init+0xdb/0x176 > > > do_one_initcall+0x67/0x320 > > > ? rcu_read_lock_sched_held+0x3f/0x80 > > > kernel_init_freeable+0x290/0x2dc > > > ? rest_init+0x24f/0x24f > > > kernel_init+0xa/0x111 > > > ret_from_fork+0x22/0x30 > > > Mem-Info: > > > ------------------------------------ > > > > > > ***Root cause: > > > In the current kernel, it assumes that DMA zone must have managed pages > > > and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not > > > always true. E.g in kdump kernel of x86_64, only low 1M is presented and > > > locked down at very early stage of boot, so that this low 1M won't be > > > added into buddy allocator to become managed pages of DMA zone. This > > > exception will always cause page allocation failure if page is requested > > > from DMA zone. > > > > > > ***Investigation: > > > This failure happens since below commit merged into linus's tree. > > > 1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options > > > 23721c8e92f7 x86/crash: Remove crash_reserve_low_1M() > > > f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM > > > 7c321eb2b843 x86/kdump: Remove the backup region handling > > > 6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified > > > > > > Before them, on x86_64, the low 640K area will be reused by kdump kernel. > > > So in kdump kernel, the content of low 640K area is copied into a backup > > > region for dumping before jumping into kdump. Then except of those firmware > > > reserved region in [0, 640K], the left area will be added into buddy > > > allocator to become available managed pages of DMA zone. > > > > > > However, after above commits applied, in kdump kernel of x86_64, the low > > > 1M is reserved by memblock, but not released to buddy allocator. So any > > > later page allocation requested from DMA zone will fail. > > > > > > This low 1M lock down is needed because AMD SME encrypts memory making > > > the old backup region mechanims impossible when switching into kdump > > > kernel. And Intel engineer mentioned their TDX (Trusted domain extensions) > > > which is under development in kernel also needs lock down the low 1M. > > > So we can't simply revert above commits to fix the page allocation > > > failure from DMA zone as someone suggested. > > > > > > ***Solution: > > > Currently, only DMA atomic pool and dma-kmalloc will initialize and > > > request page allocation with GFP_DMA during bootup. So only initialize > > > them when DMA zone has available managed pages, otherwise just skip the > > > initialization. From testing and code, this doesn't matter. In kdump > > > kernel of x86_64, the page allocation failure disappear. > > > > > > ***Further thinking > > > On x86_64, it consistently takes [0, 16M] into ZONE_DMA, and (16M, 4G] > > > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to > > > take care of antique ISA devices. In fact, on 64bit system, it rarely > > > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. > > > However, some components treat DMA as a generic concept, e.g > > > kmalloc-dma, slab allocator initializes it for later any DMA related > > > buffer allocation, but not limited to ISA DMA. > > > > > > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 > > > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 > > > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), > > > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of > > > the 32-bit addressable memory. > > > > > > I am wondering if we can also change the size of DMA and DMA32 ZONE as > > > dynamically adjusted, just as arm64 is doing? On x86_64, we can make > > > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by > > > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers > > > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. > > > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable > > > memory when enabled?) > > > > > > Change history: > > > > > > v2 post: > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20210810094835.13402-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EjaERCi0$ > > > > > > v1 post: > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20210624052010.5676-1-bhe@redhat.com/T/*u__;Iw!!ACWV5N9M2RV99hQ!beOGaLK9suYILSZ8uvbAt4Xd7raHP_p6tcVTvcnZMWCq_eL1VQxSMIJdw-z6EgRgBiPP$ > > > > > > v2->v2 RESEND: > > > John pinged to push the repost of this patchset. So fix one typo of > > > suject of patch 3/5; Fix a building error caused by mix declaration in > > > patch 5/5. Both of them are found by John from his testing. > > > > > > v1->v2: > > > Change to check if managed DMA zone exists. If DMA zone has managed > > > pages, go further to request page from DMA zone to initialize. Otherwise, > > > just skip to initialize stuffs which need pages from DMA zone. > > > > > > Baoquan He (5): > > > docs: kernel-parameters: Update to reflect the current default size of > > > atomic pool > > > dma-pool: allow user to disable atomic pool > > > mm_zone: add function to check if managed dma zone exists > > > dma/pool: create dma atomic pool only if dma zone has managed pages > > > mm/slub: do not create dma-kmalloc if no managed pages in DMA zone > > > > > > .../admin-guide/kernel-parameters.txt | 5 ++++- > > > include/linux/mmzone.h | 21 +++++++++++++++++++ > > > kernel/dma/pool.c | 11 ++++++---- > > > mm/page_alloc.c | 11 ++++++++++ > > > mm/slab_common.c | 9 ++++++++ > > > 5 files changed, 52 insertions(+), 5 deletions(-) > > > > > > -- > > > 2.17.2 > > > > > >
On 12/09/21 at 01:59pm, Christoph Lameter wrote: > On Thu, 9 Dec 2021, Baoquan He wrote: > > > > The slab allocators guarantee that all kmalloc allocations are DMA able > > > indepent of specifying ZONE_DMA/ZONE_DMA32 > > > > Here you mean we guarantee dma-kmalloc will be DMA able independent of > > specifying ZONE_DMA/DMA32, or the whole sla/ub allocator? > > All memory obtained via kmalloc --independent of "dma-alloc", ZONE_DMA > etc-- must be dmaable. This has a prerequisite as you said at below, only if devices can address full memory, right? > > > With my understanding, isn't the reasonable sequence zone DMA firstly if > > GFP_DMA, then zone DMA32, finaly zone NORMAL. At least, on x86_64, I > > believe device driver developer prefer to see this because most of time, > > zone DMA and zone DMA32 are both used for dma buffer allocation, if > > IOMMU is not enabled. However, memory got from zone NORMAL when required > > with GFP_DMA, and it succeeds, does it mean that the developer doesn't > > take the GFP_DMA flag seriously, just try to get buffer for allocation? > > ZONE_NORMAL is also used for DMA allocations. ZONE_DMA and ZONE_DMA32 are > only used if the physical range of memory supported by a device does not > include all of normal memory. If devices can address full memory, ZONE_NORMAL can also be used for DMA allocations. (This covers the systems where IOMMU is provided). If device has address limit, e.g dma mask is 24bit or 32bit, ZONE_DMA and ZONE_DMA32 are needed. > > > > The size of ZONE_DMA is traditionally depending on the platform. On some > > > it is 16MB, on some 1G and on some 4GB. ZONE32 is always 4GB and should > > > only be used if ZONE_DMA has already been used. > > > > As said at above, ia64 and riscv don't have ZONE_DMA at all, they just > > cover low 4G with ZONE_DMA32 alone. > > If you do not have devices that are crap and cannot address the full > memory then you dont need these special zones. I am not a DMA expert, with my understanding, on x86_64 and arm64, we have PCIe devices which dma mask is 32bit, means they can only address ZONE_DMA32. Supporting to address full memory might be too expensive for devices, e.g on these two ARCHes, supported memory could be deployed on Petabyte of address. > > Sorry this subject has caused confusion multiple times over the years and > there are still arches that are not implementing this in a consistent way. Seems so. And by the way, when I read slub code, noticed a strange phenomenon, I haven't found out why. When create cache with kmem_cache_create(), zone flag SLAB_CACHE_DMA, SLAB_CACHE_DMA32 can be specified. allocflags will store them, and will take out to use when allocating new slab. Meanwhile, we can also specify gfpflags, but it can't be GFP_DMA32, because of GFP_SLAB_BUG_MASK. I traced back to very old git history, didn't find out why GFP_DMA32 can't be specified during kmem_cache_alloc(). We can completely rely on the cache->allocflags to mark the zone which we will request page from, but we can also specify gfpflags in kmem_cache_alloc() to change zone. GFP_DMA32 is prohibited. Here I can only see that kmalloc() might be the reason, since kmalloc_large() doesn't have created cache, so no ->allocflags to use. Is this expected? What can we do to clarify or improve this, at leaset on code readability? I am going to post v3, will discard the 'Further thinking' in cover letter according to your comment. Please help point out if anthing need be done or missed. Thanks a lot. Baoquan Thanks
On Tue, Dec 07, 2021 at 09:05:26AM +0100, Christoph Lameter wrote: > On Tue, 7 Dec 2021, Baoquan He wrote: > > > into ZONE_DMA32 by default. The zone DMA covering low 16M is used to > > take care of antique ISA devices. In fact, on 64bit system, it rarely > > need ZONE_DMA (which is low 16M) to support almost extinct ISA devices. > > However, some components treat DMA as a generic concept, e.g > > kmalloc-dma, slab allocator initializes it for later any DMA related > > buffer allocation, but not limited to ISA DMA. > > The idea of the slab allocator DMA support is to have memory available > for devices that can only support a limited range of physical addresses. > These are only to be enabled for platforms that have such requirements. > > The slab allocators guarantee that all kmalloc allocations are DMA able > indepent of specifying ZONE_DMA/ZONE_DMA32 Yes. And we never supported slab for ZONE_DMA32 and should work on getting rid of it for ZONE_DMA as well. The only thing that guarantees device addressability is the DMA API. The DMA API needs ZONE_DMA/DMA32 to back its page allocations, but supporting this in slab is a bad idea only explained by historic reasons from before when we had a DMA API. > > On arm64, even though both CONFIG_ZONE_DMA and CONFIG_ZONE_DMA32 > > are enabled, it makes ZONE_DMA covers the low 4G area, and ZONE_DMA32 > > empty. Unless on specific platforms (e.g. 30-bit on Raspberry Pi 4), > > then zone DMA covers the 1st 1G area, zone DMA32 covers the rest of > > the 32-bit addressable memory. > > ZONE_NORMAL should cover all memory. ARM does not need ZONE_DMA32. arm32 not, arm64 does. And the Pi 4 is an arm64 device. > > I am wondering if we can also change the size of DMA and DMA32 ZONE as > > dynamically adjusted, just as arm64 is doing? On x86_64, we can make > > zone DMA covers the 32-bit addressable memory, and empty zone DMA32 by > > default. Once ISA_DMA_API is enabled, we go back to make zone DMA covers > > low 16M area, zone DMA32 covers the rest of 32-bit addressable memory. > > (I am not familiar with ISA_DMA_API, will it require 24-bit addressable > > memory when enabled?) > > The size of ZONE_DMA is traditionally depending on the platform. On some > it is 16MB, on some 1G and on some 4GB. ZONE32 is always 4GB and should > only be used if ZONE_DMA has already been used. ZONE32 should be (and generally is) used whenever there is zone covering the 32-bit CPU physical address limit. > > ZONE_DMA is dynamic in the sense of being different on different > platforms. Agreed.
On Mon, Dec 13, 2021 at 03:39:25PM +0800, Baoquan He wrote: > > > As said at above, ia64 and riscv don't have ZONE_DMA at all, they just > > > cover low 4G with ZONE_DMA32 alone. > > > > If you do not have devices that are crap and cannot address the full > > memory then you dont need these special zones. > > I am not a DMA expert, with my understanding, on x86_64 and arm64, we > have PCIe devices which dma mask is 32bit Yes, way to many, and they keep getting newly introduce as well. Also weirdo masks like 40, 44 or 48 bits. > , means they can only address > ZONE_DMA32. Yes and no. Offset between cpu physical and device address make this complicated, even ignoring iommus.
On 12/13/21 at 02:25pm, Borislav Petkov wrote: > On Tue, Dec 07, 2021 at 11:16:31AM +0800, Baoquan He wrote: > > > This low 1M lock down is needed because AMD SME encrypts memory making > > > the old backup region mechanims impossible when switching into kdump > > > kernel. And Intel engineer mentioned their TDX (Trusted domain extensions) > > > which is under development in kernel also needs lock down the low 1M. > > > So we can't simply revert above commits to fix the page allocation > > > failure from DMA zone as someone suggested. > > Did you read > > f1d4d47c5851 ("x86/setup: Always reserve the first 1M of RAM") > > carefully for a more generically important reason as to why the first 1M > should not be used? Apparently I didn't. I slacked off and just grabbed things stored in my brain. This is the right justification and missed. Thanks for pointing it out.