Message ID | CAKGA1bmJ6c4LC=NAD4MU_rBZ2hZjf6qH=C+dqMCf0eP+2vSFdA@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: > On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Fri, Jan 18, 2013 at 02:24:15PM -0600, Matt Sealey wrote: > >> Hello all, > >> > >> I wonder if anyone can shed some light on this linking problem I have > >> right now. If I configure my kernel without SMP support (it is a very > >> lean config for i.MX51 with device tree support only) I hit this error > >> on linking: > > > > Yes, I looked at this, and I've decided that I will _not_ fix this export, > > neither will I accept a patch to add an export. > > Understood.. > > > As far as I can see, this code is buggy in a SMP environment. There's > > apparantly no guarantee that: > > > > 1. the mapping will be created on a particular CPU. > > 2. the mapping will then be used only on this specific CPU. > > 3. no guarantee that another CPU won't speculatively prefetch from this > > region. > > 4. when the mapping is torn down, no guarantee that it's the same CPU that > > used the happing. > > > > So, the use of the local TLB flush leaves all the other CPUs potentially > > containing TLB entries for this mapping. > > I'm gonna put this out to the maintainers (Konrad, and Seth since he > committed it) that if this code is buggy it gets taken back out, even > if it makes zsmalloc "slow" on ARM, for the following reasons: Just to make sure I understand, you mean don't use page table mapping but instead use copying? > > * It's buggy on SMP as Russell describes above > * It might not be buggy on UP (opposite to Russell's description above > as the restrictions he states do not exist), but that would imply an > export for a really core internal MM function nobody should be using > anyway > * By that assessment, using that core internal MM function on SMP is > also bad voodoo that zsmalloc should not be doing 'local_tlb_flush' is bad voodoo? > > It also either smacks of a lack of comprehensive testing or defiance > of logic that nobody ever built the code without CONFIG_SMP, which > means it was only tested on a bunch of SMP ARM systems (I'm guessing.. > Pandaboard? :) or UP systems with SMP/SMP_ON_UP enabled (to expand on > that guess, maybe Beagleboard in some multiplatform Beagle/Panda > hybrid kernel). I am sure I was reading the mailing lists when that > patch was discussed, coded and committed and my guess is correct. In > this case, what we have here anyway is code which when PROPERLY > configured as so.. The initial patch were done on x86. Then Seth did the work to make sure it worked on PPC. Munchin looked on ARM and that is it. If you have an ARM server that you would be willing to part with I would be thrilled to look at it. > > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c > b/drivers/staging/zsmalloc/zsmalloc-main.c > index 09a9d35..ecf75fb 100644 > --- a/drivers/staging/zsmalloc/zsmalloc-main.c > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c > @@ -228,7 +228,7 @@ struct zs_pool { > * mapping rather than copying > * for object mapping. > */ > -#if defined(CONFIG_ARM) > +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) > #define USE_PGTABLE_MAPPING > #endif > > .. such that it even compiles in both "guess" configurations, the > slower Cortex-A8 600MHz single core system gets to use the slow copy > path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to > use the fast mapping path. Essentially all the patch does is "improve > performance" on the fastest, best-configured, large-amounts-of-RAM, > lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, > marvell armada, i.MX6..) while introducing the problems Russell > describes, and leave performance exactly the same and potentially far > more stable on the slower, memory-limited ARM machines. Any ideas on how to detect that? > > Given the purpose of zsmalloc, zram, zcache etc. this somewhat defies > logic. If it's not making the memory-limited, slow ARM systems run > better, what's the point? > > So in summary I suggest "we" (Greg? or is it Seth's responsibility?) > should just back out that whole USE_PGTABLE_MAPPING chunk of code > introduced with f553646. Then Russell can carry on randconfiging and I > can build for SMP and UP and get the same code.. with less bugs. I get that you want to have this fixed right now. I think having it fixed the right way is a better choice. Lets discuss that first before we start tossing patches to disable parts of it.
On Fri, Jan 18, 2013 at 11:46:02PM -0500, Konrad Rzeszutek Wilk wrote: > On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: > > On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux > > <linux@arm.linux.org.uk> wrote: > > > On Fri, Jan 18, 2013 at 02:24:15PM -0600, Matt Sealey wrote: > > >> Hello all, > > >> > > >> I wonder if anyone can shed some light on this linking problem I have > > >> right now. If I configure my kernel without SMP support (it is a very > > >> lean config for i.MX51 with device tree support only) I hit this error > > >> on linking: > > > > > > Yes, I looked at this, and I've decided that I will _not_ fix this export, > > > neither will I accept a patch to add an export. > > > > Understood.. > > > > > As far as I can see, this code is buggy in a SMP environment. There's > > > apparantly no guarantee that: > > > > > > 1. the mapping will be created on a particular CPU. > > > 2. the mapping will then be used only on this specific CPU. > > > 3. no guarantee that another CPU won't speculatively prefetch from this > > > region. > > > 4. when the mapping is torn down, no guarantee that it's the same CPU that > > > used the happing. > > > > > > So, the use of the local TLB flush leaves all the other CPUs potentially > > > containing TLB entries for this mapping. > > > > I'm gonna put this out to the maintainers (Konrad, and Seth since he > > committed it) that if this code is buggy it gets taken back out, even > > if it makes zsmalloc "slow" on ARM, for the following reasons: > > Just to make sure I understand, you mean don't use page table > mapping but instead use copying? > > > > > * It's buggy on SMP as Russell describes above > > * It might not be buggy on UP (opposite to Russell's description above > > as the restrictions he states do not exist), but that would imply an > > export for a really core internal MM function nobody should be using > > anyway > > * By that assessment, using that core internal MM function on SMP is > > also bad voodoo that zsmalloc should not be doing > > 'local_tlb_flush' is bad voodoo? > > > > > It also either smacks of a lack of comprehensive testing or defiance > > of logic that nobody ever built the code without CONFIG_SMP, which > > means it was only tested on a bunch of SMP ARM systems (I'm guessing.. > > Pandaboard? :) or UP systems with SMP/SMP_ON_UP enabled (to expand on > > that guess, maybe Beagleboard in some multiplatform Beagle/Panda > > hybrid kernel). I am sure I was reading the mailing lists when that > > patch was discussed, coded and committed and my guess is correct. In > > this case, what we have here anyway is code which when PROPERLY > > configured as so.. > > The initial patch were done on x86. Then Seth did the work to make sure > it worked on PPC. Munchin looked on ARM and that is it. s/Munchin/Minchan > > If you have an ARM server that you would be willing to part with I would > be thrilled to look at it. > > > > > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c > > b/drivers/staging/zsmalloc/zsmalloc-main.c > > index 09a9d35..ecf75fb 100644 > > --- a/drivers/staging/zsmalloc/zsmalloc-main.c > > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c > > @@ -228,7 +228,7 @@ struct zs_pool { > > * mapping rather than copying > > * for object mapping. > > */ > > -#if defined(CONFIG_ARM) > > +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) > > #define USE_PGTABLE_MAPPING I don't get it. How to prevent the problem Russel described? The problem is that other CPU can prefetch _speculatively_ under us. > > #endif > > > > .. such that it even compiles in both "guess" configurations, the > > slower Cortex-A8 600MHz single core system gets to use the slow copy > > path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to > > use the fast mapping path. Essentially all the patch does is "improve > > performance" on the fastest, best-configured, large-amounts-of-RAM, > > lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, > > marvell armada, i.MX6..) while introducing the problems Russell > > describes, and leave performance exactly the same and potentially far > > more stable on the slower, memory-limited ARM machines. > > Any ideas on how to detect that? > > > > Given the purpose of zsmalloc, zram, zcache etc. this somewhat defies > > logic. If it's not making the memory-limited, slow ARM systems run > > better, what's the point? > > > > So in summary I suggest "we" (Greg? or is it Seth's responsibility?) > > should just back out that whole USE_PGTABLE_MAPPING chunk of code > > introduced with f553646. Then Russell can carry on randconfiging and I > > can build for SMP and UP and get the same code.. with less bugs. > > I get that you want to have this fixed right now. I think having it > fixed the right way is a better choice. Lets discuss that first > before we start tossing patches to disable parts of it. If I don't miss something, we could have 2 choice. 1) use flush_tlb_kernel_range instead of local_flush_tlb_kernel_range Or 2) use only memory copy I guess everybody want 2 because it makes code very simple and maintainable. Even, rencently Joonsoo sent optimize patch. Look at https://lkml.org/lkml/2013/1/16/68 so zram/zcache effect caused by 2 would be minimized. But please give me the time and I will borrow quad-core embedded target board and test 1 on the phone with Seth's benchmark. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
Hi Minchan, On Sun, Jan 20, 2013 at 11:55 PM, Minchan Kim <minchan@kernel.org> wrote: > On Fri, Jan 18, 2013 at 11:46:02PM -0500, Konrad Rzeszutek Wilk wrote: >> On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: >> > >> > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c >> > b/drivers/staging/zsmalloc/zsmalloc-main.c >> > index 09a9d35..ecf75fb 100644 >> > --- a/drivers/staging/zsmalloc/zsmalloc-main.c >> > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c >> > @@ -228,7 +228,7 @@ struct zs_pool { >> > * mapping rather than copying >> > * for object mapping. >> > */ >> > -#if defined(CONFIG_ARM) >> > +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) >> > #define USE_PGTABLE_MAPPING > > I don't get it. How to prevent the problem Russel described? > The problem is that other CPU can prefetch _speculatively_ under us. It prevents no problems, but if that isn't there, kernels build without SMP support (i.e. specifically uniprocessor kernels) will fail at the linker stage. That's not desirable. We have 3 problems here, this solves the first of them, and creates the third. The second is constant regardless.. 1) zsmalloc will not build on ARM without CONFIG_SMP because on UP local_tlb_flush_kern_range uses a function which uses an export which isn't required on SMP Basically, with CONFIG_SMP (and CONFIG_UP_ON_SMP), local_tlb_flush_kern_range is calling functions by dereference from the per-cpu global cpu_tlb structure. On UP (!CONFIG_SMP), it is calling functions directly (in my case, v7wbi_local_tlb_flush_kern_range or whatever, but on v6, v5, v4 ARM processor kernel builds it may be different) which need to be exported outside of the MM core. If this difference is going to stick around - Russell is refusing here to export that/those direct functions - then the optimized vm mapping code simply should never be allowed to run on non-SMP systems to keep it building for everyone. The patch above is simply a build fix for !CONFIG_SMP in this case to force it to use the slow path for systems where the above missing export problem will cause the linker failure. 2) the optimized vm mapping isn't per-cpu aware as per Russell's arguments. I'll let you guys discuss that as I have no idea what the real implications are for SMP systems (and my current testing is only on a non-SMP CPU, I will have to go grab a couple boards from the lab for SMP) 3) it somewhat defeats the purpose of the optimization if UP systems (which tend to have less memory and might benefit from things like zsmalloc/zram more) cannot use it, but SMP systems which tend to have more memory (unless we're talking about a frugal configuration of a virtual machine, perhaps). Given the myriad use cases of zram that is not a pervasive or persuasive argument, I know, but it does seem slightly backwards. > If I don't miss something, we could have 2 choice. > > 1) use flush_tlb_kernel_range instead of local_flush_tlb_kernel_range > Or > 2) use only memory copy > > I guess everybody want 2 because it makes code very simple and maintainable. > Even, rencently Joonsoo sent optimize patch. > Look at https://lkml.org/lkml/2013/1/16/68 so zram/zcache effect caused by 2 > would be minimized. > > But please give me the time and I will borrow quad-core embedded target board > and test 1 on the phone with Seth's benchmark. In the meantime please take into account building a non-SMP kernel for this board and testing that; if there is a way to do the flush without using the particular function which uses the particular export that Russell will not export, then that would be better. Maybe for !CONFIG_SMP using flush_tlb_kernel_range is doing the exact same job and the real patch is not to disable the optimization with !CONFIG_SMP, but to additionally #if defined(CONFIG_SMP) around local_flush_tlb_kernel_range and alternatively for UP use flush_tlb_kernel_range which.. probably.. doesn't use that contentious export? This is far beyond the level I want to be digging around in the Linux kernel so I am not comfortable even trying that to find out.
On Fri, Jan 18, 2013 at 10:46 PM, Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> wrote: > On Fri, Jan 18, 2013 at 07:11:32PM -0600, Matt Sealey wrote: >> On Fri, Jan 18, 2013 at 3:08 PM, Russell King - ARM Linux >> >> I'm gonna put this out to the maintainers (Konrad, and Seth since he >> committed it) that if this code is buggy it gets taken back out, even >> if it makes zsmalloc "slow" on ARM, for the following reasons: > > Just to make sure I understand, you mean don't use page table > mapping but instead use copying? Yes, just back out the USE_PGTABLE_MAPPING code. But as I just replied with Minchan, maybe there is a better way. The only real problem here apart from the non-per-cpu usage Russell describes (which does not affect UP systems anyway) is that without CONFIG_SMP we have a FTBFS. However I am sure you agree on the odd fix of enabling pagetable mapping optimization only on "high end" systems and leaving the low end using the slow path. It *is* odd. Also, my rudimentary patch for disabling the code on !CONFIG_SMP is just *hiding* a misuse of a non-exported mm function... The FTBFS showed the problem, I don't want the fix to be to hide it, which is why I brought it up. >> * It's buggy on SMP as Russell describes above >> * It might not be buggy on UP (opposite to Russell's description above >> as the restrictions he states do not exist), but that would imply an >> export for a really core internal MM function nobody should be using >> anyway >> * By that assessment, using that core internal MM function on SMP is >> also bad voodoo that zsmalloc should not be doing > > 'local_tlb_flush' is bad voodoo? See previous mail to Minchan; local_tlb_flush_kernel_range calls cpu_tlb.flush_kernel_range on SMP, but a direct function call ("glue(_TLB, flush_kernel_range)" which resolves to v7wbi_flush_kernel_range etc. etc.) without CONFIG_SMP. That direct function call it resolves to is not an export and Russell just said he won't accept a patch to export it. > If you have an ARM server that you would be willing to part with I would > be thrilled to look at it. >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c >> b/drivers/staging/zsmalloc/zsmalloc-main.c >> index 09a9d35..ecf75fb 100644 > > --- a/drivers/staging/zsmalloc/zsmalloc-main.c >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c >> @@ -228,7 +228,7 @@ struct zs_pool { >> * mapping rather than copying >> * for object mapping. >> */ >> -#if defined(CONFIG_ARM) >> +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) >> #define USE_PGTABLE_MAPPING >> #endif >> >> .. such that it even compiles in both "guess" configurations, the >> slower Cortex-A8 600MHz single core system gets to use the slow copy >> path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to >> use the fast mapping path. Essentially all the patch does is "improve >> performance" on the fastest, best-configured, large-amounts-of-RAM, >> lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, >> marvell armada, i.MX6..) while introducing the problems Russell >> describes, and leave performance exactly the same and potentially far >> more stable on the slower, memory-limited ARM machines. > > Any ideas on how to detect that? Nope, sorry. It would rely on knowing the precise configuration *and* the user intent of using zram which is far beyond the scope of zram or zsmalloc itself. It could be a VM, a phone with only 256MB RAM and slow storage, it could be a device with 2TB RAM but no fast local storage.. whether using it as a compressed block device or a swap device that is compressed, you can never know inside the driver what the real use it except the kernel build time config. All we can know in zram, at build time, is whether we're configuring for SMP, UP-on-SMP, or UP (!SMP) and which code path makes more sense to build (ideally, SMP and UP would run the pagetable mapping code alike). If we can make the pagetable mapping code compile on UP without the above patch being required, and it has the same effect, then this would be the best solution. Then the code needs to be fixed for proper operation on SMP anyway. If there are still a bunch of export problems with an alternate method of flushing the tlb for a range of kernel memory exposed by trying a different way around, this is just proving an issue here that the ARM guys disagree that things that can be built as modules should be doing such things, or that the cpu_tlb.flush_kernel_range vs. v7wbi_tlb_flush_kernel_range export thing is confusing as crap at the very least in that the CPU topology model the kernel lives by and is compiled for at build time causes build breakages if you don't want (or have) your driver to be knowledgable of the differences :) >> can build for SMP and UP and get the same code.. with less bugs. > > I get that you want to have this fixed right now. Somewhat. It is not urgent, since I fixed it "for now" in my tree, I am just worried that if that "fix" goes upstream it hides a real, more important issue here. I am mostly only concerned about whether zram as swap in any configuration causes any issues with performance on my system, as I would *like* to use it. We have noticed in the past (when it was compcache, and zram before the xsmalloc ->zsmalloc/zcache2 rewrite) that it did make a difference. Given the changes I just wanted to be sure. We have some acceptably performing storage but limited RAM, so there has to be a happy medium between reserving an in-memory compressed block device and an amount of real disk swap in the back, for use cases such as webservers with high number of connections, web browsers with high numbers of tabs, media playback and suchlike. Compression obviously takes CPU time, but swapping to disk blocks, so keeping the system feeling smooth from a UX point of view while allowing them to do a tiny bit more is what we're looking for. As I said, we saw great benefits a couple years ago, a year ago, I just wanted to pull some benchmarks from a current kernel. I was not under the impression, though, that this code would FTBFS on my preferred kernel configuration :D > fixed the right way is a better choice. Lets discuss that first > before we start tossing patches to disable parts of it. Agreed.
On Mon, Jan 21, 2013 at 10:20:38AM -0600, Matt Sealey wrote: > See previous mail to Minchan; local_tlb_flush_kernel_range calls > cpu_tlb.flush_kernel_range on SMP, but a direct function call > ("glue(_TLB, flush_kernel_range)" which resolves to > v7wbi_flush_kernel_range etc. etc.) without CONFIG_SMP. Actually, that's wrong - it's got nothing to do with SMP vs non-SMP. It's more to do with which CPUs are being supported. If they all use one single cache maintanence implementation, then direct calls are used as an optimization. If they require more than one cache maintanence implementation, they are indirect calls. SMP really doesn't come into that decision. So: > >> diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c > >> b/drivers/staging/zsmalloc/zsmalloc-main.c > >> index 09a9d35..ecf75fb 100644 > > > --- a/drivers/staging/zsmalloc/zsmalloc-main.c > >> +++ b/drivers/staging/zsmalloc/zsmalloc-main.c > >> @@ -228,7 +228,7 @@ struct zs_pool { > >> * mapping rather than copying > >> * for object mapping. > >> */ > >> -#if defined(CONFIG_ARM) > >> +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) Would be wrong.
> > The initial patch were done on x86. Then Seth did the work to make sure > > it worked on PPC. Munchin looked on ARM and that is it. > > s/Munchin/Minchan Thank you. I am sorry for butchering your name. > > > > > If you have an ARM server that you would be willing to part with I would > > be thrilled to look at it. > > > > > > > > diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c > > > b/drivers/staging/zsmalloc/zsmalloc-main.c > > > index 09a9d35..ecf75fb 100644 > > > --- a/drivers/staging/zsmalloc/zsmalloc-main.c > > > +++ b/drivers/staging/zsmalloc/zsmalloc-main.c > > > @@ -228,7 +228,7 @@ struct zs_pool { > > > * mapping rather than copying > > > * for object mapping. > > > */ > > > -#if defined(CONFIG_ARM) > > > +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) > > > #define USE_PGTABLE_MAPPING > > I don't get it. How to prevent the problem Russel described? > The problem is that other CPU can prefetch _speculatively_ under us. <nods> Not sure either. > > > > #endif > > > > > > .. such that it even compiles in both "guess" configurations, the > > > slower Cortex-A8 600MHz single core system gets to use the slow copy > > > path and the dual-core 1GHz+ Cortex-A9 (with twice the RAM..) gets to > > > use the fast mapping path. Essentially all the patch does is "improve > > > performance" on the fastest, best-configured, large-amounts-of-RAM, > > > lots-of-CPU-performance ARM systems (OMAP4+, Snapdragon, Exynos4+, > > > marvell armada, i.MX6..) while introducing the problems Russell > > > describes, and leave performance exactly the same and potentially far > > > more stable on the slower, memory-limited ARM machines. > > > > Any ideas on how to detect that? > > > > > > Given the purpose of zsmalloc, zram, zcache etc. this somewhat defies > > > logic. If it's not making the memory-limited, slow ARM systems run > > > better, what's the point? > > > > > > So in summary I suggest "we" (Greg? or is it Seth's responsibility?) > > > should just back out that whole USE_PGTABLE_MAPPING chunk of code > > > introduced with f553646. Then Russell can carry on randconfiging and I > > > can build for SMP and UP and get the same code.. with less bugs. > > > > I get that you want to have this fixed right now. I think having it > > fixed the right way is a better choice. Lets discuss that first > > before we start tossing patches to disable parts of it. > > If I don't miss something, we could have 2 choice. > > 1) use flush_tlb_kernel_range instead of local_flush_tlb_kernel_range > Or > 2) use only memory copy > > I guess everybody want 2 because it makes code very simple and maintainable. > Even, rencently Joonsoo sent optimize patch. > Look at https://lkml.org/lkml/2013/1/16/68 so zram/zcache effect caused by 2 > would be minimized. > > But please give me the time and I will borrow quad-core embedded target board > and test 1 on the phone with Seth's benchmark. > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > -- > Kind regards, > Minchan Kim
diff --git a/drivers/staging/zsmalloc/zsmalloc-main.c b/drivers/staging/zsmalloc/zsmalloc-main.c index 09a9d35..ecf75fb 100644 --- a/drivers/staging/zsmalloc/zsmalloc-main.c +++ b/drivers/staging/zsmalloc/zsmalloc-main.c @@ -228,7 +228,7 @@ struct zs_pool { * mapping rather than copying * for object mapping. */ -#if defined(CONFIG_ARM) +#if defined(CONFIG_ARM) && defined(CONFIG_SMP) #define USE_PGTABLE_MAPPING #endif