Message ID | 20170822125747.GB28024@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain@codeaurora.org wrote: > Hi Will Deacon/ Al viro > > > -->Please find the attached kmsg.txt > <3>[17620.275249] BUG: sleeping function called from invalid context at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313 > <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, name: > stress-ng-dirde > <6>[17620.298995] ------------[ cut here ]------------ > <2>[17620.299009] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! > <6>[17620.306372] ------------[ cut here ]------------ > <2>[17620.327239] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! > > > --> we are using arm64 machine with kernel 4.4. > --> can you please guide us, how to capture ESR value while taking the > fault? > --> > - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation > fault" }, > + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 > translation fault" }, > we will try with above changes and get back to you. > > -> config and kmsg are attached. > > Regards, > Ankit Jain > Qualcomm India Private Limited, on behalf of Qualcomm Innovation > Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a > Linux Foundation Collaborative Project Umm... Line numbers make no sense for 4.4. Could you post a reference to the actual tree used (repository + SHA1; again, it can't be vanilla 4.4, or stable/linux-4.4.y, for that matter) as well as your .config? In any case, looks like in_atomic() is false there, so we need an explicit pagefault_disable() to make sure it goes to no_context. Looking through the callchains... * __d_lookup() -> d_same_name() -> dentry_cmp() -> dentry_string_cmp() with rcu_read_lock() held by __d_lookup(). * d_alloc_parallel() -> d_same_name(), etc. rcu_read_lock() held by d_alloc_parallel() in one case, dentry->d_lock in another. * d_exact_alias() -> d_same_name(). inode->i_lock held by d_exact_alias(). * d_alloc_parallel() -> __d_lookup_rcu() -> dentry_cmp(). rcu_read_lock() held by d_alloc_parallel(). * lookup_fast() -> __d_lookup_rcu(), etc. rcu_read_lock() grabbed by path_init(). * full_name_hash(). Fuckloads. * hashlen_string(). Fewer, but... * link_path_walk() -> hash_name(). rcu_read_lock() held by path_init(). And then there's siphash(), but that one AFAICS should never see those faults. Hell knows... I'm somewhat tempted to slap pagefault_disable()/pagefault_enable() in dentry_string_cmp(), full_name_hash(), hashlen_string() and hash_name(). Regardless of the locks held by callers. Doing that in load_unaligned_zeropad() itself would be ridiculously costly, but these 4 would probably be saner... I still would like to see the details of config, though.
Hi Al Viro Could you please reply on below query. Are below error messages pointing to an issue which we can face later if we remove force panic? http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7605 http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7608 Regards, Ankit Jain On 2017-08-30 22:49, ankijain@codeaurora.org wrote: > Hi Al Viro > > Thanks for replying. > > We are using AOSP project tree. > You can refer http://elixir.free-electrons.com/linux/v4.4.76/source. > > http://elixir.free-electrons.com/linux/v4.4.76/source/arch/arm64/mm/fault.c#L302 > (might_sleep()) > > http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7592 > (___might_sleep()) > > Panic is added forcefully in our code after > http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7625 > . > > we have a query: > Are below error messages pointing to an issue which we can face later > if we remove force panic? > http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7605 > http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7608 > > > we will retest after removing the force panic and update you if any > issue occurs. > config file is attached. > > Regards, > Ankit Jain > Qualcomm India Private Limited, on behalf of Qualcomm Innovation > Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a > Linux Foundation Collaborative Project > > On 2017-08-28 11:50, Al Viro wrote: >> On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain@codeaurora.org >> wrote: >>> Hi Will Deacon/ Al viro >>> >>> >>> -->Please find the attached kmsg.txt >>> <3>[17620.275249] BUG: sleeping function called from invalid context >>> at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313 >>> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, >>> name: >>> stress-ng-dirde >>> <6>[17620.298995] ------------[ cut here ]------------ >>> <2>[17620.299009] kernel BUG at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >>> <6>[17620.306372] ------------[ cut here ]------------ >>> <2>[17620.327239] kernel BUG at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >>> >>> >>> --> we are using arm64 machine with kernel 4.4. >>> --> can you please guide us, how to capture ESR value while taking >>> the >>> fault? >>> --> >>> - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation >>> fault" }, >>> + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation fault" }, >>> we will try with above changes and get back to you. >>> >>> -> config and kmsg are attached. >>> >>> Regards, >>> Ankit Jain >>> Qualcomm India Private Limited, on behalf of Qualcomm Innovation >>> Center, Inc. >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a >>> Linux Foundation Collaborative Project >> >> Umm... Line numbers make no sense for 4.4. Could you post a >> reference >> to the actual tree used (repository + SHA1; again, it can't be vanilla >> 4.4, or stable/linux-4.4.y, for that matter) as well as your .config? >> >> In any case, looks like in_atomic() is false there, so we need an >> explicit >> pagefault_disable() to make sure it goes to no_context. >> >> Looking through the callchains... >> * __d_lookup() -> d_same_name() -> dentry_cmp() -> >> dentry_string_cmp() >> with rcu_read_lock() held by __d_lookup(). >> * d_alloc_parallel() -> d_same_name(), etc. rcu_read_lock() held by >> d_alloc_parallel() in one case, dentry->d_lock in another. >> * d_exact_alias() -> d_same_name(). inode->i_lock held by >> d_exact_alias(). >> * d_alloc_parallel() -> __d_lookup_rcu() -> dentry_cmp(). >> rcu_read_lock() held by d_alloc_parallel(). >> * lookup_fast() -> __d_lookup_rcu(), etc. rcu_read_lock() grabbed by >> path_init(). >> * full_name_hash(). Fuckloads. >> * hashlen_string(). Fewer, but... >> * link_path_walk() -> hash_name(). rcu_read_lock() held by >> path_init(). >> >> And then there's siphash(), but that one AFAICS should never see those >> faults. >> >> Hell knows... I'm somewhat tempted to slap >> pagefault_disable()/pagefault_enable() >> in dentry_string_cmp(), full_name_hash(), hashlen_string() and >> hash_name(). >> Regardless of the locks held by callers. Doing that in >> load_unaligned_zeropad() >> itself would be ridiculously costly, but these 4 would probably be >> saner... >> >> I still would like to see the details of config, though.
On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain@codeaurora.org wrote: > Hi Will Deacon/ Al viro > > > -->Please find the attached kmsg.txt > <3>[17620.275249] BUG: sleeping function called from invalid context at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313 > <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, name: > stress-ng-dirde > <6>[17620.298995] ------------[ cut here ]------------ > <2>[17620.299009] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! > <6>[17620.306372] ------------[ cut here ]------------ > <2>[17620.327239] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! > > > --> we are using arm64 machine with kernel 4.4. > --> can you please guide us, how to capture ESR value while taking the > fault? > --> > - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation > fault" }, > + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 > translation fault" }, > we will try with above changes and get back to you. Did you test with this change? Will
Hi Will >> - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation >> fault" }, >> + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 >> translation fault" }, >> we will try with above changes and get back to you. we didn't try yet with above changes. we will start the test soon and update you. Regards, Ankit Jain Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project On 2017-09-13 01:56, Will Deacon wrote: > On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain@codeaurora.org > wrote: >> Hi Will Deacon/ Al viro >> >> >> -->Please find the attached kmsg.txt >> <3>[17620.275249] BUG: sleeping function called from invalid context >> at >> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313 >> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, >> name: >> stress-ng-dirde >> <6>[17620.298995] ------------[ cut here ]------------ >> <2>[17620.299009] kernel BUG at >> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >> <6>[17620.306372] ------------[ cut here ]------------ >> <2>[17620.327239] kernel BUG at >> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >> >> >> --> we are using arm64 machine with kernel 4.4. >> --> can you please guide us, how to capture ESR value while taking the >> fault? >> --> >> - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation >> fault" }, >> + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 >> translation fault" }, >> we will try with above changes and get back to you. > > Did you test with this change? > > Will
Hi will Unfortunately issue path is not getting hit with our released binary(without your chnage) also with your suggested changes. so we cant confirm that your suggested changes are working or not. We will come back to you once will face issue again. Regards, Ankit Jain On 2017-09-13 22:35, ankijain@codeaurora.org wrote: > Hi Will > > > >>> - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation >>> fault" }, >>> + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation fault" }, >>> we will try with above changes and get back to you. > > we didn't try yet with above changes. > we will start the test soon and update you. > > Regards, > Ankit Jain > Qualcomm India Private Limited, on behalf of Qualcomm Innovation > Center, Inc. > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a > Linux Foundation Collaborative Project > > > > On 2017-09-13 01:56, Will Deacon wrote: >> On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain@codeaurora.org >> wrote: >>> Hi Will Deacon/ Al viro >>> >>> >>> -->Please find the attached kmsg.txt >>> <3>[17620.275249] BUG: sleeping function called from invalid context >>> at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313 >>> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, >>> name: >>> stress-ng-dirde >>> <6>[17620.298995] ------------[ cut here ]------------ >>> <2>[17620.299009] kernel BUG at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >>> <6>[17620.306372] ------------[ cut here ]------------ >>> <2>[17620.327239] kernel BUG at >>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528! >>> >>> >>> --> we are using arm64 machine with kernel 4.4. >>> --> can you please guide us, how to capture ESR value while taking >>> the >>> fault? >>> --> >>> - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation >>> fault" }, >>> + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 >>> translation fault" }, >>> we will try with above changes and get back to you. >> >> Did you test with this change? >> >> Will
[looking through the old mail] On Tue, Sep 12, 2017 at 09:26:16PM +0100, Will Deacon wrote: > > - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation > > fault" }, > > + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 > > translation fault" }, > > we will try with above changes and get back to you. > > Did you test with this change? FWIW, while that went in as commit 760bfb47c36a ("arm64: fault: Route pte translation faults via do_translation_fault"), I wonder if the same issue exists on arm. It looks like the pagefault handler there is fairly similar to arm64 one and the same shortcut is present there. The more I'm looking at that, the more it looks like we *really* need a comment in all instances of load_unaligned_zeropad() warning about that pitfall. Something like /* * Load an unaligned word from kernel space. * * In the (very unlikely) case of the word being a page-crosser * and the next page not being mapped, take the exception and * return zeroes in the non-existing part. * * NOTE: this relies upon the pagefault handler *NOT* blocking * in such situation (fault in kernel mode on kernel address with * exception fixup present). Verify that for your architecture * before using an equivalent of this approach. Note that * you can't count upon faulthandler_disabled() saving you; * this function can be called e.g. under a spinlock on non-preempt * kernels without pagefault_disable() done by caller. */ perhaps. That property holds on x86, ppc and (now) arm64, but as arm64 case shows, it might not be true for other architectures. As the matter of fact, e.g. sparc64 (which will not use that thing for obvious reasons anyway) it is *not* true, etc.
On Sat, Nov 04, 2017 at 12:17:57AM +0000, Al Viro wrote: > [looking through the old mail] > > On Tue, Sep 12, 2017 at 09:26:16PM +0100, Will Deacon wrote: > > > - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation > > > fault" }, > > > + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 > > > translation fault" }, > > > we will try with above changes and get back to you. > > > > Did you test with this change? > > FWIW, while that went in as commit 760bfb47c36a ("arm64: fault: Route pte > translation faults via do_translation_fault"), I wonder if the same issue > exists on arm. It looks like the pagefault handler there is fairly > similar to arm64 one and the same shortcut is present there. So we never actually nailed this one down on arm64 (the problem mysteriously disappeared iirc, and I failed to reproduce it on my systems) but I fixed it anyway because it looked at least theoretically possible. In that case, yes, I think ARM needs the same sort of fixes for both 2 and 3 level tables where the handler branch directly to do_page_fault. > The more I'm looking at that, the more it looks like we *really* need > a comment in all instances of load_unaligned_zeropad() warning about > that pitfall. Something like > /* > * Load an unaligned word from kernel space. > * > * In the (very unlikely) case of the word being a page-crosser > * and the next page not being mapped, take the exception and > * return zeroes in the non-existing part. > * > * NOTE: this relies upon the pagefault handler *NOT* blocking > * in such situation (fault in kernel mode on kernel address with > * exception fixup present). Verify that for your architecture > * before using an equivalent of this approach. Note that > * you can't count upon faulthandler_disabled() saving you; > * this function can be called e.g. under a spinlock on non-preempt > * kernels without pagefault_disable() done by caller. > */ > perhaps. That property holds on x86, ppc and (now) arm64, but as > arm64 case shows, it might not be true for other architectures. > As the matter of fact, e.g. sparc64 (which will not use that > thing for obvious reasons anyway) it is *not* true, etc. I wonder if there's any mileage in a test module for this? That code rarely (if ever) gets run in practice. Will
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 2509e4fe6992..aa6c4a9e1113 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -611,7 +611,7 @@ static const struct fault_info fault_info[] = { { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 0 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 1 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 2 translation fault" }, - { do_page_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" }, + { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" }, { do_bad, SIGBUS, 0, "unknown 8" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 access flag fault" },