filesystem-dax: Disable PMD support

Message ID	156159454541.2964018.7466991316059381921.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=192.55.52.93; helo=mga11.intel.com; envelope-from=dan.j.williams@intel.com; receiver=linux-nvdimm@lists.01.org Subject: [PATCH] filesystem-dax: Disable PMD support From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Date: Wed, 26 Jun 2019 17:15:45 -0700 Message-ID: <156159454541.2964018.7466991316059381921.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 Precedence: list Cc: Jan Kara <jack@suse.cz>, linux-kernel@vger.kernel.org, stable@vger.kernel.org, Robert Barror <robert.barror@intel.com>, Matthew Wilcox <willy@infradead.org>, linux-fsdevel@vger.kernel.org, Seema Pandit <seema.pandit@intel.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
Series	filesystem-dax: Disable PMD support \| expand filesystem-dax: Disable PMD support

Dan Williams June 27, 2019, 12:15 a.m. UTC

Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
been encountering intermittent lockups. The backtraces always include
the filesystem-DAX PMD path, multi-order entries have been a source of
bugs in the past, and disabling the PMD path allows a test that fails in
minutes to run for an hour.

The regression has been bisected to "dax: Convert page fault handlers to
XArray", but little progress has been made on the root cause debug.
Unless / until root cause can be identified mark CONFIG_FS_DAX_PMD
broken to preclude intermittent lockups. Reverting the Xarray conversion
also works, but that change is too big to backport. The implementation
is committed to Xarray at this point.

Link: https://lore.kernel.org/linux-fsdevel/CAPcyv4hwHpX-MkUEqxwdTj7wCCZCN4RV-L4jsnuwLGyL_UEG4A@mail.gmail.com/
Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@vger.kernel.org>
Reported-by: Robert Barror <robert.barror@intel.com>
Reported-by: Seema Pandit <seema.pandit@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/Kconfig |    3 +++
 1 file changed, 3 insertions(+)

Matthew Wilcox June 27, 2019, 12:34 p.m. UTC | #1

On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> been encountering intermittent lockups. The backtraces always include
> the filesystem-DAX PMD path, multi-order entries have been a source of
> bugs in the past, and disabling the PMD path allows a test that fails in
> minutes to run for an hour.

On May 4th, I asked you:

Since this is provoked by a fatal signal, it must have something to do
with a killable or interruptible sleep.  There's only one of those in the
DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
I/O with write() or through a writable mmap()?  I'd like to know before
I chase too far down this fault tree analysis.

Dan Williams June 27, 2019, 4:06 p.m. UTC | #2

On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > been encountering intermittent lockups. The backtraces always include
> > the filesystem-DAX PMD path, multi-order entries have been a source of
> > bugs in the past, and disabling the PMD path allows a test that fails in
> > minutes to run for an hour.
>
> On May 4th, I asked you:
>
> Since this is provoked by a fatal signal, it must have something to do
> with a killable or interruptible sleep.  There's only one of those in the
> DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
> I/O with write() or through a writable mmap()?  I'd like to know before
> I chase too far down this fault tree analysis.

RocksDB in this case is using write() for writes and mmap() for reads.

Dan Williams June 27, 2019, 6:29 p.m. UTC | #3

On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > > been encountering intermittent lockups. The backtraces always include
> > > the filesystem-DAX PMD path, multi-order entries have been a source of
> > > bugs in the past, and disabling the PMD path allows a test that fails in
> > > minutes to run for an hour.
> >
> > On May 4th, I asked you:
> >
> > Since this is provoked by a fatal signal, it must have something to do
> > with a killable or interruptible sleep.  There's only one of those in the
> > DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
> > I/O with write() or through a writable mmap()?  I'd like to know before
> > I chase too far down this fault tree analysis.
>
> RocksDB in this case is using write() for writes and mmap() for reads.

It's not clear to me that a fatal signal is a component of the failure
as much as it's the way to detect that the benchmark has indeed locked
up.

Dan Williams June 27, 2019, 6:58 p.m. UTC | #4

On Thu, Jun 27, 2019 at 11:29 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > > > been encountering intermittent lockups. The backtraces always include
> > > > the filesystem-DAX PMD path, multi-order entries have been a source of
> > > > bugs in the past, and disabling the PMD path allows a test that fails in
> > > > minutes to run for an hour.
> > >
> > > On May 4th, I asked you:
> > >
> > > Since this is provoked by a fatal signal, it must have something to do
> > > with a killable or interruptible sleep.  There's only one of those in the
> > > DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
> > > I/O with write() or through a writable mmap()?  I'd like to know before
> > > I chase too far down this fault tree analysis.
> >
> > RocksDB in this case is using write() for writes and mmap() for reads.
>
> It's not clear to me that a fatal signal is a component of the failure
> as much as it's the way to detect that the benchmark has indeed locked
> up.

Even though db_bench is run with the mmap_read=1 option:

  cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \
       --use_existing_db=1 \
        --mmap_read=1 \
       --num=$num_keys \
       --threads=$num_read_threads \

When the lockup occurs there are db_bench processes in the write fault path:

[ 1666.635212] db_bench        D    0  2492   2435 0x00000000
[ 1666.641339] Call Trace:
[ 1666.644072]  ? __schedule+0x24f/0x680
[ 1666.648162]  ? __switch_to_asm+0x34/0x70
[ 1666.652545]  schedule+0x29/0x90
[ 1666.656054]  get_unlocked_entry+0xcd/0x120
[ 1666.660629]  ? dax_iomap_actor+0x270/0x270
[ 1666.665206]  grab_mapping_entry+0x14f/0x230
[ 1666.669878]  dax_iomap_pmd_fault.isra.42+0x14d/0x950
[ 1666.675425]  ? futex_wait+0x122/0x230
[ 1666.679518]  ext4_dax_huge_fault+0x16f/0x1f0
[ 1666.684288]  __handle_mm_fault+0x411/0x1350
[ 1666.688961]  ? do_futex+0xca/0xbb0
[ 1666.692760]  ? __switch_to_asm+0x34/0x70
[ 1666.697144]  handle_mm_fault+0xbe/0x1e0
[ 1666.701429]  __do_page_fault+0x249/0x4f0
[ 1666.705811]  do_page_fault+0x32/0x110
[ 1666.709903]  ? page_fault+0x8/0x30
[ 1666.713702]  page_fault+0x1e/0x30

...where __handle_mm_fault+0x411 is in wp_huge_pmd():

(gdb) li *(__handle_mm_fault+0x411)
0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800).
3795    static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf,
pmd_t orig_pmd)
3796    {
3797            if (vma_is_anonymous(vmf->vma))
3798                    return do_huge_pmd_wp_page(vmf, orig_pmd);
3799            if (vmf->vma->vm_ops->huge_fault)
3800                    return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
3801
3802            /* COW handled on pte level: split pmd */
3803            VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
3804            __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);

This bug feels like we failed to unlock, or unlocked the wrong entry
and this hunk in the bisected commit looks suspect to me. Why do we
still need to drop the lock now that the radix_tree_preload() calls
are gone?

                /*
                 * Besides huge zero pages the only other thing that gets
                 * downgraded are empty entries which don't need to be
                 * unmapped.
                 */
-               if (pmd_downgrade && dax_is_zero_entry(entry))
-                       unmap_mapping_pages(mapping, index & ~PG_PMD_COLOUR,
-                                                       PG_PMD_NR, false);
-
-               err = radix_tree_preload(
-                               mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
-               if (err) {
-                       if (pmd_downgrade)
-                               put_locked_mapping_entry(mapping, index);
-                       return ERR_PTR(err);
-               }
-               xa_lock_irq(&mapping->i_pages);
-
-               if (!entry) {
-                       /*
-                        * We needed to drop the i_pages lock while calling
-                        * radix_tree_preload() and we didn't have an entry to
-                        * lock.  See if another thread inserted an entry at
-                        * our index during this time.
-                        */
-                       entry = __radix_tree_lookup(&mapping->i_pages, index,
-                                       NULL, &slot);
-                       if (entry) {
-                               radix_tree_preload_end();
-                               xa_unlock_irq(&mapping->i_pages);
-                               goto restart;
-                       }
+               if (dax_is_zero_entry(entry)) {
+                       xas_unlock_irq(xas);
+                       unmap_mapping_pages(mapping,
+                                       xas->xa_index & ~PG_PMD_COLOUR,
+                                       PG_PMD_NR, false);
+                       xas_reset(xas);
+                       xas_lock_irq(xas);
                }

-               if (pmd_downgrade) {
-                       dax_disassociate_entry(entry, mapping, false);
-                       radix_tree_delete(&mapping->i_pages, index);
-                       mapping->nrexceptional--;
-                       dax_wake_mapping_entry_waiter(&mapping->i_pages,
-                                       index, entry, true);
-               }
+               dax_disassociate_entry(entry, mapping, false);
+               xas_store(xas, NULL);   /* undo the PMD join */
+               dax_wake_entry(xas, entry, true);
+               mapping->nrexceptional--;

Dan Williams June 27, 2019, 7:09 p.m. UTC | #5

On Thu, Jun 27, 2019 at 11:58 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Thu, Jun 27, 2019 at 11:29 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Thu, Jun 27, 2019 at 9:06 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 5:34 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Wed, Jun 26, 2019 at 05:15:45PM -0700, Dan Williams wrote:
> > > > > Ever since the conversion of DAX to the Xarray a RocksDB benchmark has
> > > > > been encountering intermittent lockups. The backtraces always include
> > > > > the filesystem-DAX PMD path, multi-order entries have been a source of
> > > > > bugs in the past, and disabling the PMD path allows a test that fails in
> > > > > minutes to run for an hour.
> > > >
> > > > On May 4th, I asked you:
> > > >
> > > > Since this is provoked by a fatal signal, it must have something to do
> > > > with a killable or interruptible sleep.  There's only one of those in the
> > > > DAX code; fatal_signal_pending() in dax_iomap_actor().  Does rocksdb do
> > > > I/O with write() or through a writable mmap()?  I'd like to know before
> > > > I chase too far down this fault tree analysis.
> > >
> > > RocksDB in this case is using write() for writes and mmap() for reads.
> >
> > It's not clear to me that a fatal signal is a component of the failure
> > as much as it's the way to detect that the benchmark has indeed locked
> > up.
>
> Even though db_bench is run with the mmap_read=1 option:
>
>   cmd="${rocksdb_dir}/db_bench $params_r --benchmarks=readwhilewriting \
>        --use_existing_db=1 \
>         --mmap_read=1 \
>        --num=$num_keys \
>        --threads=$num_read_threads \
>
> When the lockup occurs there are db_bench processes in the write fault path:
>
> [ 1666.635212] db_bench        D    0  2492   2435 0x00000000
> [ 1666.641339] Call Trace:
> [ 1666.644072]  ? __schedule+0x24f/0x680
> [ 1666.648162]  ? __switch_to_asm+0x34/0x70
> [ 1666.652545]  schedule+0x29/0x90
> [ 1666.656054]  get_unlocked_entry+0xcd/0x120
> [ 1666.660629]  ? dax_iomap_actor+0x270/0x270
> [ 1666.665206]  grab_mapping_entry+0x14f/0x230
> [ 1666.669878]  dax_iomap_pmd_fault.isra.42+0x14d/0x950
> [ 1666.675425]  ? futex_wait+0x122/0x230
> [ 1666.679518]  ext4_dax_huge_fault+0x16f/0x1f0
> [ 1666.684288]  __handle_mm_fault+0x411/0x1350
> [ 1666.688961]  ? do_futex+0xca/0xbb0
> [ 1666.692760]  ? __switch_to_asm+0x34/0x70
> [ 1666.697144]  handle_mm_fault+0xbe/0x1e0
> [ 1666.701429]  __do_page_fault+0x249/0x4f0
> [ 1666.705811]  do_page_fault+0x32/0x110
> [ 1666.709903]  ? page_fault+0x8/0x30
> [ 1666.713702]  page_fault+0x1e/0x30
>
> ...where __handle_mm_fault+0x411 is in wp_huge_pmd():
>
> (gdb) li *(__handle_mm_fault+0x411)
> 0xffffffff812713d1 is in __handle_mm_fault (mm/memory.c:3800).
> 3795    static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf,
> pmd_t orig_pmd)
> 3796    {
> 3797            if (vma_is_anonymous(vmf->vma))
> 3798                    return do_huge_pmd_wp_page(vmf, orig_pmd);
> 3799            if (vmf->vma->vm_ops->huge_fault)
> 3800                    return vmf->vma->vm_ops->huge_fault(vmf, PE_SIZE_PMD);
> 3801
> 3802            /* COW handled on pte level: split pmd */
> 3803            VM_BUG_ON_VMA(vmf->vma->vm_flags & VM_SHARED, vmf->vma);
> 3804            __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>
> This bug feels like we failed to unlock, or unlocked the wrong entry
> and this hunk in the bisected commit looks suspect to me. Why do we
> still need to drop the lock now that the radix_tree_preload() calls
> are gone?

Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
wonder why we don't restart the lookup like the old implementation.

Matthew Wilcox June 27, 2019, 7:59 p.m. UTC | #6

On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > This bug feels like we failed to unlock, or unlocked the wrong entry
> > and this hunk in the bisected commit looks suspect to me. Why do we
> > still need to drop the lock now that the radix_tree_preload() calls
> > are gone?
> 
> Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> wonder why we don't restart the lookup like the old implementation.

We have the entry locked:

                /*
                 * Make sure 'entry' remains valid while we drop
                 * the i_pages lock.
                 */
                dax_lock_entry(xas, entry);

                /*
                 * Besides huge zero pages the only other thing that gets
                 * downgraded are empty entries which don't need to be
                 * unmapped.
                 */
                if (dax_is_zero_entry(entry)) {
                        xas_unlock_irq(xas);
                        unmap_mapping_pages(mapping,
                                        xas->xa_index & ~PG_PMD_COLOUR,
                                        PG_PMD_NR, false);
                        xas_reset(xas);
                        xas_lock_irq(xas);
                }

If something can remove a locked entry, then that would seem like the
real bug.  Might be worth inserting a lookup there to make sure that it
hasn't happened, I suppose?

Dan Williams June 28, 2019, 2:39 a.m. UTC | #7

On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > still need to drop the lock now that the radix_tree_preload() calls
> > > are gone?
> >
> > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > wonder why we don't restart the lookup like the old implementation.
>
> We have the entry locked:
>
>                 /*
>                  * Make sure 'entry' remains valid while we drop
>                  * the i_pages lock.
>                  */
>                 dax_lock_entry(xas, entry);
>
>                 /*
>                  * Besides huge zero pages the only other thing that gets
>                  * downgraded are empty entries which don't need to be
>                  * unmapped.
>                  */
>                 if (dax_is_zero_entry(entry)) {
>                         xas_unlock_irq(xas);
>                         unmap_mapping_pages(mapping,
>                                         xas->xa_index & ~PG_PMD_COLOUR,
>                                         PG_PMD_NR, false);
>                         xas_reset(xas);
>                         xas_lock_irq(xas);
>                 }
>
> If something can remove a locked entry, then that would seem like the
> real bug.  Might be worth inserting a lookup there to make sure that it
> hasn't happened, I suppose?

Nope, added a check, we do in fact get the same locked entry back
after dropping the lock.

The deadlock revolves around the mmap_sem. One thread holds it for
read and then gets stuck indefinitely in get_unlocked_entry(). Once
that happens another rocksdb thread tries to mmap and gets stuck
trying to take the mmap_sem for write. Then all new readers, including
ps and top that try to access a remote vma, then get queued behind
that write.

It could also be the case that we're missing a wake up.

Matthew Wilcox June 28, 2019, 4:37 p.m. UTC | #8

On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > are gone?
> > >
> > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > wonder why we don't restart the lookup like the old implementation.
> >
> > If something can remove a locked entry, then that would seem like the
> > real bug.  Might be worth inserting a lookup there to make sure that it
> > hasn't happened, I suppose?
> 
> Nope, added a check, we do in fact get the same locked entry back
> after dropping the lock.

Okay, good, glad to have ruled that out.

> The deadlock revolves around the mmap_sem. One thread holds it for
> read and then gets stuck indefinitely in get_unlocked_entry(). Once
> that happens another rocksdb thread tries to mmap and gets stuck
> trying to take the mmap_sem for write. Then all new readers, including
> ps and top that try to access a remote vma, then get queued behind
> that write.
> 
> It could also be the case that we're missing a wake up.

That was the conclusion I came to; that one thread holding the mmap sem
for read isn't being woken up when it should be.  Just need to find it ...
obviously it's something to do with the PMD entries.

Dan Williams June 28, 2019, 4:39 p.m. UTC | #9

On Fri, Jun 28, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > are gone?
> > > >
> > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > wonder why we don't restart the lookup like the old implementation.
> > >
> > > If something can remove a locked entry, then that would seem like the
> > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > hasn't happened, I suppose?
> >
> > Nope, added a check, we do in fact get the same locked entry back
> > after dropping the lock.
>
> Okay, good, glad to have ruled that out.
>
> > The deadlock revolves around the mmap_sem. One thread holds it for
> > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > that happens another rocksdb thread tries to mmap and gets stuck
> > trying to take the mmap_sem for write. Then all new readers, including
> > ps and top that try to access a remote vma, then get queued behind
> > that write.
> >
> > It could also be the case that we're missing a wake up.
>
> That was the conclusion I came to; that one thread holding the mmap sem
> for read isn't being woken up when it should be.  Just need to find it ...
> obviously it's something to do with the PMD entries.

Can you explain to me one more time, yes I'm slow on the uptake on
this, the difference between xas_load() and xas_find_conflict() and
why it's ok for dax_lock_page() to use xas_load() while
grab_mapping_entry() uses xas_find_conflict()?

Matthew Wilcox June 28, 2019, 4:54 p.m. UTC | #10

On Fri, Jun 28, 2019 at 09:39:01AM -0700, Dan Williams wrote:
> On Fri, Jun 28, 2019 at 9:37 AM Matthew Wilcox <willy@infradead.org> wrote:
> > That was the conclusion I came to; that one thread holding the mmap sem
> > for read isn't being woken up when it should be.  Just need to find it ...
> > obviously it's something to do with the PMD entries.
> 
> Can you explain to me one more time, yes I'm slow on the uptake on
> this, the difference between xas_load() and xas_find_conflict() and
> why it's ok for dax_lock_page() to use xas_load() while
> grab_mapping_entry() uses xas_find_conflict()?

When used with a non-zero 'order', xas_find_conflict() will return
an entry whereas xas_load() might return a pointer to a node.
dax_lock_page() always uses a zero order, so they would always do the
same thing (xas_find_conflict() would be less efficient).

Matthew Wilcox June 29, 2019, 4:03 p.m. UTC | #11

On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > are gone?
> > >
> > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > wonder why we don't restart the lookup like the old implementation.
> >
> > We have the entry locked:
> >
> >                 /*
> >                  * Make sure 'entry' remains valid while we drop
> >                  * the i_pages lock.
> >                  */
> >                 dax_lock_entry(xas, entry);
> >
> >                 /*
> >                  * Besides huge zero pages the only other thing that gets
> >                  * downgraded are empty entries which don't need to be
> >                  * unmapped.
> >                  */
> >                 if (dax_is_zero_entry(entry)) {
> >                         xas_unlock_irq(xas);
> >                         unmap_mapping_pages(mapping,
> >                                         xas->xa_index & ~PG_PMD_COLOUR,
> >                                         PG_PMD_NR, false);
> >                         xas_reset(xas);
> >                         xas_lock_irq(xas);
> >                 }
> >
> > If something can remove a locked entry, then that would seem like the
> > real bug.  Might be worth inserting a lookup there to make sure that it
> > hasn't happened, I suppose?
> 
> Nope, added a check, we do in fact get the same locked entry back
> after dropping the lock.
> 
> The deadlock revolves around the mmap_sem. One thread holds it for
> read and then gets stuck indefinitely in get_unlocked_entry(). Once
> that happens another rocksdb thread tries to mmap and gets stuck
> trying to take the mmap_sem for write. Then all new readers, including
> ps and top that try to access a remote vma, then get queued behind
> that write.
> 
> It could also be the case that we're missing a wake up.

OK, I have a Theory.

get_unlocked_entry() doesn't check the size of the entry being waited for.
So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
which is (a) foolish, because we know it's going to fall back, and (b)
can lead to a missed wakeup because it's going to sleep waiting for
the PMD entry to come unlocked.  Which it won't, unless there's a happy
accident that happens to map to the same hash bucket.

Let's see if I can steal some time this weekend to whip up a patch.

Dan Williams June 30, 2019, 7:27 a.m. UTC | #12

On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > are gone?
> > > >
> > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > wonder why we don't restart the lookup like the old implementation.
> > >
> > > We have the entry locked:
> > >
> > >                 /*
> > >                  * Make sure 'entry' remains valid while we drop
> > >                  * the i_pages lock.
> > >                  */
> > >                 dax_lock_entry(xas, entry);
> > >
> > >                 /*
> > >                  * Besides huge zero pages the only other thing that gets
> > >                  * downgraded are empty entries which don't need to be
> > >                  * unmapped.
> > >                  */
> > >                 if (dax_is_zero_entry(entry)) {
> > >                         xas_unlock_irq(xas);
> > >                         unmap_mapping_pages(mapping,
> > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > >                                         PG_PMD_NR, false);
> > >                         xas_reset(xas);
> > >                         xas_lock_irq(xas);
> > >                 }
> > >
> > > If something can remove a locked entry, then that would seem like the
> > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > hasn't happened, I suppose?
> >
> > Nope, added a check, we do in fact get the same locked entry back
> > after dropping the lock.
> >
> > The deadlock revolves around the mmap_sem. One thread holds it for
> > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > that happens another rocksdb thread tries to mmap and gets stuck
> > trying to take the mmap_sem for write. Then all new readers, including
> > ps and top that try to access a remote vma, then get queued behind
> > that write.
> >
> > It could also be the case that we're missing a wake up.
>
> OK, I have a Theory.
>
> get_unlocked_entry() doesn't check the size of the entry being waited for.
> So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> which is (a) foolish, because we know it's going to fall back, and (b)
> can lead to a missed wakeup because it's going to sleep waiting for
> the PMD entry to come unlocked.  Which it won't, unless there's a happy
> accident that happens to map to the same hash bucket.
>
> Let's see if I can steal some time this weekend to whip up a patch.

Theory seems to have some evidence... I instrumented fs/dax.c to track
outstanding 'lock' entries and 'wait' events. At the time of the hang
we see no locks held and the waiter is waiting on a pmd entry:

[ 4001.354334] fs/dax locked entries: 0
[ 4001.358425] fs/dax wait entries: 1
[ 4001.362227] db_bench/2445 index: 0x0 shift: 6
[ 4001.367099]  grab_mapping_entry+0x17a/0x260
[ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
[ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
[ 4001.382086]  __handle_mm_fault+0x411/0x1390
[ 4001.386756]  handle_mm_fault+0x172/0x360

Dan Williams June 30, 2019, 8:01 a.m. UTC | #13

On Sun, Jun 30, 2019 at 12:27 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > > are gone?
> > > > >
> > > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > > wonder why we don't restart the lookup like the old implementation.
> > > >
> > > > We have the entry locked:
> > > >
> > > >                 /*
> > > >                  * Make sure 'entry' remains valid while we drop
> > > >                  * the i_pages lock.
> > > >                  */
> > > >                 dax_lock_entry(xas, entry);
> > > >
> > > >                 /*
> > > >                  * Besides huge zero pages the only other thing that gets
> > > >                  * downgraded are empty entries which don't need to be
> > > >                  * unmapped.
> > > >                  */
> > > >                 if (dax_is_zero_entry(entry)) {
> > > >                         xas_unlock_irq(xas);
> > > >                         unmap_mapping_pages(mapping,
> > > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > > >                                         PG_PMD_NR, false);
> > > >                         xas_reset(xas);
> > > >                         xas_lock_irq(xas);
> > > >                 }
> > > >
> > > > If something can remove a locked entry, then that would seem like the
> > > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > > hasn't happened, I suppose?
> > >
> > > Nope, added a check, we do in fact get the same locked entry back
> > > after dropping the lock.
> > >
> > > The deadlock revolves around the mmap_sem. One thread holds it for
> > > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > > that happens another rocksdb thread tries to mmap and gets stuck
> > > trying to take the mmap_sem for write. Then all new readers, including
> > > ps and top that try to access a remote vma, then get queued behind
> > > that write.
> > >
> > > It could also be the case that we're missing a wake up.
> >
> > OK, I have a Theory.
> >
> > get_unlocked_entry() doesn't check the size of the entry being waited for.
> > So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> > which is (a) foolish, because we know it's going to fall back, and (b)
> > can lead to a missed wakeup because it's going to sleep waiting for
> > the PMD entry to come unlocked.  Which it won't, unless there's a happy
> > accident that happens to map to the same hash bucket.
> >
> > Let's see if I can steal some time this weekend to whip up a patch.
>
> Theory seems to have some evidence... I instrumented fs/dax.c to track
> outstanding 'lock' entries and 'wait' events. At the time of the hang
> we see no locks held and the waiter is waiting on a pmd entry:
>
> [ 4001.354334] fs/dax locked entries: 0
> [ 4001.358425] fs/dax wait entries: 1
> [ 4001.362227] db_bench/2445 index: 0x0 shift: 6
> [ 4001.367099]  grab_mapping_entry+0x17a/0x260
> [ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
> [ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
> [ 4001.382086]  __handle_mm_fault+0x411/0x1390
> [ 4001.386756]  handle_mm_fault+0x172/0x360

In fact, this naive fix is holding up so far:

@@ -215,7 +216,7 @@ static wait_queue_head_t
*dax_entry_waitqueue(struct xa_state *xas,
         * queue to the start of that PMD.  This ensures that all offsets in
         * the range covered by the PMD map to the same bit lock.
         */
-       if (dax_is_pmd_entry(entry))
+       //if (dax_is_pmd_entry(entry))
                index &= ~PG_PMD_COLOUR;
        key->xa = xas->xa;
        key->entry_start = index;

Matthew Wilcox June 30, 2019, 3:23 p.m. UTC | #14

On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> On Sun, Jun 30, 2019 at 12:27 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > > > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > >
> > > > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > > > are gone?
> > > > > >
> > > > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > > > wonder why we don't restart the lookup like the old implementation.
> > > > >
> > > > > We have the entry locked:
> > > > >
> > > > >                 /*
> > > > >                  * Make sure 'entry' remains valid while we drop
> > > > >                  * the i_pages lock.
> > > > >                  */
> > > > >                 dax_lock_entry(xas, entry);
> > > > >
> > > > >                 /*
> > > > >                  * Besides huge zero pages the only other thing that gets
> > > > >                  * downgraded are empty entries which don't need to be
> > > > >                  * unmapped.
> > > > >                  */
> > > > >                 if (dax_is_zero_entry(entry)) {
> > > > >                         xas_unlock_irq(xas);
> > > > >                         unmap_mapping_pages(mapping,
> > > > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > > > >                                         PG_PMD_NR, false);
> > > > >                         xas_reset(xas);
> > > > >                         xas_lock_irq(xas);
> > > > >                 }
> > > > >
> > > > > If something can remove a locked entry, then that would seem like the
> > > > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > > > hasn't happened, I suppose?
> > > >
> > > > Nope, added a check, we do in fact get the same locked entry back
> > > > after dropping the lock.
> > > >
> > > > The deadlock revolves around the mmap_sem. One thread holds it for
> > > > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > > > that happens another rocksdb thread tries to mmap and gets stuck
> > > > trying to take the mmap_sem for write. Then all new readers, including
> > > > ps and top that try to access a remote vma, then get queued behind
> > > > that write.
> > > >
> > > > It could also be the case that we're missing a wake up.
> > >
> > > OK, I have a Theory.
> > >
> > > get_unlocked_entry() doesn't check the size of the entry being waited for.
> > > So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> > > which is (a) foolish, because we know it's going to fall back, and (b)
> > > can lead to a missed wakeup because it's going to sleep waiting for
> > > the PMD entry to come unlocked.  Which it won't, unless there's a happy
> > > accident that happens to map to the same hash bucket.
> > >
> > > Let's see if I can steal some time this weekend to whip up a patch.
> >
> > Theory seems to have some evidence... I instrumented fs/dax.c to track
> > outstanding 'lock' entries and 'wait' events. At the time of the hang
> > we see no locks held and the waiter is waiting on a pmd entry:
> >
> > [ 4001.354334] fs/dax locked entries: 0
> > [ 4001.358425] fs/dax wait entries: 1
> > [ 4001.362227] db_bench/2445 index: 0x0 shift: 6
> > [ 4001.367099]  grab_mapping_entry+0x17a/0x260
> > [ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
> > [ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
> > [ 4001.382086]  __handle_mm_fault+0x411/0x1390
> > [ 4001.386756]  handle_mm_fault+0x172/0x360
> 
> In fact, this naive fix is holding up so far:
> 
> @@ -215,7 +216,7 @@ static wait_queue_head_t
> *dax_entry_waitqueue(struct xa_state *xas,
>          * queue to the start of that PMD.  This ensures that all offsets in
>          * the range covered by the PMD map to the same bit lock.
>          */
> -       if (dax_is_pmd_entry(entry))
> +       //if (dax_is_pmd_entry(entry))
>                 index &= ~PG_PMD_COLOUR;
>         key->xa = xas->xa;
>         key->entry_start = index;

Hah, that's a great naive fix!  Thanks for trying that out.

I think my theory was slightly mistaken, but your fix has the effect of
fixing the actual problem too.

The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
512), but xas_find_conflict() does _not_ adjust xa_index (... which I
really should have mentioned in the documentation).  So we go to sleep
on the PMD-aligned index instead of the index of the PTE.  Your patch
fixes this by using the PMD-aligned index for PTEs too.

I'm trying to come up with a clean fix for this.  Clearly we
shouldn't wait for a PTE entry if we're looking for a PMD entry.
But what should get_unlocked_entry() return if it detects that case?
We could have it return an error code encoded as an internal entry,
like grab_mapping_entry() does.  Or we could have it return the _locked_
PTE entry, and have callers interpret that.

At least get_unlocked_entry() is static, but it's got quite a few callers.
Trying to discern which ones might ask for a PMD entry is a bit tricky.
So this seems like a large patch which might have bugs.

Thoughts?

Dan Williams June 30, 2019, 9:37 p.m. UTC | #15

On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> > On Sun, Jun 30, 2019 at 12:27 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Sat, Jun 29, 2019 at 9:03 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > >
> > > > On Thu, Jun 27, 2019 at 07:39:37PM -0700, Dan Williams wrote:
> > > > > On Thu, Jun 27, 2019 at 12:59 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > >
> > > > > > On Thu, Jun 27, 2019 at 12:09:29PM -0700, Dan Williams wrote:
> > > > > > > > This bug feels like we failed to unlock, or unlocked the wrong entry
> > > > > > > > and this hunk in the bisected commit looks suspect to me. Why do we
> > > > > > > > still need to drop the lock now that the radix_tree_preload() calls
> > > > > > > > are gone?
> > > > > > >
> > > > > > > Nevermind, unmapp_mapping_pages() takes a sleeping lock, but then I
> > > > > > > wonder why we don't restart the lookup like the old implementation.
> > > > > >
> > > > > > We have the entry locked:
> > > > > >
> > > > > >                 /*
> > > > > >                  * Make sure 'entry' remains valid while we drop
> > > > > >                  * the i_pages lock.
> > > > > >                  */
> > > > > >                 dax_lock_entry(xas, entry);
> > > > > >
> > > > > >                 /*
> > > > > >                  * Besides huge zero pages the only other thing that gets
> > > > > >                  * downgraded are empty entries which don't need to be
> > > > > >                  * unmapped.
> > > > > >                  */
> > > > > >                 if (dax_is_zero_entry(entry)) {
> > > > > >                         xas_unlock_irq(xas);
> > > > > >                         unmap_mapping_pages(mapping,
> > > > > >                                         xas->xa_index & ~PG_PMD_COLOUR,
> > > > > >                                         PG_PMD_NR, false);
> > > > > >                         xas_reset(xas);
> > > > > >                         xas_lock_irq(xas);
> > > > > >                 }
> > > > > >
> > > > > > If something can remove a locked entry, then that would seem like the
> > > > > > real bug.  Might be worth inserting a lookup there to make sure that it
> > > > > > hasn't happened, I suppose?
> > > > >
> > > > > Nope, added a check, we do in fact get the same locked entry back
> > > > > after dropping the lock.
> > > > >
> > > > > The deadlock revolves around the mmap_sem. One thread holds it for
> > > > > read and then gets stuck indefinitely in get_unlocked_entry(). Once
> > > > > that happens another rocksdb thread tries to mmap and gets stuck
> > > > > trying to take the mmap_sem for write. Then all new readers, including
> > > > > ps and top that try to access a remote vma, then get queued behind
> > > > > that write.
> > > > >
> > > > > It could also be the case that we're missing a wake up.
> > > >
> > > > OK, I have a Theory.
> > > >
> > > > get_unlocked_entry() doesn't check the size of the entry being waited for.
> > > > So dax_iomap_pmd_fault() can end up sleeping waiting for a PTE entry,
> > > > which is (a) foolish, because we know it's going to fall back, and (b)
> > > > can lead to a missed wakeup because it's going to sleep waiting for
> > > > the PMD entry to come unlocked.  Which it won't, unless there's a happy
> > > > accident that happens to map to the same hash bucket.
> > > >
> > > > Let's see if I can steal some time this weekend to whip up a patch.
> > >
> > > Theory seems to have some evidence... I instrumented fs/dax.c to track
> > > outstanding 'lock' entries and 'wait' events. At the time of the hang
> > > we see no locks held and the waiter is waiting on a pmd entry:
> > >
> > > [ 4001.354334] fs/dax locked entries: 0
> > > [ 4001.358425] fs/dax wait entries: 1
> > > [ 4001.362227] db_bench/2445 index: 0x0 shift: 6
> > > [ 4001.367099]  grab_mapping_entry+0x17a/0x260
> > > [ 4001.371773]  dax_iomap_pmd_fault.isra.43+0x168/0x7a0
> > > [ 4001.377316]  ext4_dax_huge_fault+0x16f/0x1f0
> > > [ 4001.382086]  __handle_mm_fault+0x411/0x1390
> > > [ 4001.386756]  handle_mm_fault+0x172/0x360
> >
> > In fact, this naive fix is holding up so far:
> >
> > @@ -215,7 +216,7 @@ static wait_queue_head_t
> > *dax_entry_waitqueue(struct xa_state *xas,
> >          * queue to the start of that PMD.  This ensures that all offsets in
> >          * the range covered by the PMD map to the same bit lock.
> >          */
> > -       if (dax_is_pmd_entry(entry))
> > +       //if (dax_is_pmd_entry(entry))
> >                 index &= ~PG_PMD_COLOUR;
> >         key->xa = xas->xa;
> >         key->entry_start = index;
>
> Hah, that's a great naive fix!  Thanks for trying that out.
>
> I think my theory was slightly mistaken, but your fix has the effect of
> fixing the actual problem too.
>
> The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> really should have mentioned in the documentation).  So we go to sleep
> on the PMD-aligned index instead of the index of the PTE.  Your patch
> fixes this by using the PMD-aligned index for PTEs too.
>
> I'm trying to come up with a clean fix for this.  Clearly we
> shouldn't wait for a PTE entry if we're looking for a PMD entry.
> But what should get_unlocked_entry() return if it detects that case?
> We could have it return an error code encoded as an internal entry,
> like grab_mapping_entry() does.  Or we could have it return the _locked_
> PTE entry, and have callers interpret that.
>
> At least get_unlocked_entry() is static, but it's got quite a few callers.
> Trying to discern which ones might ask for a PMD entry is a bit tricky.
> So this seems like a large patch which might have bugs.
>
> Thoughts?

...but if it was a problem of just mismatched waitqueue's I would have
expected it to trigger prior to commit b15cd800682f "dax: Convert page
fault handlers to XArray". This hunk, if I'm reading it correctly,
looks suspicious: @index in this case is coming directly from
vm->pgoff without pmd alignment adjustment whereas after the
conversion it's always pmd aligned from the xas->xa_index. So perhaps
the issue is that the lock happens at pte granularity. I expect it
would cause the old put_locked_mapping_entry() to WARN, but maybe that
avoids the lockup and was missed in the bisect.

@@ -884,21 +711,18 @@ static void *dax_insert_entry(struct
address_space *mapping,
                 * existing entry is a PMD, we will just leave the PMD in the
                 * tree and dirty it if necessary.
                 */
-               struct radix_tree_node *node;
-               void **slot;
-               void *ret;
-
-               ret = __radix_tree_lookup(pages, index, &node, &slot);
-               WARN_ON_ONCE(ret != entry);
-               __radix_tree_replace(pages, node, slot,
-                                    new_entry, NULL);
+               void *old = dax_lock_entry(xas, new_entry);
+               WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
+                                       DAX_LOCKED));
                entry = new_entry;
+       } else {
+               xas_load(xas);  /* Walk the xa_state */
        }

        if (dirty)
-               radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY);
+               xas_set_mark(xas, PAGECACHE_TAG_DIRTY);

-       xa_unlock_irq(pages);
+       xas_unlock_irq(xas);
        return entry;
 }

Jan Kara July 1, 2019, 12:11 p.m. UTC | #16

On Sun 30-06-19 08:23:24, Matthew Wilcox wrote:
> On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> > @@ -215,7 +216,7 @@ static wait_queue_head_t
> > *dax_entry_waitqueue(struct xa_state *xas,
> >          * queue to the start of that PMD.  This ensures that all offsets in
> >          * the range covered by the PMD map to the same bit lock.
> >          */
> > -       if (dax_is_pmd_entry(entry))
> > +       //if (dax_is_pmd_entry(entry))
> >                 index &= ~PG_PMD_COLOUR;
> >         key->xa = xas->xa;
> >         key->entry_start = index;
> 
> Hah, that's a great naive fix!  Thanks for trying that out.
> 
> I think my theory was slightly mistaken, but your fix has the effect of
> fixing the actual problem too.
> 
> The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> really should have mentioned in the documentation).  So we go to sleep
> on the PMD-aligned index instead of the index of the PTE.  Your patch
> fixes this by using the PMD-aligned index for PTEs too.
> 
> I'm trying to come up with a clean fix for this.  Clearly we
> shouldn't wait for a PTE entry if we're looking for a PMD entry.
> But what should get_unlocked_entry() return if it detects that case?
> We could have it return an error code encoded as an internal entry,
> like grab_mapping_entry() does.  Or we could have it return the _locked_
> PTE entry, and have callers interpret that.
> 
> At least get_unlocked_entry() is static, but it's got quite a few callers.
> Trying to discern which ones might ask for a PMD entry is a bit tricky.
> So this seems like a large patch which might have bugs.

Yeah. So get_unlocked_entry() is used in several cases:

1) Case where we already have entry at given index but it is locked and we
need it unlocked so that we can do our thing `(dax_writeback_one(),
dax_layout_busy_page()).

2) Case where we want any entry covering given index (in
__dax_invalidate_entry()). This is essentially the same as case 1) since we
have already looked up the entry (just didn't propagate that information
from mm/truncate.c) - we want any unlocked entry covering given index.

3) Cases where we really want entry at given index and we have some entry
order constraints (dax_insert_pfn_mkwrite(), grab_mapping_entry()).

Honestly I'd make the rule that get_unlocked_entry() returns entry of any
order that is covering given index. I agree it may be unnecessarily waiting
for PTE entry lock for the case where in case 3) we are really looking only
for PMD entry but that seems like a relatively small cost for the
simplicity of the interface.

BTW, looking into the xarray code, I think I found another difference
between the old radix tree code and the new xarray code that could cause
issues. In the old radix tree code if we tried to insert PMD entry but
there was some PTE entry in the covered range, we'd get EEXIST error back
and the DAX fault code relies on this. I don't see how similar behavior is
achieved by xas_store()...

								Honza

Matthew Wilcox July 2, 2019, 3:34 a.m. UTC | #17

On Sun, Jun 30, 2019 at 02:37:32PM -0700, Dan Williams wrote:
> On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox <willy@infradead.org> wrote:
> > I think my theory was slightly mistaken, but your fix has the effect of
> > fixing the actual problem too.
> >
> > The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> > 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> > really should have mentioned in the documentation).  So we go to sleep
> > on the PMD-aligned index instead of the index of the PTE.  Your patch
> > fixes this by using the PMD-aligned index for PTEs too.
> >
> > I'm trying to come up with a clean fix for this.  Clearly we
> > shouldn't wait for a PTE entry if we're looking for a PMD entry.
> > But what should get_unlocked_entry() return if it detects that case?
> > We could have it return an error code encoded as an internal entry,
> > like grab_mapping_entry() does.  Or we could have it return the _locked_
> > PTE entry, and have callers interpret that.
> >
> > At least get_unlocked_entry() is static, but it's got quite a few callers.
> > Trying to discern which ones might ask for a PMD entry is a bit tricky.
> > So this seems like a large patch which might have bugs.
> >
> > Thoughts?
> 
> ...but if it was a problem of just mismatched waitqueue's I would have
> expected it to trigger prior to commit b15cd800682f "dax: Convert page
> fault handlers to XArray".

That commit converts grab_mapping_entry() (called by dax_iomap_pmd_fault())
from calling get_unlocked_mapping_entry() to calling get_unlocked_entry().
get_unlocked_mapping_entry() (eventually) called __radix_tree_lookup()
instead of dax_find_conflict().

> This hunk, if I'm reading it correctly,
> looks suspicious: @index in this case is coming directly from
> vm->pgoff without pmd alignment adjustment whereas after the
> conversion it's always pmd aligned from the xas->xa_index. So perhaps
> the issue is that the lock happens at pte granularity. I expect it
> would cause the old put_locked_mapping_entry() to WARN, but maybe that
> avoids the lockup and was missed in the bisect.

I don't think that hunk is the problem.  The __radix_tree_lookup()
is going to return a 'slot' which points to the canonical slot, no
matter which of the 512 indices corresponding to that slot is chosen.
So I think it's going to do essentially the same thing.

> @@ -884,21 +711,18 @@ static void *dax_insert_entry(struct
> address_space *mapping,
>                  * existing entry is a PMD, we will just leave the PMD in the
>                  * tree and dirty it if necessary.
>                  */
> -               struct radix_tree_node *node;
> -               void **slot;
> -               void *ret;
> -
> -               ret = __radix_tree_lookup(pages, index, &node, &slot);
> -               WARN_ON_ONCE(ret != entry);
> -               __radix_tree_replace(pages, node, slot,
> -                                    new_entry, NULL);
> +               void *old = dax_lock_entry(xas, new_entry);
> +               WARN_ON_ONCE(old != xa_mk_value(xa_to_value(entry) |
> +                                       DAX_LOCKED));
>                 entry = new_entry;
> +       } else {
> +               xas_load(xas);  /* Walk the xa_state */
>         }
> 
>         if (dirty)
> -               radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY);
> +               xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
> 
> -       xa_unlock_irq(pages);
> +       xas_unlock_irq(xas);
>         return entry;
>  }

Dan Williams July 2, 2019, 3:37 p.m. UTC | #18

On Mon, Jul 1, 2019 at 8:34 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sun, Jun 30, 2019 at 02:37:32PM -0700, Dan Williams wrote:
> > On Sun, Jun 30, 2019 at 8:23 AM Matthew Wilcox <willy@infradead.org> wrote:
> > > I think my theory was slightly mistaken, but your fix has the effect of
> > > fixing the actual problem too.
> > >
> > > The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> > > 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> > > really should have mentioned in the documentation).  So we go to sleep
> > > on the PMD-aligned index instead of the index of the PTE.  Your patch
> > > fixes this by using the PMD-aligned index for PTEs too.
> > >
> > > I'm trying to come up with a clean fix for this.  Clearly we
> > > shouldn't wait for a PTE entry if we're looking for a PMD entry.
> > > But what should get_unlocked_entry() return if it detects that case?
> > > We could have it return an error code encoded as an internal entry,
> > > like grab_mapping_entry() does.  Or we could have it return the _locked_
> > > PTE entry, and have callers interpret that.
> > >
> > > At least get_unlocked_entry() is static, but it's got quite a few callers.
> > > Trying to discern which ones might ask for a PMD entry is a bit tricky.
> > > So this seems like a large patch which might have bugs.
> > >
> > > Thoughts?
> >
> > ...but if it was a problem of just mismatched waitqueue's I would have
> > expected it to trigger prior to commit b15cd800682f "dax: Convert page
> > fault handlers to XArray".
>
> That commit converts grab_mapping_entry() (called by dax_iomap_pmd_fault())
> from calling get_unlocked_mapping_entry() to calling get_unlocked_entry().
> get_unlocked_mapping_entry() (eventually) called __radix_tree_lookup()
> instead of dax_find_conflict().
>
> > This hunk, if I'm reading it correctly,
> > looks suspicious: @index in this case is coming directly from
> > vm->pgoff without pmd alignment adjustment whereas after the
> > conversion it's always pmd aligned from the xas->xa_index. So perhaps
> > the issue is that the lock happens at pte granularity. I expect it
> > would cause the old put_locked_mapping_entry() to WARN, but maybe that
> > avoids the lockup and was missed in the bisect.
>
> I don't think that hunk is the problem.  The __radix_tree_lookup()
> is going to return a 'slot' which points to the canonical slot, no
> matter which of the 512 indices corresponding to that slot is chosen.
> So I think it's going to do essentially the same thing.

Yeah, no warnings on the parent commit for the regression.

I'd be inclined to do the brute force fix of not trying to get fancy
with separate PTE/PMD waitqueues and then follow on with a more clever
performance enhancement later. Thoughts about that?

Boaz Harrosh July 3, 2019, 12:22 a.m. UTC | #19

On 02/07/2019 18:37, Dan Williams wrote:
<>
> 
> I'd be inclined to do the brute force fix of not trying to get fancy
> with separate PTE/PMD waitqueues and then follow on with a more clever
> performance enhancement later. Thoughts about that?
> 

Sir Dan

I do not understand how separate waitqueues are any performance enhancement?
The all point of the waitqueues is that there is enough of them and the hash
function does a good radomization spread to effectively grab a single locker
per waitqueue unless the system is very contended and waitqueues are shared.
Which is good because it means you effectively need a back pressure to the app.
(Because pmem IO is mostly CPU bound with no long term sleeps I do not think
 you will ever get to that situation)

So the way I understand it having twice as many waitqueues serving two types
will be better performance over all then, segregating the types each with half
the number of queues.

(Regardless of the above problem of where the segregation is not race clean)

Thanks
Boaz

Dan Williams July 3, 2019, 12:42 a.m. UTC | #20

On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh <openosd@gmail.com> wrote:
>
> On 02/07/2019 18:37, Dan Williams wrote:
> <>
> >
> > I'd be inclined to do the brute force fix of not trying to get fancy
> > with separate PTE/PMD waitqueues and then follow on with a more clever
> > performance enhancement later. Thoughts about that?
> >
>
> Sir Dan
>
> I do not understand how separate waitqueues are any performance enhancement?
> The all point of the waitqueues is that there is enough of them and the hash
> function does a good radomization spread to effectively grab a single locker
> per waitqueue unless the system is very contended and waitqueues are shared.

Right, the fix in question limits the input to the hash calculation by
masking the input to always be 2MB aligned.

> Which is good because it means you effectively need a back pressure to the app.
> (Because pmem IO is mostly CPU bound with no long term sleeps I do not think
>  you will ever get to that situation)
>
> So the way I understand it having twice as many waitqueues serving two types
> will be better performance over all then, segregating the types each with half
> the number of queues.

Yes, but the trick is how to manage cases where someone waiting on one
type needs to be woken up by an event on the other. So all I'm saying
it lets live with more hash collisions until we can figure out a race
free way to better scale waitqueue usage.

>
> (Regardless of the above problem of where the segregation is not race clean)
>
> Thanks
> Boaz

Boaz Harrosh July 3, 2019, 1:39 a.m. UTC | #21

On 03/07/2019 03:42, Dan Williams wrote:
> On Tue, Jul 2, 2019 at 5:23 PM Boaz Harrosh <openosd@gmail.com> wrote:
<>
> 
> Yes, but the trick is how to manage cases where someone waiting on one
> type needs to be woken up by an event on the other. 

Exactly I'm totally with you on this.

> So all I'm saying it lets live with more hash collisions until we can figure out a race
> free way to better scale waitqueue usage.
> 

Yes and lets actually do real measurements to see if this really hurts needlessly.
Maybe not so much

Thanks
Boaz

<>

Matthew Wilcox July 3, 2019, 3:47 p.m. UTC | #22

On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
> BTW, looking into the xarray code, I think I found another difference
> between the old radix tree code and the new xarray code that could cause
> issues. In the old radix tree code if we tried to insert PMD entry but
> there was some PTE entry in the covered range, we'd get EEXIST error back
> and the DAX fault code relies on this. I don't see how similar behavior is
> achieved by xas_store()...

Are you referring to this?

-               entry = dax_make_locked(0, size_flag | DAX_EMPTY);
-
-               err = __radix_tree_insert(&mapping->i_pages, index,
-                               dax_entry_order(entry), entry);
-               radix_tree_preload_end();
-               if (err) {
-                       xa_unlock_irq(&mapping->i_pages);
-                       /*
-                        * Our insertion of a DAX entry failed, most likely
-                        * because we were inserting a PMD entry and it
-                        * collided with a PTE sized entry at a different
-                        * index in the PMD range.  We haven't inserted
-                        * anything into the radix tree and have no waiters to
-                        * wake.
-                        */
-                       return ERR_PTR(err);
-               }

If so, that can't happen any more because we no longer drop the i_pages
lock while the entry is NULL, so the entry is always locked while the
i_pages lock is dropped.

Jan Kara July 4, 2019, 4:40 p.m. UTC | #23

On Wed 03-07-19 08:47:00, Matthew Wilcox wrote:
> On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
> > BTW, looking into the xarray code, I think I found another difference
> > between the old radix tree code and the new xarray code that could cause
> > issues. In the old radix tree code if we tried to insert PMD entry but
> > there was some PTE entry in the covered range, we'd get EEXIST error back
> > and the DAX fault code relies on this. I don't see how similar behavior is
> > achieved by xas_store()...
> 
> Are you referring to this?
> 
> -               entry = dax_make_locked(0, size_flag | DAX_EMPTY);
> -
> -               err = __radix_tree_insert(&mapping->i_pages, index,
> -                               dax_entry_order(entry), entry);
> -               radix_tree_preload_end();
> -               if (err) {
> -                       xa_unlock_irq(&mapping->i_pages);
> -                       /*
> -                        * Our insertion of a DAX entry failed, most likely
> -                        * because we were inserting a PMD entry and it
> -                        * collided with a PTE sized entry at a different
> -                        * index in the PMD range.  We haven't inserted
> -                        * anything into the radix tree and have no waiters to
> -                        * wake.
> -                        */
> -                       return ERR_PTR(err);
> -               }

Mostly yes.

> If so, that can't happen any more because we no longer drop the i_pages
> lock while the entry is NULL, so the entry is always locked while the
> i_pages lock is dropped.

Ah, I have misinterpretted what xas_find_conflict() does. I'm sorry for the
noise. I find it somewhat unfortunate that xas_find_conflict() will not
return in any way the index where it has found the conflicting entry. We
could then use it for the wait logic as well and won't have to resort to
some masking tricks...

								Honza

filesystem-dax: Disable PMD support

Commit Message

Comments

Patch