[v3,00/13] dax: fix dma vs truncate and remove 'page-less' support

On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote:
> On Fri 20-10-17 11:31:48, Christoph Hellwig wrote:
> > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote:
> > > I'd like to brainstorm how we can do something better.
> > > 
> > > How about:
> > > 
> > > If we hit a page with an elevated refcount in truncate / hole puch
> > > etc for a DAX file system we do not free the blocks in the file system,
> > > but add it to the extent busy list.  We mark the page as delayed
> > > free (e.g. page flag?) so that when it finally hits refcount zero we
> > > call back into the file system to remove it from the busy list.
> > 
> > Brainstorming some more:
> > 
> > Given that on a DAX file there shouldn't be any long-term page
> > references after we unmap it from the page table and don't allow
> > get_user_pages calls why not wait for the references for all
> > DAX pages to go away first?  E.g. if we find a DAX page in
> > truncate_inode_pages_range that has an elevated refcount we set
> > a new flag to prevent new references from showing up, and then
> > simply wait for it to go away.  Instead of a busy way we can
> > do this through a few hashed waitqueued in dev_pagemap.  And in
> > fact put_zone_device_page already gets called when putting the
> > last page so we can handle the wakeup from there.
> > 
> > In fact if we can't find a page flag for the stop new callers
> > things we could probably come up with a way to do that through
> > dev_pagemap somehow, but I'm not sure how efficient that would
> > be.
> 
> We were talking about this yesterday with Dan so some more brainstorming
> from us. We can implement the solution with extent busy list in ext4
> relatively easily - we already have such list currently similarly to XFS.
> There would be some modifications needed but nothing too complex. The
> biggest downside of this solution I see is that it requires per-filesystem
> solution for busy extents - ext4 and XFS are reasonably fine, however btrfs
> may have problems and ext2 definitely will need some modifications.
> Invisible used blocks may be surprising to users at times although given
> page refs should be relatively short term, that should not be a big issue.
> But are we guaranteed page refs are short term? E.g. if someone creates
> v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs
> can be rather long-term similarly as in RDMA case. Also freeing of blocks
> on page reference drop is another async entry point into the filesystem
> which could unpleasantly surprise us but I guess workqueues would solve
> that reasonably fine.
> 
> WRT waiting for page refs to be dropped before proceeding with truncate (or
> punch hole for that matter - that case is even nastier since we don't have
> i_size to guard us). What I like about this solution is that it is very
> visible there's something unusual going on with the file being truncated /
> punched and so problems are easier to diagnose / fix from the admin side.
> So far we have guarded hole punching from concurrent faults (and
> get_user_pages() does fault once you do unmap_mapping_range()) with
> I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page
> refs to be dropped under I_MMAP_LOCK as that could deadlock - the most
> obvious case Dan came up with is when GUP obtains ref to page A, then hole
> punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be
> dropped, and then GUP blocks on trying to fault in another page.
> 
> I think we cannot easily prevent new page references to be grabbed as you
> write above since nobody expects stuff like get_page() to fail. But I 
> think that unmapping relevant pages and then preventing them to be faulted
> in again is workable and stops GUP as well. The problem with that is though
> what to do with page faults to such pages - you cannot just fail them for
> hole punch, and you cannot easily allocate new blocks either. So we are
> back at a situation where we need to detach blocks from the inode and then
> wait for page refs to be dropped - so some form of busy extents. Am I
> missing something?
> 

No, that's a good summary of what we talked about. However, I did go
back and give the new lock approach a try and was able to get my test
to pass. The new locking is not pretty especially since you need to
drop and reacquire the lock so that get_user_pages() can finish
grabbing all the pages it needs. Here are the two primary patches in
the series, do you think the extent-busy approach would be cleaner?

---

commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Wed Oct 25 05:14:54 2017 -0700

    mm, dax: handle truncate of dma-busy pages

    get_user_pages() pins file backed memory pages for access by dma
    devices. However, it only pins the memory pages not the page-to-file
    offset association. If a file is truncated the pages are mapped out of
    the file and dma may continue indefinitely into a page that is owned by
    a device driver. This breaks coherency of the file vs dma, but the
    assumption is that if userspace wants the file-space truncated it does
    not matter what data is inbound from the device, it is not relevant
    anymore.

    The assumptions of the truncate-page-cache model are broken by DAX where
    the target DMA page *is* the filesystem block. Leaving the page pinned
    for DMA, but truncating the file block out of the file, means that the
    filesytem is free to reallocate a block under active DMA to another
    file!

    Here are some possible options for fixing this situation ('truncate' and
    'fallocate(punch hole)' are synonymous below):

        1/ Fail truncate while any file blocks might be under dma

        2/ Block (sleep-wait) truncate while any file blocks might be under
           dma

        3/ Remap file blocks to a "lost+found"-like file-inode where
           dma can continue and we might see what inbound data from DMA was
           mapped out of the original file. Blocks in this file could be
           freed back to the filesystem when dma eventually ends.

        4/ List the blocks under DMA in the extent busy list and either hold
           off commit of the truncate transaction until commit, or otherwise
           keep the blocks marked busy so the allocator does not reuse them
           until DMA completes.

        5/ Disable dax until option 3 or another long term solution has been
           implemented. However, filesystem-dax is still marked experimental
           for concerns like this.

    Option 1 will throw failures where userspace has never expected them
    before, option 2 might hang the truncating process indefinitely, and
    option 3 requires per filesystem enabling to remap blocks from one inode
    to another.  Option 2 is implemented in this patch for the DAX path with
    the expectation that non-transient users of get_user_pages() (RDMA) are
    disallowed from setting up dax mappings and that the potential delay
    introduced to the truncate path is acceptable compared to the response
    time of the page cache case. This can only be seen as a stop-gap until
    we can solve the problem of safely sequestering unallocated filesystem
    blocks under active dma.

    The solution introduces a new inode semaphore that that is held
    exclusively for get_user_pages() and held for read at truncate while
    sleep-waiting on a hashed waitqueue.

    Credit for option 3 goes to Dave Hansen, who proposed something similar
    as an alternative way to solve the problem that MAP_DIRECT was trying to
    solve. Credit for option 4 goes to Christoph Hellwig.

    Cc: Jan Kara <jack@suse.cz>
    Cc: Jeff Moyer <jmoyer@redhat.com>
    Cc: Dave Chinner <david@fromorbit.com>
    Cc: Matthew Wilcox <mawilcox@microsoft.com>
    Cc: Alexander Viro <viro@zeniv.linux.org.uk>
    Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
    Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
    Cc: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Reported-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Dan Williams <dan.j.williams@intel.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[v3,00/13] dax: fix dma vs truncate and remove 'page-less' support

Commit Message

Comments

Patch