Message ID | 20150409174916.5a2efef5@notabene.brown (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
NeilBrown <neilb@suse.de> wrote: > Is there a better way? Could a better way be created? Maybe > SEEK_DATA_RELIABLE ?? fiemap() maybe? > Also, if you do try to use fscache on btrfs with 3.19, then nothing gets > cached (as expected) and with a heavy load you can lose a race and get an > asserting fail in fscache_enqueue_operation Do you have the patches here applied? http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote: > NeilBrown <neilb@suse.de> wrote: > > > Is there a better way? Could a better way be created? Maybe > > SEEK_DATA_RELIABLE ?? > > fiemap() maybe? fiemap is not reliable for mapping holes - it returns extent info, not whether there is data in a range. i.e. there can be data over a hole (e.g. delayed allocation) and fiemap will return it as a hole. cp made this mistake back when fiemap was first introduced, resulting in corrupt file copies. SEEK_HOLE/SEEK_DATA is what you want, as they are page cache coherent, not extent based operations. And, really if you need it to really be able to find real holes, then a superblock flag might be a better way of marking filesystems with the required capability. Cheers, Dave.
On Fri, 10 Apr 2015 09:52:34 +1000 Dave Chinner <david@fromorbit.com> wrote: > On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote: > > NeilBrown <neilb@suse.de> wrote: > > > > > Is there a better way? Could a better way be created? Maybe > > > SEEK_DATA_RELIABLE ?? > > > > fiemap() maybe? > > fiemap is not reliable for mapping holes - it returns extent info, > not whether there is data in a range. i.e. there can be data over a > hole (e.g. delayed allocation) and fiemap will return it as a hole. > cp made this mistake back when fiemap was first introduced, > resulting in corrupt file copies. I assumed from the doco that FIEMAP_EXTENT_DELALLOC would get returned in this case. I guess not? > > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache > coherent, not extent based operations. And, really if you need it to > really be able to find real holes, then a superblock flag might be a > better way of marking filesystems with the required capability. btrfs seems to use the same underlying functionality for both fiemap and SEEK_HOLE/SEEK_DATA... But yes: a new fs_flags flag is probably the go. Thanks, NeilBrown > > Cheers, > > Dave.
On Fri, Apr 10, 2015 at 10:42:21AM +1000, NeilBrown wrote: > On Fri, 10 Apr 2015 09:52:34 +1000 Dave Chinner <david@fromorbit.com> wrote: > > > On Thu, Apr 09, 2015 at 10:23:08AM +0100, David Howells wrote: > > > NeilBrown <neilb@suse.de> wrote: > > > > > > > Is there a better way? Could a better way be created? Maybe > > > > SEEK_DATA_RELIABLE ?? > > > > > > fiemap() maybe? > > > > fiemap is not reliable for mapping holes - it returns extent info, > > not whether there is data in a range. i.e. there can be data over a > > hole (e.g. delayed allocation) and fiemap will return it as a hole. > > cp made this mistake back when fiemap was first introduced, > > resulting in corrupt file copies. > > I assumed from the doco that FIEMAP_EXTENT_DELALLOC would get returned in > this case. I guess not? FIEMAP is advisory and implementing features in it are optional. It's also not guaranteed to be an accurate representation of the layout of the file because it can change even before hte FIEMAP ioctl returns to userspace. it's also a representation of the physical layout of the data in the file, but it does not need to be coherent with the current data in the file. i.e. FIEMAP is not required to report the current state of data in the files to the caller, just the working physical layout sitting in memory... > > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache > > coherent, not extent based operations. And, really if you need it to > > really be able to find real holes, then a superblock flag might be a > > better way of marking filesystems with the required capability. > > btrfs seems to use the same underlying functionality for both fiemap and > SEEK_HOLE/SEEK_DATA... That's just an internal implementation detail - it does not define the behaviour and constraints of the information FIEMAP or SEEK_HOLE/SEEK_DATA APIs are supposed to provide. Cheers, Dave.
On Thu, 09 Apr 2015 10:23:08 +0100 David Howells <dhowells@redhat.com> wrote: > NeilBrown <neilb@suse.de> wrote: > > > Is there a better way? Could a better way be created? Maybe > > SEEK_DATA_RELIABLE ?? > > fiemap() maybe? > > > Also, if you do try to use fscache on btrfs with 3.19, then nothing gets > > cached (as expected) and with a heavy load you can lose a race and get an > > asserting fail in fscache_enqueue_operation > > Do you have the patches here applied? > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes > Do I don't. I had looked through them and while they did seem to be addressing similar sorts of races, nothing seems like an obvious match. I haven't been able to reproduce the BUG_ON myself. I only have a report of it repeatedly affecting someone else: https://bugzilla.opensuse.org/show_bug.cgi?id=908706 I'll probably have to be happy with fixing usage on btrfs, and hope the other bug is fixed already or doesn't become a problem. Thanks, NeilBrown
Dave Chinner <david@fromorbit.com> wrote: > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache > coherent, not extent based operations. And, really if you need it to > really be able to find real holes, then a superblock flag might be a > better way of marking filesystems with the required capability. Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon encountering a hole at the beginning of the area to be read. Of course, what I really, really want is asynchronous, direct read and write within the kernel, where the read will notify you of any holes, but will read all the data it can around those holes. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 10, 2015 at 02:28:16PM +0100, David Howells wrote: > Dave Chinner <david@fromorbit.com> wrote: > > > SEEK_HOLE/SEEK_DATA is what you want, as they are page cache > > coherent, not extent based operations. And, really if you need it to > > really be able to find real holes, then a superblock flag might be a > > better way of marking filesystems with the required capability. > > Actually, I wonder if what I want is a kernel_read() that returns ENODATA upon > encountering a hole at the beginning of the area to be read. NFS READ_PLUS could also make use of this, but someone needs to actually implement it. Until we have that lseek SEEK_HOLE/DATA is the way to go, and the horrible ->bmap hack needs to die ASAP, I can't believe you managed to sneak that in in the not too distant past. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 10 Apr 2015 11:24:31 +1000 NeilBrown <neilb@suse.de> wrote: > On Thu, 09 Apr 2015 10:23:08 +0100 David Howells <dhowells@redhat.com> wrote: > > > NeilBrown <neilb@suse.de> wrote: > > > > > Is there a better way? Could a better way be created? Maybe > > > SEEK_DATA_RELIABLE ?? > > > > fiemap() maybe? > > > > > Also, if you do try to use fscache on btrfs with 3.19, then nothing gets > > > cached (as expected) and with a heavy load you can lose a race and get an > > > asserting fail in fscache_enqueue_operation > > > > Do you have the patches here applied? > > > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=fscache-fixes > > > > Do I don't. I had looked through them and while they did seem to be > addressing similar sorts of races, nothing seems like an obvious match. > I haven't been able to reproduce the BUG_ON myself. I only have a report > of it repeatedly affecting someone else: > > https://bugzilla.opensuse.org/show_bug.cgi?id=908706 > > I'll probably have to be happy with fixing usage on btrfs, and hope the other > bug is fixed already or doesn't become a problem. I managed to reproduce the bug, and when I applied your patches I cannot any more. So it looks like you've fixed it - thanks. That just leave the bmap issue. I'll post a patch which causes lseek to be used when the fs says that is OK. Thanks, NeilBrown
NeilBrown <neilb@suse.de> wrote: > I managed to reproduce the bug, and when I applied your patches I cannot any > more. So it looks like you've fixed it - thanks. I hope so too. Now I just hope Linus takes the patches. > That just leave the bmap issue. I'll post a patch which causes lseek to be > used when the fs says that is OK. Okay, thanks. David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c index 1e51714eb33e..1389d8483d5d 100644 --- a/fs/cachefiles/namei.c +++ b/fs/cachefiles/namei.c @@ -20,6 +20,7 @@ #include <linux/namei.h> #include <linux/security.h> #include <linux/slab.h> +#include <linux/magic.h> #include "internal.h" #define CACHEFILES_KEYBUF_SIZE 512 @@ -647,7 +648,8 @@ lookup_again: ret = -EPERM; aops = object->dentry->d_inode->i_mapping->a_ops; - if (!aops->bmap) + if (!aops->bmap && + object->dentry->d_sb->s_magic != BTRFS_SUPER_MAGIC) goto check_error; object->backer = object->dentry; diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index c6cd8d7a4eef..49fb330c0ab8 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -410,11 +410,11 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, inode = object->backer->d_inode; ASSERT(S_ISREG(inode->i_mode)); - ASSERT(inode->i_mapping->a_ops->bmap); ASSERT(inode->i_mapping->a_ops->readpages); /* calculate the shift required to use bmap */ - if (inode->i_sb->s_blocksize > PAGE_SIZE) + if (inode->i_mapping->a_ops->bmap && + inode->i_sb->s_blocksize > PAGE_SIZE) goto enobufs; shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits; @@ -423,20 +423,36 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, op->op.flags |= FSCACHE_OP_ASYNC; op->op.processor = cachefiles_read_copier; - /* we assume the absence or presence of the first block is a good - * enough indication for the page as a whole - * - TODO: don't use bmap() for this as it is _not_ actually good - * enough for this as it doesn't indicate errors, but it's all we've - * got for the moment - */ - block0 = page->index; - block0 <<= shift; - - block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0); - _debug("%llx -> %llx", - (unsigned long long) block0, - (unsigned long long) block); + if (inode->i_mapping->a_ops->bmap) { + /* we assume the absence or presence of the first block is a good + * enough indication for the page as a whole + * - TODO: don't use bmap() for this as it is _not_ actually good + * enough for this as it doesn't indicate errors, but it's all we've + * got for the moment + */ + block0 = page->index; + block0 <<= shift; + block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0); + _debug("%llx -> %llx", + (unsigned long long) block0, + (unsigned long long) block); + } else { + /* Use llseek */ + struct path path; + struct file *file; + path.mnt = cache->mnt; + path.dentry = object->backer; + file = dentry_open(&path, O_RDONLY, cache->cache_cred); + if (IS_ERR(file)) + goto enobufs; + block = vfs_llseek(file, page->index << PAGE_SHIFT, SEEK_DATA); + filp_close(file, NULL); + if (block != page->index << PAGE_SHIFT) + block = 0; + else + block = 1; + } if (block) { /* submit the apparently valid page to the backing fs to be * read from disk */ @@ -707,11 +723,11 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, inode = object->backer->d_inode; ASSERT(S_ISREG(inode->i_mode)); - ASSERT(inode->i_mapping->a_ops->bmap); ASSERT(inode->i_mapping->a_ops->readpages); /* calculate the shift required to use bmap */ - if (inode->i_sb->s_blocksize > PAGE_SIZE) + if (inode->i_mapping->a_ops->bmap && + inode->i_sb->s_blocksize > PAGE_SIZE) goto all_enobufs; shift = PAGE_SHIFT - inode->i_sb->s_blocksize_bits; @@ -729,21 +745,38 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, list_for_each_entry_safe(page, _n, pages, lru) { sector_t block0, block; - /* we assume the absence or presence of the first block is a - * good enough indication for the page as a whole - * - TODO: don't use bmap() for this as it is _not_ actually - * good enough for this as it doesn't indicate errors, but - * it's all we've got for the moment - */ - block0 = page->index; - block0 <<= shift; - - block = inode->i_mapping->a_ops->bmap(inode->i_mapping, - block0); - _debug("%llx -> %llx", - (unsigned long long) block0, - (unsigned long long) block); + if (inode->i_mapping->a_ops->bmap) { + /* we assume the absence or presence of the first block is a + * good enough indication for the page as a whole + * - TODO: don't use bmap() for this as it is _not_ actually + * good enough for this as it doesn't indicate errors, but + * it's all we've got for the moment + */ + block0 = page->index; + block0 <<= shift; + + block = inode->i_mapping->a_ops->bmap(inode->i_mapping, + block0); + _debug("%llx -> %llx", + (unsigned long long) block0, + (unsigned long long) block); + } else { + /* Use llseek */ + struct path path; + struct file *file; + path.mnt = cache->mnt; + path.dentry = object->backer; + file = dentry_open(&path, O_RDONLY, cache->cache_cred); + if (IS_ERR(file)) + goto all_enobufs; + block = vfs_llseek(file, page->index << PAGE_SHIFT, SEEK_DATA); + filp_close(file, NULL); + if (block != page->index << PAGE_SHIFT) + block = 0; + else + block = 1; + } if (block) { /* we have data - add it to the list to give to the * backing fs */