Message ID | 20190808082744.31405-3-cmaiolino@redhat.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Series | New ->fiemap infrastructure and ->bmap removal | expand |
On Thu, Aug 08, 2019 at 10:27:37AM +0200, Carlos Maiolino wrote: > + block = page->index; > + block <<= shift; Can't this cause overflows? > + > + ret = bmap(inode, &block); > + ASSERT(!ret); I think we want some real error handling here instead of just an assert..
On Wed, Aug 14, 2019 at 01:15:35PM +0200, Christoph Hellwig wrote: > On Thu, Aug 08, 2019 at 10:27:37AM +0200, Carlos Maiolino wrote: > > + block = page->index; > > + block <<= shift; > > Can't this cause overflows? Hmm, I honestly don't know. I did look at the code, and I couldn't really spot anything concrete. Maybe if the block size is much smaller than PAGE_SIZE, but I am really not sure. Bear in mind though, I didn't change the logic here at all. I just reused one variable instead of juggling both (block0 and block) old variables. So, if this really can overflow, the code is already buggy even without my patch, I'm CC'ing dhowells just in case. > > > + > > + ret = bmap(inode, &block); > > + ASSERT(!ret); > > I think we want some real error handling here instead of just an > assert.. I left this ASSERT() here, to match the current logic. By now, the only error we can get is -EINVAL, which basically says ->bmap() method does not exist, which is basically what does happen today with: ASSERT(inode->i_mapping->a_ops->bmap); But I do agree, it will be better to provide some sort of error handling here, maybe I should do something like: ASSERT(ret == -EINVAL) to keep the logic exactly the same and do not blow up in the future if/when we expand possible error values from bmap() What you think?
Carlos Maiolino <cmaiolino@redhat.com> wrote: > > > + block = page->index; > > > + block <<= shift; > > > > Can't this cause overflows? > > Hmm, I honestly don't know. I did look at the code, and I couldn't really spot > anything concrete. Maybe, though we'd have to support file sizes over 16 Exabytes for that to be a problem. Note that bmap() is *only* used to find out if the page is present in the cache - and even that I'm not actually doing very well, since I really *ought* to check every block in the page. I really want to replace the use of bmap entirely with iov_iter doing DIO. Cachefiles currently does double buffering because it works through the pagecache of the underlying to do actual read or write - and this appears to cause memory management problems. David
Christoph Hellwig <hch@lst.de> wrote: > > + block = page->index; > > + block <<= shift; > > Can't this cause overflows? No, not unless the netfs allows files >16EiB in size and as long as block (type sector_t) is a 64-bit integer. A 16EiB-1 (0xffffffffffffffff) file would have 4P-1 (0xfffffffffffff) pages assuming a 4K page size. At a block size of 1 (and a shift therefore of 12), the maximum block number calculated would be 0xfffffffffffff000. David
On Tue, Aug 20, 2019 at 01:50:30PM +0100, David Howells wrote: > Carlos Maiolino <cmaiolino@redhat.com> wrote: > > > > > + block = page->index; > > > > + block <<= shift; > > > > > > Can't this cause overflows? > > > > Hmm, I honestly don't know. I did look at the code, and I couldn't really spot > > anything concrete. > > Maybe, though we'd have to support file sizes over 16 Exabytes for that to be > a problem. On 32-bit sysems page->index is a 32-bit value, so you'd overflow at pretty normal sizes of a few TB. > Note that bmap() is *only* used to find out if the page is present in the > cache - and even that I'm not actually doing very well, since I really *ought* > to check every block in the page. > > I really want to replace the use of bmap entirely with iov_iter doing DIO. > Cachefiles currently does double buffering because it works through the > pagecache of the underlying to do actual read or write - and this appears to > cause memory management problems. Not related to this patch, but using iov_iter with dio is trivial, what is the blocker therere?
Christoph Hellwig <hch@lst.de> wrote: > Not related to this patch, but using iov_iter with dio is trivial, what > is the blocker therere? The usual: time. The change as a whole is not actually trivial since it will involve completely overhauling the fscache data API and how the filesystems use it - and then having cachefiles perform the DIO asynchronously as per Trond's requirements for using fscache with NFS. I also need to work out how I'm going to do data/hole detection. Can I set, say, O_NOREADHOLE and then expect the DIO to stop early with a short read? Or do I need to use SEEK_DATA/SEEK_HOLE in advance to define the occupied regions? Maybe a better way would be to take a leaf out of the book of OpenAFS and suchlike and keep a parallel file that tracks the occupancy of a cache object (eg. a bitmap with 1 bit per 64k block) - but that the synchronisation and performance issues. David
On Fri, Aug 30, 2019 at 05:17:51PM +0100, David Howells wrote: > Christoph Hellwig <hch@lst.de> wrote: > > > Not related to this patch, but using iov_iter with dio is trivial, what > > is the blocker therere? > > The usual: time. > > The change as a whole is not actually trivial since it will involve completely > overhauling the fscache data API and how the filesystems use it - and then > having cachefiles perform the DIO asynchronously as per Trond's requirements > for using fscache with NFS. Well, doing in-kernel async I/O is actually rather trivial these days. Take a look at the loop driver for an example. > I also need to work out how I'm going to do data/hole detection. Can I set, > say, O_NOREADHOLE and then expect the DIO to stop early with a short read? Or > do I need to use SEEK_DATA/SEEK_HOLE in advance to define the occupied > regions? We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all that hard. Having a READ_PLUS like interface that actually tells you how large the hole is might be hard.
Christoph Hellwig <hch@lst.de> wrote: > We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all > that hard. Having a READ_PLUS like interface that actually tells you > how large the hole is might be hard. Actually, that raises another couple of questions: (1) Is a filesystem allowed to join up two disjoint blocks of data with a block of zeros to make a single extent? If this happens, I'll see the inserted block of zeros to be valid data. (2) Is a filesystem allowed to punch out a block of valid zero data to make a hole? This would cause me to refetch the data. David
On Sat, Aug 31, 2019 at 01:45:57AM +0100, David Howells wrote: > Christoph Hellwig <hch@lst.de> wrote: > > > We'll you'd need to implement a IOCB_NOHOLE, but that wouldn't be all > > that hard. Having a READ_PLUS like interface that actually tells you > > how large the hole is might be hard. > > Actually, that raises another couple of questions: > > (1) Is a filesystem allowed to join up two disjoint blocks of data with a > block of zeros to make a single extent? If this happens, I'll see the > inserted block of zeros to be valid data. Yes. > (2) Is a filesystem allowed to punch out a block of valid zero data to make a > hole? This would cause me to refetch the data. Yes. Essentially, assumptions that filesystems will not change the file layout even if the data does not change are invalid. copy-on-write, deduplication, speculative preallocation, etc mean the file layout can change even if the file data itself is not directly modified. If you want to cache physical file layout information and be notified when they may change, then that's what layout leases are for.... Cheers, Dave.
diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c index 44a3ce1e4ce4..073c14cae6aa 100644 --- a/fs/cachefiles/rdwr.c +++ b/fs/cachefiles/rdwr.c @@ -396,7 +396,7 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, struct cachefiles_object *object; struct cachefiles_cache *cache; struct inode *inode; - sector_t block0, block; + sector_t block; unsigned shift; int ret; @@ -412,7 +412,6 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, inode = d_backing_inode(object->backer); ASSERT(S_ISREG(inode->i_mode)); - ASSERT(inode->i_mapping->a_ops->bmap); ASSERT(inode->i_mapping->a_ops->readpages); /* calculate the shift required to use bmap */ @@ -428,12 +427,14 @@ int cachefiles_read_or_alloc_page(struct fscache_retrieval *op, * enough for this as it doesn't indicate errors, but it's all we've * got for the moment */ - block0 = page->index; - block0 <<= shift; + block = page->index; + block <<= shift; + + ret = bmap(inode, &block); + ASSERT(!ret); - block = inode->i_mapping->a_ops->bmap(inode->i_mapping, block0); _debug("%llx -> %llx", - (unsigned long long) block0, + (unsigned long long) (page->index << shift), (unsigned long long) block); if (block) { @@ -711,7 +712,6 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, inode = d_backing_inode(object->backer); ASSERT(S_ISREG(inode->i_mode)); - ASSERT(inode->i_mapping->a_ops->bmap); ASSERT(inode->i_mapping->a_ops->readpages); /* calculate the shift required to use bmap */ @@ -728,7 +728,7 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, ret = space ? -ENODATA : -ENOBUFS; list_for_each_entry_safe(page, _n, pages, lru) { - sector_t block0, block; + sector_t block; /* we assume the absence or presence of the first block is a * good enough indication for the page as a whole @@ -736,13 +736,14 @@ int cachefiles_read_or_alloc_pages(struct fscache_retrieval *op, * good enough for this as it doesn't indicate errors, but * it's all we've got for the moment */ - block0 = page->index; - block0 <<= shift; + block = page->index; + block <<= shift; + + ret = bmap(inode, &block); + ASSERT(!ret); - block = inode->i_mapping->a_ops->bmap(inode->i_mapping, - block0); _debug("%llx -> %llx", - (unsigned long long) block0, + (unsigned long long) (page->index << shift), (unsigned long long) block); if (block) {