Message ID | 1426093353-23709-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 03/11/2015 07:02 PM, Ross Zwisler wrote: > The functions copy_from_brd() and copy_to_brd() are written with an > assumption that the bio_vec they are given has size <= PAGE_SIZE. This > assumption is not enforced in any way, and if the bio_vec has size > larger than PAGE_SIZE data will just be lost. > > Such a situation can occur with I/Os generated from in-kernel sources, > or with coalesced bio_vecs. I wish you could show me where in Kernel this can happen. who "coalesced bio_vecs" ? what Kernel sources generate bio->b_size > PAGE_SIZE ? I did try to look and could not find any. Sorry for my slowness. In fact I know of a couple of places that would break if this is true > This bug was originally reported against > the pmem driver, where it was found using the Enmotus tiering engine. > This out-of-tree driver - none-gpl, with no source code - is the first I have heard of this. > Instead we should have brd explicitly tell the block layer that it can > handle data segments of at most PAGE_SIZE. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Reported-by: Hugh Daschbach <hugh.daschbach@enmotus.com> > Cc: Roger C. Pao (Enmotus) <rcpao.enmotus@gmail.com> > Cc: Boaz Harrosh <boaz@plexistor.com> > Cc: linux-nvdimm@lists.01.org > Cc: Nick Piggin <npiggin@kernel.dk> > --- > drivers/block/brd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 898b4f256782..7e4873361b64 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -490,6 +490,7 @@ static struct brd_device *brd_alloc(int i) > blk_queue_make_request(brd->brd_queue, brd_make_request); > blk_queue_max_hw_sectors(brd->brd_queue, 1024); > blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); > + blk_queue_max_segment_size(brd->brd_queue, PAGE_SIZE); The only place that I can find that uses _max_segment_size is when translating a bio list to an sg_list, where physical segments may coalesce. I have never seen it at the bio level > > brd->brd_queue->limits.discard_granularity = PAGE_SIZE; > brd->brd_queue->limits.max_discard_sectors = UINT_MAX; > Cheers Boaz
On Wed, 2015-03-11 at 19:17 +0200, Boaz Harrosh wrote: > On 03/11/2015 07:02 PM, Ross Zwisler wrote: > > The functions copy_from_brd() and copy_to_brd() are written with an > > assumption that the bio_vec they are given has size <= PAGE_SIZE. This > > assumption is not enforced in any way, and if the bio_vec has size > > larger than PAGE_SIZE data will just be lost. > > > > Such a situation can occur with I/Os generated from in-kernel sources, > > or with coalesced bio_vecs. > > I wish you could show me where in Kernel this can happen. > who "coalesced bio_vecs" ? what Kernel sources generate bio->b_size > PAGE_SIZE ? > I did try to look and could not find any. Sorry for my slowness. In truth I'm not certain I know of a place either. :) In part I'm quoting the original bug report: https://lists.01.org/pipermail/linux-nvdimm/2015-February/000079.html The pertinent lines, in case you don't want to follow the link: " The biovec can present a size greater that PAGE_SIZE if an I/O buffer contains physically contiguous pages. This may be unusual for user space pages, as the virtual to physical memory map gets fragmented. But for buffers allocated by the kernel with kmalloc, physical memory will be contiguous. Even if a single I/O request does not contain two contiguous pages, the block layer may merge two requests that have contiguous pages. It will then attempt to coalesce biovecs. You probably won't see that if you avoid the I/O scheduler by capturing requests at make_request. But it is still a good idea to declare your devices segment size limitation with blk_queue_max_segment_size. There are a couple drivers in drivers/block/ that do just that to limit segments to PAGE_SIZE. " I wandered around a bit in the block code and I *think* that bvec coalescing happens via the merge_bvec_fn() function pointers. DM, for instance, sets this to dm_merge_bvec() via the blk_queue_merge_bvec() function. After that it gets into lots of DM code. > In fact I know of a couple of places that would break if this is true Yep, PMEM and BRD both currently break because of this. > > This bug was originally reported against > > the pmem driver, where it was found using the Enmotus tiering engine. > > This out-of-tree driver - none-gpl, with no source code - is the first I have > heard of this. It was hidden in the original bug report. Same link as above, and here are the relevant lines: " We caught this because the Enmotus tiering engine issues rather large I/O requests to buffers that were allocated with kmalloc. It is fairly common for the tiering engine to allocate I/O buffers of 64KB or greater. If the underlying block device supports it, we will submit a bio with a biovec mapping many contiguous pages. The entire buffer will possibly be mapped by a single biovec. The tiering engine uses max_segment_size to determine how to build it's biovec list. " I've never used it or heard of it before this either. > > Instead we should have brd explicitly tell the block layer that it can > > handle data segments of at most PAGE_SIZE. > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > Reported-by: Hugh Daschbach <hugh.daschbach@enmotus.com> > > Cc: Roger C. Pao (Enmotus) <rcpao.enmotus@gmail.com> > > Cc: Boaz Harrosh <boaz@plexistor.com> > > Cc: linux-nvdimm@lists.01.org > > Cc: Nick Piggin <npiggin@kernel.dk> > > --- > > drivers/block/brd.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > > index 898b4f256782..7e4873361b64 100644 > > --- a/drivers/block/brd.c > > +++ b/drivers/block/brd.c > > @@ -490,6 +490,7 @@ static struct brd_device *brd_alloc(int i) > > blk_queue_make_request(brd->brd_queue, brd_make_request); > > blk_queue_max_hw_sectors(brd->brd_queue, 1024); > > blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); > > + blk_queue_max_segment_size(brd->brd_queue, PAGE_SIZE); > > The only place that I can find that uses _max_segment_size is > when translating a bio list to an sg_list, where physical segments > may coalesce. I have never seen it at the bio level > > > > > brd->brd_queue->limits.discard_granularity = PAGE_SIZE; > > brd->brd_queue->limits.max_discard_sectors = UINT_MAX; > > > > Cheers > Boaz Anyway, I thought your response to the original bug report against PMEM was that you were alright with this one line change since it didn't hurt anything, and perhaps it helped someone. Do you have the same stance for BRD, or do you think we need to track down if or how bio_vecs can make it to the driver with more than one page of data first? - Ross
On Wed, Mar 11, 2015 at 1:02 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > The functions copy_from_brd() and copy_to_brd() are written with an > assumption that the bio_vec they are given has size <= PAGE_SIZE. This > assumption is not enforced in any way, and if the bio_vec has size > larger than PAGE_SIZE data will just be lost. > > Such a situation can occur with I/Os generated from in-kernel sources, > or with coalesced bio_vecs. This bug was originally reported against > the pmem driver, where it was found using the Enmotus tiering engine. > > Instead we should have brd explicitly tell the block layer that it can > handle data segments of at most PAGE_SIZE. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Reported-by: Hugh Daschbach <hugh.daschbach@enmotus.com> > Cc: Roger C. Pao (Enmotus) <rcpao.enmotus@gmail.com> > Cc: Boaz Harrosh <boaz@plexistor.com> > Cc: linux-nvdimm@lists.01.org > Cc: Nick Piggin <npiggin@kernel.dk> > --- > drivers/block/brd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/block/brd.c b/drivers/block/brd.c > index 898b4f256782..7e4873361b64 100644 > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -490,6 +490,7 @@ static struct brd_device *brd_alloc(int i) > blk_queue_make_request(brd->brd_queue, brd_make_request); > blk_queue_max_hw_sectors(brd->brd_queue, 1024); > blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); > + blk_queue_max_segment_size(brd->brd_queue, PAGE_SIZE); The block layer only honors this for request-based drivers. I think we should just fix it properly. Especially when we have in-kernel users of persistent memory allowing > PAGE_SIZE segments will be a nice feature.
On 03/12/2015 12:42 AM, Ross Zwisler wrote: <> > > Anyway, I thought your response to the original bug report against PMEM was > that you were alright with this one line change since it didn't hurt anything, > and perhaps it helped someone. Do you have the same stance for BRD, No what I did in pmem is remove the BUG that was borrowed from brd, because in pmem we can support any imaginary bv_len since we are always contiguous in memory. (And Kernel mapping of a contiguous physical pages is also contiguous physical in virtual space) > or do you > think we need to track down if or how bio_vecs can make it to the driver with > more than one page of data first? > OK So I have (again) audited every instance of *max_segment_size* and call me slow and stupid but all I can see is that: All this "coalesce"ing behavior and any reference to max_segment_size that you guys are talking about happen when translating the bio to an sg_list. In fact think about it the block layer must not touch the bio+biovec because it is a contract with upper layer. for each page submission there must be a one-to-one end_io notification for instance an FS might have taken a page_lock on the page. So if what you are saying is true and "bios are coalesced" it means that you need to copy the original bio-list to a new compacted one. This does not happen the copy to a new list happens from the move from bio to an sg_list, and the original bio is untouched. That said this is all up to Jens he is the maintainer of the block layer, if you are correct and bv_len may ever be > PAGE_SIZE then he will gladly take this patch I'm sure. > - Ross Just that I would really like to understand, that's all Thanks Boaz
diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 898b4f256782..7e4873361b64 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -490,6 +490,7 @@ static struct brd_device *brd_alloc(int i) blk_queue_make_request(brd->brd_queue, brd_make_request); blk_queue_max_hw_sectors(brd->brd_queue, 1024); blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); + blk_queue_max_segment_size(brd->brd_queue, PAGE_SIZE); brd->brd_queue->limits.discard_granularity = PAGE_SIZE; brd->brd_queue->limits.max_discard_sectors = UINT_MAX;
The functions copy_from_brd() and copy_to_brd() are written with an assumption that the bio_vec they are given has size <= PAGE_SIZE. This assumption is not enforced in any way, and if the bio_vec has size larger than PAGE_SIZE data will just be lost. Such a situation can occur with I/Os generated from in-kernel sources, or with coalesced bio_vecs. This bug was originally reported against the pmem driver, where it was found using the Enmotus tiering engine. Instead we should have brd explicitly tell the block layer that it can handle data segments of at most PAGE_SIZE. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Hugh Daschbach <hugh.daschbach@enmotus.com> Cc: Roger C. Pao (Enmotus) <rcpao.enmotus@gmail.com> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: linux-nvdimm@lists.01.org Cc: Nick Piggin <npiggin@kernel.dk> --- drivers/block/brd.c | 1 + 1 file changed, 1 insertion(+)