Message ID | 20120830220745.GI27257@redhat.com (mailing list archive) |
---|---|
State | Deferred, archived |
Headers | show |
On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > [..] > > > Performance aside, punting submission to per device worker in case of deep > > > stack usage sounds cleaner solution to me. > > > > Agreed, but performance tends to matter in the real world. And either > > way the tricky bits are going to be confined to a few functions, so I > > don't think it matters that much. > > > > If someone wants to code up the workqueue version and test it, they're > > more than welcome... > > Here is one quick and dirty proof of concept patch. It checks for stack > depth and if remaining space is less than 20% of stack size, then it > defers the bio submission to per queue worker. I can't think of any correctness issues. I see some stuff that could be simplified (blk_drain_deferred_bios() is redundant, just make it a wrapper around blk_deffered_bio_work()). Still skeptical about the performance impact, though - frankly, on some of the hardware I've been running bcache on this would be a visible performance regression - probably double digit percentages but I'd have to benchmark it. That kind of of hardware/usage is not normal today, but I've put a lot of work into performance and I don't want to make things worse without good reason. Have you tested/benchmarked it? There's scheduling behaviour, too. We really want the workqueue thread's cpu time to be charged to the process that submitted the bio. (We could use a mechanism like that in other places, too... not like this is a new issue). This is going to be a real issue for users that need strong isolation - for any driver that uses non negligable cpu (i.e. dm crypt), we're breaking that (not that it wasn't broken already, but this makes it worse). I could be convinced, but right now I prefer my solution. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Aug 30, 2012 at 6:43 PM, Kent Overstreet <koverstreet@google.com> wrote: > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: >> On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: >> >> [..] >> > > Performance aside, punting submission to per device worker in case of deep >> > > stack usage sounds cleaner solution to me. >> > >> > Agreed, but performance tends to matter in the real world. And either >> > way the tricky bits are going to be confined to a few functions, so I >> > don't think it matters that much. >> > >> > If someone wants to code up the workqueue version and test it, they're >> > more than welcome... >> >> Here is one quick and dirty proof of concept patch. It checks for stack >> depth and if remaining space is less than 20% of stack size, then it >> defers the bio submission to per queue worker. > > I can't think of any correctness issues. I see some stuff that could be > simplified (blk_drain_deferred_bios() is redundant, just make it a > wrapper around blk_deffered_bio_work()). > > Still skeptical about the performance impact, though - frankly, on some > of the hardware I've been running bcache on this would be a visible > performance regression - probably double digit percentages but I'd have > to benchmark it. That kind of of hardware/usage is not normal today, > but I've put a lot of work into performance and I don't want to make > things worse without good reason. Here's another crazy idea - we don't really need another thread, just more stack space. We could check if we're running out of stack space, then if we are just allocate another two pages and memcpy the struct thread_info over. I think the main obstacle is that we'd need some per arch code for mucking with the stack pointer. And it'd break backtraces, but that's fixable. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, Aug 30, 2012 at 06:43:59PM -0700, Kent Overstreet wrote: > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > > > [..] > > > > Performance aside, punting submission to per device worker in case of deep > > > > stack usage sounds cleaner solution to me. > > > > > > Agreed, but performance tends to matter in the real world. And either > > > way the tricky bits are going to be confined to a few functions, so I > > > don't think it matters that much. > > > > > > If someone wants to code up the workqueue version and test it, they're > > > more than welcome... > > > > Here is one quick and dirty proof of concept patch. It checks for stack > > depth and if remaining space is less than 20% of stack size, then it > > defers the bio submission to per queue worker. > > I can't think of any correctness issues. I see some stuff that could be > simplified (blk_drain_deferred_bios() is redundant, just make it a > wrapper around blk_deffered_bio_work()). > > Still skeptical about the performance impact, though - frankly, on some > of the hardware I've been running bcache on this would be a visible > performance regression - probably double digit percentages but I'd have > to benchmark it. That kind of of hardware/usage is not normal today, > but I've put a lot of work into performance and I don't want to make > things worse without good reason. Would you like to give this patch a quick try and see with bcache on your hardware how much performance impact do you see. Given the fact that submission through worker happens only in case of when stack usage is high, that should reduce the impact of the patch and common use cases should reamin unaffected. > > Have you tested/benchmarked it? No, I have not. I will run some simple workloads on SSD. > > There's scheduling behaviour, too. We really want the workqueue thread's > cpu time to be charged to the process that submitted the bio. (We could > use a mechanism like that in other places, too... not like this is a new > issue). > > This is going to be a real issue for users that need strong isolation - > for any driver that uses non negligable cpu (i.e. dm crypt), we're > breaking that (not that it wasn't broken already, but this makes it > worse). There are so many places in kernel where worker threads do work on behalf of each process. I think this is really a minor concern and I would not be too worried about it. What is concerning though really is the greater stack usage due to recursive nature of make_request() and performance impact of deferral to a worker thread. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hello, Vivek. On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > Here is one quick and dirty proof of concept patch. It checks for stack > depth and if remaining space is less than 20% of stack size, then it > defers the bio submission to per queue worker. So, it removes breadth-first walking of bio construction by ensuring stack overflow never happens by bouncing to workqueue if stack usage seems too high. I do like removal of breadth-first walking. It makes failure scenarios a lot less mind-bending. That said, Kent is right that this can incur significant overhead for certain configurations, and looking at stack usage in block layer is rather nasty both in design and implementation. If we're gonna need rescuer anyway and can get it right and the mechanism can be contained in block proper relatively well, I think it would be better to make bread-first walking safe. Both are nasty in their own ways after all. Thanks.
On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > [..] > > > Performance aside, punting submission to per device worker in case of deep > > > stack usage sounds cleaner solution to me. > > > > Agreed, but performance tends to matter in the real world. And either > > way the tricky bits are going to be confined to a few functions, so I > > don't think it matters that much. > > > > If someone wants to code up the workqueue version and test it, they're > > more than welcome... > > Here is one quick and dirty proof of concept patch. It checks for stack > depth and if remaining space is less than 20% of stack size, then it > defers the bio submission to per queue worker. Given that we are working around stack depth issues in the filesystems already in several places, and now it seems like there's a reason to work around it in the block layers as well, shouldn't we simply increase the default stack size rather than introduce complexity and performance regressions to try and work around not having enough stack? I mean, we can deal with it like the ia32 4k stack issue was dealt with (i.e. ignore those stupid XFS people, that's an XFS bug), or we can face the reality that storage stacks have become so complex that 8k is no longer a big enough stack for a modern system.... Cheers, Dave.
On Mon, Sep 03, 2012 at 10:49:27AM +1000, Dave Chinner wrote: > Given that we are working around stack depth issues in the > filesystems already in several places, and now it seems like there's > a reason to work around it in the block layers as well, shouldn't we > simply increase the default stack size rather than introduce > complexity and performance regressions to try and work around not > having enough stack? > > I mean, we can deal with it like the ia32 4k stack issue was dealt > with (i.e. ignore those stupid XFS people, that's an XFS bug), or > we can face the reality that storage stacks have become so complex > that 8k is no longer a big enough stack for a modern system.... I'm not arguing against increasing the default stack size (I really don't have an opinion there) - but it's not a solution for the block layer, as stacking block devices can require an unbounded amount of stack without the generic_make_request() convert recursion-to-iteration thing. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Aug 31, 2012 at 11:01:59AM -0400, Vivek Goyal wrote: > On Thu, Aug 30, 2012 at 06:43:59PM -0700, Kent Overstreet wrote: > > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > > > > > [..] > > > > > Performance aside, punting submission to per device worker in case of deep > > > > > stack usage sounds cleaner solution to me. > > > > > > > > Agreed, but performance tends to matter in the real world. And either > > > > way the tricky bits are going to be confined to a few functions, so I > > > > don't think it matters that much. > > > > > > > > If someone wants to code up the workqueue version and test it, they're > > > > more than welcome... > > > > > > Here is one quick and dirty proof of concept patch. It checks for stack > > > depth and if remaining space is less than 20% of stack size, then it > > > defers the bio submission to per queue worker. > > > > I can't think of any correctness issues. I see some stuff that could be > > simplified (blk_drain_deferred_bios() is redundant, just make it a > > wrapper around blk_deffered_bio_work()). > > > > Still skeptical about the performance impact, though - frankly, on some > > of the hardware I've been running bcache on this would be a visible > > performance regression - probably double digit percentages but I'd have > > to benchmark it. That kind of of hardware/usage is not normal today, > > but I've put a lot of work into performance and I don't want to make > > things worse without good reason. > > Would you like to give this patch a quick try and see with bcache on your > hardware how much performance impact do you see. If I can get a test system I can publish numbers setup with a modern kernel, on I will. Will take a bit though. > Given the fact that submission through worker happens only in case of > when stack usage is high, that should reduce the impact of the patch > and common use cases should reamin unaffected. Except depending on how users have their systems configured, it'll either never happen or it'll happen for most every bio. That makes the performance overhead unpredictable, too. > > > > Have you tested/benchmarked it? > > No, I have not. I will run some simple workloads on SSD. Normal SATA ssds are not going to show the overhead - achi is a pig and it'll be lost in the noise. > There are so many places in kernel where worker threads do work on behalf > of each process. I think this is really a minor concern and I would not > be too worried about it. Yeah, but this is somewhat unprecedented in the amount of cpu time you're potentially moving to worker threads. It is a concern. > What is concerning though really is the greater stack usage due to > recursive nature of make_request() and performance impact of deferral > to a worker thread. Your patch shouldn't increase stack usage (at least if your threshold is safe - it's too high as is). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Thu, 30 Aug 2012, Kent Overstreet wrote: > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > > > [..] > > > > Performance aside, punting submission to per device worker in case of deep > > > > stack usage sounds cleaner solution to me. > > > > > > Agreed, but performance tends to matter in the real world. And either > > > way the tricky bits are going to be confined to a few functions, so I > > > don't think it matters that much. > > > > > > If someone wants to code up the workqueue version and test it, they're > > > more than welcome... > > > > Here is one quick and dirty proof of concept patch. It checks for stack > > depth and if remaining space is less than 20% of stack size, then it > > defers the bio submission to per queue worker. > > I can't think of any correctness issues. I see some stuff that could be > simplified (blk_drain_deferred_bios() is redundant, just make it a > wrapper around blk_deffered_bio_work()). > > Still skeptical about the performance impact, though - frankly, on some > of the hardware I've been running bcache on this would be a visible > performance regression - probably double digit percentages but I'd have > to benchmark it. That kind of of hardware/usage is not normal today, > but I've put a lot of work into performance and I don't want to make > things worse without good reason. > > Have you tested/benchmarked it? > > There's scheduling behaviour, too. We really want the workqueue thread's > cpu time to be charged to the process that submitted the bio. (We could > use a mechanism like that in other places, too... not like this is a new > issue). > > This is going to be a real issue for users that need strong isolation - > for any driver that uses non negligable cpu (i.e. dm crypt), we're > breaking that (not that it wasn't broken already, but this makes it > worse). ... or another possibility - start a timer when something is put to current->bio_list and use that timer to pop entries off current->bio_list and submit them to a workqueue. The timer can be cpu-local so only interrupt masking is required to synchronize against the timer. This would normally run just like the current kernel and in case of deadlock, the timer would kick in and resolve the deadlock. > I could be convinced, but right now I prefer my solution. It fixes bio allocation problem, but not other similar mempool problems in dm and md. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > ... or another possibility - start a timer when something is put to > current->bio_list and use that timer to pop entries off current->bio_list > and submit them to a workqueue. The timer can be cpu-local so only > interrupt masking is required to synchronize against the timer. > > This would normally run just like the current kernel and in case of > deadlock, the timer would kick in and resolve the deadlock. Ugh. That's a _terrible_ idea. Remember the old plugging code? You ever have to debug performance issues caused by it? > > > I could be convinced, but right now I prefer my solution. > > It fixes bio allocation problem, but not other similar mempool problems in > dm and md. I looked a bit more, and actually I think the rest of the problem is pretty limited in scope - most of those mempool allocations are per request, not per split. I'm willing to put some time into converting dm/md over to bioset's front_pad. I'm having to learn the code for the immutable biovec work, anyways. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Mon, Sep 03, 2012 at 10:49:27AM +1000, Dave Chinner wrote: > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > On Wed, Aug 29, 2012 at 10:13:45AM -0700, Kent Overstreet wrote: > > > > [..] > > > > Performance aside, punting submission to per device worker in case of deep > > > > stack usage sounds cleaner solution to me. > > > > > > Agreed, but performance tends to matter in the real world. And either > > > way the tricky bits are going to be confined to a few functions, so I > > > don't think it matters that much. > > > > > > If someone wants to code up the workqueue version and test it, they're > > > more than welcome... > > > > Here is one quick and dirty proof of concept patch. It checks for stack > > depth and if remaining space is less than 20% of stack size, then it > > defers the bio submission to per queue worker. > > Given that we are working around stack depth issues in the > filesystems already in several places, and now it seems like there's > a reason to work around it in the block layers as well, shouldn't we > simply increase the default stack size rather than introduce > complexity and performance regressions to try and work around not > having enough stack? Dave, In this particular instance, we really don't have any bug reports of stack overflowing. Just discussing what will happen if we make generic_make_request() recursive again. > > I mean, we can deal with it like the ia32 4k stack issue was dealt > with (i.e. ignore those stupid XFS people, that's an XFS bug), or > we can face the reality that storage stacks have become so complex > that 8k is no longer a big enough stack for a modern system.... So first question will be, what's the right stack size? If we make generic_make_request() recursive, then at some storage stack depth we will overflow stack anyway (if we have created too deep a stack). Hence keeping current logic kind of makes sense as in theory we can support arbitrary depth of storage stack. Yes, if higher layers are consuming more stack, then it does raise the question whether to offload work to worker and take performance hit or increase stack depth. I don't know what's the answer to that question. I have only tried going through the archive where some people seem to have pushed for even smaller stack size (4K). Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Fri, Aug 31, 2012 at 07:13:48PM -0700, Tejun Heo wrote: > Hello, Vivek. > > On Thu, Aug 30, 2012 at 06:07:45PM -0400, Vivek Goyal wrote: > > Here is one quick and dirty proof of concept patch. It checks for stack > > depth and if remaining space is less than 20% of stack size, then it > > defers the bio submission to per queue worker. > > So, it removes breadth-first walking of bio construction by ensuring > stack overflow never happens by bouncing to workqueue if stack usage > seems too high. > > I do like removal of breadth-first walking. It makes failure > scenarios a lot less mind-bending. That said, Kent is right that this > can incur significant overhead for certain configurations, and looking > at stack usage in block layer is rather nasty both in design and > implementation. > > If we're gonna need rescuer anyway and can get it right and the > mechanism can be contained in block proper relatively well, I think it > would be better to make bread-first walking safe. Both are nasty in > their own ways after all. Hi Tejun, That's fine. Breadth first walking does make sense given the fact that storage stack can be arbitrarily deep. And Kent's bio set rescuer patch is not too bad. So I am fine with pursuing that patch. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hello, On Tue, Sep 04, 2012 at 09:54:23AM -0400, Vivek Goyal wrote: > > Given that we are working around stack depth issues in the > > filesystems already in several places, and now it seems like there's > > a reason to work around it in the block layers as well, shouldn't we > > simply increase the default stack size rather than introduce > > complexity and performance regressions to try and work around not > > having enough stack? > > Dave, > > In this particular instance, we really don't have any bug reports of > stack overflowing. Just discussing what will happen if we make > generic_make_request() recursive again. I think there was one and that's why we added the bio_list thing. > > I mean, we can deal with it like the ia32 4k stack issue was dealt > > with (i.e. ignore those stupid XFS people, that's an XFS bug), or > > we can face the reality that storage stacks have become so complex > > that 8k is no longer a big enough stack for a modern system.... > > So first question will be, what's the right stack size? If we make > generic_make_request() recursive, then at some storage stack depth we will > overflow stack anyway (if we have created too deep a stack). Hence > keeping current logic kind of makes sense as in theory we can support > arbitrary depth of storage stack. But, yeah, this can't be solved by enlarging the stack size. The upper limit is unbound. Thanks.
Hello, Mikulas, Kent. On Mon, Sep 03, 2012 at 08:41:00PM -0700, Kent Overstreet wrote: > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > ... or another possibility - start a timer when something is put to > > current->bio_list and use that timer to pop entries off current->bio_list > > and submit them to a workqueue. The timer can be cpu-local so only > > interrupt masking is required to synchronize against the timer. > > > > This would normally run just like the current kernel and in case of > > deadlock, the timer would kick in and resolve the deadlock. > > Ugh. That's a _terrible_ idea. That's exactly how workqueue rescuers work - rescuers kick in if new worker creation doesn't succeed in given amount of time. The suggested mechanism already makes use of workqueue, so it's already doing it. If you can think of a better way to detect the generic stall condition, please be my guest. > Remember the old plugging code? You ever have to debug performance > issues caused by it? That is not equivalent. Plugging was kicking in all the time and it wasn't entirely well-defined what the plugging / unplugging conditions were. This type of rescuing for forward-progress guarantee only kicks in under severe memory pressure and people expect finite latency and throughput hits under such conditions. The usual bio / request / scsi_cmd allocations could be failing under these circumstances and things could be progressing only thanks to the finite preallocated pools. I don't think involving rescue timer would be noticeably deterimental. Actually, if the timer approach can reduce the frequency of rescuer involvement, I think it could actually be better. Thanks.
On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote: > Actually, if the timer approach can reduce the frequency of rescuer > involvement, I think it could actually be better. Ooh, it wouldn't. It's kicking in only after alloc failure. I don't know. I think conditioning it on alloc failure is cleaner and converting all per-bio allocations to front-pad makes sense. Using a timer wouldn't make the mechanism any simpler, right? Thanks.
On Mon, 3 Sep 2012, Kent Overstreet wrote: > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > ... or another possibility - start a timer when something is put to > > current->bio_list and use that timer to pop entries off current->bio_list > > and submit them to a workqueue. The timer can be cpu-local so only > > interrupt masking is required to synchronize against the timer. > > > > This would normally run just like the current kernel and in case of > > deadlock, the timer would kick in and resolve the deadlock. > > Ugh. That's a _terrible_ idea. > > Remember the old plugging code? You ever have to debug performance > issues caused by it? Yes, I do remember it (and I fixed one bug that resulted in missed unplug and degraded performance). But currently, deadlocks due to exhausted mempools and bios being stalled in current->bio_list don't happen (or do happen below so rarely that they aren't reported). If we add a timer, it will turn a deadlock into an i/o delay, but it can't make things any worse. BTW. can these new-style timerless plugs introduce deadlocks too? What happens when some bios are indefinitely delayed because their requests are held in a plug and a mempool runs out? > > > I could be convinced, but right now I prefer my solution. > > > > It fixes bio allocation problem, but not other similar mempool problems in > > dm and md. > > I looked a bit more, and actually I think the rest of the problem is > pretty limited in scope - most of those mempool allocations are per > request, not per split. > > I'm willing to put some time into converting dm/md over to bioset's > front_pad. I'm having to learn the code for the immutable biovec work, > anyways. Currently, dm targets allocate request-specific data from target-specific mempool. mempools are in dm-crypt, dm-delay, dm-mirror, dm-snapshot, dm-thin, dm-verity. You can change it to allocate request-specific data with the bio. Mikulas -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Sep 04, 2012 at 03:26:19PM -0400, Mikulas Patocka wrote: [..] > BTW. can these new-style timerless plugs introduce deadlocks too? What > happens when some bios are indefinitely delayed because their requests are > held in a plug and a mempool runs out? I think they will not deadlock because these on stack bios/requests are flushed/dispatched when process schedules out. So if a submitter blocks on a mempool, it will be scheduled out and requests on plug will be dispatched. Thanks Vivek -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote: > Hello, Mikulas, Kent. > > On Mon, Sep 03, 2012 at 08:41:00PM -0700, Kent Overstreet wrote: > > On Mon, Sep 03, 2012 at 04:41:37PM -0400, Mikulas Patocka wrote: > > > ... or another possibility - start a timer when something is put to > > > current->bio_list and use that timer to pop entries off current->bio_list > > > and submit them to a workqueue. The timer can be cpu-local so only > > > interrupt masking is required to synchronize against the timer. > > > > > > This would normally run just like the current kernel and in case of > > > deadlock, the timer would kick in and resolve the deadlock. > > > > Ugh. That's a _terrible_ idea. > > That's exactly how workqueue rescuers work - rescuers kick in if new > worker creation doesn't succeed in given amount of time. The > suggested mechanism already makes use of workqueue, so it's already > doing it. If you can think of a better way to detect the generic > stall condition, please be my guest. > > > Remember the old plugging code? You ever have to debug performance > > issues caused by it? > > That is not equivalent. Plugging was kicking in all the time and it > wasn't entirely well-defined what the plugging / unplugging conditions > were. This type of rescuing for forward-progress guarantee only kicks > in under severe memory pressure and people expect finite latency and > throughput hits under such conditions. The usual bio / request / > scsi_cmd allocations could be failing under these circumstances and > things could be progressing only thanks to the finite preallocated > pools. I don't think involving rescue timer would be noticeably > deterimental. > > Actually, if the timer approach can reduce the frequency of rescuer > involvement, I think it could actually be better. Ok, that was an overly harsh, emotional response. But I still hate the idea. You want to point me at the relevant workqueue code? I'd really like to see what you did there, it's entirely possible you're aware of some issue I'm not but if not I'd like to take a stab at it. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On Tue, Sep 04, 2012 at 12:01:19PM -0700, Tejun Heo wrote: > On Tue, Sep 04, 2012 at 11:55:40AM -0700, Tejun Heo wrote: > > Actually, if the timer approach can reduce the frequency of rescuer > > involvement, I think it could actually be better. > > Ooh, it wouldn't. It's kicking in only after alloc failure. I don't > know. I think conditioning it on alloc failure is cleaner and > converting all per-bio allocations to front-pad makes sense. Using a > timer wouldn't make the mechanism any simpler, right? Exactly -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Hello, On Tue, Sep 04, 2012 at 12:42:37PM -0700, Kent Overstreet wrote: > You want to point me at the relevant workqueue code? I'd really like to > see what you did there, it's entirely possible you're aware of some > issue I'm not but if not I'd like to take a stab at it. I was mistaken. The issue was that adding @gfp_flags to kthread_create() wasn't trivial involving updates to arch callbacks, so the timer was added to side-step the issue. So, yeah, if it can be made to work w/o timer, I think that would be better. Thanks.
On Tue, Sep 04, 2012 at 11:26:33AM -0700, Tejun Heo wrote: > Hello, > > On Tue, Sep 04, 2012 at 09:54:23AM -0400, Vivek Goyal wrote: > > > Given that we are working around stack depth issues in the > > > filesystems already in several places, and now it seems like there's > > > a reason to work around it in the block layers as well, shouldn't we > > > simply increase the default stack size rather than introduce > > > complexity and performance regressions to try and work around not > > > having enough stack? > > > > Dave, > > > > In this particular instance, we really don't have any bug reports of > > stack overflowing. Just discussing what will happen if we make > > generic_make_request() recursive again. > > I think there was one and that's why we added the bio_list thing. There was more than one - it was a regular enough to be considered a feature... :/ > > > I mean, we can deal with it like the ia32 4k stack issue was dealt > > > with (i.e. ignore those stupid XFS people, that's an XFS bug), or > > > we can face the reality that storage stacks have become so complex > > > that 8k is no longer a big enough stack for a modern system.... > > > > So first question will be, what's the right stack size? If we make > > generic_make_request() recursive, then at some storage stack depth we will > > overflow stack anyway (if we have created too deep a stack). Hence > > keeping current logic kind of makes sense as in theory we can support > > arbitrary depth of storage stack. > > But, yeah, this can't be solved by enlarging the stack size. The > upper limit is unbound. Sure, but recursion issue is isolated to the block layer. If we can still submit IO directly through the block layer without pushing it off to a work queue, then the overall stack usage problem still exists. But if the block layer always pushes the IO off into another workqueue to avoid stack overflows, then the context switches are going to cause significant performance regressions for high IOPS workloads. I don't really like either situation. So while you are discussing stack issues, think a little about the bigger picture outside of the immediate issue at hand - a better solution for everyone might pop up.... Cheers, Dave.
Hello, Dave. On Wed, Sep 05, 2012 at 01:57:59PM +1000, Dave Chinner wrote: > > But, yeah, this can't be solved by enlarging the stack size. The > > upper limit is unbound. > > Sure, but recursion issue is isolated to the block layer. > > If we can still submit IO directly through the block layer without > pushing it off to a work queue, then the overall stack usage problem > still exists. But if the block layer always pushes the IO off into > another workqueue to avoid stack overflows, then the context > switches are going to cause significant performance regressions for > high IOPS workloads. I don't really like either situation. Kent's proposed solution doesn't do that. The rescuer work item is used iff mempool allocation fails w/o GFP_WAIT. IOW, we're already under severe memory pressure and stalls are expected all around the kernel (somehow this sounds festive...) It doesn't alter the breadth-first walk of bio decomposition and shouldn't degrade performance in any noticeable way. > So while you are discussing stack issues, think a little about the > bigger picture outside of the immediate issue at hand - a better > solution for everyone might pop up.... It's probably because I haven't been bitten much from stack overflow but I'd like to keep thinking that stack overflows are extremely unusual and the ones causing them are the bad ones. Thank you very much. :p
Index: linux-2.6/include/linux/blkdev.h =================================================================== --- linux-2.6.orig/include/linux/blkdev.h 2012-09-01 17:44:51.686485550 -0400 +++ linux-2.6/include/linux/blkdev.h 2012-09-01 18:09:58.805577658 -0400 @@ -430,6 +430,14 @@ struct request_queue { /* Throttle data */ struct throtl_data *td; #endif + + /* + * Bio submission to queue can be deferred to a workqueue if stack + * usage of submitter is high. + */ + struct bio_list deferred_bios; + struct work_struct deferred_bio_work; + struct workqueue_struct *deferred_bio_workqueue; }; #define QUEUE_FLAG_QUEUED 1 /* uses generic tag queueing */ Index: linux-2.6/block/blk-core.c =================================================================== --- linux-2.6.orig/block/blk-core.c 2012-09-01 17:44:51.686485550 -0400 +++ linux-2.6/block/blk-core.c 2012-09-02 00:34:55.204091269 -0400 @@ -211,6 +211,23 @@ static void blk_delay_work(struct work_s spin_unlock_irq(q->queue_lock); } +static void blk_deferred_bio_work(struct work_struct *work) +{ + struct request_queue *q; + struct bio *bio = NULL; + + q = container_of(work, struct request_queue, deferred_bio_work); + + do { + spin_lock_irq(q->queue_lock); + bio = bio_list_pop(&q->deferred_bios); + spin_unlock_irq(q->queue_lock); + if (!bio) + break; + generic_make_request(bio); + } while (1); +} + /** * blk_delay_queue - restart queueing after defined interval * @q: The &struct request_queue in question @@ -289,6 +306,7 @@ void blk_sync_queue(struct request_queue { del_timer_sync(&q->timeout); cancel_delayed_work_sync(&q->delay_work); + cancel_work_sync(&q->deferred_bio_work); } EXPORT_SYMBOL(blk_sync_queue); @@ -351,6 +369,29 @@ void blk_put_queue(struct request_queue EXPORT_SYMBOL(blk_put_queue); /** + * blk_drain_deferred_bios - drain deferred bios + * @q: request_queue to drain deferred bios for + * + * Dispatch all currently deferred bios on @q through ->make_request_fn(). + */ +static void blk_drain_deferred_bios(struct request_queue *q) +{ + struct bio_list bl; + struct bio *bio; + unsigned long flags; + + bio_list_init(&bl); + + spin_lock_irqsave(q->queue_lock, flags); + bio_list_merge(&bl, &q->deferred_bios); + bio_list_init(&q->deferred_bios); + spin_unlock_irqrestore(q->queue_lock, flags); + + while ((bio = bio_list_pop(&bl))) + generic_make_request(bio); +} + +/** * blk_drain_queue - drain requests from request_queue * @q: queue to drain * @drain_all: whether to drain all requests or only the ones w/ ELVPRIV @@ -358,6 +399,10 @@ EXPORT_SYMBOL(blk_put_queue); * Drain requests from @q. If @drain_all is set, all requests are drained. * If not, only ELVPRIV requests are drained. The caller is responsible * for ensuring that no new requests which need to be drained are queued. + * + * Note: It does not drain bios on q->deferred_bios list. + * Call blk_drain_deferred_bios() if need be. + * */ void blk_drain_queue(struct request_queue *q, bool drain_all) { @@ -505,6 +550,9 @@ void blk_cleanup_queue(struct request_qu spin_unlock_irq(lock); mutex_unlock(&q->sysfs_lock); + /* First drain all deferred bios. */ + blk_drain_deferred_bios(q); + /* drain all requests queued before DEAD marking */ blk_drain_queue(q, true); @@ -614,11 +662,19 @@ struct request_queue *blk_alloc_queue_no q->bypass_depth = 1; __set_bit(QUEUE_FLAG_BYPASS, &q->queue_flags); - if (blkcg_init_queue(q)) + bio_list_init(&q->deferred_bios); + INIT_WORK(&q->deferred_bio_work, blk_deferred_bio_work); + q->deferred_bio_workqueue = alloc_workqueue("kdeferbiod", WQ_MEM_RECLAIM, 0); + if (!q->deferred_bio_workqueue) goto fail_id; + if (blkcg_init_queue(q)) + goto fail_deferred_bio_wq; + return q; +fail_deferred_bio_wq: + destroy_workqueue(q->deferred_bio_workqueue); fail_id: ida_simple_remove(&blk_queue_ida, q->id); fail_q: @@ -1635,8 +1691,10 @@ static inline int bio_check_eod(struct b return 0; } + + static noinline_for_stack bool -generic_make_request_checks(struct bio *bio) +generic_make_request_checks_early(struct bio *bio) { struct request_queue *q; int nr_sectors = bio_sectors(bio); @@ -1715,9 +1773,6 @@ generic_make_request_checks(struct bio * */ create_io_context(GFP_ATOMIC, q->node); - if (blk_throtl_bio(q, bio)) - return false; /* throttled, will be resubmitted later */ - trace_block_bio_queue(q, bio); return true; @@ -1726,6 +1781,56 @@ end_io: return false; } +static noinline_for_stack bool +generic_make_request_checks_late(struct bio *bio) +{ + struct request_queue *q; + + q = bdev_get_queue(bio->bi_bdev); + + /* + * Various block parts want %current->io_context and lazy ioc + * allocation ends up trading a lot of pain for a small amount of + * memory. Just allocate it upfront. This may fail and block + * layer knows how to live with it. + */ + create_io_context(GFP_ATOMIC, q->node); + + if (blk_throtl_bio(q, bio)) + return false; /* throttled, will be resubmitted later */ + + return true; +} + +static void __generic_make_request(struct bio *bio) +{ + struct request_queue *q; + + if (!generic_make_request_checks_late(bio)) + return; + q = bdev_get_queue(bio->bi_bdev); + q->make_request_fn(q, bio); +} + +static void generic_make_request_defer_bio(struct bio *bio) +{ + struct request_queue *q; + unsigned long flags; + + q = bdev_get_queue(bio->bi_bdev); + + spin_lock_irqsave(q->queue_lock, flags); + if (unlikely(blk_queue_dead(q))) { + spin_unlock_irqrestore(q->queue_lock, flags); + bio_endio(bio, -ENODEV); + return; + } + set_bit(BIO_DEFERRED, &bio->bi_flags); + bio_list_add(&q->deferred_bios, bio); + spin_unlock_irqrestore(q->queue_lock, flags); + queue_work(q->deferred_bio_workqueue, &q->deferred_bio_work); +} + /** * generic_make_request - hand a buffer to its device driver for I/O * @bio: The bio describing the location in memory and on the device. @@ -1752,51 +1857,31 @@ end_io: */ void generic_make_request(struct bio *bio) { - struct bio_list bio_list_on_stack; + unsigned long sp = 0; + unsigned int threshold = (THREAD_SIZE * 2)/10; - if (!generic_make_request_checks(bio)) - return; + BUG_ON(bio->bi_next); - /* - * We only want one ->make_request_fn to be active at a time, else - * stack usage with stacked devices could be a problem. So use - * current->bio_list to keep a list of requests submited by a - * make_request_fn function. current->bio_list is also used as a - * flag to say if generic_make_request is currently active in this - * task or not. If it is NULL, then no make_request is active. If - * it is non-NULL, then a make_request is active, and new requests - * should be added at the tail - */ - if (current->bio_list) { - bio_list_add(current->bio_list, bio); + /* Submitteing deferred bio from worker context. */ + if (bio_flagged(bio, BIO_DEFERRED)) { + clear_bit(BIO_DEFERRED, &bio->bi_flags); + __generic_make_request(bio); return; } - /* following loop may be a bit non-obvious, and so deserves some - * explanation. - * Before entering the loop, bio->bi_next is NULL (as all callers - * ensure that) so we have a list with a single bio. - * We pretend that we have just taken it off a longer list, so - * we assign bio_list to a pointer to the bio_list_on_stack, - * thus initialising the bio_list of new bios to be - * added. ->make_request() may indeed add some more bios - * through a recursive call to generic_make_request. If it - * did, we find a non-NULL value in bio_list and re-enter the loop - * from the top. In this case we really did just take the bio - * of the top of the list (no pretending) and so remove it from - * bio_list, and call into ->make_request() again. - */ - BUG_ON(bio->bi_next); - bio_list_init(&bio_list_on_stack); - current->bio_list = &bio_list_on_stack; - do { - struct request_queue *q = bdev_get_queue(bio->bi_bdev); + if (!generic_make_request_checks_early(bio)) + return; - q->make_request_fn(q, bio); + /* + * FIXME. Provide an arch dependent function to return left stack + * space for current task. This is hack for x86_64. + */ + asm volatile("movq %%rsp,%0" : "=m"(sp)); - bio = bio_list_pop(current->bio_list); - } while (bio); - current->bio_list = NULL; /* deactivate */ + if ((sp - (unsigned long)end_of_stack(current)) < threshold) + generic_make_request_defer_bio(bio); + else + __generic_make_request(bio); } EXPORT_SYMBOL(generic_make_request); Index: linux-2.6/block/blk-sysfs.c =================================================================== --- linux-2.6.orig/block/blk-sysfs.c 2012-09-01 17:44:51.686485550 -0400 +++ linux-2.6/block/blk-sysfs.c 2012-09-01 18:09:58.808577661 -0400 @@ -505,6 +505,7 @@ static void blk_release_queue(struct kob ida_simple_remove(&blk_queue_ida, q->id); kmem_cache_free(blk_requestq_cachep, q); + destroy_workqueue(q->deferred_bio_workqueue); } static const struct sysfs_ops queue_sysfs_ops = { Index: linux-2.6/include/linux/blk_types.h =================================================================== --- linux-2.6.orig/include/linux/blk_types.h 2012-09-02 00:34:17.607086696 -0400 +++ linux-2.6/include/linux/blk_types.h 2012-09-02 00:34:21.997087104 -0400 @@ -105,6 +105,7 @@ struct bio { #define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */ #define BIO_QUIET 10 /* Make BIO Quiet */ #define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */ +#define BIO_DEFERRED 12 /* Bio was deferred for submission by worker */ #define bio_flagged(bio, flag) ((bio)->bi_flags & (1 << (flag))) /*