Message ID | 1344457310-26442-1-git-send-email-Trond.Myklebust@netapp.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust <Trond.Myklebust@netapp.com> wrote: > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > disconnected data server) we've been sending layoutreturn calls > while there is potentially still outstanding I/O to the data > servers. The reason we do this is to avoid races between replayed > writes to the MDS and the original writes to the DS. > > When this happens, the BUG_ON() in nfs4_layoutreturn_done can > be triggered because it assumes that we would never call > layoutreturn without knowing that all I/O to the DS is > finished. The fix is to remove the BUG_ON() now that the > assumptions behind the test are obsolete. > Isn't MDS supposed to recall the layout if races are possible between outstanding write-to-DS and write-through-MDS? And it causes data corruption for blocklayout if client returns layout while there is in-flight disk IO...
On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: > On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust > <Trond.Myklebust@netapp.com> wrote: > > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > > disconnected data server) we've been sending layoutreturn calls > > while there is potentially still outstanding I/O to the data > > servers. The reason we do this is to avoid races between replayed > > writes to the MDS and the original writes to the DS. > > > > When this happens, the BUG_ON() in nfs4_layoutreturn_done can > > be triggered because it assumes that we would never call > > layoutreturn without knowing that all I/O to the DS is > > finished. The fix is to remove the BUG_ON() now that the > > assumptions behind the test are obsolete. > > > Isn't MDS supposed to recall the layout if races are possible between > outstanding write-to-DS and write-through-MDS? Where do you read that in RFC5661? > And it causes data corruption for blocklayout if client returns layout > while there is in-flight disk IO... Then it needs to turn off fast failover to write-through-MDS. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >> <Trond.Myklebust@netapp.com> wrote: >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >> > disconnected data server) we've been sending layoutreturn calls >> > while there is potentially still outstanding I/O to the data >> > servers. The reason we do this is to avoid races between replayed >> > writes to the MDS and the original writes to the DS. >> > >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can >> > be triggered because it assumes that we would never call >> > layoutreturn without knowing that all I/O to the DS is >> > finished. The fix is to remove the BUG_ON() now that the >> > assumptions behind the test are obsolete. >> > >> Isn't MDS supposed to recall the layout if races are possible between >> outstanding write-to-DS and write-through-MDS? > > Where do you read that in RFC5661? > That's my (maybe mis-)understanding of how server works... But looking at rfc5661 section 18.44.3. layoutreturn implementation. " After this call, the client MUST NOT use the returned layout(s) and the associated storage protocol to access the file data. " And given commit 0a57cdac3f, client is using the layout even after layoutreturn, which IMHO is a violation of rfc5661. >> And it causes data corruption for blocklayout if client returns layout >> while there is in-flight disk IO... > > Then it needs to turn off fast failover to write-through-MDS. > If you still consider it following rfc5661, I'd choose to disable layoutreturn in before write-through-MDS for blocklayout, by adding some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' PNFS_LAYOUTRET_ON_SETATTR.
On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: > On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond > <Trond.Myklebust@netapp.com> wrote: > > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: > >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust > >> <Trond.Myklebust@netapp.com> wrote: > >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > >> > disconnected data server) we've been sending layoutreturn calls > >> > while there is potentially still outstanding I/O to the data > >> > servers. The reason we do this is to avoid races between replayed > >> > writes to the MDS and the original writes to the DS. > >> > > >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can > >> > be triggered because it assumes that we would never call > >> > layoutreturn without knowing that all I/O to the DS is > >> > finished. The fix is to remove the BUG_ON() now that the > >> > assumptions behind the test are obsolete. > >> > > >> Isn't MDS supposed to recall the layout if races are possible between > >> outstanding write-to-DS and write-through-MDS? > > > > Where do you read that in RFC5661? > > > That's my (maybe mis-)understanding of how server works... But looking > at rfc5661 section 18.44.3. layoutreturn implementation. > " > After this call, > the client MUST NOT use the returned layout(s) and the associated > storage protocol to access the file data. > " > And given commit 0a57cdac3f, client is using the layout even after > layoutreturn, which IMHO is a violation of rfc5661. No. It is using the layoutreturn to tell the MDS to fence off I/O to a data server that is not responding. It isn't attempting to use the layout after the layoutreturn: the whole point is that we are attempting write-through-MDS after the attempt to write through the DS timed out. > >> And it causes data corruption for blocklayout if client returns layout > >> while there is in-flight disk IO... > > > > Then it needs to turn off fast failover to write-through-MDS. > > > If you still consider it following rfc5661, I'd choose to disable > layoutreturn in before write-through-MDS for blocklayout, by adding > some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' > PNFS_LAYOUTRET_ON_SETATTR. I don't see how that will prevent corruption. In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the layoutreturn to communicate to the MDS that the DS is timing out via an error code (like the object layout has done all the time). How can you reconcile that change with a flag such as the one you propose? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >> <Trond.Myklebust@netapp.com> wrote: >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >> >> <Trond.Myklebust@netapp.com> wrote: >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >> >> > disconnected data server) we've been sending layoutreturn calls >> >> > while there is potentially still outstanding I/O to the data >> >> > servers. The reason we do this is to avoid races between replayed >> >> > writes to the MDS and the original writes to the DS. >> >> > >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can >> >> > be triggered because it assumes that we would never call >> >> > layoutreturn without knowing that all I/O to the DS is >> >> > finished. The fix is to remove the BUG_ON() now that the >> >> > assumptions behind the test are obsolete. >> >> > >> >> Isn't MDS supposed to recall the layout if races are possible between >> >> outstanding write-to-DS and write-through-MDS? >> > >> > Where do you read that in RFC5661? >> > >> That's my (maybe mis-)understanding of how server works... But looking >> at rfc5661 section 18.44.3. layoutreturn implementation. >> " >> After this call, >> the client MUST NOT use the returned layout(s) and the associated >> storage protocol to access the file data. >> " >> And given commit 0a57cdac3f, client is using the layout even after >> layoutreturn, which IMHO is a violation of rfc5661. > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > data server that is not responding. It isn't attempting to use the > layout after the layoutreturn: the whole point is that we are attempting > write-through-MDS after the attempt to write through the DS timed out. > But it is RFC violation that there is in-flight DS IO when client sends layoutreturn, right? Not just in-flight, client is well possible to send IO to DS _after_ layoutreturn because some thread can hold lseg reference and not yet send IO. >> >> And it causes data corruption for blocklayout if client returns layout >> >> while there is in-flight disk IO... >> > >> > Then it needs to turn off fast failover to write-through-MDS. >> > >> If you still consider it following rfc5661, I'd choose to disable >> layoutreturn in before write-through-MDS for blocklayout, by adding >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' >> PNFS_LAYOUTRET_ON_SETATTR. > > I don't see how that will prevent corruption. > > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the > layoutreturn to communicate to the MDS that the DS is timing out via an > error code (like the object layout has done all the time). How can you > reconcile that change with a flag such as the one you propose? I just intend to use the flag to disable layoutreturn in pnfs_ld_write_done. block extents are data access permissions per rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block layout works correctly because server can decide if there is data access race and if there is, MDS can recall the layout from client before applying the MDS writes. Sorin's proposed error code is just a client indication to server that there is disk access error. It is not intended to solve the data race between write-through-MDS and write-through-DS. Thanks, Tao -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote: > On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond > <Trond.Myklebust@netapp.com> wrote: > > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: > >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond > >> <Trond.Myklebust@netapp.com> wrote: > >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: > >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust > >> >> <Trond.Myklebust@netapp.com> wrote: > >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > >> >> > disconnected data server) we've been sending layoutreturn calls > >> >> > while there is potentially still outstanding I/O to the data > >> >> > servers. The reason we do this is to avoid races between replayed > >> >> > writes to the MDS and the original writes to the DS. > >> >> > > >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can > >> >> > be triggered because it assumes that we would never call > >> >> > layoutreturn without knowing that all I/O to the DS is > >> >> > finished. The fix is to remove the BUG_ON() now that the > >> >> > assumptions behind the test are obsolete. > >> >> > > >> >> Isn't MDS supposed to recall the layout if races are possible between > >> >> outstanding write-to-DS and write-through-MDS? > >> > > >> > Where do you read that in RFC5661? > >> > > >> That's my (maybe mis-)understanding of how server works... But looking > >> at rfc5661 section 18.44.3. layoutreturn implementation. > >> " > >> After this call, > >> the client MUST NOT use the returned layout(s) and the associated > >> storage protocol to access the file data. > >> " > >> And given commit 0a57cdac3f, client is using the layout even after > >> layoutreturn, which IMHO is a violation of rfc5661. > > > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > > data server that is not responding. It isn't attempting to use the > > layout after the layoutreturn: the whole point is that we are attempting > > write-through-MDS after the attempt to write through the DS timed out. > > > But it is RFC violation that there is in-flight DS IO when client > sends layoutreturn, right? Not just in-flight, client is well possible > to send IO to DS _after_ layoutreturn because some thread can hold > lseg reference and not yet send IO. Once the write has been sent, how do you know that it is no longer 'in-flight' unless the DS responds? > >> >> And it causes data corruption for blocklayout if client returns layout > >> >> while there is in-flight disk IO... > >> > > >> > Then it needs to turn off fast failover to write-through-MDS. > >> > > >> If you still consider it following rfc5661, I'd choose to disable > >> layoutreturn in before write-through-MDS for blocklayout, by adding > >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' > >> PNFS_LAYOUTRET_ON_SETATTR. > > > > I don't see how that will prevent corruption. > > > > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the > > layoutreturn to communicate to the MDS that the DS is timing out via an > > error code (like the object layout has done all the time). How can you > > reconcile that change with a flag such as the one you propose? > I just intend to use the flag to disable layoutreturn in > pnfs_ld_write_done. block extents are data access permissions per > rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block > layout works correctly because server can decide if there is data > access race and if there is, MDS can recall the layout from client > before applying the MDS writes. > > Sorin's proposed error code is just a client indication to server that > there is disk access error. It is not intended to solve the data race > between write-through-MDS and write-through-DS. Then how do you solve that race on a block device? -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Fri, Aug 10, 2012 at 12:29 AM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote: >> On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond >> <Trond.Myklebust@netapp.com> wrote: >> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >> >> <Trond.Myklebust@netapp.com> wrote: >> >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >> >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >> >> >> <Trond.Myklebust@netapp.com> wrote: >> >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >> >> >> > disconnected data server) we've been sending layoutreturn calls >> >> >> > while there is potentially still outstanding I/O to the data >> >> >> > servers. The reason we do this is to avoid races between replayed >> >> >> > writes to the MDS and the original writes to the DS. >> >> >> > >> >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can >> >> >> > be triggered because it assumes that we would never call >> >> >> > layoutreturn without knowing that all I/O to the DS is >> >> >> > finished. The fix is to remove the BUG_ON() now that the >> >> >> > assumptions behind the test are obsolete. >> >> >> > >> >> >> Isn't MDS supposed to recall the layout if races are possible between >> >> >> outstanding write-to-DS and write-through-MDS? >> >> > >> >> > Where do you read that in RFC5661? >> >> > >> >> That's my (maybe mis-)understanding of how server works... But looking >> >> at rfc5661 section 18.44.3. layoutreturn implementation. >> >> " >> >> After this call, >> >> the client MUST NOT use the returned layout(s) and the associated >> >> storage protocol to access the file data. >> >> " >> >> And given commit 0a57cdac3f, client is using the layout even after >> >> layoutreturn, which IMHO is a violation of rfc5661. >> > >> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a >> > data server that is not responding. It isn't attempting to use the >> > layout after the layoutreturn: the whole point is that we are attempting >> > write-through-MDS after the attempt to write through the DS timed out. >> > >> But it is RFC violation that there is in-flight DS IO when client >> sends layoutreturn, right? Not just in-flight, client is well possible >> to send IO to DS _after_ layoutreturn because some thread can hold >> lseg reference and not yet send IO. > > Once the write has been sent, how do you know that it is no longer > 'in-flight' unless the DS responds? > >> >> >> And it causes data corruption for blocklayout if client returns layout >> >> >> while there is in-flight disk IO... >> >> > >> >> > Then it needs to turn off fast failover to write-through-MDS. >> >> > >> >> If you still consider it following rfc5661, I'd choose to disable >> >> layoutreturn in before write-through-MDS for blocklayout, by adding >> >> some flag like PNFS_NO_LAYOUTRET_ON_FALLTHRU similar to objects' >> >> PNFS_LAYOUTRET_ON_SETATTR. >> > >> > I don't see how that will prevent corruption. >> > >> > In fact, IIRC, in NFSv4.2 Sorin's proposed changes specifically use the >> > layoutreturn to communicate to the MDS that the DS is timing out via an >> > error code (like the object layout has done all the time). How can you >> > reconcile that change with a flag such as the one you propose? >> I just intend to use the flag to disable layoutreturn in >> pnfs_ld_write_done. block extents are data access permissions per >> rfc5663. When we don't layoutreturn in pnfs_ld_write_done(), block >> layout works correctly because server can decide if there is data >> access race and if there is, MDS can recall the layout from client >> before applying the MDS writes. >> >> Sorin's proposed error code is just a client indication to server that >> there is disk access error. It is not intended to solve the data race >> between write-through-MDS and write-through-DS. > > Then how do you solve that race on a block device? As mentioned above, block extents are permissions per RFC5663. So if MDS needs to access the disk, it needs the permission as well. So if there is data access race, MDS must recall the layout from client before processing the MDS writes. We've been dealing with the problem for years in MPFS and it works perfectly to rely on MDS's decisions. Thanks, Tao -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 10, 2012 at 12:29 AM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Fri, 2012-08-10 at 00:22 +0800, Peng Tao wrote: >> On Thu, Aug 9, 2012 at 11:39 PM, Myklebust, Trond >> <Trond.Myklebust@netapp.com> wrote: >> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >> >> <Trond.Myklebust@netapp.com> wrote: >> >> > On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >> >> >> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >> >> >> <Trond.Myklebust@netapp.com> wrote: >> >> >> > Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >> >> >> > disconnected data server) we've been sending layoutreturn calls >> >> >> > while there is potentially still outstanding I/O to the data >> >> >> > servers. The reason we do this is to avoid races between replayed >> >> >> > writes to the MDS and the original writes to the DS. >> >> >> > >> >> >> > When this happens, the BUG_ON() in nfs4_layoutreturn_done can >> >> >> > be triggered because it assumes that we would never call >> >> >> > layoutreturn without knowing that all I/O to the DS is >> >> >> > finished. The fix is to remove the BUG_ON() now that the >> >> >> > assumptions behind the test are obsolete. >> >> >> > >> >> >> Isn't MDS supposed to recall the layout if races are possible between >> >> >> outstanding write-to-DS and write-through-MDS? >> >> > >> >> > Where do you read that in RFC5661? >> >> > >> >> That's my (maybe mis-)understanding of how server works... But looking >> >> at rfc5661 section 18.44.3. layoutreturn implementation. >> >> " >> >> After this call, >> >> the client MUST NOT use the returned layout(s) and the associated >> >> storage protocol to access the file data. >> >> " >> >> And given commit 0a57cdac3f, client is using the layout even after >> >> layoutreturn, which IMHO is a violation of rfc5661. >> > >> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a >> > data server that is not responding. It isn't attempting to use the >> > layout after the layoutreturn: the whole point is that we are attempting >> > write-through-MDS after the attempt to write through the DS timed out. >> > >> But it is RFC violation that there is in-flight DS IO when client >> sends layoutreturn, right? Not just in-flight, client is well possible >> to send IO to DS _after_ layoutreturn because some thread can hold >> lseg reference and not yet send IO. > > Once the write has been sent, how do you know that it is no longer > 'in-flight' unless the DS responds? RFC5663 provides a way. " "blh_maximum_io_time" is the maximum time it can take for a client I/O to the storage system to either complete or fail " It is not perfect solution but still serves as a best effort. It solves the in-flight IO question for current writing thread. For in-flight IO from other concurrent threads, lseg reference is the source that we can rely on. And I think that the BUG_ON can be triggered much easily because of concurrent writing threads and one of them fails DS writes.
On 08/09/2012 06:39 PM, Myklebust, Trond wrote: > If the problem is that the DS is failing to respond, how does the client > know that the in-flight I/O has ended? For the client, the above DS in question, has timed-out, we have reset it's session and closed it's sockets. And all it's RPC requests have been, or are being, ended with a timeout-error. So the timed-out DS is a no-op. All it's IO request will end very soon, if not already. A DS time-out is just a very valid, and meaningful response, just like an op-done-with-error. This was what Andy added to the RFC's errata which I agree with. > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > data server that is not responding. It isn't attempting to use the > layout after the layoutreturn: > the whole point is that we are attempting > write-through-MDS after the attempt to write through the DS timed out. > Trond STOP!!! this is pure bullshit. You guys took the opportunity of me being in Hospital, and the rest of the bunch not having a clue. And snuck in a patch that is totally wrong for everyone, not taking care of any other LD *crashes* . And especially when this patch is wrong even for files layout. This above here is where you are wrong!! You don't understand my point, and ignore my comments. So let me state it as clear as I can. (Lets assume files layout, for blocks and objects it's a bit different but mostly the same.) - Heavy IO is going on, the device_id in question has *3* DSs in it's device topography. Say DS1, DS2, DS3 - We have been queuing IO, and all queues are full. (we have 3 queues in in question, right? What is the maximum Q depth per files-DS? I know that in blocks and objects we usually have, I think, something like 128. This is a *tunable* in the block-layer's request-queue. Is it not some negotiated parameter with the NFS servers?) - Now, boom DS2 has timed-out. The Linux-client resets the session and internally closes all sockets of that session. All the RPCs that belong to DS2 are being returned up with a timeout error. This one is just the first of all those belonging to this DS2. They will be decrementing the reference for this layout very, very soon. - But what about DS1, and DS3 RPCs. What should we do with those? This is where you guys (Trond and Andy) are wrong. We must also wait for these RPC's as well. And opposite to what you think, this should not take long. Let me explain: We don't know anything about DS1 and DS3, each might be, either, "Having the same communication problem, like DS2". Or "is just working fine". So lets say for example that DS3 will also time-out in the future, and that DS1 is just fine and is writing as usual. * DS1 - Since it's working, it has most probably already done with all it's IO, because the NFS timeout is usually much longer then the normal RPC time, and since we are queuing evenly on all 3 DSs, at this point must probably, all of DS1 RPCs are already done. (And layout has been de-referenced). * DS3 - Will timeout in the future, when will that be? So let me start with, saying: (1). We could enhance our code and proactively, "cancel/abort" all RPCs that belong to DS3 (more on this below) (2). Or We can prove that DS3's RPCs will timeout at worst case 1 x NFS-timeout after above DS2 timeout event, or 2 x NFS-timeout after the queuing of the first timed-out RPC. And statistically in the average case DS3 will timeout very near the time DS2 timed-out. This is easy since the last IO we queued was the one that made DS2's queue to be full, and it was kept full because DS2 stopped responding and nothing emptied the queue. So the easiest we can do is wait for DS3 to timeout, soon enough, and once that will happen, session will be reset and all RPCs will end with an error. So in the worst case scenario we can recover 2 x NFS-timeout after a network partition, which is just 1 x NFS-timeout, after your schizophrenic FENCE_ME_OFF, newly invented operation. What we can do to enhance our code to reduce error recovery to 1 x NFS-timeout: - DS3 above: (As I said DS1's queues are now empty, because it was working fine, So DS3 is a representation of all DS's that have RPCs at the time DS2 timed-out, which belong to this layout) We can proactively abort all RPCs belonging to DS3. If there is a way to internally abort RPC's use that. Else just reset it's session and all sockets will close (and reopen), and all RPC's will end with a disconnect error. - Both DS2 that timed-out, and DS3 that was aborted. Should be marked with a flag. When new IO that belong to some other inode through some other layout+device_id encounters a flagged device, it should abort and turn to MDS IO, with also invalidating it's layout, and hens, soon enough the device_id for DS2&3 will be de-referenced and be removed from device cache. (And all referencing layouts are now gone) So we do not continue queuing new IO to dead devices. And since most probably MDS will not give us dead servers in new layout, we should be good. In summery. - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client *must not* skb-send a single byte belonging to a layout, after the send of LAYOUT_RETURN. (It need not wait for OPT_DONE from DS to do that, it just must make sure, that all it's internal, or on-the-wire request, are aborted by easily closing the sockets they belong too, and/or waiting for healthy DS's IO to be OPT_DONE . So the client is not dependent on any DS response, it is only dependent on it's internal state being *clean* from any more skb-send(s)) - The proper implementation of LAYOUT_RETURN on error for fast turnover is not hard, and does not involve a new invented NFS operation such as FENCE_ME_OFF. Proper codded client, independently, without the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround by actively returning all layouts that belong to a bad DS, and not waiting for a fence-off of a single layout, then encountering just the same error with all other layouts that have the same DS - And I know that just as you did not read my emails from before me going to Hospital, you will continue to not understand this one, or what I'm trying to explain, and will most probably ignore all of it. But please note one thing: YOU have sabotaged the NFS 4.1 Linux client, which is now totally not STD complaint, and have introduced CRASHs. And for no good reason. No thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, 2012-08-12 at 20:36 +0300, Boaz Harrosh wrote: > On 08/09/2012 06:39 PM, Myklebust, Trond wrote: > > If the problem is that the DS is failing to respond, how does the client > > know that the in-flight I/O has ended? > > For the client, the above DS in question, has timed-out, we have reset > it's session and closed it's sockets. And all it's RPC requests have > been, or are being, ended with a timeout-error. So the timed-out > DS is a no-op. All it's IO request will end very soon, if not already. > > A DS time-out is just a very valid, and meaningful response, just like > an op-done-with-error. This was what Andy added to the RFC's errata > which I agree with. > > > > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > > data server that is not responding. It isn't attempting to use the > > layout after the layoutreturn: > > > the whole point is that we are attempting > > write-through-MDS after the attempt to write through the DS timed out. > > > > Trond STOP!!! this is pure bullshit. You guys took the opportunity of > me being in Hospital, and the rest of the bunch not having a clue. And > snuck in a patch that is totally wrong for everyone, not taking care of > any other LD *crashes* . And especially when this patch is wrong even for > files layout. > > This above here is where you are wrong!! You don't understand my point, > and ignore my comments. So let me state it as clear as I can. YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout for an RPC call once it has started. This is why we need fencing _specifically_ for the pNFS files client. > (Lets assume files layout, for blocks and objects it's a bit different > but mostly the same.) That, and the fact that fencing hasn't been implemented for blocks and objects. The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT for each file with failed DS connection I/O) and touches only fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects. > - Heavy IO is going on, the device_id in question has *3* DSs in it's > device topography. Say DS1, DS2, DS3 > > - We have been queuing IO, and all queues are full. (we have 3 queues in > in question, right? What is the maximum Q depth per files-DS? I know > that in blocks and objects we usually have, I think, something like 128. > This is a *tunable* in the block-layer's request-queue. Is it not some > negotiated parameter with the NFS servers?) > > - Now, boom DS2 has timed-out. The Linux-client resets the session and > internally closes all sockets of that session. All the RPCs that > belong to DS2 are being returned up with a timeout error. This one > is just the first of all those belonging to this DS2. They will > be decrementing the reference for this layout very, very soon. > > - But what about DS1, and DS3 RPCs. What should we do with those? > This is where you guys (Trond and Andy) are wrong. We must also > wait for these RPC's as well. And opposite to what you think, this > should not take long. Let me explain: > > We don't know anything about DS1 and DS3, each might be, either, > "Having the same communication problem, like DS2". Or "is just working > fine". So lets say for example that DS3 will also time-out in the > future, and that DS1 is just fine and is writing as usual. > > * DS1 - Since it's working, it has most probably already done > with all it's IO, because the NFS timeout is usually much longer > then the normal RPC time, and since we are queuing evenly on > all 3 DSs, at this point must probably, all of DS1 RPCs are > already done. (And layout has been de-referenced). > > * DS3 - Will timeout in the future, when will that be? > So let me start with, saying: > (1). We could enhance our code and proactively, > "cancel/abort" all RPCs that belong to DS3 (more on this > below) Which makes the race _WORSE_. As I said above, there is no 'cancel RPC' operation in SUNRPC. Once your RPC call is launched, it cannot be recalled. All your discussion above is about the client side, and ignores what may be happening on the data server side. The fencing is what is needed to deal with the data server picture. > (2). Or We can prove that DS3's RPCs will timeout at worst > case 1 x NFS-timeout after above DS2 timeout event, or > 2 x NFS-timeout after the queuing of the first timed-out > RPC. And statistically in the average case DS3 will timeout > very near the time DS2 timed-out. > > This is easy since the last IO we queued was the one that > made DS2's queue to be full, and it was kept full because > DS2 stopped responding and nothing emptied the queue. > > So the easiest we can do is wait for DS3 to timeout, soon > enough, and once that will happen, session will be reset and all > RPCs will end with an error. You are still only discussing the client side. Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR CANCELED. Fencing is the closest we can come to an abort operation. > So in the worst case scenario we can recover 2 x NFS-timeout after > a network partition, which is just 1 x NFS-timeout, after your > schizophrenic FENCE_ME_OFF, newly invented operation. > > What we can do to enhance our code to reduce error recovery to > 1 x NFS-timeout: > > - DS3 above: > (As I said DS1's queues are now empty, because it was working fine, > So DS3 is a representation of all DS's that have RPCs at the > time DS2 timed-out, which belong to this layout) > > We can proactively abort all RPCs belonging to DS3. If there is > a way to internally abort RPC's use that. Else just reset it's > session and all sockets will close (and reopen), and all RPC's > will end with a disconnect error. Not on most servers that I'm aware of. If you close or reset the socket on the client, then the Linux server will happily continue to process those RPC calls; it just won't be able to send a reply. Furthermore, if the problem is that the data server isn't responding, then a socket close/reset tells you nothing either. > - Both DS2 that timed-out, and DS3 that was aborted. Should be > marked with a flag. When new IO that belong to some other > inode through some other layout+device_id encounters a flagged > device, it should abort and turn to MDS IO, with also invalidating > it's layout, and hens, soon enough the device_id for DS2&3 will be > de-referenced and be removed from device cache. (And all referencing > layouts are now gone) There is no RPC abort functionality Sun RPC. Again, this argument relies on functionality that _doesn't_ exist. > So we do not continue queuing new IO to dead devices. And since most > probably MDS will not give us dead servers in new layout, we should be > good. > In summery. > - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client > *must not* skb-send a single byte belonging to a layout, after the send > of LAYOUT_RETURN. > (It need not wait for OPT_DONE from DS to do that, it just must make > sure, that all it's internal, or on-the-wire request, are aborted > by easily closing the sockets they belong too, and/or waiting for > healthy DS's IO to be OPT_DONE . So the client is not dependent on > any DS response, it is only dependent on it's internal state being > *clean* from any more skb-send(s)) Ditto > - The proper implementation of LAYOUT_RETURN on error for fast turnover > is not hard, and does not involve a new invented NFS operation such > as FENCE_ME_OFF. Proper codded client, independently, without > the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround > by actively returning all layouts that belong to a bad DS, and not > waiting for a fence-off of a single layout, then encountering just > the same error with all other layouts that have the same DS What do you mean by "all layouts that belong to a bad DS"? Layouts don't belong to a DS, and so there is no way to get from a DS to a layout. > - And I know that just as you did not read my emails from before > me going to Hospital, you will continue to not understand this > one, or what I'm trying to explain, and will most probably ignore > all of it. But please note one thing: I read them, but just as now, they continue to ignore the reality about timeouts: timeouts mean _nothing_ in an RPC failover situation. There is no RPC abort functionality that you can rely on other than fencing. > YOU have sabotaged the NFS 4.1 Linux client, which is now totally > not STD complaint, and have introduced CRASHs. And for no good > reason. See above. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
T24gTW9uLCAyMDEyLTA4LTEzIGF0IDEyOjI2IC0wNDAwLCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6 DQo+IE9uIFN1biwgMjAxMi0wOC0xMiBhdCAyMDozNiArMDMwMCwgQm9heiBIYXJyb3NoIHdyb3Rl Og0KPiA+IE9uIDA4LzA5LzIwMTIgMDY6MzkgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+ ID4gPiBJZiB0aGUgcHJvYmxlbSBpcyB0aGF0IHRoZSBEUyBpcyBmYWlsaW5nIHRvIHJlc3BvbmQs IGhvdyBkb2VzIHRoZSBjbGllbnQNCj4gPiA+IGtub3cgdGhhdCB0aGUgaW4tZmxpZ2h0IEkvTyBo YXMgZW5kZWQ/DQo+ID4gDQo+ID4gRm9yIHRoZSBjbGllbnQsIHRoZSBhYm92ZSBEUyBpbiBxdWVz dGlvbiwgaGFzIHRpbWVkLW91dCwgd2UgaGF2ZSByZXNldA0KPiA+IGl0J3Mgc2Vzc2lvbiBhbmQg Y2xvc2VkIGl0J3Mgc29ja2V0cy4gQW5kIGFsbCBpdCdzIFJQQyByZXF1ZXN0cyBoYXZlDQo+ID4g YmVlbiwgb3IgYXJlIGJlaW5nLCBlbmRlZCB3aXRoIGEgdGltZW91dC1lcnJvci4gU28gdGhlIHRp bWVkLW91dA0KPiA+IERTIGlzIGEgbm8tb3AuIEFsbCBpdCdzIElPIHJlcXVlc3Qgd2lsbCBlbmQg dmVyeSBzb29uLCBpZiBub3QgYWxyZWFkeS4NCj4gPiANCj4gPiBBIERTIHRpbWUtb3V0IGlzIGp1 c3QgYSB2ZXJ5IHZhbGlkLCBhbmQgbWVhbmluZ2Z1bCByZXNwb25zZSwganVzdCBsaWtlDQo+ID4g YW4gb3AtZG9uZS13aXRoLWVycm9yLiBUaGlzIHdhcyB3aGF0IEFuZHkgYWRkZWQgdG8gdGhlIFJG QydzIGVycmF0YQ0KPiA+IHdoaWNoIEkgYWdyZWUgd2l0aC4NCj4gPiANCj4gPiA+IA0KPiA+ID4g Tm8uIEl0IGlzIHVzaW5nIHRoZSBsYXlvdXRyZXR1cm4gdG8gdGVsbCB0aGUgTURTIHRvIGZlbmNl IG9mZiBJL08gdG8gYQ0KPiA+ID4gZGF0YSBzZXJ2ZXIgdGhhdCBpcyBub3QgcmVzcG9uZGluZy4g SXQgaXNuJ3QgYXR0ZW1wdGluZyB0byB1c2UgdGhlDQo+ID4gPiBsYXlvdXQgYWZ0ZXIgdGhlIGxh eW91dHJldHVybjogDQo+ID4gDQo+ID4gPiB0aGUgd2hvbGUgcG9pbnQgaXMgdGhhdCB3ZSBhcmUg YXR0ZW1wdGluZw0KPiA+ID4gd3JpdGUtdGhyb3VnaC1NRFMgYWZ0ZXIgdGhlIGF0dGVtcHQgdG8g d3JpdGUgdGhyb3VnaCB0aGUgRFMgdGltZWQgb3V0Lg0KPiA+ID4gDQo+ID4gDQo+ID4gVHJvbmQg U1RPUCEhISB0aGlzIGlzIHB1cmUgYnVsbHNoaXQuIFlvdSBndXlzIHRvb2sgdGhlIG9wcG9ydHVu aXR5IG9mDQo+ID4gbWUgYmVpbmcgaW4gSG9zcGl0YWwsIGFuZCB0aGUgcmVzdCBvZiB0aGUgYnVu Y2ggbm90IGhhdmluZyBhIGNsdWUuIEFuZA0KPiA+IHNudWNrIGluIGEgcGF0Y2ggdGhhdCBpcyB0 b3RhbGx5IHdyb25nIGZvciBldmVyeW9uZSwgbm90IHRha2luZyBjYXJlIG9mDQo+ID4gYW55IG90 aGVyIExEICpjcmFzaGVzKiAuIEFuZCBlc3BlY2lhbGx5IHdoZW4gdGhpcyBwYXRjaCBpcyB3cm9u ZyBldmVuIGZvcg0KPiA+IGZpbGVzIGxheW91dC4NCj4gPiANCj4gPiBUaGlzIGFib3ZlIGhlcmUg aXMgd2hlcmUgeW91IGFyZSB3cm9uZyEhIFlvdSBkb24ndCB1bmRlcnN0YW5kIG15IHBvaW50LA0K PiA+IGFuZCBpZ25vcmUgbXkgY29tbWVudHMuIFNvIGxldCBtZSBzdGF0ZSBpdCBhcyBjbGVhciBh cyBJIGNhbi4NCj4gDQo+IFlPVSBhcmUgaWdub3JpbmcgdGhlIHJlYWxpdHkgb2YgU3VuUlBDLiBU aGVyZSBpcyBubyBhYm9ydC9jYW5jZWwvdGltZW91dA0KPiBmb3IgYW4gUlBDIGNhbGwgb25jZSBp dCBoYXMgc3RhcnRlZC4gVGhpcyBpcyB3aHkgd2UgbmVlZCBmZW5jaW5nDQo+IF9zcGVjaWZpY2Fs bHlfIGZvciB0aGUgcE5GUyBmaWxlcyBjbGllbnQuDQo+IA0KPiA+IChMZXRzIGFzc3VtZSBmaWxl cyBsYXlvdXQsIGZvciBibG9ja3MgYW5kIG9iamVjdHMgaXQncyBhIGJpdCBkaWZmZXJlbnQNCj4g PiAgYnV0IG1vc3RseSB0aGUgc2FtZS4pDQo+IA0KPiBUaGF0LCBhbmQgdGhlIGZhY3QgdGhhdCBm ZW5jaW5nIGhhc24ndCBiZWVuIGltcGxlbWVudGVkIGZvciBibG9ja3MgYW5kDQo+IG9iamVjdHMu IFRoZSBjb21taXQgaW4gcXVlc3Rpb24gaXMgODJjN2M3YTVhIChORlN2NC4xIHJldHVybiB0aGUg TEFZT1VUDQo+IGZvciBlYWNoIGZpbGUgd2l0aCBmYWlsZWQgRFMgY29ubmVjdGlvbiBJL08pIGFu ZCB0b3VjaGVzIG9ubHkNCj4gZnMvbmZzL25mczRmaWxlbGF5b3V0LmMuIEl0IGNhbm5vdCBiZSBh ZmZlY3RpbmcgYmxvY2tzIGFuZCBvYmplY3RzLg0KPiANCj4gPiAtIEhlYXZ5IElPIGlzIGdvaW5n IG9uLCB0aGUgZGV2aWNlX2lkIGluIHF1ZXN0aW9uIGhhcyAqMyogRFNzIGluIGl0J3MNCj4gPiAg IGRldmljZSB0b3BvZ3JhcGh5LiBTYXkgRFMxLCBEUzIsIERTMw0KPiA+IA0KPiA+IC0gV2UgaGF2 ZSBiZWVuIHF1ZXVpbmcgSU8sIGFuZCBhbGwgcXVldWVzIGFyZSBmdWxsLiAod2UgaGF2ZSAzIHF1 ZXVlcyBpbg0KPiA+ICAgaW4gcXVlc3Rpb24sIHJpZ2h0PyBXaGF0IGlzIHRoZSBtYXhpbXVtIFEg ZGVwdGggcGVyIGZpbGVzLURTPyBJIGtub3cNCj4gPiAgIHRoYXQgaW4gYmxvY2tzIGFuZCBvYmpl Y3RzIHdlIHVzdWFsbHkgaGF2ZSwgSSB0aGluaywgc29tZXRoaW5nIGxpa2UgMTI4Lg0KPiA+ICAg VGhpcyBpcyBhICp0dW5hYmxlKiBpbiB0aGUgYmxvY2stbGF5ZXIncyByZXF1ZXN0LXF1ZXVlLiBJ cyBpdCBub3Qgc29tZQ0KPiA+ICAgbmVnb3RpYXRlZCBwYXJhbWV0ZXIgd2l0aCB0aGUgTkZTIHNl cnZlcnM/KQ0KPiA+IA0KPiA+IC0gTm93LCBib29tIERTMiBoYXMgdGltZWQtb3V0LiBUaGUgTGlu dXgtY2xpZW50IHJlc2V0cyB0aGUgc2Vzc2lvbiBhbmQNCj4gPiAgIGludGVybmFsbHkgY2xvc2Vz IGFsbCBzb2NrZXRzIG9mIHRoYXQgc2Vzc2lvbi4gQWxsIHRoZSBSUENzIHRoYXQNCj4gPiAgIGJl bG9uZyB0byBEUzIgYXJlIGJlaW5nIHJldHVybmVkIHVwIHdpdGggYSB0aW1lb3V0IGVycm9yLiBU aGlzIG9uZQ0KPiA+ICAgaXMganVzdCB0aGUgZmlyc3Qgb2YgYWxsIHRob3NlIGJlbG9uZ2luZyB0 byB0aGlzIERTMi4gVGhleSB3aWxsDQo+ID4gICBiZSBkZWNyZW1lbnRpbmcgdGhlIHJlZmVyZW5j ZSBmb3IgdGhpcyBsYXlvdXQgdmVyeSwgdmVyeSBzb29uLg0KPiA+IA0KPiA+IC0gQnV0IHdoYXQg YWJvdXQgRFMxLCBhbmQgRFMzIFJQQ3MuIFdoYXQgc2hvdWxkIHdlIGRvIHdpdGggdGhvc2U/DQo+ ID4gICBUaGlzIGlzIHdoZXJlIHlvdSBndXlzIChUcm9uZCBhbmQgQW5keSkgYXJlIHdyb25nLiBX ZSBtdXN0IGFsc28NCj4gPiAgIHdhaXQgZm9yIHRoZXNlIFJQQydzIGFzIHdlbGwuIEFuZCBvcHBv c2l0ZSB0byB3aGF0IHlvdSB0aGluaywgdGhpcw0KPiA+ICAgc2hvdWxkIG5vdCB0YWtlIGxvbmcu IExldCBtZSBleHBsYWluOg0KPiA+IA0KPiA+ICAgV2UgZG9uJ3Qga25vdyBhbnl0aGluZyBhYm91 dCBEUzEgYW5kIERTMywgZWFjaCBtaWdodCBiZSwgZWl0aGVyLA0KPiA+ICAgIkhhdmluZyB0aGUg c2FtZSBjb21tdW5pY2F0aW9uIHByb2JsZW0sIGxpa2UgRFMyIi4gT3IgImlzIGp1c3Qgd29ya2lu Zw0KPiA+ICAgZmluZSIuIFNvIGxldHMgc2F5IGZvciBleGFtcGxlIHRoYXQgRFMzIHdpbGwgYWxz byB0aW1lLW91dCBpbiB0aGUNCj4gPiAgIGZ1dHVyZSwgYW5kIHRoYXQgRFMxIGlzIGp1c3QgZmlu ZSBhbmQgaXMgd3JpdGluZyBhcyB1c3VhbC4NCj4gPiANCj4gPiAgICogRFMxIC0gU2luY2UgaXQn cyB3b3JraW5nLCBpdCBoYXMgbW9zdCBwcm9iYWJseSBhbHJlYWR5IGRvbmUNCj4gPiAgICAgd2l0 aCBhbGwgaXQncyBJTywgYmVjYXVzZSB0aGUgTkZTIHRpbWVvdXQgaXMgdXN1YWxseSBtdWNoIGxv bmdlcg0KPiA+ICAgICB0aGVuIHRoZSBub3JtYWwgUlBDIHRpbWUsIGFuZCBzaW5jZSB3ZSBhcmUg cXVldWluZyBldmVubHkgb24NCj4gPiAgICAgYWxsIDMgRFNzLCBhdCB0aGlzIHBvaW50IG11c3Qg cHJvYmFibHksIGFsbCBvZiBEUzEgUlBDcyBhcmUNCj4gPiAgICAgYWxyZWFkeSBkb25lLiAoQW5k IGxheW91dCBoYXMgYmVlbiBkZS1yZWZlcmVuY2VkKS4NCj4gPiANCj4gPiAgICogRFMzIC0gV2ls bCB0aW1lb3V0IGluIHRoZSBmdXR1cmUsIHdoZW4gd2lsbCB0aGF0IGJlPw0KPiA+ICAgICBTbyBs ZXQgbWUgc3RhcnQgd2l0aCwgc2F5aW5nOg0KPiA+ICAgICAoMSkuIFdlIGNvdWxkIGVuaGFuY2Ug b3VyIGNvZGUgYW5kIHByb2FjdGl2ZWx5LCANCj4gPiAgICAgICAgICJjYW5jZWwvYWJvcnQiIGFs bCBSUENzIHRoYXQgYmVsb25nIHRvIERTMyAobW9yZSBvbiB0aGlzDQo+ID4gICAgICAgICAgYmVs b3cpDQo+IA0KPiBXaGljaCBtYWtlcyB0aGUgcmFjZSBfV09SU0VfLiBBcyBJIHNhaWQgYWJvdmUs IHRoZXJlIGlzIG5vICdjYW5jZWwgUlBDJw0KPiBvcGVyYXRpb24gaW4gU1VOUlBDLiBPbmNlIHlv dXIgUlBDIGNhbGwgaXMgbGF1bmNoZWQsIGl0IGNhbm5vdCBiZQ0KPiByZWNhbGxlZC4gQWxsIHlv dXIgZGlzY3Vzc2lvbiBhYm92ZSBpcyBhYm91dCB0aGUgY2xpZW50IHNpZGUsIGFuZA0KPiBpZ25v cmVzIHdoYXQgbWF5IGJlIGhhcHBlbmluZyBvbiB0aGUgZGF0YSBzZXJ2ZXIgc2lkZS4gVGhlIGZl bmNpbmcgaXMNCj4gd2hhdCBpcyBuZWVkZWQgdG8gZGVhbCB3aXRoIHRoZSBkYXRhIHNlcnZlciBw aWN0dXJlLg0KPiANCj4gPiAgICAgKDIpLiBPciBXZSBjYW4gcHJvdmUgdGhhdCBEUzMncyBSUENz IHdpbGwgdGltZW91dCBhdCB3b3JzdA0KPiA+ICAgICAgICAgIGNhc2UgMSB4IE5GUy10aW1lb3V0 IGFmdGVyIGFib3ZlIERTMiB0aW1lb3V0IGV2ZW50LCBvcg0KPiA+ICAgICAgICAgIDIgeCBORlMt dGltZW91dCBhZnRlciB0aGUgcXVldWluZyBvZiB0aGUgZmlyc3QgdGltZWQtb3V0DQo+ID4gICAg ICAgICAgUlBDLiBBbmQgc3RhdGlzdGljYWxseSBpbiB0aGUgYXZlcmFnZSBjYXNlIERTMyB3aWxs IHRpbWVvdXQNCj4gPiAgICAgICAgICB2ZXJ5IG5lYXIgdGhlIHRpbWUgRFMyIHRpbWVkLW91dC4N Cj4gPiANCj4gPiAgICAgICAgICBUaGlzIGlzIGVhc3kgc2luY2UgdGhlIGxhc3QgSU8gd2UgcXVl dWVkIHdhcyB0aGUgb25lIHRoYXQNCj4gPiAgICAgICAgICBtYWRlIERTMidzIHF1ZXVlIHRvIGJl IGZ1bGwsIGFuZCBpdCB3YXMga2VwdCBmdWxsIGJlY2F1c2UNCj4gPiAgICAgICAgICBEUzIgc3Rv cHBlZCByZXNwb25kaW5nIGFuZCBub3RoaW5nIGVtcHRpZWQgdGhlIHF1ZXVlLg0KPiA+IA0KPiA+ ICAgICAgU28gdGhlIGVhc2llc3Qgd2UgY2FuIGRvIGlzIHdhaXQgZm9yIERTMyB0byB0aW1lb3V0 LCBzb29uDQo+ID4gICAgICBlbm91Z2gsIGFuZCBvbmNlIHRoYXQgd2lsbCBoYXBwZW4sIHNlc3Np b24gd2lsbCBiZSByZXNldCBhbmQgYWxsDQo+ID4gICAgICBSUENzIHdpbGwgZW5kIHdpdGggYW4g ZXJyb3IuDQo+IA0KPiANCj4gWW91IGFyZSBzdGlsbCBvbmx5IGRpc2N1c3NpbmcgdGhlIGNsaWVu dCBzaWRlLg0KPiANCj4gUmVhZCBteSBsaXBzOiBTdW4gUlBDIE9QRVJBVElPTlMgRE8gTk9UIFRJ TUVPVVQgQU5EIENBTk5PVCBCRSBBQk9SVEVEIE9SDQo+IENBTkNFTEVELiBGZW5jaW5nIGlzIHRo ZSBjbG9zZXN0IHdlIGNhbiBjb21lIHRvIGFuIGFib3J0IG9wZXJhdGlvbi4NCj4gDQo+ID4gU28g aW4gdGhlIHdvcnN0IGNhc2Ugc2NlbmFyaW8gd2UgY2FuIHJlY292ZXIgMiB4IE5GUy10aW1lb3V0 IGFmdGVyDQo+ID4gYSBuZXR3b3JrIHBhcnRpdGlvbiwgd2hpY2ggaXMganVzdCAxIHggTkZTLXRp bWVvdXQsIGFmdGVyIHlvdXINCj4gPiBzY2hpem9waHJlbmljIEZFTkNFX01FX09GRiwgbmV3bHkg aW52ZW50ZWQgb3BlcmF0aW9uLg0KPiA+IA0KPiA+IFdoYXQgd2UgY2FuIGRvIHRvIGVuaGFuY2Ug b3VyIGNvZGUgdG8gcmVkdWNlIGVycm9yIHJlY292ZXJ5IHRvDQo+ID4gMSB4IE5GUy10aW1lb3V0 Og0KPiA+IA0KPiA+IC0gRFMzIGFib3ZlOg0KPiA+ICAgKEFzIEkgc2FpZCBEUzEncyBxdWV1ZXMg YXJlIG5vdyBlbXB0eSwgYmVjYXVzZSBpdCB3YXMgd29ya2luZyBmaW5lLA0KPiA+ICAgIFNvIERT MyBpcyBhIHJlcHJlc2VudGF0aW9uIG9mIGFsbCBEUydzIHRoYXQgaGF2ZSBSUENzIGF0IHRoZQ0K PiA+ICAgIHRpbWUgRFMyIHRpbWVkLW91dCwgd2hpY2ggYmVsb25nIHRvIHRoaXMgbGF5b3V0KQ0K PiA+IA0KPiA+ICAgV2UgY2FuIHByb2FjdGl2ZWx5IGFib3J0IGFsbCBSUENzIGJlbG9uZ2luZyB0 byBEUzMuIElmIHRoZXJlIGlzDQo+ID4gICBhIHdheSB0byBpbnRlcm5hbGx5IGFib3J0IFJQQydz IHVzZSB0aGF0LiBFbHNlIGp1c3QgcmVzZXQgaXQncw0KPiA+ICAgc2Vzc2lvbiBhbmQgYWxsIHNv Y2tldHMgd2lsbCBjbG9zZSAoYW5kIHJlb3BlbiksIGFuZCBhbGwgUlBDJ3MNCj4gPiAgIHdpbGwg ZW5kIHdpdGggYSBkaXNjb25uZWN0IGVycm9yLg0KPiANCj4gTm90IG9uIG1vc3Qgc2VydmVycyB0 aGF0IEknbSBhd2FyZSBvZi4gSWYgeW91IGNsb3NlIG9yIHJlc2V0IHRoZSBzb2NrZXQNCj4gb24g dGhlIGNsaWVudCwgdGhlbiB0aGUgTGludXggc2VydmVyIHdpbGwgaGFwcGlseSBjb250aW51ZSB0 byBwcm9jZXNzDQo+IHRob3NlIFJQQyBjYWxsczsgaXQganVzdCB3b24ndCBiZSBhYmxlIHRvIHNl bmQgYSByZXBseS4NCg0KT25lIHNtYWxsIGNvcnJlY3Rpb24gaGVyZToNCl9JZl8gd2UgYXJlIHVz aW5nIE5GU3Y0LjIsIGFuZCBfaWZfIHRoZSBjbGllbnQgcmVxdWVzdHMgdGhlDQpFWENIR0lENF9G TEFHX1NVUFBfRkVOQ0VfT1BTIGluIHRoZSBFWENIQU5HRV9JRCBvcGVyYXRpb24sIGFuZCBfaWZf IHRoZQ0KZGF0YSBzZXJ2ZXIgcmVwbGllcyB0aGF0IGl0IHN1cHBvcnRzIHRoYXQsIGFuZCBfaWZf IHRoZSBjbGllbnQgZ2V0cyBhDQpzdWNjZXNzZnVsIHJlcGx5IHRvIGEgREVTVFJPWV9TRVNTSU9O IGNhbGwgdG8gdGhlIGRhdGEgc2VydmVyLCBfdGhlbl8gaXQNCmNhbiBrbm93IHRoYXQgYWxsIFJQ QyBjYWxscyBoYXZlIGNvbXBsZXRlZC4NCg0KSG93ZXZlciwgd2UncmUgbm90IHN1cHBvcnRpbmcg TkZTdjQuMiB5ZXQuDQoNCj4gRnVydGhlcm1vcmUsIGlmIHRoZSBwcm9ibGVtIGlzIHRoYXQgdGhl IGRhdGEgc2VydmVyIGlzbid0IHJlc3BvbmRpbmcsDQo+IHRoZW4gYSBzb2NrZXQgY2xvc2UvcmVz ZXQgdGVsbHMgeW91IG5vdGhpbmcgZWl0aGVyLg0KDQouLi5hbmQgd2Ugc3RpbGwgaGF2ZSBubyBz b2x1dGlvbiBmb3IgdGhpcyBjYXNlLg0KDQo+ID4gLSBCb3RoIERTMiB0aGF0IHRpbWVkLW91dCwg YW5kIERTMyB0aGF0IHdhcyBhYm9ydGVkLiBTaG91bGQgYmUNCj4gPiAgIG1hcmtlZCB3aXRoIGEg ZmxhZy4gV2hlbiBuZXcgSU8gdGhhdCBiZWxvbmcgdG8gc29tZSBvdGhlcg0KPiA+ICAgaW5vZGUg dGhyb3VnaCBzb21lIG90aGVyIGxheW91dCtkZXZpY2VfaWQgZW5jb3VudGVycyBhIGZsYWdnZWQN Cj4gPiAgIGRldmljZSwgaXQgc2hvdWxkIGFib3J0IGFuZCB0dXJuIHRvIE1EUyBJTywgd2l0aCBh bHNvIGludmFsaWRhdGluZw0KPiA+ICAgaXQncyBsYXlvdXQsIGFuZCBoZW5zLCBzb29uIGVub3Vn aCB0aGUgZGV2aWNlX2lkIGZvciBEUzImMyB3aWxsIGJlDQo+ID4gICBkZS1yZWZlcmVuY2VkIGFu ZCBiZSByZW1vdmVkIGZyb20gZGV2aWNlIGNhY2hlLiAoQW5kIGFsbCByZWZlcmVuY2luZw0KPiA+ ICAgbGF5b3V0cyBhcmUgbm93IGdvbmUpDQo+IA0KPiBUaGVyZSBpcyBubyBSUEMgYWJvcnQgZnVu Y3Rpb25hbGl0eSBTdW4gUlBDLiBBZ2FpbiwgdGhpcyBhcmd1bWVudCByZWxpZXMNCj4gb24gZnVu Y3Rpb25hbGl0eSB0aGF0IF9kb2Vzbid0XyBleGlzdC4NCj4gDQo+ID4gICBTbyB3ZSBkbyBub3Qg Y29udGludWUgcXVldWluZyBuZXcgSU8gdG8gZGVhZCBkZXZpY2VzLiBBbmQgc2luY2UgbW9zdA0K PiA+ICAgcHJvYmFibHkgTURTIHdpbGwgbm90IGdpdmUgdXMgZGVhZCBzZXJ2ZXJzIGluIG5ldyBs YXlvdXQsIHdlIHNob3VsZCBiZQ0KPiA+ICAgZ29vZC4NCj4gPiBJbiBzdW1tZXJ5Lg0KPiA+IC0g RkVOQ0VfTUVfT0ZGIGlzIGEgbmV3IG9wZXJhdGlvbiwgYW5kIGlzIG5vdCA9PT0gTEFZT1VUX1JF VFVSTi4gQ2xpZW50DQo+ID4gICAqbXVzdCBub3QqIHNrYi1zZW5kIGEgc2luZ2xlIGJ5dGUgYmVs b25naW5nIHRvIGEgbGF5b3V0LCBhZnRlciB0aGUgc2VuZA0KPiA+ICAgb2YgTEFZT1VUX1JFVFVS Ti4NCj4gPiAgIChJdCBuZWVkIG5vdCB3YWl0IGZvciBPUFRfRE9ORSBmcm9tIERTIHRvIGRvIHRo YXQsIGl0IGp1c3QgbXVzdCBtYWtlDQo+ID4gICAgc3VyZSwgdGhhdCBhbGwgaXQncyBpbnRlcm5h bCwgb3Igb24tdGhlLXdpcmUgcmVxdWVzdCwgYXJlIGFib3J0ZWQNCj4gPiAgICBieSBlYXNpbHkg Y2xvc2luZyB0aGUgc29ja2V0cyB0aGV5IGJlbG9uZyB0b28sIGFuZC9vciB3YWl0aW5nIGZvcg0K PiA+ICAgIGhlYWx0aHkgRFMncyBJTyB0byBiZSBPUFRfRE9ORSAuIFNvIHRoZSBjbGllbnQgaXMg bm90IGRlcGVuZGVudCBvbg0KPiA+ICAgIGFueSBEUyByZXNwb25zZSwgaXQgaXMgb25seSBkZXBl bmRlbnQgb24gaXQncyBpbnRlcm5hbCBzdGF0ZSBiZWluZw0KPiA+ICAgICpjbGVhbiogZnJvbSBh bnkgbW9yZSBza2Itc2VuZChzKSkNCj4gDQo+IERpdHRvDQo+IA0KPiA+IC0gVGhlIHByb3BlciBp bXBsZW1lbnRhdGlvbiBvZiBMQVlPVVRfUkVUVVJOIG9uIGVycm9yIGZvciBmYXN0IHR1cm5vdmVy DQo+ID4gICBpcyBub3QgaGFyZCwgYW5kIGRvZXMgbm90IGludm9sdmUgYSBuZXcgaW52ZW50ZWQg TkZTIG9wZXJhdGlvbiBzdWNoDQo+ID4gICBhcyBGRU5DRV9NRV9PRkYuIFByb3BlciBjb2RkZWQg Y2xpZW50LCBpbmRlcGVuZGVudGx5LCB3aXRob3V0DQo+ID4gICB0aGUgYWlkIG9mIGFueSBGRU5D RV9NRV9PRkYgb3BlcmF0aW9uLCBjYW4gYWNoaWV2ZSBhIGZhc3RlciB0dXJuYXJvdW5kDQo+ID4g ICBieSBhY3RpdmVseSByZXR1cm5pbmcgYWxsIGxheW91dHMgdGhhdCBiZWxvbmcgdG8gYSBiYWQg RFMsIGFuZCBub3QNCj4gPiAgIHdhaXRpbmcgZm9yIGEgZmVuY2Utb2ZmIG9mIGEgc2luZ2xlIGxh eW91dCwgdGhlbiBlbmNvdW50ZXJpbmcganVzdA0KPiA+ICAgdGhlIHNhbWUgZXJyb3Igd2l0aCBh bGwgb3RoZXIgbGF5b3V0cyB0aGF0IGhhdmUgdGhlIHNhbWUgRFMgICAgIA0KPiANCj4gV2hhdCBk byB5b3UgbWVhbiBieSAiYWxsIGxheW91dHMgdGhhdCBiZWxvbmcgdG8gYSBiYWQgRFMiPyBMYXlv dXRzIGRvbid0DQo+IGJlbG9uZyB0byBhIERTLCBhbmQgc28gdGhlcmUgaXMgbm8gd2F5IHRvIGdl dCBmcm9tIGEgRFMgdG8gYSBsYXlvdXQuDQo+IA0KPiA+IC0gQW5kIEkga25vdyB0aGF0IGp1c3Qg YXMgeW91IGRpZCBub3QgcmVhZCBteSBlbWFpbHMgZnJvbSBiZWZvcmUNCj4gPiAgIG1lIGdvaW5n IHRvIEhvc3BpdGFsLCB5b3Ugd2lsbCBjb250aW51ZSB0byBub3QgdW5kZXJzdGFuZCB0aGlzDQo+ ID4gICBvbmUsIG9yIHdoYXQgSSdtIHRyeWluZyB0byBleHBsYWluLCBhbmQgd2lsbCBtb3N0IHBy b2JhYmx5IGlnbm9yZQ0KPiA+ICAgYWxsIG9mIGl0LiBCdXQgcGxlYXNlIG5vdGUgb25lIHRoaW5n Og0KPiANCj4gSSByZWFkIHRoZW0sIGJ1dCBqdXN0IGFzIG5vdywgdGhleSBjb250aW51ZSB0byBp Z25vcmUgdGhlIHJlYWxpdHkgYWJvdXQNCj4gdGltZW91dHM6IHRpbWVvdXRzIG1lYW4gX25vdGhp bmdfIGluIGFuIFJQQyBmYWlsb3ZlciBzaXR1YXRpb24uIFRoZXJlIGlzDQo+IG5vIFJQQyBhYm9y dCBmdW5jdGlvbmFsaXR5IHRoYXQgeW91IGNhbiByZWx5IG9uIG90aGVyIHRoYW4gZmVuY2luZy4N Cj4gDQo+ID4gICAgIFlPVSBoYXZlIHNhYm90YWdlZCB0aGUgTkZTIDQuMSBMaW51eCBjbGllbnQs IHdoaWNoIGlzIG5vdyB0b3RhbGx5DQo+ID4gICAgIG5vdCBTVEQgY29tcGxhaW50LCBhbmQgaGF2 ZSBpbnRyb2R1Y2VkIENSQVNIcy4gQW5kIGZvciBubyBnb29kDQo+ID4gICAgIHJlYXNvbi4NCj4g DQo+IFNlZSBhYm92ZS4NCj4gDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xp ZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3 Lm5ldGFwcC5jb20NCg0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/13/2012 07:26 PM, Myklebust, Trond wrote: >> This above here is where you are wrong!! You don't understand my point, >> and ignore my comments. So let me state it as clear as I can. > > YOU are ignoring the reality of SunRPC. There is no abort/cancel/timeout > for an RPC call once it has started. This is why we need fencing > _specifically_ for the pNFS files client. > Again we have a communication problem between us. I say some words and mean one thing, and you say and hear the same words but attach different meanings to them. This is no one's fault it's just is. Lets do an experiment, mount a regular NFS4 in -o soft mode and start writing to server, say with dd. Now disconnect the cable. After some timeout the dd will return with "IO error", and will stop writing to file. This is the timeout I mean. Surely some RPC-requests did not complete and returned to NFS core with some kind of error. With RPC-requests I do not mean the RPC protocol on the wire, I mean that entity inside the Linux Kernel which represents an RPC. Surly some linux-RPC-requests objects were not released do to a server "rpc-done" received. But do to an internal mechanism that called the "release" method do to a communication timeout. So this is what I call "returned with a timeout". It does exist and used every day. Even better if I don't disconnect the wire but do an if_down or halt on the server, the dd's IO error will happen immediately, not even wait for any timeout. This is because the socket is orderly closed and all sends/receives will return quickly with a "disconnect-error". When I use a single Server like the nfs4 above. Then there is one fact in above scenario that I want to point out: At some point in the NFS-Core state. There is a point that no more requests are issued, all old request have released, and an error is returned to the application. At that point the client will not call skb-send, and will not try farther communication with the Server. This is what must happen with ALL DSs that belong to a layout, before client should be LAYOUT_RETURN(ing). The client can only do it's job. That is: STOP any skb-send, to any of the DSs in a layout. Only then it is complying to the RFC. So this is what I mean by "return with a timeout below" >> (Lets assume files layout, for blocks and objects it's a bit different >> but mostly the same.) > > That, and the fact that fencing hasn't been implemented for blocks and > objects. That's not true. At Panasas and both at EMC there is fencing in place and it is used every day. This is why I insist that it is very much the same for all of us. > The commit in question is 82c7c7a5a (NFSv4.1 return the LAYOUT > for each file with failed DS connection I/O) and touches only > fs/nfs/nfs4filelayout.c. It cannot be affecting blocks and objects. > OK I had in mind the patches that Andy sent. I'll look again for what actually went in. (It was all while I was unavailable) >> - Heavy IO is going on, the device_id in question has *3* DSs in it's >> device topography. Say DS1, DS2, DS3 >> >> - We have been queuing IO, and all queues are full. (we have 3 queues in >> in question, right? What is the maximum Q depth per files-DS? I know >> that in blocks and objects we usually have, I think, something like 128. >> This is a *tunable* in the block-layer's request-queue. Is it not some >> negotiated parameter with the NFS servers?) >> >> - Now, boom DS2 has timed-out. The Linux-client resets the session and >> internally closes all sockets of that session. All the RPCs that >> belong to DS2 are being returned up with a timeout error. This one >> is just the first of all those belonging to this DS2. They will >> be decrementing the reference for this layout very, very soon. >> >> - But what about DS1, and DS3 RPCs. What should we do with those? >> This is where you guys (Trond and Andy) are wrong. We must also >> wait for these RPC's as well. And opposite to what you think, this >> should not take long. Let me explain: >> >> We don't know anything about DS1 and DS3, each might be, either, >> "Having the same communication problem, like DS2". Or "is just working >> fine". So lets say for example that DS3 will also time-out in the >> future, and that DS1 is just fine and is writing as usual. >> >> * DS1 - Since it's working, it has most probably already done >> with all it's IO, because the NFS timeout is usually much longer >> then the normal RPC time, and since we are queuing evenly on >> all 3 DSs, at this point must probably, all of DS1 RPCs are >> already done. (And layout has been de-referenced). >> >> * DS3 - Will timeout in the future, when will that be? >> So let me start with, saying: >> (1). We could enhance our code and proactively, >> "cancel/abort" all RPCs that belong to DS3 (more on this >> below) > > Which makes the race _WORSE_. As I said above, there is no 'cancel RPC' > operation in SUNRPC. Once your RPC call is launched, it cannot be > recalled. All your discussion above is about the client side, and > ignores what may be happening on the data server side. The fencing is > what is needed to deal with the data server picture. > Again, some miss understanding. I never said we should not send a LAYOUT_RETURN before writing through MDS. The opposite is true, I think it is a novel idea and gives you the kind of barrier that will harden and robust the system. WHAT I'm saying is that this cannot happen while the schizophrenic client is busily still actively skb-sending more and more bytes to all the other DSs in the layout. LONG AFTER THE LAYOUT_RETURN HAS BEEN SENT AND RESPONDED. So what you are saying does not at all contradicts what I want. "The fencing is what is needed to deal with the data server picture" Fine But ONLY after the client has really stopped all sends. (Each one will do it's job) BTW: The Server does not *need* the Client to send a LAYOUT_RETURN It's just a nice-to-have, which I'm fine with. Both Panasas and EMC when IO is sent through MDS, will first recall, overlapping layouts, and only then proceed with MDS processing. (This is some deeply rooted mechanism inside the FS, an MDS being just another client) So this is a known problem that is taken care of. But I totally agree with you, the client LAYOUT_RETURN(ing) the layout will save lots of protocol time by avoiding the recalls. Now you understand why in Objects we mandated this LAYOUT_RETURN on errors. And while at it we want the exact error reported. >> (2). Or We can prove that DS3's RPCs will timeout at worst >> case 1 x NFS-timeout after above DS2 timeout event, or >> 2 x NFS-timeout after the queuing of the first timed-out >> RPC. And statistically in the average case DS3 will timeout >> very near the time DS2 timed-out. >> >> This is easy since the last IO we queued was the one that >> made DS2's queue to be full, and it was kept full because >> DS2 stopped responding and nothing emptied the queue. >> >> So the easiest we can do is wait for DS3 to timeout, soon >> enough, and once that will happen, session will be reset and all >> RPCs will end with an error. > > > You are still only discussing the client side. > > Read my lips: Sun RPC OPERATIONS DO NOT TIMEOUT AND CANNOT BE ABORTED OR > CANCELED. Fencing is the closest we can come to an abort operation. > Again I did not mean the "Sun RPC OPERATIONS" on the wire. I meant the Linux-request-entity which while exist has a potential to be submitted for skb-send. As seen above these entities do timeout in "-o soft" mode and once released remove the potential of any more future skb-sends on the wire. BUT what I do not understand is: In above example we are talking about DS3. We assumed that DS3 has a communication problem. So no amount of "fencing" or vudu or any other kind of operation can ever affect the client regarding DS3. Because even if On-the-server pending requests from client on DS3 are fenced and discarded these errors will not be communicated back the client. The client will sit idle on DS3 communication until the end of the timeout, regardless. Actually what I propose for DS3 in the best robust client is to destroy it's DS3's sessions and therefor cause all Linux-request-entities to return much much faster, then if *just waiting* for the timeout to expire. >> So in the worst case scenario we can recover 2 x NFS-timeout after >> a network partition, which is just 1 x NFS-timeout, after your >> schizophrenic FENCE_ME_OFF, newly invented operation. >> >> What we can do to enhance our code to reduce error recovery to >> 1 x NFS-timeout: >> >> - DS3 above: >> (As I said DS1's queues are now empty, because it was working fine, >> So DS3 is a representation of all DS's that have RPCs at the >> time DS2 timed-out, which belong to this layout) >> >> We can proactively abort all RPCs belonging to DS3. If there is >> a way to internally abort RPC's use that. Else just reset it's >> session and all sockets will close (and reopen), and all RPC's >> will end with a disconnect error. > > Not on most servers that I'm aware of. If you close or reset the socket > on the client, then the Linux server will happily continue to process > those RPC calls; it just won't be able to send a reply. > Furthermore, if the problem is that the data server isn't responding, > then a socket close/reset tells you nothing either. > Again I'm talking about the NFS-Internal-request-entities these will be released, though guarantying that no more threads will use any of these to send any more bytes over to any DSs. AND yes, yes. Once the client has done it's job and stopped any future skb-sends to *all* DSs in question, only then it can report to MDS: "Hey I'm done sending in all other routs here LAYOUT_RETURN" (Now fencing happens on servers) and client goes on and says "Hey can you MDS please also write this data" Which is perfect for MDS because otherwise if it wants to make sure, it will need to recall all outstanding layouts, exactly for your reason, for concern for the data corruption that can happen. >> - Both DS2 that timed-out, and DS3 that was aborted. Should be >> marked with a flag. When new IO that belong to some other >> inode through some other layout+device_id encounters a flagged >> device, it should abort and turn to MDS IO, with also invalidating >> it's layout, and hens, soon enough the device_id for DS2&3 will be >> de-referenced and be removed from device cache. (And all referencing >> layouts are now gone) > > There is no RPC abort functionality Sun RPC. Again, this argument relies > on functionality that _doesn't_ exist. > Again I mean internally at the client. For example closing the socket will have the effect I want. (And some other tricks we can talk about those later, lets agree about the principal first) >> So we do not continue queuing new IO to dead devices. And since most >> probably MDS will not give us dead servers in new layout, we should be >> good. >> In summery. >> - FENCE_ME_OFF is a new operation, and is not === LAYOUT_RETURN. Client >> *must not* skb-send a single byte belonging to a layout, after the send >> of LAYOUT_RETURN. >> (It need not wait for OPT_DONE from DS to do that, it just must make >> sure, that all it's internal, or on-the-wire request, are aborted >> by easily closing the sockets they belong too, and/or waiting for >> healthy DS's IO to be OPT_DONE . So the client is not dependent on >> any DS response, it is only dependent on it's internal state being >> *clean* from any more skb-send(s)) > > Ditto > >> - The proper implementation of LAYOUT_RETURN on error for fast turnover >> is not hard, and does not involve a new invented NFS operation such >> as FENCE_ME_OFF. Proper codded client, independently, without >> the aid of any FENCE_ME_OFF operation, can achieve a faster turnaround >> by actively returning all layouts that belong to a bad DS, and not >> waiting for a fence-off of a single layout, then encountering just >> the same error with all other layouts that have the same DS > > What do you mean by "all layouts that belong to a bad DS"? Layouts don't > belong to a DS, and so there is no way to get from a DS to a layout. > Why, sure. loop on all layouts and ask if it has a specific DS. >> - And I know that just as you did not read my emails from before >> me going to Hospital, you will continue to not understand this >> one, or what I'm trying to explain, and will most probably ignore >> all of it. But please note one thing: > > I read them, but just as now, they continue to ignore the reality about > timeouts: timeouts mean _nothing_ in an RPC failover situation. There is > no RPC abort functionality that you can rely on other than fencing. > I hope I explained this by now. If not please, please, lets organize a phone call. We can use Panasas conference number whenever you are available. I think we communicate better in person. Everyone else is also invited. BUT there is one most important point for me: As stated by the RFC. Client must guaranty that no more bytes will be sent to any DSs in a layout, once LAYOUT_RETURN is sent. This is the only definition of LAYOUT_RETURN, and NO_MATCHING_LAYOUT as response to a LAYOUT_RECALL. Which is: Client has indicated no more future sends on a layout. (And server will enforce it with a fencing) >> YOU have sabotaged the NFS 4.1 Linux client, which is now totally >> not STD complaint, and have introduced CRASHs. And for no good >> reason. > > See above. > OK We'll have to see about these crashes, lets talk about them. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
T24gVHVlLCAyMDEyLTA4LTE0IGF0IDAyOjM5ICswMzAwLCBCb2F6IEhhcnJvc2ggd3JvdGU6DQo+ IE9uIDA4LzEzLzIwMTIgMDc6MjYgUE0sIE15a2xlYnVzdCwgVHJvbmQgd3JvdGU6DQo+IA0KPiA+ PiBUaGlzIGFib3ZlIGhlcmUgaXMgd2hlcmUgeW91IGFyZSB3cm9uZyEhIFlvdSBkb24ndCB1bmRl cnN0YW5kIG15IHBvaW50LA0KPiA+PiBhbmQgaWdub3JlIG15IGNvbW1lbnRzLiBTbyBsZXQgbWUg c3RhdGUgaXQgYXMgY2xlYXIgYXMgSSBjYW4uDQo+ID4gDQo+ID4gWU9VIGFyZSBpZ25vcmluZyB0 aGUgcmVhbGl0eSBvZiBTdW5SUEMuIFRoZXJlIGlzIG5vIGFib3J0L2NhbmNlbC90aW1lb3V0DQo+ ID4gZm9yIGFuIFJQQyBjYWxsIG9uY2UgaXQgaGFzIHN0YXJ0ZWQuIFRoaXMgaXMgd2h5IHdlIG5l ZWQgZmVuY2luZw0KPiA+IF9zcGVjaWZpY2FsbHlfIGZvciB0aGUgcE5GUyBmaWxlcyBjbGllbnQu DQo+ID4gDQo+IA0KPiANCj4gQWdhaW4gd2UgaGF2ZSBhIGNvbW11bmljYXRpb24gcHJvYmxlbSBi ZXR3ZWVuIHVzLiBJIHNheSBzb21lIHdvcmRzIGFuZA0KPiBtZWFuIG9uZSB0aGluZywgYW5kIHlv dSBzYXkgYW5kIGhlYXIgdGhlIHNhbWUgd29yZHMgYnV0IGF0dGFjaCBkaWZmZXJlbnQNCj4gbWVh bmluZ3MgdG8gdGhlbS4gVGhpcyBpcyBubyBvbmUncyBmYXVsdCBpdCdzIGp1c3QgaXMuDQo+IA0K PiBMZXRzIGRvIGFuIGV4cGVyaW1lbnQsIG1vdW50IGEgcmVndWxhciBORlM0IGluIC1vIHNvZnQg bW9kZSBhbmQgc3RhcnQNCj4gd3JpdGluZyB0byBzZXJ2ZXIsIHNheSB3aXRoIGRkLiBOb3cgZGlz Y29ubmVjdCB0aGUgY2FibGUuIEFmdGVyIHNvbWUgdGltZW91dA0KPiB0aGUgZGQgd2lsbCByZXR1 cm4gd2l0aCAiSU8gZXJyb3IiLCBhbmQgd2lsbCBzdG9wIHdyaXRpbmcgdG8gZmlsZS4NCj4gDQo+ IFRoaXMgaXMgdGhlIHRpbWVvdXQgSSBtZWFuLiBTdXJlbHkgc29tZSBSUEMtcmVxdWVzdHMgZGlk IG5vdCBjb21wbGV0ZSBhbmQNCj4gcmV0dXJuZWQgdG8gTkZTIGNvcmUgd2l0aCBzb21lIGtpbmQg b2YgZXJyb3IuDQo+IA0KPiBXaXRoIFJQQy1yZXF1ZXN0cyBJIGRvIG5vdCBtZWFuIHRoZSBSUEMg cHJvdG9jb2wgb24gdGhlIHdpcmUsIEkgbWVhbiB0aGF0DQo+IGVudGl0eSBpbnNpZGUgdGhlIExp bnV4IEtlcm5lbCB3aGljaCByZXByZXNlbnRzIGFuIFJQQy4gU3VybHkgc29tZQ0KPiBsaW51eC1S UEMtcmVxdWVzdHMgb2JqZWN0cyB3ZXJlIG5vdCByZWxlYXNlZCBkbyB0byBhIHNlcnZlciAicnBj LWRvbmUiDQo+IHJlY2VpdmVkLiBCdXQgZG8gdG8gYW4gaW50ZXJuYWwgbWVjaGFuaXNtIHRoYXQg Y2FsbGVkIHRoZSAicmVsZWFzZSIgbWV0aG9kIGRvIHRvDQo+IGEgY29tbXVuaWNhdGlvbiB0aW1l b3V0Lg0KPiANCj4gU28gdGhpcyBpcyB3aGF0IEkgY2FsbCAicmV0dXJuZWQgd2l0aCBhIHRpbWVv dXQiLiBJdCBkb2VzIGV4aXN0IGFuZCB1c2VkDQo+IGV2ZXJ5IGRheS4NCj4gDQo+IEV2ZW4gYmV0 dGVyIGlmIEkgZG9uJ3QgZGlzY29ubmVjdCB0aGUgd2lyZSBidXQgZG8gYW4gaWZfZG93biBvciBo YWx0IG9uIHRoZQ0KPiBzZXJ2ZXIsIHRoZSBkZCdzIElPIGVycm9yIHdpbGwgaGFwcGVuIGltbWVk aWF0ZWx5LCBub3QgZXZlbiB3YWl0IGZvcg0KPiBhbnkgdGltZW91dC4gVGhpcyBpcyBiZWNhdXNl IHRoZSBzb2NrZXQgaXMgb3JkZXJseSBjbG9zZWQgYW5kIGFsbA0KPiBzZW5kcy9yZWNlaXZlcyB3 aWxsIHJldHVybiBxdWlja2x5IHdpdGggYSAiZGlzY29ubmVjdC1lcnJvciIuDQo+IA0KPiBXaGVu IEkgdXNlIGEgc2luZ2xlIFNlcnZlciBsaWtlIHRoZSBuZnM0IGFib3ZlLiBUaGVuIHRoZXJlIGlz IG9uZSBmYWN0DQo+IGluIGFib3ZlIHNjZW5hcmlvIHRoYXQgSSB3YW50IHRvIHBvaW50IG91dDoN Cj4gDQo+ICAgICBBdCBzb21lIHBvaW50IGluIHRoZSBORlMtQ29yZSBzdGF0ZS4gVGhlcmUgaXMg YSBwb2ludCB0aGF0IG5vIG1vcmUNCj4gICAgIHJlcXVlc3RzIGFyZSBpc3N1ZWQsIGFsbCBvbGQg cmVxdWVzdCBoYXZlIHJlbGVhc2VkLCBhbmQgYW4gZXJyb3IgaXMNCj4gICAgIHJldHVybmVkIHRv IHRoZSBhcHBsaWNhdGlvbi4gQXQgdGhhdCBwb2ludCB0aGUgY2xpZW50IHdpbGwgbm90IGNhbGwN Cj4gICAgIHNrYi1zZW5kLCBhbmQgd2lsbCBub3QgdHJ5IGZhcnRoZXIgY29tbXVuaWNhdGlvbiB3 aXRoIHRoZSBTZXJ2ZXIuDQo+IA0KPiBUaGlzIGlzIHdoYXQgbXVzdCBoYXBwZW4gd2l0aCBBTEwg RFNzIHRoYXQgYmVsb25nIHRvIGEgbGF5b3V0LCBiZWZvcmUNCj4gY2xpZW50IHNob3VsZCBiZSBM QVlPVVRfUkVUVVJOKGluZykuIFRoZSBjbGllbnQgY2FuIG9ubHkgZG8gaXQncyBqb2IuIFRoYXQg aXM6DQo+IA0KPiAgICBTVE9QIGFueSBza2Itc2VuZCwgdG8gYW55IG9mIHRoZSBEU3MgaW4gYSBs YXlvdXQuDQo+ICAgIE9ubHkgdGhlbiBpdCBpcyBjb21wbHlpbmcgdG8gdGhlIFJGQy4NCj4gDQo+ IFNvIHRoaXMgaXMgd2hhdCBJIG1lYW4gYnkgInJldHVybiB3aXRoIGEgdGltZW91dCBiZWxvdyIN Cg0KDQpJIGhlYXIgeW91LCBub3cgbGlzdGVuIHRvIG1lLg0KDQpXaG8gX2NhcmVzXyBpZiB0aGUg Y2xpZW50IHNlbmRzIGFuIFJQQyBjYWxsIGFmdGVyIHRoZSBsYXlvdXRyZXR1cm4/IEluDQp0aGUg Y2FzZSBvZiBhbiB1bnJlc3BvbnNpdmUgZGF0YSBzZXJ2ZXIgdGhlIGNsaWVudCBjYW4ndCBndWFy YW50ZWUgdGhhdA0KdGhpcyB3b24ndCBoYXBwZW4gZXZlbiBpZiBpdCBkb2VzIHdhaXQuDQpBIHBO RlMgZmlsZXMgc2VydmVyIHRoYXQgZG9lc24ndCBwcm9wYWdhdGUgdGhlIGxheW91dHJldHVybiB0 byB0aGUgZGF0YQ0Kc2VydmVyIGluIGEgdGltZWx5IGZhc2hpb24gaXMgZnVuZGFtZW50YWxseSBf YnJva2VuXyBpbiB0aGUgY2FzZSB3aGVyZQ0KdGhlIGNvbW11bmljYXRpb24gYmV0d2VlbiB0aGUg ZGF0YSBzZXJ2ZXIgYW5kIGNsaWVudCBpcyBkb3duLiBJdCBjYW5ub3QNCm9mZmVyIGFueSBkYXRh IGludGVncml0eSBndWFyYW50ZWVzIHdoZW4gdGhlIGNsaWVudCB0cmllcyB0byB3cml0ZQ0KdGhy b3VnaCBNRFMsIGJlY2F1c2UgdGhlIERTZXMgbWF5IHN0aWxsIGJlIHByb2Nlc3Npbmcgb2xkIHdy aXRlIFJQQw0KY2FsbHMuDQoNCj4gPj4gKExldHMgYXNzdW1lIGZpbGVzIGxheW91dCwgZm9yIGJs b2NrcyBhbmQgb2JqZWN0cyBpdCdzIGEgYml0IGRpZmZlcmVudA0KPiA+PiAgYnV0IG1vc3RseSB0 aGUgc2FtZS4pDQo+ID4gDQo+ID4gVGhhdCwgYW5kIHRoZSBmYWN0IHRoYXQgZmVuY2luZyBoYXNu J3QgYmVlbiBpbXBsZW1lbnRlZCBmb3IgYmxvY2tzIGFuZA0KPiA+IG9iamVjdHMuIA0KPiANCj4g DQo+IFRoYXQncyBub3QgdHJ1ZS4gQXQgUGFuYXNhcyBhbmQgYm90aCBhdCBFTUMgdGhlcmUgaXMg ZmVuY2luZyBpbiBwbGFjZSBhbmQNCj4gaXQgaXMgdXNlZCBldmVyeSBkYXkuIFRoaXMgaXMgd2h5 IEkgaW5zaXN0IHRoYXQgaXQgaXMgdmVyeSBtdWNoDQo+IHRoZSBzYW1lIGZvciBhbGwgb2YgdXMu DQoNCkknbSB0YWxraW5nIGFib3V0IHRoZSB1c2Ugb2YgbGF5b3V0cmV0dXJuIGZvciBjbGllbnQg ZmVuY2luZywgd2hpY2ggaXMNCm9ubHkgaW1wbGVtZW50ZWQgZm9yIGZpbGVzLg0KDQpIb3dldmVy IFRhbyBhZG1pdHRlZCB0aGF0IHRoZSBibG9ja3MgY2xpZW50IGhhcyBub3QgeWV0IGltcGxlbWVu dGVkIHRoZQ0KdGltZWQtbGVhc2UgZmVuY2luZyBhcyBkZXNjcmliZWQgaW4gUkZDNTY2Mywgc28g dGhlcmUgaXMgc3RpbGwgd29yayB0bw0KYmUgZG9uZSB0aGVyZS4NCg0KSSd2ZSBubyBpZGVhIHdo YXQgdGhlIG9iamVjdCBjbGllbnQgaXMgZG9pbmcuDQoNCj4gPiBUaGUgY29tbWl0IGluIHF1ZXN0 aW9uIGlzIDgyYzdjN2E1YSAoTkZTdjQuMSByZXR1cm4gdGhlIExBWU9VVA0KPiA+IGZvciBlYWNo IGZpbGUgd2l0aCBmYWlsZWQgRFMgY29ubmVjdGlvbiBJL08pIGFuZCB0b3VjaGVzIG9ubHkNCj4g PiBmcy9uZnMvbmZzNGZpbGVsYXlvdXQuYy4gSXQgY2Fubm90IGJlIGFmZmVjdGluZyBibG9ja3Mg YW5kIG9iamVjdHMuDQo+ID4gDQo+IA0KPiANCj4gT0sgSSBoYWQgaW4gbWluZCB0aGUgcGF0Y2hl cyB0aGF0IEFuZHkgc2VudC4gSSdsbCBsb29rIGFnYWluIGZvciB3aGF0DQo+IGFjdHVhbGx5IHdl bnQgaW4uIChJdCB3YXMgYWxsIHdoaWxlIEkgd2FzIHVuYXZhaWxhYmxlKQ0KDQpIZSBzZW50IGEg cmV2aXNlZCBwYXRjaCBzZXQsIHdoaWNoIHNob3VsZCBvbmx5IGFmZmVjdCB0aGUgZmlsZXMgbGF5 b3V0Lg0KDQo+ID4+IC0gSGVhdnkgSU8gaXMgZ29pbmcgb24sIHRoZSBkZXZpY2VfaWQgaW4gcXVl c3Rpb24gaGFzICozKiBEU3MgaW4gaXQncw0KPiA+PiAgIGRldmljZSB0b3BvZ3JhcGh5LiBTYXkg RFMxLCBEUzIsIERTMw0KPiA+Pg0KPiA+PiAtIFdlIGhhdmUgYmVlbiBxdWV1aW5nIElPLCBhbmQg YWxsIHF1ZXVlcyBhcmUgZnVsbC4gKHdlIGhhdmUgMyBxdWV1ZXMgaW4NCj4gPj4gICBpbiBxdWVz dGlvbiwgcmlnaHQ/IFdoYXQgaXMgdGhlIG1heGltdW0gUSBkZXB0aCBwZXIgZmlsZXMtRFM/IEkg a25vdw0KPiA+PiAgIHRoYXQgaW4gYmxvY2tzIGFuZCBvYmplY3RzIHdlIHVzdWFsbHkgaGF2ZSwg SSB0aGluaywgc29tZXRoaW5nIGxpa2UgMTI4Lg0KPiA+PiAgIFRoaXMgaXMgYSAqdHVuYWJsZSog aW4gdGhlIGJsb2NrLWxheWVyJ3MgcmVxdWVzdC1xdWV1ZS4gSXMgaXQgbm90IHNvbWUNCj4gPj4g ICBuZWdvdGlhdGVkIHBhcmFtZXRlciB3aXRoIHRoZSBORlMgc2VydmVycz8pDQo+ID4+DQo+ID4+ IC0gTm93LCBib29tIERTMiBoYXMgdGltZWQtb3V0LiBUaGUgTGludXgtY2xpZW50IHJlc2V0cyB0 aGUgc2Vzc2lvbiBhbmQNCj4gPj4gICBpbnRlcm5hbGx5IGNsb3NlcyBhbGwgc29ja2V0cyBvZiB0 aGF0IHNlc3Npb24uIEFsbCB0aGUgUlBDcyB0aGF0DQo+ID4+ICAgYmVsb25nIHRvIERTMiBhcmUg YmVpbmcgcmV0dXJuZWQgdXAgd2l0aCBhIHRpbWVvdXQgZXJyb3IuIFRoaXMgb25lDQo+ID4+ICAg aXMganVzdCB0aGUgZmlyc3Qgb2YgYWxsIHRob3NlIGJlbG9uZ2luZyB0byB0aGlzIERTMi4gVGhl eSB3aWxsDQo+ID4+ICAgYmUgZGVjcmVtZW50aW5nIHRoZSByZWZlcmVuY2UgZm9yIHRoaXMgbGF5 b3V0IHZlcnksIHZlcnkgc29vbi4NCj4gPj4NCj4gPj4gLSBCdXQgd2hhdCBhYm91dCBEUzEsIGFu ZCBEUzMgUlBDcy4gV2hhdCBzaG91bGQgd2UgZG8gd2l0aCB0aG9zZT8NCj4gPj4gICBUaGlzIGlz IHdoZXJlIHlvdSBndXlzIChUcm9uZCBhbmQgQW5keSkgYXJlIHdyb25nLiBXZSBtdXN0IGFsc28N Cj4gPj4gICB3YWl0IGZvciB0aGVzZSBSUEMncyBhcyB3ZWxsLiBBbmQgb3Bwb3NpdGUgdG8gd2hh dCB5b3UgdGhpbmssIHRoaXMNCj4gPj4gICBzaG91bGQgbm90IHRha2UgbG9uZy4gTGV0IG1lIGV4 cGxhaW46DQo+ID4+DQo+ID4+ICAgV2UgZG9uJ3Qga25vdyBhbnl0aGluZyBhYm91dCBEUzEgYW5k IERTMywgZWFjaCBtaWdodCBiZSwgZWl0aGVyLA0KPiA+PiAgICJIYXZpbmcgdGhlIHNhbWUgY29t bXVuaWNhdGlvbiBwcm9ibGVtLCBsaWtlIERTMiIuIE9yICJpcyBqdXN0IHdvcmtpbmcNCj4gPj4g ICBmaW5lIi4gU28gbGV0cyBzYXkgZm9yIGV4YW1wbGUgdGhhdCBEUzMgd2lsbCBhbHNvIHRpbWUt b3V0IGluIHRoZQ0KPiA+PiAgIGZ1dHVyZSwgYW5kIHRoYXQgRFMxIGlzIGp1c3QgZmluZSBhbmQg aXMgd3JpdGluZyBhcyB1c3VhbC4NCj4gPj4NCj4gPj4gICAqIERTMSAtIFNpbmNlIGl0J3Mgd29y a2luZywgaXQgaGFzIG1vc3QgcHJvYmFibHkgYWxyZWFkeSBkb25lDQo+ID4+ICAgICB3aXRoIGFs bCBpdCdzIElPLCBiZWNhdXNlIHRoZSBORlMgdGltZW91dCBpcyB1c3VhbGx5IG11Y2ggbG9uZ2Vy DQo+ID4+ICAgICB0aGVuIHRoZSBub3JtYWwgUlBDIHRpbWUsIGFuZCBzaW5jZSB3ZSBhcmUgcXVl dWluZyBldmVubHkgb24NCj4gPj4gICAgIGFsbCAzIERTcywgYXQgdGhpcyBwb2ludCBtdXN0IHBy b2JhYmx5LCBhbGwgb2YgRFMxIFJQQ3MgYXJlDQo+ID4+ICAgICBhbHJlYWR5IGRvbmUuIChBbmQg bGF5b3V0IGhhcyBiZWVuIGRlLXJlZmVyZW5jZWQpLg0KPiA+Pg0KPiA+PiAgICogRFMzIC0gV2ls bCB0aW1lb3V0IGluIHRoZSBmdXR1cmUsIHdoZW4gd2lsbCB0aGF0IGJlPw0KPiA+PiAgICAgU28g bGV0IG1lIHN0YXJ0IHdpdGgsIHNheWluZzoNCj4gPj4gICAgICgxKS4gV2UgY291bGQgZW5oYW5j ZSBvdXIgY29kZSBhbmQgcHJvYWN0aXZlbHksIA0KPiA+PiAgICAgICAgICJjYW5jZWwvYWJvcnQi IGFsbCBSUENzIHRoYXQgYmVsb25nIHRvIERTMyAobW9yZSBvbiB0aGlzDQo+ID4+ICAgICAgICAg IGJlbG93KQ0KPiA+IA0KPiA+IFdoaWNoIG1ha2VzIHRoZSByYWNlIF9XT1JTRV8uIEFzIEkgc2Fp ZCBhYm92ZSwgdGhlcmUgaXMgbm8gJ2NhbmNlbCBSUEMnDQo+ID4gb3BlcmF0aW9uIGluIFNVTlJQ Qy4gT25jZSB5b3VyIFJQQyBjYWxsIGlzIGxhdW5jaGVkLCBpdCBjYW5ub3QgYmUNCj4gPiByZWNh bGxlZC4gQWxsIHlvdXIgZGlzY3Vzc2lvbiBhYm92ZSBpcyBhYm91dCB0aGUgY2xpZW50IHNpZGUs IGFuZA0KPiA+IGlnbm9yZXMgd2hhdCBtYXkgYmUgaGFwcGVuaW5nIG9uIHRoZSBkYXRhIHNlcnZl ciBzaWRlLiBUaGUgZmVuY2luZyBpcw0KPiA+IHdoYXQgaXMgbmVlZGVkIHRvIGRlYWwgd2l0aCB0 aGUgZGF0YSBzZXJ2ZXIgcGljdHVyZS4NCj4gPiANCj4gDQo+IA0KPiBBZ2Fpbiwgc29tZSBtaXNz IHVuZGVyc3RhbmRpbmcuIEkgbmV2ZXIgc2FpZCB3ZSBzaG91bGQgbm90IHNlbmQNCj4gYSBMQVlP VVRfUkVUVVJOIGJlZm9yZSB3cml0aW5nIHRocm91Z2ggTURTLiBUaGUgb3Bwb3NpdGUgaXMgdHJ1 ZSwNCj4gSSB0aGluayBpdCBpcyBhIG5vdmVsIGlkZWEgYW5kIGdpdmVzIHlvdSB0aGUga2luZCBv ZiBiYXJyaWVyIHRoYXQNCj4gd2lsbCBoYXJkZW4gYW5kIHJvYnVzdCB0aGUgc3lzdGVtLg0KPiAN Cj4gICAgV0hBVCBJJ20gc2F5aW5nIGlzIHRoYXQgdGhpcyBjYW5ub3QgaGFwcGVuIHdoaWxlIHRo ZSBzY2hpem9waHJlbmljDQo+ICAgIGNsaWVudCBpcyBidXNpbHkgc3RpbGwgYWN0aXZlbHkgc2ti LXNlbmRpbmcgbW9yZSBhbmQgbW9yZSBieXRlcw0KPiAgICB0byBhbGwgdGhlIG90aGVyIERTcyBp biB0aGUgbGF5b3V0LiBMT05HIEFGVEVSIFRIRSBMQVlPVVRfUkVUVVJODQo+ICAgIEhBUyBCRUVO IFNFTlQgQU5EIFJFU1BPTkRFRC4NCj4gDQo+IFNvIHdoYXQgeW91IGFyZSBzYXlpbmcgZG9lcyBu b3QgYXQgYWxsIGNvbnRyYWRpY3RzIHdoYXQgSSB3YW50Lg0KPiANCj4gICAgIlRoZSBmZW5jaW5n IGlzIHdoYXQgaXMgbmVlZGVkIHRvIGRlYWwgd2l0aCB0aGUgZGF0YSBzZXJ2ZXIgcGljdHVyZSIN Cj4gICAgIA0KPiAgICAgRmluZSBCdXQgT05MWSBhZnRlciB0aGUgY2xpZW50IGhhcyByZWFsbHkg c3RvcHBlZCBhbGwgc2VuZHMuDQo+ICAgICAoRWFjaCBvbmUgd2lsbCBkbyBpdCdzIGpvYikNCj4g DQo+IEJUVzogVGhlIFNlcnZlciBkb2VzIG5vdCAqbmVlZCogdGhlIENsaWVudCB0byBzZW5kIGEg TEFZT1VUX1JFVFVSTg0KPiAgICAgIEl0J3MganVzdCBhIG5pY2UtdG8taGF2ZSwgd2hpY2ggSSdt IGZpbmUgd2l0aC4NCj4gICAgICBCb3RoIFBhbmFzYXMgYW5kIEVNQyB3aGVuIElPIGlzIHNlbnQg dGhyb3VnaCBNRFMsIHdpbGwgZmlyc3QNCj4gICAgICByZWNhbGwsIG92ZXJsYXBwaW5nIGxheW91 dHMsIGFuZCBvbmx5IHRoZW4gcHJvY2VlZCB3aXRoDQo+ICAgICAgTURTIHByb2Nlc3NpbmcuIChU aGlzIGlzIHNvbWUgZGVlcGx5IHJvb3RlZCBtZWNoYW5pc20gaW5zaWRlDQo+ICAgICAgdGhlIEZT LCBhbiBNRFMgYmVpbmcganVzdCBhbm90aGVyIGNsaWVudCkNCg0KQWdhaW4sIHdlJ3JlIG5vdCB0 YWxraW5nIGFib3V0IGJsb2NrcyBvciBvYmplY3RzLg0KYQ0KPiAgICAgIFNvIHRoaXMgaXMgYSBr bm93biBwcm9ibGVtIHRoYXQgaXMgdGFrZW4gY2FyZSBvZi4gQnV0IEkgdG90YWxseQ0KPiAgICAg IGFncmVlIHdpdGggeW91LCB0aGUgY2xpZW50IExBWU9VVF9SRVRVUk4oaW5nKSB0aGUgbGF5b3V0 IHdpbGwgc2F2ZQ0KPiAgICAgIGxvdHMgb2YgcHJvdG9jb2wgdGltZSBieSBhdm9pZGluZyB0aGUg cmVjYWxscy4NCj4gICAgICBOb3cgeW91IHVuZGVyc3RhbmQgd2h5IGluIE9iamVjdHMgd2UgbWFu ZGF0ZWQgdGhpcyBMQVlPVVRfUkVUVVJODQo+ICAgICAgb24gZXJyb3JzLiBBbmQgd2hpbGUgYXQg aXQgd2Ugd2FudCB0aGUgZXhhY3QgZXJyb3IgcmVwb3J0ZWQuDQo+IA0KPiA+PiAgICAgKDIpLiBP ciBXZSBjYW4gcHJvdmUgdGhhdCBEUzMncyBSUENzIHdpbGwgdGltZW91dCBhdCB3b3JzdA0KPiA+ PiAgICAgICAgICBjYXNlIDEgeCBORlMtdGltZW91dCBhZnRlciBhYm92ZSBEUzIgdGltZW91dCBl dmVudCwgb3INCj4gPj4gICAgICAgICAgMiB4IE5GUy10aW1lb3V0IGFmdGVyIHRoZSBxdWV1aW5n IG9mIHRoZSBmaXJzdCB0aW1lZC1vdXQNCj4gPj4gICAgICAgICAgUlBDLiBBbmQgc3RhdGlzdGlj YWxseSBpbiB0aGUgYXZlcmFnZSBjYXNlIERTMyB3aWxsIHRpbWVvdXQNCj4gPj4gICAgICAgICAg dmVyeSBuZWFyIHRoZSB0aW1lIERTMiB0aW1lZC1vdXQuDQo+ID4+DQo+ID4+ICAgICAgICAgIFRo aXMgaXMgZWFzeSBzaW5jZSB0aGUgbGFzdCBJTyB3ZSBxdWV1ZWQgd2FzIHRoZSBvbmUgdGhhdA0K PiA+PiAgICAgICAgICBtYWRlIERTMidzIHF1ZXVlIHRvIGJlIGZ1bGwsIGFuZCBpdCB3YXMga2Vw dCBmdWxsIGJlY2F1c2UNCj4gPj4gICAgICAgICAgRFMyIHN0b3BwZWQgcmVzcG9uZGluZyBhbmQg bm90aGluZyBlbXB0aWVkIHRoZSBxdWV1ZS4NCj4gPj4NCj4gPj4gICAgICBTbyB0aGUgZWFzaWVz dCB3ZSBjYW4gZG8gaXMgd2FpdCBmb3IgRFMzIHRvIHRpbWVvdXQsIHNvb24NCj4gPj4gICAgICBl bm91Z2gsIGFuZCBvbmNlIHRoYXQgd2lsbCBoYXBwZW4sIHNlc3Npb24gd2lsbCBiZSByZXNldCBh bmQgYWxsDQo+ID4+ICAgICAgUlBDcyB3aWxsIGVuZCB3aXRoIGFuIGVycm9yLg0KPiA+IA0KPiA+ IA0KPiA+IFlvdSBhcmUgc3RpbGwgb25seSBkaXNjdXNzaW5nIHRoZSBjbGllbnQgc2lkZS4NCj4g PiANCj4gPiBSZWFkIG15IGxpcHM6IFN1biBSUEMgT1BFUkFUSU9OUyBETyBOT1QgVElNRU9VVCBB TkQgQ0FOTk9UIEJFIEFCT1JURUQgT1INCj4gPiBDQU5DRUxFRC4gRmVuY2luZyBpcyB0aGUgY2xv c2VzdCB3ZSBjYW4gY29tZSB0byBhbiBhYm9ydCBvcGVyYXRpb24uDQo+ID4gDQo+IA0KPiANCj4g QWdhaW4gSSBkaWQgbm90IG1lYW4gdGhlICJTdW4gUlBDIE9QRVJBVElPTlMiIG9uIHRoZSB3aXJl LiBJIG1lYW50DQo+IHRoZSBMaW51eC1yZXF1ZXN0LWVudGl0eSB3aGljaCB3aGlsZSBleGlzdCBo YXMgYSBwb3RlbnRpYWwgdG8gYmUNCj4gc3VibWl0dGVkIGZvciBza2Itc2VuZC4gQXMgc2VlbiBh Ym92ZSB0aGVzZSBlbnRpdGllcyBkbyB0aW1lb3V0DQo+IGluICItbyBzb2Z0IiBtb2RlIGFuZCBv bmNlIHJlbGVhc2VkIHJlbW92ZSB0aGUgcG90ZW50aWFsIG9mIGFueSBtb3JlDQo+IGZ1dHVyZSBz a2Itc2VuZHMgb24gdGhlIHdpcmUuDQo+IA0KPiBCVVQgd2hhdCBJIGRvIG5vdCB1bmRlcnN0YW5k IGlzOiBJbiBhYm92ZSBleGFtcGxlIHdlIGFyZSB0YWxraW5nDQo+IGFib3V0IERTMy4gV2UgYXNz dW1lZCB0aGF0IERTMyBoYXMgYSBjb21tdW5pY2F0aW9uIHByb2JsZW0uIFNvIG5vIGFtb3VudA0K PiBvZiAiZmVuY2luZyIgb3IgdnVkdSBvciBhbnkgb3RoZXIga2luZCBvZiBvcGVyYXRpb24gY2Fu IGV2ZXIgYWZmZWN0DQo+IHRoZSBjbGllbnQgcmVnYXJkaW5nIERTMy4gQmVjYXVzZSBldmVuIGlm IE9uLXRoZS1zZXJ2ZXIgcGVuZGluZyByZXF1ZXN0cw0KPiBmcm9tIGNsaWVudCBvbiBEUzMgYXJl IGZlbmNlZCBhbmQgZGlzY2FyZGVkIHRoZXNlIGVycm9ycyB3aWxsIG5vdA0KPiBiZSBjb21tdW5p Y2F0ZWQgYmFjayB0aGUgY2xpZW50LiBUaGUgY2xpZW50IHdpbGwgc2l0IGlkbGUgb24gRFMzDQo+ IGNvbW11bmljYXRpb24gdW50aWwgdGhlIGVuZCBvZiB0aGUgdGltZW91dCwgcmVnYXJkbGVzcy4N Cg0KV2UgZG9uJ3QgY2FyZSBhYm91dCBhbnkgcmVjZWl2aW5nIHRoZSBlcnJvcnMuIFdlJ3ZlIHRp bWVkIG91dC4gQWxsIHdlDQp3YW50IHRvIGRvIGlzIHRvIGZlbmNlIG9mZiB0aGUgZGFtbmVkIHdy aXRlcyB0aGF0IGhhdmUgYWxyZWFkeSBiZWVuIHNlbnQNCnRvIHRoZSBib3JrZW4gRFMgYW5kIHRo ZW4gcmVzZW5kIHRoZW0gdGhyb3VnaCB0aGUgTURTLg0KDQo+IEFjdHVhbGx5IHdoYXQgSSBwcm9w b3NlIGZvciBEUzMgaW4gdGhlIGJlc3Qgcm9idXN0IGNsaWVudCBpcyB0bw0KPiBkZXN0cm95IGl0 J3MgRFMzJ3Mgc2Vzc2lvbnMgYW5kIHRoZXJlZm9yIGNhdXNlIGFsbCBMaW51eC1yZXF1ZXN0LWVu dGl0aWVzDQo+IHRvIHJldHVybiBtdWNoIG11Y2ggZmFzdGVyLCB0aGVuIGlmICpqdXN0IHdhaXRp bmcqIGZvciB0aGUgdGltZW91dCB0byBleHBpcmUuDQoNCkkgcmVwZWF0OiBkZXN0cm95aW5nIHRo ZSBzZXNzaW9uIG9uIHRoZSBjbGllbnQgZG9lcyBOT1RISU5HIHRvIGhlbHAgeW91Lg0KDQo+ID4+ IFNvIGluIHRoZSB3b3JzdCBjYXNlIHNjZW5hcmlvIHdlIGNhbiByZWNvdmVyIDIgeCBORlMtdGlt ZW91dCBhZnRlcg0KPiA+PiBhIG5ldHdvcmsgcGFydGl0aW9uLCB3aGljaCBpcyBqdXN0IDEgeCBO RlMtdGltZW91dCwgYWZ0ZXIgeW91cg0KPiA+PiBzY2hpem9waHJlbmljIEZFTkNFX01FX09GRiwg bmV3bHkgaW52ZW50ZWQgb3BlcmF0aW9uLg0KPiA+Pg0KPiA+PiBXaGF0IHdlIGNhbiBkbyB0byBl bmhhbmNlIG91ciBjb2RlIHRvIHJlZHVjZSBlcnJvciByZWNvdmVyeSB0bw0KPiA+PiAxIHggTkZT LXRpbWVvdXQ6DQo+ID4+DQo+ID4+IC0gRFMzIGFib3ZlOg0KPiA+PiAgIChBcyBJIHNhaWQgRFMx J3MgcXVldWVzIGFyZSBub3cgZW1wdHksIGJlY2F1c2UgaXQgd2FzIHdvcmtpbmcgZmluZSwNCj4g Pj4gICAgU28gRFMzIGlzIGEgcmVwcmVzZW50YXRpb24gb2YgYWxsIERTJ3MgdGhhdCBoYXZlIFJQ Q3MgYXQgdGhlDQo+ID4+ICAgIHRpbWUgRFMyIHRpbWVkLW91dCwgd2hpY2ggYmVsb25nIHRvIHRo aXMgbGF5b3V0KQ0KPiA+Pg0KPiA+PiAgIFdlIGNhbiBwcm9hY3RpdmVseSBhYm9ydCBhbGwgUlBD cyBiZWxvbmdpbmcgdG8gRFMzLiBJZiB0aGVyZSBpcw0KPiA+PiAgIGEgd2F5IHRvIGludGVybmFs bHkgYWJvcnQgUlBDJ3MgdXNlIHRoYXQuIEVsc2UganVzdCByZXNldCBpdCdzDQo+ID4+ICAgc2Vz c2lvbiBhbmQgYWxsIHNvY2tldHMgd2lsbCBjbG9zZSAoYW5kIHJlb3BlbiksIGFuZCBhbGwgUlBD J3MNCj4gPj4gICB3aWxsIGVuZCB3aXRoIGEgZGlzY29ubmVjdCBlcnJvci4NCj4gPiANCj4gPiBO b3Qgb24gbW9zdCBzZXJ2ZXJzIHRoYXQgSSdtIGF3YXJlIG9mLiBJZiB5b3UgY2xvc2Ugb3IgcmVz ZXQgdGhlIHNvY2tldA0KPiA+IG9uIHRoZSBjbGllbnQsIHRoZW4gdGhlIExpbnV4IHNlcnZlciB3 aWxsIGhhcHBpbHkgY29udGludWUgdG8gcHJvY2Vzcw0KPiA+IHRob3NlIFJQQyBjYWxsczsgaXQg anVzdCB3b24ndCBiZSBhYmxlIHRvIHNlbmQgYSByZXBseS4NCj4gPiBGdXJ0aGVybW9yZSwgaWYg dGhlIHByb2JsZW0gaXMgdGhhdCB0aGUgZGF0YSBzZXJ2ZXIgaXNuJ3QgcmVzcG9uZGluZywNCj4g PiB0aGVuIGEgc29ja2V0IGNsb3NlL3Jlc2V0IHRlbGxzIHlvdSBub3RoaW5nIGVpdGhlci4NCj4g PiANCj4gDQo+IA0KPiBBZ2FpbiBJJ20gdGFsa2luZyBhYm91dCB0aGUgTkZTLUludGVybmFsLXJl cXVlc3QtZW50aXRpZXMgdGhlc2Ugd2lsbA0KPiBiZSByZWxlYXNlZCwgdGhvdWdoIGd1YXJhbnR5 aW5nIHRoYXQgbm8gbW9yZSB0aHJlYWRzIHdpbGwgdXNlIGFueSBvZg0KPiB0aGVzZSB0byBzZW5k IGFueSBtb3JlIGJ5dGVzIG92ZXIgdG8gYW55IERTcy4NCj4gDQo+IEFORCB5ZXMsIHllcy4gT25j ZSB0aGUgY2xpZW50IGhhcyBkb25lIGl0J3Mgam9iIGFuZCBzdG9wcGVkIGFueSBmdXR1cmUNCj4g c2tiLXNlbmRzIHRvICphbGwqIERTcyBpbiBxdWVzdGlvbiwgb25seSB0aGVuIGl0IGNhbiByZXBv cnQgdG8gTURTOg0KPiAgICJIZXkgSSdtIGRvbmUgc2VuZGluZyBpbiBhbGwgb3RoZXIgcm91dHMg aGVyZSBMQVlPVVRfUkVUVVJOIg0KPiAgIChOb3cgZmVuY2luZyBoYXBwZW5zIG9uIHNlcnZlcnMp DQo+ICAgDQo+ICAgYW5kIGNsaWVudCBnb2VzIG9uIGFuZCBzYXlzDQo+ICAgDQo+ICAgIkhleSBj YW4geW91IE1EUyBwbGVhc2UgYWxzbyB3cml0ZSB0aGlzIGRhdGEiDQo+IA0KPiBXaGljaCBpcyBw ZXJmZWN0IGZvciBNRFMgYmVjYXVzZSBvdGhlcndpc2UgaWYgaXQgd2FudHMgdG8gbWFrZSBzdXJl LA0KPiBpdCB3aWxsIG5lZWQgdG8gcmVjYWxsIGFsbCBvdXRzdGFuZGluZyBsYXlvdXRzLCBleGFj dGx5IGZvciB5b3VyDQo+IHJlYXNvbiwgZm9yIGNvbmNlcm4gZm9yIHRoZSBkYXRhIGNvcnJ1cHRp b24gdGhhdCBjYW4gaGFwcGVuLg0KDQpTbyBpdCByZWNhbGxzIHRoZSBsYXlvdXRzLCBhbmQgdGhl bi4uLiBpdCBfc3RpbGxfIGhhcyB0byBmZW5jZSBvZmYgYW55DQp3cml0ZXMgdGhhdCBhcmUgaW4g cHJvZ3Jlc3Mgb24gdGhlIGJyb2tlbiBEUy4NCg0KQWxsIHlvdSd2ZSBkb25lIGlzIGFkZCBhIHJl Y2FsbCB0byB0aGUgd2hvbGUgcHJvY2Vzcy4gV2h5Pw0KDQo+ID4+IC0gQm90aCBEUzIgdGhhdCB0 aW1lZC1vdXQsIGFuZCBEUzMgdGhhdCB3YXMgYWJvcnRlZC4gU2hvdWxkIGJlDQo+ID4+ICAgbWFy a2VkIHdpdGggYSBmbGFnLiBXaGVuIG5ldyBJTyB0aGF0IGJlbG9uZyB0byBzb21lIG90aGVyDQo+ ID4+ICAgaW5vZGUgdGhyb3VnaCBzb21lIG90aGVyIGxheW91dCtkZXZpY2VfaWQgZW5jb3VudGVy cyBhIGZsYWdnZWQNCj4gPj4gICBkZXZpY2UsIGl0IHNob3VsZCBhYm9ydCBhbmQgdHVybiB0byBN RFMgSU8sIHdpdGggYWxzbyBpbnZhbGlkYXRpbmcNCj4gPj4gICBpdCdzIGxheW91dCwgYW5kIGhl bnMsIHNvb24gZW5vdWdoIHRoZSBkZXZpY2VfaWQgZm9yIERTMiYzIHdpbGwgYmUNCj4gPj4gICBk ZS1yZWZlcmVuY2VkIGFuZCBiZSByZW1vdmVkIGZyb20gZGV2aWNlIGNhY2hlLiAoQW5kIGFsbCBy ZWZlcmVuY2luZw0KPiA+PiAgIGxheW91dHMgYXJlIG5vdyBnb25lKQ0KPiA+IA0KPiA+IFRoZXJl IGlzIG5vIFJQQyBhYm9ydCBmdW5jdGlvbmFsaXR5IFN1biBSUEMuIEFnYWluLCB0aGlzIGFyZ3Vt ZW50IHJlbGllcw0KPiA+IG9uIGZ1bmN0aW9uYWxpdHkgdGhhdCBfZG9lc24ndF8gZXhpc3QuDQo+ ID4gDQo+IA0KPiANCj4gQWdhaW4gSSBtZWFuIGludGVybmFsbHkgYXQgdGhlIGNsaWVudC4gRm9y IGV4YW1wbGUgY2xvc2luZyB0aGUgc29ja2V0IHdpbGwNCj4gaGF2ZSB0aGUgZWZmZWN0IEkgd2Fu dC4gKEFuZCBzb21lIG90aGVyIHRyaWNrcyB3ZSBjYW4gdGFsayBhYm91dCB0aG9zZQ0KPiBsYXRl ciwgbGV0cyBhZ3JlZSBhYm91dCB0aGUgcHJpbmNpcGFsIGZpcnN0KQ0KDQpUaW1pbmcgb3V0IHdp bGwgcHJldmVudCB0aGUgZGFtbmVkIGNsaWVudCBmcm9tIHNlbmRpbmcgbW9yZSBkYXRhLiBTbw0K d2hhdD8NCg0KPiA+PiAgIFNvIHdlIGRvIG5vdCBjb250aW51ZSBxdWV1aW5nIG5ldyBJTyB0byBk ZWFkIGRldmljZXMuIEFuZCBzaW5jZSBtb3N0DQo+ID4+ICAgcHJvYmFibHkgTURTIHdpbGwgbm90 IGdpdmUgdXMgZGVhZCBzZXJ2ZXJzIGluIG5ldyBsYXlvdXQsIHdlIHNob3VsZCBiZQ0KPiA+PiAg IGdvb2QuDQo+ID4+IEluIHN1bW1lcnkuDQo+ID4+IC0gRkVOQ0VfTUVfT0ZGIGlzIGEgbmV3IG9w ZXJhdGlvbiwgYW5kIGlzIG5vdCA9PT0gTEFZT1VUX1JFVFVSTi4gQ2xpZW50DQo+ID4+ICAgKm11 c3Qgbm90KiBza2Itc2VuZCBhIHNpbmdsZSBieXRlIGJlbG9uZ2luZyB0byBhIGxheW91dCwgYWZ0 ZXIgdGhlIHNlbmQNCj4gPj4gICBvZiBMQVlPVVRfUkVUVVJOLg0KPiA+PiAgIChJdCBuZWVkIG5v dCB3YWl0IGZvciBPUFRfRE9ORSBmcm9tIERTIHRvIGRvIHRoYXQsIGl0IGp1c3QgbXVzdCBtYWtl DQo+ID4+ICAgIHN1cmUsIHRoYXQgYWxsIGl0J3MgaW50ZXJuYWwsIG9yIG9uLXRoZS13aXJlIHJl cXVlc3QsIGFyZSBhYm9ydGVkDQo+ID4+ICAgIGJ5IGVhc2lseSBjbG9zaW5nIHRoZSBzb2NrZXRz IHRoZXkgYmVsb25nIHRvbywgYW5kL29yIHdhaXRpbmcgZm9yDQo+ID4+ICAgIGhlYWx0aHkgRFMn cyBJTyB0byBiZSBPUFRfRE9ORSAuIFNvIHRoZSBjbGllbnQgaXMgbm90IGRlcGVuZGVudCBvbg0K PiA+PiAgICBhbnkgRFMgcmVzcG9uc2UsIGl0IGlzIG9ubHkgZGVwZW5kZW50IG9uIGl0J3MgaW50 ZXJuYWwgc3RhdGUgYmVpbmcNCj4gPj4gICAgKmNsZWFuKiBmcm9tIGFueSBtb3JlIHNrYi1zZW5k KHMpKQ0KPiA+IA0KPiA+IERpdHRvDQo+ID4gDQo+ID4+IC0gVGhlIHByb3BlciBpbXBsZW1lbnRh dGlvbiBvZiBMQVlPVVRfUkVUVVJOIG9uIGVycm9yIGZvciBmYXN0IHR1cm5vdmVyDQo+ID4+ICAg aXMgbm90IGhhcmQsIGFuZCBkb2VzIG5vdCBpbnZvbHZlIGEgbmV3IGludmVudGVkIE5GUyBvcGVy YXRpb24gc3VjaA0KPiA+PiAgIGFzIEZFTkNFX01FX09GRi4gUHJvcGVyIGNvZGRlZCBjbGllbnQs IGluZGVwZW5kZW50bHksIHdpdGhvdXQNCj4gPj4gICB0aGUgYWlkIG9mIGFueSBGRU5DRV9NRV9P RkYgb3BlcmF0aW9uLCBjYW4gYWNoaWV2ZSBhIGZhc3RlciB0dXJuYXJvdW5kDQo+ID4+ICAgYnkg YWN0aXZlbHkgcmV0dXJuaW5nIGFsbCBsYXlvdXRzIHRoYXQgYmVsb25nIHRvIGEgYmFkIERTLCBh bmQgbm90DQo+ID4+ICAgd2FpdGluZyBmb3IgYSBmZW5jZS1vZmYgb2YgYSBzaW5nbGUgbGF5b3V0 LCB0aGVuIGVuY291bnRlcmluZyBqdXN0DQo+ID4+ICAgdGhlIHNhbWUgZXJyb3Igd2l0aCBhbGwg b3RoZXIgbGF5b3V0cyB0aGF0IGhhdmUgdGhlIHNhbWUgRFMgICAgIA0KPiA+IA0KPiA+IFdoYXQg ZG8geW91IG1lYW4gYnkgImFsbCBsYXlvdXRzIHRoYXQgYmVsb25nIHRvIGEgYmFkIERTIj8gTGF5 b3V0cyBkb24ndA0KPiA+IGJlbG9uZyB0byBhIERTLCBhbmQgc28gdGhlcmUgaXMgbm8gd2F5IHRv IGdldCBmcm9tIGEgRFMgdG8gYSBsYXlvdXQuDQo+ID4gDQo+IA0KPiANCj4gV2h5LCBzdXJlLiBs b29wIG9uIGFsbCBsYXlvdXRzIGFuZCBhc2sgaWYgaXQgaGFzIGEgc3BlY2lmaWMgRFMuDQoNCg0K DQo+ID4+IC0gQW5kIEkga25vdyB0aGF0IGp1c3QgYXMgeW91IGRpZCBub3QgcmVhZCBteSBlbWFp bHMgZnJvbSBiZWZvcmUNCj4gPj4gICBtZSBnb2luZyB0byBIb3NwaXRhbCwgeW91IHdpbGwgY29u dGludWUgdG8gbm90IHVuZGVyc3RhbmQgdGhpcw0KPiA+PiAgIG9uZSwgb3Igd2hhdCBJJ20gdHJ5 aW5nIHRvIGV4cGxhaW4sIGFuZCB3aWxsIG1vc3QgcHJvYmFibHkgaWdub3JlDQo+ID4+ICAgYWxs IG9mIGl0LiBCdXQgcGxlYXNlIG5vdGUgb25lIHRoaW5nOg0KPiA+IA0KPiA+IEkgcmVhZCB0aGVt LCBidXQganVzdCBhcyBub3csIHRoZXkgY29udGludWUgdG8gaWdub3JlIHRoZSByZWFsaXR5IGFi b3V0DQo+ID4gdGltZW91dHM6IHRpbWVvdXRzIG1lYW4gX25vdGhpbmdfIGluIGFuIFJQQyBmYWls b3ZlciBzaXR1YXRpb24uIFRoZXJlIGlzDQo+ID4gbm8gUlBDIGFib3J0IGZ1bmN0aW9uYWxpdHkg dGhhdCB5b3UgY2FuIHJlbHkgb24gb3RoZXIgdGhhbiBmZW5jaW5nLg0KPiA+IA0KPiANCj4gDQo+ IEkgaG9wZSBJIGV4cGxhaW5lZCB0aGlzIGJ5IG5vdy4gSWYgbm90IHBsZWFzZSwgcGxlYXNlLCBs ZXRzIG9yZ2FuaXplDQo+IGEgcGhvbmUgY2FsbC4gV2UgY2FuIHVzZSBQYW5hc2FzIGNvbmZlcmVu Y2UgbnVtYmVyIHdoZW5ldmVyIHlvdSBhcmUNCj4gYXZhaWxhYmxlLiBJIHRoaW5rIHdlIGNvbW11 bmljYXRlIGJldHRlciBpbiBwZXJzb24uDQo+IA0KPiBFdmVyeW9uZSBlbHNlIGlzIGFsc28gaW52 aXRlZC4NCj4gDQo+IEJVVCB0aGVyZSBpcyBvbmUgbW9zdCBpbXBvcnRhbnQgcG9pbnQgZm9yIG1l Og0KPiANCj4gICAgQXMgc3RhdGVkIGJ5IHRoZSBSRkMuIENsaWVudCBtdXN0IGd1YXJhbnR5IHRo YXQgbm8gbW9yZSBieXRlcyB3aWxsIGJlDQo+ICAgIHNlbnQgdG8gYW55IERTcyBpbiBhIGxheW91 dCwgb25jZSBMQVlPVVRfUkVUVVJOIGlzIHNlbnQuIFRoaXMgaXMgdGhlDQo+ICAgIG9ubHkgZGVm aW5pdGlvbiBvZiBMQVlPVVRfUkVUVVJOLCBhbmQgTk9fTUFUQ0hJTkdfTEFZT1VUIGFzIHJlc3Bv bnNlDQo+ICAgIHRvIGEgTEFZT1VUX1JFQ0FMTC4gV2hpY2ggaXM6DQo+ICAgIENsaWVudCBoYXMg aW5kaWNhdGVkIG5vIG1vcmUgZnV0dXJlIHNlbmRzIG9uIGEgbGF5b3V0LiAoQW5kIHNlcnZlciB3 aWxsDQo+ICAgIGVuZm9yY2UgaXQgd2l0aCBhIGZlbmNpbmcpDQoNClRoZSBjbGllbnQgY2FuJ3Qg Z3VhcmFudGVlIHRoYXQuIFRoZSBwcm90b2NvbCBvZmZlcnMgbm8gd2F5IGZvciBpdCB0byBkbw0K c28sIG5vIG1hdHRlciB3aGF0IHRoZSBwTkZTIHRleHQgbWF5IGNob29zZSB0byBzYXkuDQoNCj4g Pj4gICAgIFlPVSBoYXZlIHNhYm90YWdlZCB0aGUgTkZTIDQuMSBMaW51eCBjbGllbnQsIHdoaWNo IGlzIG5vdyB0b3RhbGx5DQo+ID4+ICAgICBub3QgU1REIGNvbXBsYWludCwgYW5kIGhhdmUgaW50 cm9kdWNlZCBDUkFTSHMuIEFuZCBmb3Igbm8gZ29vZA0KPiA+PiAgICAgcmVhc29uLg0KPiA+IA0K PiA+IFNlZSBhYm92ZS4NCj4gPiANCj4gDQo+IA0KPiBPSyBXZSdsbCBoYXZlIHRvIHNlZSBhYm91 dCB0aGVzZSBjcmFzaGVzLCBsZXRzIHRhbGsgYWJvdXQgdGhlbS4NCj4gDQo+IFRoYW5rcw0KPiBC b2F6DQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIN Cg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/14/2012 03:16 AM, Myklebust, Trond wrote: > > The client can't guarantee that. The protocol offers no way for it to do > so, no matter what the pNFS text may choose to say. > What? why not? all the client needs to do is stop sending bytes. What is not guarantied? I completely do not understand what you say. How stopping any sends does not guaranty that? Please explain Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-08-14 at 03:28 +0300, Boaz Harrosh wrote: > On 08/14/2012 03:16 AM, Myklebust, Trond wrote: > > > > > The client can't guarantee that. The protocol offers no way for it to do > > so, no matter what the pNFS text may choose to say. > > > > > What? why not? all the client needs to do is stop sending bytes. > What is not guarantied? I completely do not understand what you > say. How stopping any sends does not guaranty that? If the client has lost control of the transport, then it has no control over what the data server sees and when. It can process the client's write RPC 5 minutes from now, and it will never know. THAT is what is absurd about the whole "the client MUST NOT send..." shebang. THAT is why there is no guarantee. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On 2012-08-09 18:39, Myklebust, Trond wrote: > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >> <Trond.Myklebust@netapp.com> wrote: >>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >>>> <Trond.Myklebust@netapp.com> wrote: >>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >>>>> disconnected data server) we've been sending layoutreturn calls >>>>> while there is potentially still outstanding I/O to the data >>>>> servers. The reason we do this is to avoid races between replayed >>>>> writes to the MDS and the original writes to the DS. >>>>> >>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can >>>>> be triggered because it assumes that we would never call >>>>> layoutreturn without knowing that all I/O to the DS is >>>>> finished. The fix is to remove the BUG_ON() now that the >>>>> assumptions behind the test are obsolete. >>>>> >>>> Isn't MDS supposed to recall the layout if races are possible between >>>> outstanding write-to-DS and write-through-MDS? >>> >>> Where do you read that in RFC5661? >>> >> That's my (maybe mis-)understanding of how server works... But looking >> at rfc5661 section 18.44.3. layoutreturn implementation. >> " >> After this call, >> the client MUST NOT use the returned layout(s) and the associated >> storage protocol to access the file data. >> " >> And given commit 0a57cdac3f, client is using the layout even after >> layoutreturn, which IMHO is a violation of rfc5661. > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > data server that is not responding. It isn't attempting to use the > layout after the layoutreturn: the whole point is that we are attempting > write-through-MDS after the attempt to write through the DS timed out. > I hear you, but this use case is valid after a time out / disconnect (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout) In other cases, I/Os to the DS might obviously be in flight and the BUG_ON indicates that. IMO, the right way to implement that is to initially mark the lsegs invalid and increment plh_block_lgets, as we do today in _pnfs_return_layout but actually send the layout return only when the last segment is dereferenced. Benny -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote: > On 2012-08-09 18:39, Myklebust, Trond wrote: > > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: > >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond > >> <Trond.Myklebust@netapp.com> wrote: > >>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: > >>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust > >>>> <Trond.Myklebust@netapp.com> wrote: > >>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence > >>>>> disconnected data server) we've been sending layoutreturn calls > >>>>> while there is potentially still outstanding I/O to the data > >>>>> servers. The reason we do this is to avoid races between replayed > >>>>> writes to the MDS and the original writes to the DS. > >>>>> > >>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can > >>>>> be triggered because it assumes that we would never call > >>>>> layoutreturn without knowing that all I/O to the DS is > >>>>> finished. The fix is to remove the BUG_ON() now that the > >>>>> assumptions behind the test are obsolete. > >>>>> > >>>> Isn't MDS supposed to recall the layout if races are possible between > >>>> outstanding write-to-DS and write-through-MDS? > >>> > >>> Where do you read that in RFC5661? > >>> > >> That's my (maybe mis-)understanding of how server works... But looking > >> at rfc5661 section 18.44.3. layoutreturn implementation. > >> " > >> After this call, > >> the client MUST NOT use the returned layout(s) and the associated > >> storage protocol to access the file data. > >> " > >> And given commit 0a57cdac3f, client is using the layout even after > >> layoutreturn, which IMHO is a violation of rfc5661. > > > > No. It is using the layoutreturn to tell the MDS to fence off I/O to a > > data server that is not responding. It isn't attempting to use the > > layout after the layoutreturn: the whole point is that we are attempting > > write-through-MDS after the attempt to write through the DS timed out. > > > > I hear you, but this use case is valid after a time out / disconnect > (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout) > In other cases, I/Os to the DS might obviously be in flight and the BUG_ON > indicates that. > > IMO, the right way to implement that is to initially mark the lsegs invalid > and increment plh_block_lgets, as we do today in _pnfs_return_layout > but actually send the layout return only when the last segment is dereferenced. This is what we do for object and block layout types, so your objects-specific objection is unfounded. As I understand it, iSCSI has different semantics w.r.t. disconnect and timeout, which means that the client can in principle rely on a timeout leaving the DS in a known state. Ditto for FCP. I've no idea about other block/object transport types, but I assume those that support multi-pathing implement similar devices. The problem is that RPC does not, so the files layout needs to be treated differently. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@netapp.com www.netapp.com
On Tue, Aug 14, 2012 at 9:45 PM, Myklebust, Trond <Trond.Myklebust@netapp.com> wrote: > On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote: >> On 2012-08-09 18:39, Myklebust, Trond wrote: >> > On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >> >> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >> >> <Trond.Myklebust@netapp.com> wrote: >> >>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >> >>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >> >>>> <Trond.Myklebust@netapp.com> wrote: >> >>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >> >>>>> disconnected data server) we've been sending layoutreturn calls >> >>>>> while there is potentially still outstanding I/O to the data >> >>>>> servers. The reason we do this is to avoid races between replayed >> >>>>> writes to the MDS and the original writes to the DS. >> >>>>> >> >>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can >> >>>>> be triggered because it assumes that we would never call >> >>>>> layoutreturn without knowing that all I/O to the DS is >> >>>>> finished. The fix is to remove the BUG_ON() now that the >> >>>>> assumptions behind the test are obsolete. >> >>>>> >> >>>> Isn't MDS supposed to recall the layout if races are possible between >> >>>> outstanding write-to-DS and write-through-MDS? >> >>> >> >>> Where do you read that in RFC5661? >> >>> >> >> That's my (maybe mis-)understanding of how server works... But looking >> >> at rfc5661 section 18.44.3. layoutreturn implementation. >> >> " >> >> After this call, >> >> the client MUST NOT use the returned layout(s) and the associated >> >> storage protocol to access the file data. >> >> " >> >> And given commit 0a57cdac3f, client is using the layout even after >> >> layoutreturn, which IMHO is a violation of rfc5661. >> > >> > No. It is using the layoutreturn to tell the MDS to fence off I/O to a >> > data server that is not responding. It isn't attempting to use the >> > layout after the layoutreturn: the whole point is that we are attempting >> > write-through-MDS after the attempt to write through the DS timed out. >> > >> >> I hear you, but this use case is valid after a time out / disconnect >> (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout) >> In other cases, I/Os to the DS might obviously be in flight and the BUG_ON >> indicates that. >> >> IMO, the right way to implement that is to initially mark the lsegs invalid >> and increment plh_block_lgets, as we do today in _pnfs_return_layout >> but actually send the layout return only when the last segment is dereferenced. > > This is what we do for object and block layout types, so your > objects-specific objection is unfounded. > object layout is also doing layout return on IO error (commit fe0fe83585f8). And it doesn't take care of draining concurrent in-flight IO. I guess that's why Boaz saw the same BUG_ON.
T24gVHVlLCAyMDEyLTA4LTE0IGF0IDIyOjMwICswODAwLCBQZW5nIFRhbyB3cm90ZToNCj4gT24g VHVlLCBBdWcgMTQsIDIwMTIgYXQgOTo0NSBQTSwgTXlrbGVidXN0LCBUcm9uZA0KPiA8VHJvbmQu TXlrbGVidXN0QG5ldGFwcC5jb20+IHdyb3RlOg0KPiA+IE9uIFR1ZSwgMjAxMi0wOC0xNCBhdCAx MDo0OCArMDMwMCwgQmVubnkgSGFsZXZ5IHdyb3RlOg0KPiA+PiBPbiAyMDEyLTA4LTA5IDE4OjM5 LCBNeWtsZWJ1c3QsIFRyb25kIHdyb3RlOg0KPiA+PiA+IE9uIFRodSwgMjAxMi0wOC0wOSBhdCAy MzowMSArMDgwMCwgUGVuZyBUYW8gd3JvdGU6DQo+ID4+ID4+IE9uIFRodSwgQXVnIDksIDIwMTIg YXQgMTA6MzYgUE0sIE15a2xlYnVzdCwgVHJvbmQNCj4gPj4gPj4gPFRyb25kLk15a2xlYnVzdEBu ZXRhcHAuY29tPiB3cm90ZToNCj4gPj4gPj4+IE9uIFRodSwgMjAxMi0wOC0wOSBhdCAyMjozMCAr MDgwMCwgUGVuZyBUYW8gd3JvdGU6DQo+ID4+ID4+Pj4gT24gVGh1LCBBdWcgOSwgMjAxMiBhdCA0 OjIxIEFNLCBUcm9uZCBNeWtsZWJ1c3QNCj4gPj4gPj4+PiA8VHJvbmQuTXlrbGVidXN0QG5ldGFw cC5jb20+IHdyb3RlOg0KPiA+PiA+Pj4+PiBFdmVyIHNpbmNlIGNvbW1pdCAwYTU3Y2RhYzNmIChO RlN2NC4xIHNlbmQgbGF5b3V0cmV0dXJuIHRvIGZlbmNlDQo+ID4+ID4+Pj4+IGRpc2Nvbm5lY3Rl ZCBkYXRhIHNlcnZlcikgd2UndmUgYmVlbiBzZW5kaW5nIGxheW91dHJldHVybiBjYWxscw0KPiA+ PiA+Pj4+PiB3aGlsZSB0aGVyZSBpcyBwb3RlbnRpYWxseSBzdGlsbCBvdXRzdGFuZGluZyBJL08g dG8gdGhlIGRhdGENCj4gPj4gPj4+Pj4gc2VydmVycy4gVGhlIHJlYXNvbiB3ZSBkbyB0aGlzIGlz IHRvIGF2b2lkIHJhY2VzIGJldHdlZW4gcmVwbGF5ZWQNCj4gPj4gPj4+Pj4gd3JpdGVzIHRvIHRo ZSBNRFMgYW5kIHRoZSBvcmlnaW5hbCB3cml0ZXMgdG8gdGhlIERTLg0KPiA+PiA+Pj4+Pg0KPiA+ PiA+Pj4+PiBXaGVuIHRoaXMgaGFwcGVucywgdGhlIEJVR19PTigpIGluIG5mczRfbGF5b3V0cmV0 dXJuX2RvbmUgY2FuDQo+ID4+ID4+Pj4+IGJlIHRyaWdnZXJlZCBiZWNhdXNlIGl0IGFzc3VtZXMg dGhhdCB3ZSB3b3VsZCBuZXZlciBjYWxsDQo+ID4+ID4+Pj4+IGxheW91dHJldHVybiB3aXRob3V0 IGtub3dpbmcgdGhhdCBhbGwgSS9PIHRvIHRoZSBEUyBpcw0KPiA+PiA+Pj4+PiBmaW5pc2hlZC4g VGhlIGZpeCBpcyB0byByZW1vdmUgdGhlIEJVR19PTigpIG5vdyB0aGF0IHRoZQ0KPiA+PiA+Pj4+ PiBhc3N1bXB0aW9ucyBiZWhpbmQgdGhlIHRlc3QgYXJlIG9ic29sZXRlLg0KPiA+PiA+Pj4+Pg0K PiA+PiA+Pj4+IElzbid0IE1EUyBzdXBwb3NlZCB0byByZWNhbGwgdGhlIGxheW91dCBpZiByYWNl cyBhcmUgcG9zc2libGUgYmV0d2Vlbg0KPiA+PiA+Pj4+IG91dHN0YW5kaW5nIHdyaXRlLXRvLURT IGFuZCB3cml0ZS10aHJvdWdoLU1EUz8NCj4gPj4gPj4+DQo+ID4+ID4+PiBXaGVyZSBkbyB5b3Ug cmVhZCB0aGF0IGluIFJGQzU2NjE/DQo+ID4+ID4+Pg0KPiA+PiA+PiBUaGF0J3MgbXkgKG1heWJl IG1pcy0pdW5kZXJzdGFuZGluZyBvZiBob3cgc2VydmVyIHdvcmtzLi4uIEJ1dCBsb29raW5nDQo+ ID4+ID4+IGF0IHJmYzU2NjEgc2VjdGlvbiAxOC40NC4zLiBsYXlvdXRyZXR1cm4gaW1wbGVtZW50 YXRpb24uDQo+ID4+ID4+ICINCj4gPj4gPj4gQWZ0ZXIgdGhpcyBjYWxsLA0KPiA+PiA+PiAgICB0 aGUgY2xpZW50IE1VU1QgTk9UIHVzZSB0aGUgcmV0dXJuZWQgbGF5b3V0KHMpIGFuZCB0aGUgYXNz b2NpYXRlZA0KPiA+PiA+PiAgICBzdG9yYWdlIHByb3RvY29sIHRvIGFjY2VzcyB0aGUgZmlsZSBk YXRhLg0KPiA+PiA+PiAiDQo+ID4+ID4+IEFuZCBnaXZlbiBjb21taXQgMGE1N2NkYWMzZiwgY2xp ZW50IGlzIHVzaW5nIHRoZSBsYXlvdXQgZXZlbiBhZnRlcg0KPiA+PiA+PiBsYXlvdXRyZXR1cm4s IHdoaWNoIElNSE8gaXMgYSB2aW9sYXRpb24gb2YgcmZjNTY2MS4NCj4gPj4gPg0KPiA+PiA+IE5v LiBJdCBpcyB1c2luZyB0aGUgbGF5b3V0cmV0dXJuIHRvIHRlbGwgdGhlIE1EUyB0byBmZW5jZSBv ZmYgSS9PIHRvIGENCj4gPj4gPiBkYXRhIHNlcnZlciB0aGF0IGlzIG5vdCByZXNwb25kaW5nLiBJ dCBpc24ndCBhdHRlbXB0aW5nIHRvIHVzZSB0aGUNCj4gPj4gPiBsYXlvdXQgYWZ0ZXIgdGhlIGxh eW91dHJldHVybjogdGhlIHdob2xlIHBvaW50IGlzIHRoYXQgd2UgYXJlIGF0dGVtcHRpbmcNCj4g Pj4gPiB3cml0ZS10aHJvdWdoLU1EUyBhZnRlciB0aGUgYXR0ZW1wdCB0byB3cml0ZSB0aHJvdWdo IHRoZSBEUyB0aW1lZCBvdXQuDQo+ID4+ID4NCj4gPj4NCj4gPj4gSSBoZWFyIHlvdSwgYnV0IHRo aXMgdXNlIGNhc2UgaXMgdmFsaWQgYWZ0ZXIgYSB0aW1lIG91dCAvIGRpc2Nvbm5lY3QNCj4gPj4g KHdoaWNoIHdpbGwgdHJhbnNsYXRlIHRvIFBORlNfT1NEX0VSUl9VTlJFQUNIQUJMRSBmb3IgdGhl IG9iamVjdHMgbGF5b3V0KQ0KPiA+PiBJbiBvdGhlciBjYXNlcywgSS9PcyB0byB0aGUgRFMgbWln aHQgb2J2aW91c2x5IGJlIGluIGZsaWdodCBhbmQgdGhlIEJVR19PTg0KPiA+PiBpbmRpY2F0ZXMg dGhhdC4NCj4gPj4NCj4gPj4gSU1PLCB0aGUgcmlnaHQgd2F5IHRvIGltcGxlbWVudCB0aGF0IGlz IHRvIGluaXRpYWxseSBtYXJrIHRoZSBsc2VncyBpbnZhbGlkDQo+ID4+IGFuZCBpbmNyZW1lbnQg cGxoX2Jsb2NrX2xnZXRzLCBhcyB3ZSBkbyB0b2RheSBpbiBfcG5mc19yZXR1cm5fbGF5b3V0DQo+ ID4+IGJ1dCBhY3R1YWxseSBzZW5kIHRoZSBsYXlvdXQgcmV0dXJuIG9ubHkgd2hlbiB0aGUgbGFz dCBzZWdtZW50IGlzIGRlcmVmZXJlbmNlZC4NCj4gPg0KPiA+IFRoaXMgaXMgd2hhdCB3ZSBkbyBm b3Igb2JqZWN0IGFuZCBibG9jayBsYXlvdXQgdHlwZXMsIHNvIHlvdXINCj4gPiBvYmplY3RzLXNw ZWNpZmljIG9iamVjdGlvbiBpcyB1bmZvdW5kZWQuDQo+ID4NCj4gb2JqZWN0IGxheW91dCBpcyBh bHNvIGRvaW5nIGxheW91dCByZXR1cm4gb24gSU8gZXJyb3IgKGNvbW1pdA0KPiBmZTBmZTgzNTg1 ZjgpLiBBbmQgaXQgZG9lc24ndCB0YWtlIGNhcmUgb2YgZHJhaW5pbmcgY29uY3VycmVudA0KPiBp bi1mbGlnaHQgSU8uIEkgZ3Vlc3MgdGhhdCdzIHdoeSBCb2F6IHNhdyB0aGUgc2FtZSBCVUdfT04u DQoNClllcy4gSSBkaWQgbm90aWNlIHRoYXQgY29kZSB3aGVuIEkgd2FzIGxvb2tpbmcgaW50byB0 aGlzLiBIb3dldmVyIHRoYXQncw0KQm9heidzIG93biBwYXRjaCwgYW5kIGl0IF9vbmx5XyBhcHBs aWVzIHRvIHRoZSBvYmplY3RzIGxheW91dCB0eXBlLiBJDQphc3N1bWVkIHRoYXQgaGUgaGFkIHRl c3RlZCBpdCB3aGVuIEkgYXBwbGllZCBpdC4uLg0KDQpPbmUgd2F5IHRvIGZpeCB0aGF0IHdvdWxk IGJlIHRvIGtlZXAgYSBjb3VudCBvZiAib3V0c3RhbmRpbmcNCnJlYWQvd3JpdGVzIiBpbiB0aGUg bGF5b3V0LCBzbyB0aGF0IHdoZW4gdGhlIGVycm9yIG9jY3VycywgYW5kIHdlIHdhbnQNCnRvIGZh bGwgYmFjayB0byBNRFMsIHdlIGp1c3QgaW5jcmVtZW50IHBsaF9ibG9ja19sZ2V0cywgaW52YWxp ZGF0ZSB0aGUNCmxheW91dCwgYW5kIHRoZW4gbGV0IHRoZSBvdXRzdGFuZGluZyByZWFkL3dyaXRl cyBmYWxsIHRvIHplcm8gYmVmb3JlDQpzZW5kaW5nIHRoZSBsYXlvdXRyZXR1cm4uDQpJZiB0aGUg b2JqZWN0cyBsYXlvdXQgd2FudHMgdG8gZG8gdGhhdCwgdGhlbiBJIGhhdmUgbm8gb2JqZWN0aW9u LiBBcw0KSSd2ZSBzYWlkIG11bHRpcGxlIHRpbWVzLCB0aG91Z2gsIEknbSBub3QgY29udmluY2Vk IHdlIHdhbnQgdG8gZG8gdGhhdA0KZm9yIHRoZSBmaWxlcyBsYXlvdXQuDQoNCi0tIA0KVHJvbmQg TXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5N eWtsZWJ1c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg0K -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2012-08-14 17:53, Myklebust, Trond wrote: > On Tue, 2012-08-14 at 22:30 +0800, Peng Tao wrote: >> On Tue, Aug 14, 2012 at 9:45 PM, Myklebust, Trond >> <Trond.Myklebust@netapp.com> wrote: >>> On Tue, 2012-08-14 at 10:48 +0300, Benny Halevy wrote: >>>> On 2012-08-09 18:39, Myklebust, Trond wrote: >>>>> On Thu, 2012-08-09 at 23:01 +0800, Peng Tao wrote: >>>>>> On Thu, Aug 9, 2012 at 10:36 PM, Myklebust, Trond >>>>>> <Trond.Myklebust@netapp.com> wrote: >>>>>>> On Thu, 2012-08-09 at 22:30 +0800, Peng Tao wrote: >>>>>>>> On Thu, Aug 9, 2012 at 4:21 AM, Trond Myklebust >>>>>>>> <Trond.Myklebust@netapp.com> wrote: >>>>>>>>> Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence >>>>>>>>> disconnected data server) we've been sending layoutreturn calls >>>>>>>>> while there is potentially still outstanding I/O to the data >>>>>>>>> servers. The reason we do this is to avoid races between replayed >>>>>>>>> writes to the MDS and the original writes to the DS. >>>>>>>>> >>>>>>>>> When this happens, the BUG_ON() in nfs4_layoutreturn_done can >>>>>>>>> be triggered because it assumes that we would never call >>>>>>>>> layoutreturn without knowing that all I/O to the DS is >>>>>>>>> finished. The fix is to remove the BUG_ON() now that the >>>>>>>>> assumptions behind the test are obsolete. >>>>>>>>> >>>>>>>> Isn't MDS supposed to recall the layout if races are possible between >>>>>>>> outstanding write-to-DS and write-through-MDS? >>>>>>> >>>>>>> Where do you read that in RFC5661? >>>>>>> >>>>>> That's my (maybe mis-)understanding of how server works... But looking >>>>>> at rfc5661 section 18.44.3. layoutreturn implementation. >>>>>> " >>>>>> After this call, >>>>>> the client MUST NOT use the returned layout(s) and the associated >>>>>> storage protocol to access the file data. >>>>>> " >>>>>> And given commit 0a57cdac3f, client is using the layout even after >>>>>> layoutreturn, which IMHO is a violation of rfc5661. >>>>> >>>>> No. It is using the layoutreturn to tell the MDS to fence off I/O to a >>>>> data server that is not responding. It isn't attempting to use the >>>>> layout after the layoutreturn: the whole point is that we are attempting >>>>> write-through-MDS after the attempt to write through the DS timed out. >>>>> >>>> >>>> I hear you, but this use case is valid after a time out / disconnect >>>> (which will translate to PNFS_OSD_ERR_UNREACHABLE for the objects layout) >>>> In other cases, I/Os to the DS might obviously be in flight and the BUG_ON >>>> indicates that. >>>> >>>> IMO, the right way to implement that is to initially mark the lsegs invalid >>>> and increment plh_block_lgets, as we do today in _pnfs_return_layout >>>> but actually send the layout return only when the last segment is dereferenced. >>> >>> This is what we do for object and block layout types, so your >>> objects-specific objection is unfounded. >>> >> object layout is also doing layout return on IO error (commit >> fe0fe83585f8). And it doesn't take care of draining concurrent >> in-flight IO. I guess that's why Boaz saw the same BUG_ON. > > Yes. I did notice that code when I was looking into this. However that's > Boaz's own patch, and it _only_ applies to the objects layout type. I > assumed that he had tested it when I applied it... > > One way to fix that would be to keep a count of "outstanding > read/writes" in the layout, so that when the error occurs, and we want > to fall back to MDS, we just increment plh_block_lgets, invalidate the > layout, and then let the outstanding read/writes fall to zero before > sending the layoutreturn. Sounds reasonable to me too. > If the objects layout wants to do that, then I have no objection. As > I've said multiple times, though, I'm not convinced we want to do that > for the files layout. > I just fear that removing the BUG_ON will prevent us from detecting cases where a LAYOUTRETURN is sent while there are layout segments in use in the error free or non-timeout case. Benny -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c index f94f6b3..c77d296 100644 --- a/fs/nfs/nfs4proc.c +++ b/fs/nfs/nfs4proc.c @@ -6359,12 +6359,8 @@ static void nfs4_layoutreturn_done(struct rpc_task *task, void *calldata) return; } spin_lock(&lo->plh_inode->i_lock); - if (task->tk_status == 0) { - if (lrp->res.lrs_present) { - pnfs_set_layout_stateid(lo, &lrp->res.stateid, true); - } else - BUG_ON(!list_empty(&lo->plh_segs)); - } + if (task->tk_status == 0 && lrp->res.lrs_present) + pnfs_set_layout_stateid(lo, &lrp->res.stateid, true); lo->plh_block_lgets--; spin_unlock(&lo->plh_inode->i_lock); dprintk("<-- %s\n", __func__);
Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence disconnected data server) we've been sending layoutreturn calls while there is potentially still outstanding I/O to the data servers. The reason we do this is to avoid races between replayed writes to the MDS and the original writes to the DS. When this happens, the BUG_ON() in nfs4_layoutreturn_done can be triggered because it assumes that we would never call layoutreturn without knowing that all I/O to the DS is finished. The fix is to remove the BUG_ON() now that the assumptions behind the test are obsolete. Reported-by: Boaz Harrosh <bharrosh@panasas.com> Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org [>=3.5] --- fs/nfs/nfs4proc.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-)