[v2,0/2] enhance NFSv4.2 SSC to delay unmount source's export.

Message ID	20210402233031.36731-1-dai.ngo@oracle.com (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@kernel.org> From: Dai Ngo <dai.ngo@oracle.com> To: olga.kornievskaia@gmail.com Cc: linux-nfs@vger.kernel.org, trondmy@hammerspace.com, bfields@fieldses.org, chuck.lever@oracle.com Subject: [PATCH v2 0/2] enhance NFSv4.2 SSC to delay unmount source's export. Date: Fri, 2 Apr 2021 19:30:29 -0400 Message-Id: <20210402233031.36731-1-dai.ngo@oracle.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	enhance NFSv4.2 SSC to delay unmount source's export. \| expand [v2,0/2] enhance NFSv4.2 SSC to delay unmount source's export. [v2,1/2] NFSD: delay unmount source's export after inter-server copy completed. [v2,2/2] NFSv4.2: mount overhead should not be used as threshold for inter-server copy

Dai Ngo April 2, 2021, 11:30 p.m. UTC

Hi,

Currently the source's export is mounted and unmounted on every
inter-server copy operation. This causes unnecessary overhead
for each copy.

This patch series is an enhancement to allow the export to remain
mounted for a configurable period (default to 15 minutes). If the 
export is not being used for the configured time it will be unmounted
by a delayed task. If it's used again then its expiration time is
extended for another period.

Since mount and unmount are no longer done on each copy request,
this overhead is no longer used to decide whether the copy should
be done with inter-server copy or generic copy. The threshold used
to determine sync or async copy is now used for this decision.

-Dai

v2: fix compiler warning of missing prototype.

Chuck Lever April 6, 2021, 4:33 p.m. UTC | #1

> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> 
> Hi,
> 
> Currently the source's export is mounted and unmounted on every
> inter-server copy operation. This causes unnecessary overhead
> for each copy.
> 
> This patch series is an enhancement to allow the export to remain
> mounted for a configurable period (default to 15 minutes). If the 
> export is not being used for the configured time it will be unmounted
> by a delayed task. If it's used again then its expiration time is
> extended for another period.
> 
> Since mount and unmount are no longer done on each copy request,
> this overhead is no longer used to decide whether the copy should
> be done with inter-server copy or generic copy. The threshold used
> to determine sync or async copy is now used for this decision.
> 
> -Dai
> 
> v2: fix compiler warning of missing prototype.

Hi Olga-

I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
Have you had a chance to review Dai's patches?


--
Chuck Lever

Olga Kornievskaia April 6, 2021, 7:41 p.m. UTC | #2

On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >
> > Hi,
> >
> > Currently the source's export is mounted and unmounted on every
> > inter-server copy operation. This causes unnecessary overhead
> > for each copy.
> >
> > This patch series is an enhancement to allow the export to remain
> > mounted for a configurable period (default to 15 minutes). If the
> > export is not being used for the configured time it will be unmounted
> > by a delayed task. If it's used again then its expiration time is
> > extended for another period.
> >
> > Since mount and unmount are no longer done on each copy request,
> > this overhead is no longer used to decide whether the copy should
> > be done with inter-server copy or generic copy. The threshold used
> > to determine sync or async copy is now used for this decision.
> >
> > -Dai
> >
> > v2: fix compiler warning of missing prototype.
>
> Hi Olga-
>
> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
> Have you had a chance to review Dai's patches?

Hi Chuck,

I apologize I haven't had the chance to review/test it yet. Can I have
until tomorrow evening to do so?

>
>
> --
> Chuck Lever
>
>
>

Chuck Lever April 6, 2021, 7:42 p.m. UTC | #3

> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Currently the source's export is mounted and unmounted on every
>>> inter-server copy operation. This causes unnecessary overhead
>>> for each copy.
>>> 
>>> This patch series is an enhancement to allow the export to remain
>>> mounted for a configurable period (default to 15 minutes). If the
>>> export is not being used for the configured time it will be unmounted
>>> by a delayed task. If it's used again then its expiration time is
>>> extended for another period.
>>> 
>>> Since mount and unmount are no longer done on each copy request,
>>> this overhead is no longer used to decide whether the copy should
>>> be done with inter-server copy or generic copy. The threshold used
>>> to determine sync or async copy is now used for this decision.
>>> 
>>> -Dai
>>> 
>>> v2: fix compiler warning of missing prototype.
>> 
>> Hi Olga-
>> 
>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
>> Have you had a chance to review Dai's patches?
> 
> Hi Chuck,
> 
> I apologize I haven't had the chance to review/test it yet. Can I have
> until tomorrow evening to do so?

Next couple of days will be fine. Thanks!


--
Chuck Lever

Olga Kornievskaia April 6, 2021, 7:57 p.m. UTC | #4

On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Currently the source's export is mounted and unmounted on every
> >>> inter-server copy operation. This causes unnecessary overhead
> >>> for each copy.
> >>>
> >>> This patch series is an enhancement to allow the export to remain
> >>> mounted for a configurable period (default to 15 minutes). If the
> >>> export is not being used for the configured time it will be unmounted
> >>> by a delayed task. If it's used again then its expiration time is
> >>> extended for another period.
> >>>
> >>> Since mount and unmount are no longer done on each copy request,
> >>> this overhead is no longer used to decide whether the copy should
> >>> be done with inter-server copy or generic copy. The threshold used
> >>> to determine sync or async copy is now used for this decision.
> >>>
> >>> -Dai
> >>>
> >>> v2: fix compiler warning of missing prototype.
> >>
> >> Hi Olga-
> >>
> >> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
> >> Have you had a chance to review Dai's patches?
> >
> > Hi Chuck,
> >
> > I apologize I haven't had the chance to review/test it yet. Can I have
> > until tomorrow evening to do so?
>
> Next couple of days will be fine. Thanks!
>

I also assumed there would be a v2 given that kbot complained about
the NFSD patch.

> --
> Chuck Lever
>
>
>

Chuck Lever April 6, 2021, 7:58 p.m. UTC | #5

> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>> 
>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Currently the source's export is mounted and unmounted on every
>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>> for each copy.
>>>>> 
>>>>> This patch series is an enhancement to allow the export to remain
>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>> export is not being used for the configured time it will be unmounted
>>>>> by a delayed task. If it's used again then its expiration time is
>>>>> extended for another period.
>>>>> 
>>>>> Since mount and unmount are no longer done on each copy request,
>>>>> this overhead is no longer used to decide whether the copy should
>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>> to determine sync or async copy is now used for this decision.
>>>>> 
>>>>> -Dai
>>>>> 
>>>>> v2: fix compiler warning of missing prototype.
>>>> 
>>>> Hi Olga-
>>>> 
>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
>>>> Have you had a chance to review Dai's patches?
>>> 
>>> Hi Chuck,
>>> 
>>> I apologize I haven't had the chance to review/test it yet. Can I have
>>> until tomorrow evening to do so?
>> 
>> Next couple of days will be fine. Thanks!
>> 
> 
> I also assumed there would be a v2 given that kbot complained about
> the NFSD patch.

This is the v2 (see Subject: )


--
Chuck Lever

Dai Ngo April 6, 2021, 7:58 p.m. UTC | #6

Hi Olga,

On 4/6/21 12:57 PM, Olga Kornievskaia wrote:
> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>
>>
>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>>
>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>
>>>>
>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Currently the source's export is mounted and unmounted on every
>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>> for each copy.
>>>>>
>>>>> This patch series is an enhancement to allow the export to remain
>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>> export is not being used for the configured time it will be unmounted
>>>>> by a delayed task. If it's used again then its expiration time is
>>>>> extended for another period.
>>>>>
>>>>> Since mount and unmount are no longer done on each copy request,
>>>>> this overhead is no longer used to decide whether the copy should
>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>> to determine sync or async copy is now used for this decision.
>>>>>
>>>>> -Dai
>>>>>
>>>>> v2: fix compiler warning of missing prototype.
>>>> Hi Olga-
>>>>
>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
>>>> Have you had a chance to review Dai's patches?
>>> Hi Chuck,
>>>
>>> I apologize I haven't had the chance to review/test it yet. Can I have
>>> until tomorrow evening to do so?
>> Next couple of days will be fine. Thanks!
>>
> I also assumed there would be a v2 given that kbot complained about
> the NFSD patch.

Yes, I sent the v2 patch to fix the compiler warning on 4/2.

Thanks,

-Dai

>
>> --
>> Chuck Lever
>>
>>
>>

Olga Kornievskaia April 6, 2021, 8:43 p.m. UTC | #7

On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >>>
> >>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Currently the source's export is mounted and unmounted on every
> >>>>> inter-server copy operation. This causes unnecessary overhead
> >>>>> for each copy.
> >>>>>
> >>>>> This patch series is an enhancement to allow the export to remain
> >>>>> mounted for a configurable period (default to 15 minutes). If the
> >>>>> export is not being used for the configured time it will be unmounted
> >>>>> by a delayed task. If it's used again then its expiration time is
> >>>>> extended for another period.
> >>>>>
> >>>>> Since mount and unmount are no longer done on each copy request,
> >>>>> this overhead is no longer used to decide whether the copy should
> >>>>> be done with inter-server copy or generic copy. The threshold used
> >>>>> to determine sync or async copy is now used for this decision.
> >>>>>
> >>>>> -Dai
> >>>>>
> >>>>> v2: fix compiler warning of missing prototype.
> >>>>
> >>>> Hi Olga-
> >>>>
> >>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
> >>>> Have you had a chance to review Dai's patches?
> >>>
> >>> Hi Chuck,
> >>>
> >>> I apologize I haven't had the chance to review/test it yet. Can I have
> >>> until tomorrow evening to do so?
> >>
> >> Next couple of days will be fine. Thanks!
> >>
> >
> > I also assumed there would be a v2 given that kbot complained about
> > the NFSD patch.
>
> This is the v2 (see Subject: )

Sigh. Thank you. I somehow missed v2 patches themselves and only saw
the originals. Again I'll test/review the v2 by the end of the day
tomorrow!

Actually a question for Dai: have you done performance tests with your
patches and can show that small copies still perform? Can you please
post your numbers with the patch series? When we posted the original
patch set we did provide performance numbers to support the choices we
made (ie, not hurting performance of small copies).

While I agree that delaying the unmount on the server is beneficial
I'm not so sure that dropping the client restriction is wise because
the small (singular) copy would suffer the setup cost of the initial
mount. Just my initial thoughts...

>

>
> --
> Chuck Lever
>
>
>

Dai Ngo April 7, 2021, 1:12 a.m. UTC | #8

On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>
>>
>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>>
>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>
>>>>
>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>>>>
>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>> for each copy.
>>>>>>>
>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>> export is not being used for the configured time it will be unmounted
>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>> extended for another period.
>>>>>>>
>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>
>>>>>>> -Dai
>>>>>>>
>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>> Hi Olga-
>>>>>>
>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull request.
>>>>>> Have you had a chance to review Dai's patches?
>>>>> Hi Chuck,
>>>>>
>>>>> I apologize I haven't had the chance to review/test it yet. Can I have
>>>>> until tomorrow evening to do so?
>>>> Next couple of days will be fine. Thanks!
>>>>
>>> I also assumed there would be a v2 given that kbot complained about
>>> the NFSD patch.
>> This is the v2 (see Subject: )
> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
> the originals. Again I'll test/review the v2 by the end of the day
> tomorrow!
>
> Actually a question for Dai: have you done performance tests with your
> patches and can show that small copies still perform? Can you please
> post your numbers with the patch series? When we posted the original
> patch set we did provide performance numbers to support the choices we
> made (ie, not hurting performance of small copies).

Currently the source and destination export was mounted with default
rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
to decide whether to do inter-server copy or generic copy.

I ran performance tests on my test VMs, with and without the patch,
using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
each test 5 times and took the average. I include the results of 'cp'
for reference:

  size     cp          with patch             without patch
----------------------------------------------------------------
1048576  0.031    0.032 (generic)           0.029 (generic)
1049490  0.032    0.042 (inter-server)      0.037 (generic)
2048000  0.051    0.047 (inter-server)      0.053 (generic)
7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
----------------------------------------------------------------

Note that without the patch, the threshold to do inter-server
copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
to do inter-server is (524288 * 2 = 1048576) bytes, same as
threshold to decide to sync/async for intra-copy.

>
> While I agree that delaying the unmount on the server is beneficial
> I'm not so sure that dropping the client restriction is wise because
> the small (singular) copy would suffer the setup cost of the initial
> mount.

Right, but only the 1st copy. The export remains to be mounted for
15 mins so subsequent small copies do not incur the mount and unmount
overhead.

I think ideally we want the server to do inter-copy only if it's faster
than the generic copy. We can probably come up with a number after some
testing and that number can not be based on the rsize as it is now since
the rsize can be changed by mount option. This can be a fixed number,
1M/2M/etc, and it should be configurable. What do you think? I'm open
to any other options.

>   Just my initial thoughts...

Thanks,
-Dai

>
>> --
>> Chuck Lever
>>
>>
>>

Dai Ngo April 7, 2021, 1:23 a.m. UTC | #9

On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
>
> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III 
>> <chuck.lever@oracle.com> wrote:
>>>
>>>
>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia 
>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>
>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III 
>>>> <chuck.lever@oracle.com> wrote:
>>>>>
>>>>>
>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia 
>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III 
>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>> for each copy.
>>>>>>>>
>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>> export is not being used for the configured time it will be 
>>>>>>>> unmounted
>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>> extended for another period.
>>>>>>>>
>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>
>>>>>>>> -Dai
>>>>>>>>
>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>> Hi Olga-
>>>>>>>
>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull 
>>>>>>> request.
>>>>>>> Have you had a chance to review Dai's patches?
>>>>>> Hi Chuck,
>>>>>>
>>>>>> I apologize I haven't had the chance to review/test it yet. Can I 
>>>>>> have
>>>>>> until tomorrow evening to do so?
>>>>> Next couple of days will be fine. Thanks!
>>>>>
>>>> I also assumed there would be a v2 given that kbot complained about
>>>> the NFSD patch.
>>> This is the v2 (see Subject: )
>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>> the originals. Again I'll test/review the v2 by the end of the day
>> tomorrow!
>>
>> Actually a question for Dai: have you done performance tests with your
>> patches and can show that small copies still perform? Can you please
>> post your numbers with the patch series? When we posted the original
>> patch set we did provide performance numbers to support the choices we
>> made (ie, not hurting performance of small copies).
>
> Currently the source and destination export was mounted with default
> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
> to decide whether to do inter-server copy or generic copy.
>
> I ran performance tests on my test VMs, with and without the patch,
> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
> each test 5 times and took the average. I include the results of 'cp'
> for reference:
>
> size            cp          with patch                  without patch
> ----------------------------------------------------------------
> 1048576  0.031    0.032 (generic)             0.029 (generic)
> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
> ----------------------------------------------------------------

Sorry, times are in seconds.

-Dai

>
> Note that without the patch, the threshold to do inter-server
> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
> to do inter-server is (524288 * 2 = 1048576) bytes, same as
> threshold to decide to sync/async for intra-copy.
>
>>
>> While I agree that delaying the unmount on the server is beneficial
>> I'm not so sure that dropping the client restriction is wise because
>> the small (singular) copy would suffer the setup cost of the initial
>> mount.
>
> Right, but only the 1st copy. The export remains to be mounted for
> 15 mins so subsequent small copies do not incur the mount and unmount
> overhead.
>
> I think ideally we want the server to do inter-copy only if it's faster
> than the generic copy. We can probably come up with a number after some
> testing and that number can not be based on the rsize as it is now since
> the rsize can be changed by mount option. This can be a fixed number,
> 1M/2M/etc, and it should be configurable. What do you think? I'm open
> to any other options.
>
>>   Just my initial thoughts...
>
> Thanks,
> -Dai
>
>>
>>> -- 
>>> Chuck Lever
>>>
>>>
>>>

Dai Ngo April 7, 2021, 5:13 p.m. UTC | #10

On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
>>
>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
>>>> <chuck.lever@oracle.com> wrote:
>>>>>
>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>
>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>
>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>>>> for each copy.
>>>>>>>>>>
>>>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>>>> export is not being used for the configured time it will be
>>>>>>>>>> unmounted
>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>>>> extended for another period.
>>>>>>>>>>
>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>>>
>>>>>>>>>> -Dai
>>>>>>>>>>
>>>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>>>> Hi Olga-
>>>>>>>>>
>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
>>>>>>>>> request.
>>>>>>>>> Have you had a chance to review Dai's patches?
>>>>>>>> Hi Chuck,
>>>>>>>>
>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
>>>>>>>> have
>>>>>>>> until tomorrow evening to do so?
>>>>>>> Next couple of days will be fine. Thanks!
>>>>>>>
>>>>>> I also assumed there would be a v2 given that kbot complained about
>>>>>> the NFSD patch.
>>>>> This is the v2 (see Subject: )
>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>>>> the originals. Again I'll test/review the v2 by the end of the day
>>>> tomorrow!
>>>>
>>>> Actually a question for Dai: have you done performance tests with your
>>>> patches and can show that small copies still perform? Can you please
>>>> post your numbers with the patch series? When we posted the original
>>>> patch set we did provide performance numbers to support the choices we
>>>> made (ie, not hurting performance of small copies).
>>> Currently the source and destination export was mounted with default
>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
>>> to decide whether to do inter-server copy or generic copy.
>>>
>>> I ran performance tests on my test VMs, with and without the patch,
>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
>>> each test 5 times and took the average. I include the results of 'cp'
>>> for reference:
>>>
>>> size            cp          with patch                  without patch
>>> ----------------------------------------------------------------
>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
>>> ----------------------------------------------------------------
>> Sorry, times are in seconds.
> Thank you for the numbers. #2 case is what I'm worried about.

Regarding performance numbers, the patch does better than the original
code in #3 and #4 and worse then original code in #1 and #2. #4 run
shows the benefit of the patch when doing inter-copy. The #2 case can
be mitigated by using a configurable threshold. In general, I think it's
more important to have good performance on large files than small files
when using inter-server copy.  Note that the original code does not
do well with small files either as shown above.

>
> I don't believe the code works. In my 1st test doing "nfstest_ssc
> --runtest inter01" and then doing it again. What I see from inspecting
> the traces is that indeed unmount doesn't happen but for the 2nd copy
> the mount happens again.
>
> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
> client .68. The first run of "inter01" places a copy in frame 364.
> frame 367 has the beginning of the "mount" between .114 and .110. then
> read happens. then a copy offload callback happens. No unmount happens
> as expected. inter01 continues with its verification and clean up. By
> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
> between the .114 and .110 in frame 769. Then the next run of the
> "inter01", COPY is placed in frame 1110. The next thing that happens
> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
> however it works provides no savings.

The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
The saving is reflected in the number of #4 test run above.

Note that the overhead of the copy in the current code includes mount
*and* unmount. However the threshold computed in __nfs4_copy_file_range
includes only the guesstimated mount overhead and not the unmount
overhead so it not correct.

-Dai


>
> Honestly I don't understand the whole need of a semaphore and all.

The semaphore is to prevent the export to be unmounted while it's
being used.

-Dai

> My
> approach that I tried before was to create a delayed work item but I
> don't recall why I dropped it.
> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
>
>
>> -Dai
>>
>>> Note that without the patch, the threshold to do inter-server
>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
>>> threshold to decide to sync/async for intra-copy.
>>>
>>>> While I agree that delaying the unmount on the server is beneficial
>>>> I'm not so sure that dropping the client restriction is wise because
>>>> the small (singular) copy would suffer the setup cost of the initial
>>>> mount.
>>> Right, but only the 1st copy. The export remains to be mounted for
>>> 15 mins so subsequent small copies do not incur the mount and unmount
>>> overhead.
>>>
>>> I think ideally we want the server to do inter-copy only if it's faster
>>> than the generic copy. We can probably come up with a number after some
>>> testing and that number can not be based on the rsize as it is now since
>>> the rsize can be changed by mount option. This can be a fixed number,
>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
>>> to any other options.
>>>
>>>>    Just my initial thoughts...
>>> Thanks,
>>> -Dai
>>>
>>>>> --
>>>>> Chuck Lever
>>>>>
>>>>>
>>>>>

Olga Kornievskaia April 7, 2021, 7:01 p.m. UTC | #11

On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
>
>
> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
> > On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
> >>
> >> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
> >>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
> >>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
> >>>> <chuck.lever@oracle.com> wrote:
> >>>>>
> >>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
> >>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>
> >>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
> >>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>
> >>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
> >>>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
> >>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Currently the source's export is mounted and unmounted on every
> >>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
> >>>>>>>>>> for each copy.
> >>>>>>>>>>
> >>>>>>>>>> This patch series is an enhancement to allow the export to remain
> >>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
> >>>>>>>>>> export is not being used for the configured time it will be
> >>>>>>>>>> unmounted
> >>>>>>>>>> by a delayed task. If it's used again then its expiration time is
> >>>>>>>>>> extended for another period.
> >>>>>>>>>>
> >>>>>>>>>> Since mount and unmount are no longer done on each copy request,
> >>>>>>>>>> this overhead is no longer used to decide whether the copy should
> >>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
> >>>>>>>>>> to determine sync or async copy is now used for this decision.
> >>>>>>>>>>
> >>>>>>>>>> -Dai
> >>>>>>>>>>
> >>>>>>>>>> v2: fix compiler warning of missing prototype.
> >>>>>>>>> Hi Olga-
> >>>>>>>>>
> >>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
> >>>>>>>>> request.
> >>>>>>>>> Have you had a chance to review Dai's patches?
> >>>>>>>> Hi Chuck,
> >>>>>>>>
> >>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
> >>>>>>>> have
> >>>>>>>> until tomorrow evening to do so?
> >>>>>>> Next couple of days will be fine. Thanks!
> >>>>>>>
> >>>>>> I also assumed there would be a v2 given that kbot complained about
> >>>>>> the NFSD patch.
> >>>>> This is the v2 (see Subject: )
> >>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
> >>>> the originals. Again I'll test/review the v2 by the end of the day
> >>>> tomorrow!
> >>>>
> >>>> Actually a question for Dai: have you done performance tests with your
> >>>> patches and can show that small copies still perform? Can you please
> >>>> post your numbers with the patch series? When we posted the original
> >>>> patch set we did provide performance numbers to support the choices we
> >>>> made (ie, not hurting performance of small copies).
> >>> Currently the source and destination export was mounted with default
> >>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
> >>> to decide whether to do inter-server copy or generic copy.
> >>>
> >>> I ran performance tests on my test VMs, with and without the patch,
> >>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
> >>> each test 5 times and took the average. I include the results of 'cp'
> >>> for reference:
> >>>
> >>> size            cp          with patch                  without patch
> >>> ----------------------------------------------------------------
> >>> 1048576  0.031    0.032 (generic)             0.029 (generic)
> >>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
> >>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
> >>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
> >>> ----------------------------------------------------------------
> >> Sorry, times are in seconds.
> > Thank you for the numbers. #2 case is what I'm worried about.
>
> Regarding performance numbers, the patch does better than the original
> code in #3 and #4 and worse then original code in #1 and #2. #4 run
> shows the benefit of the patch when doing inter-copy. The #2 case can
> be mitigated by using a configurable threshold. In general, I think it's
> more important to have good performance on large files than small files
> when using inter-server copy.  Note that the original code does not
> do well with small files either as shown above.

I think the current approach tries to be very conservative to achieve
the goal of never being worse than the cp. I'm not sure what you mean
that current code doesn't do well with small files. For small files it
falls back to the generic copy.

> > I don't believe the code works. In my 1st test doing "nfstest_ssc
> > --runtest inter01" and then doing it again. What I see from inspecting
> > the traces is that indeed unmount doesn't happen but for the 2nd copy
> > the mount happens again.
> >
> > I'm attaching the trace. my servers are .114 (dest), .110 (src). my
> > client .68. The first run of "inter01" places a copy in frame 364.
> > frame 367 has the beginning of the "mount" between .114 and .110. then
> > read happens. then a copy offload callback happens. No unmount happens
> > as expected. inter01 continues with its verification and clean up. By
> > frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
> > between the .114 and .110 in frame 769. Then the next run of the
> > "inter01", COPY is placed in frame 1110. The next thing that happens
> > are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
> > the saving here? a single EXCHANGE_ID? Either the code doesn't work or
> > however it works provides no savings.
>
> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
> The saving is reflected in the number of #4 test run above.

Can't we do better than that? Since you are keeping a list of umounts,
can't they be searched before doing the vfs_mount() and instead get
the mount structure from the list (and not call the vfs_mount at all)
and remove it from the umount list (wouldn't that save all the calls)?

> Note that the overhead of the copy in the current code includes mount
> *and* unmount. However the threshold computed in __nfs4_copy_file_range
> includes only the guesstimated mount overhead and not the unmount
> overhead so it not correct.
>
> -Dai
>
>
> >
> > Honestly I don't understand the whole need of a semaphore and all.
>
> The semaphore is to prevent the export to be unmounted while it's
> being used.
>
> -Dai
>
> > My
> > approach that I tried before was to create a delayed work item but I
> > don't recall why I dropped it.
> > https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
> >
> >
> >> -Dai
> >>
> >>> Note that without the patch, the threshold to do inter-server
> >>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
> >>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
> >>> threshold to decide to sync/async for intra-copy.
> >>>
> >>>> While I agree that delaying the unmount on the server is beneficial
> >>>> I'm not so sure that dropping the client restriction is wise because
> >>>> the small (singular) copy would suffer the setup cost of the initial
> >>>> mount.
> >>> Right, but only the 1st copy. The export remains to be mounted for
> >>> 15 mins so subsequent small copies do not incur the mount and unmount
> >>> overhead.
> >>>
> >>> I think ideally we want the server to do inter-copy only if it's faster
> >>> than the generic copy. We can probably come up with a number after some
> >>> testing and that number can not be based on the rsize as it is now since
> >>> the rsize can be changed by mount option. This can be a fixed number,
> >>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
> >>> to any other options.
> >>>
> >>>>    Just my initial thoughts...
> >>> Thanks,
> >>> -Dai
> >>>
> >>>>> --
> >>>>> Chuck Lever
> >>>>>
> >>>>>
> >>>>>

Dai Ngo April 7, 2021, 8:16 p.m. UTC | #12

On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
> On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
>>
>> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
>>> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
>>>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
>>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>
>>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>>>>>> for each copy.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>>>>>> export is not being used for the configured time it will be
>>>>>>>>>>>> unmounted
>>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>>>>>> extended for another period.
>>>>>>>>>>>>
>>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>>>>>
>>>>>>>>>>>> -Dai
>>>>>>>>>>>>
>>>>>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>>>>>> Hi Olga-
>>>>>>>>>>>
>>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
>>>>>>>>>>> request.
>>>>>>>>>>> Have you had a chance to review Dai's patches?
>>>>>>>>>> Hi Chuck,
>>>>>>>>>>
>>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
>>>>>>>>>> have
>>>>>>>>>> until tomorrow evening to do so?
>>>>>>>>> Next couple of days will be fine. Thanks!
>>>>>>>>>
>>>>>>>> I also assumed there would be a v2 given that kbot complained about
>>>>>>>> the NFSD patch.
>>>>>>> This is the v2 (see Subject: )
>>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>>>>>> the originals. Again I'll test/review the v2 by the end of the day
>>>>>> tomorrow!
>>>>>>
>>>>>> Actually a question for Dai: have you done performance tests with your
>>>>>> patches and can show that small copies still perform? Can you please
>>>>>> post your numbers with the patch series? When we posted the original
>>>>>> patch set we did provide performance numbers to support the choices we
>>>>>> made (ie, not hurting performance of small copies).
>>>>> Currently the source and destination export was mounted with default
>>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
>>>>> to decide whether to do inter-server copy or generic copy.
>>>>>
>>>>> I ran performance tests on my test VMs, with and without the patch,
>>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
>>>>> each test 5 times and took the average. I include the results of 'cp'
>>>>> for reference:
>>>>>
>>>>> size            cp          with patch                  without patch
>>>>> ----------------------------------------------------------------
>>>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
>>>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
>>>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
>>>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
>>>>> ----------------------------------------------------------------
>>>> Sorry, times are in seconds.
>>> Thank you for the numbers. #2 case is what I'm worried about.
>> Regarding performance numbers, the patch does better than the original
>> code in #3 and #4 and worse then original code in #1 and #2. #4 run
>> shows the benefit of the patch when doing inter-copy. The #2 case can
>> be mitigated by using a configurable threshold. In general, I think it's
>> more important to have good performance on large files than small files
>> when using inter-server copy.  Note that the original code does not
>> do well with small files either as shown above.
> I think the current approach tries to be very conservative to achieve
> the goal of never being worse than the cp. I'm not sure what you mean
> that current code doesn't do well with small files. For small files it
> falls back to the generic copy.

In this table, the only advantage the current code has over 'cp' is
run 1 which I don't know why. The rest is slower than 'cp'. I don't
have the size of the copy where the inter-copy in the current code
starts showing better performance yet, but even at ~7MB it is still
slower than 'cp'. So for any size that is smaller than 7MB+, the
inter-server copy will be slower than 'cp'. Compare that with the
patch, the benefit of inter-server copy starts at ~2MB.

>   
>
>>> I don't believe the code works. In my 1st test doing "nfstest_ssc
>>> --runtest inter01" and then doing it again. What I see from inspecting
>>> the traces is that indeed unmount doesn't happen but for the 2nd copy
>>> the mount happens again.
>>>
>>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
>>> client .68. The first run of "inter01" places a copy in frame 364.
>>> frame 367 has the beginning of the "mount" between .114 and .110. then
>>> read happens. then a copy offload callback happens. No unmount happens
>>> as expected. inter01 continues with its verification and clean up. By
>>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
>>> between the .114 and .110 in frame 769. Then the next run of the
>>> "inter01", COPY is placed in frame 1110. The next thing that happens
>>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
>>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
>>> however it works provides no savings.
>> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
>> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
>> The saving is reflected in the number of #4 test run above.
> Can't we do better than that? Since you are keeping a list of umounts,
> can't they be searched before doing the vfs_mount() and instead get
> the mount structure from the list (and not call the vfs_mount at all)
> and remove it from the umount list (wouldn't that save all the calls)?

I thought about this. My problem here is that we don't have much to key
on to search the list. The only thing in the COPY argument can be used
for this purpose is the list of IP addresses of the source server.
I think that is not enough, there can be multiple exports from the
same server, how do we find the right one? it can get complicated.
I'm happy to consider any suggestion you have for this.

I think the patch is an improvement, in performance for copying large
files (if you consider 2MB file is large) and for removing the bug
of computing overhead in __nfs4_copy_file_range. Note that we can
always improve it and not necessary doing it all at once.

-Dai

>
>> Note that the overhead of the copy in the current code includes mount
>> *and* unmount. However the threshold computed in __nfs4_copy_file_range
>> includes only the guesstimated mount overhead and not the unmount
>> overhead so it not correct.
>>
>> -Dai
>>
>>
>>> Honestly I don't understand the whole need of a semaphore and all.
>> The semaphore is to prevent the export to be unmounted while it's
>> being used.
>>
>> -Dai
>>
>>> My
>>> approach that I tried before was to create a delayed work item but I
>>> don't recall why I dropped it.
>>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
>>>
>>>
>>>> -Dai
>>>>
>>>>> Note that without the patch, the threshold to do inter-server
>>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
>>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
>>>>> threshold to decide to sync/async for intra-copy.
>>>>>
>>>>>> While I agree that delaying the unmount on the server is beneficial
>>>>>> I'm not so sure that dropping the client restriction is wise because
>>>>>> the small (singular) copy would suffer the setup cost of the initial
>>>>>> mount.
>>>>> Right, but only the 1st copy. The export remains to be mounted for
>>>>> 15 mins so subsequent small copies do not incur the mount and unmount
>>>>> overhead.
>>>>>
>>>>> I think ideally we want the server to do inter-copy only if it's faster
>>>>> than the generic copy. We can probably come up with a number after some
>>>>> testing and that number can not be based on the rsize as it is now since
>>>>> the rsize can be changed by mount option. This can be a fixed number,
>>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
>>>>> to any other options.
>>>>>
>>>>>>     Just my initial thoughts...
>>>>> Thanks,
>>>>> -Dai
>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>>
>>>>>>>
>>>>>>>

Olga Kornievskaia April 7, 2021, 9:40 p.m. UTC | #13

On Wed, Apr 7, 2021 at 4:16 PM <dai.ngo@oracle.com> wrote:
>
>
> On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
> > On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
> >>
> >> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
> >>> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
> >>>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
> >>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
> >>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
> >>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
> >>>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
> >>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
> >>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
> >>>>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
> >>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
> >>>>>>>>>>>> for each copy.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
> >>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
> >>>>>>>>>>>> export is not being used for the configured time it will be
> >>>>>>>>>>>> unmounted
> >>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
> >>>>>>>>>>>> extended for another period.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
> >>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
> >>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
> >>>>>>>>>>>> to determine sync or async copy is now used for this decision.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Dai
> >>>>>>>>>>>>
> >>>>>>>>>>>> v2: fix compiler warning of missing prototype.
> >>>>>>>>>>> Hi Olga-
> >>>>>>>>>>>
> >>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
> >>>>>>>>>>> request.
> >>>>>>>>>>> Have you had a chance to review Dai's patches?
> >>>>>>>>>> Hi Chuck,
> >>>>>>>>>>
> >>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
> >>>>>>>>>> have
> >>>>>>>>>> until tomorrow evening to do so?
> >>>>>>>>> Next couple of days will be fine. Thanks!
> >>>>>>>>>
> >>>>>>>> I also assumed there would be a v2 given that kbot complained about
> >>>>>>>> the NFSD patch.
> >>>>>>> This is the v2 (see Subject: )
> >>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
> >>>>>> the originals. Again I'll test/review the v2 by the end of the day
> >>>>>> tomorrow!
> >>>>>>
> >>>>>> Actually a question for Dai: have you done performance tests with your
> >>>>>> patches and can show that small copies still perform? Can you please
> >>>>>> post your numbers with the patch series? When we posted the original
> >>>>>> patch set we did provide performance numbers to support the choices we
> >>>>>> made (ie, not hurting performance of small copies).
> >>>>> Currently the source and destination export was mounted with default
> >>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
> >>>>> to decide whether to do inter-server copy or generic copy.
> >>>>>
> >>>>> I ran performance tests on my test VMs, with and without the patch,
> >>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
> >>>>> each test 5 times and took the average. I include the results of 'cp'
> >>>>> for reference:
> >>>>>
> >>>>> size            cp          with patch                  without patch
> >>>>> ----------------------------------------------------------------
> >>>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
> >>>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
> >>>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
> >>>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
> >>>>> ----------------------------------------------------------------
> >>>> Sorry, times are in seconds.
> >>> Thank you for the numbers. #2 case is what I'm worried about.
> >> Regarding performance numbers, the patch does better than the original
> >> code in #3 and #4 and worse then original code in #1 and #2. #4 run
> >> shows the benefit of the patch when doing inter-copy. The #2 case can
> >> be mitigated by using a configurable threshold. In general, I think it's
> >> more important to have good performance on large files than small files
> >> when using inter-server copy.  Note that the original code does not
> >> do well with small files either as shown above.
> > I think the current approach tries to be very conservative to achieve
> > the goal of never being worse than the cp. I'm not sure what you mean
> > that current code doesn't do well with small files. For small files it
> > falls back to the generic copy.
>
> In this table, the only advantage the current code has over 'cp' is
> run 1 which I don't know why. The rest is slower than 'cp'. I don't
> have the size of the copy where the inter-copy in the current code
> starts showing better performance yet, but even at ~7MB it is still
> slower than 'cp'. So for any size that is smaller than 7MB+, the
> inter-server copy will be slower than 'cp'. Compare that with the
> patch, the benefit of inter-server copy starts at ~2MB.

I went back to Jorge Mora's perf numbers we posted. You are right, we
did report perf degradation for any copies smaller than 16MB for when
we didn't cap the copy size to be at least 14*rsize (I think the
assumption was that rsize=1M and making it 14M). I'm just uneasy to
open it to even smaller sizes. I think we should explicitly change it
to 16MB instead of removing the restriction. Again I think the policy
we want it to do no worse than cp.

> >>> I don't believe the code works. In my 1st test doing "nfstest_ssc
> >>> --runtest inter01" and then doing it again. What I see from inspecting
> >>> the traces is that indeed unmount doesn't happen but for the 2nd copy
> >>> the mount happens again.
> >>>
> >>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
> >>> client .68. The first run of "inter01" places a copy in frame 364.
> >>> frame 367 has the beginning of the "mount" between .114 and .110. then
> >>> read happens. then a copy offload callback happens. No unmount happens
> >>> as expected. inter01 continues with its verification and clean up. By
> >>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
> >>> between the .114 and .110 in frame 769. Then the next run of the
> >>> "inter01", COPY is placed in frame 1110. The next thing that happens
> >>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
> >>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
> >>> however it works provides no savings.
> >> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
> >> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
> >> The saving is reflected in the number of #4 test run above.
> > Can't we do better than that? Since you are keeping a list of umounts,
> > can't they be searched before doing the vfs_mount() and instead get
> > the mount structure from the list (and not call the vfs_mount at all)
> > and remove it from the umount list (wouldn't that save all the calls)?
>
> I thought about this. My problem here is that we don't have much to key
> on to search the list. The only thing in the COPY argument can be used
> for this purpose is the list of IP addresses of the source server.
> I think that is not enough, there can be multiple exports from the
> same server, how do we find the right one? it can get complicated.
> I'm happy to consider any suggestion you have for this.

I believe an IP address is exactly what's needed for keying. Each of
those "mounts" to the same server is shared (that's normal behaviour
for the client) -- meaning there is just   1 "mount" for a given IP
(there is a single nfs4_client structure).

> I think the patch is an improvement, in performance for copying large
> files (if you consider 2MB file is large) and for removing the bug
> of computing overhead in __nfs4_copy_file_range. Note that we can
> always improve it and not necessary doing it all at once.

I don't think saving 3 RPCs out of 14 is a good enough improvement
when it can be made to save them all (unless you can convince me that
we can't save all 14).

>
> -Dai
>
> >
> >> Note that the overhead of the copy in the current code includes mount
> >> *and* unmount. However the threshold computed in __nfs4_copy_file_range
> >> includes only the guesstimated mount overhead and not the unmount
> >> overhead so it not correct.
> >>
> >> -Dai
> >>
> >>
> >>> Honestly I don't understand the whole need of a semaphore and all.
> >> The semaphore is to prevent the export to be unmounted while it's
> >> being used.
> >>
> >> -Dai
> >>
> >>> My
> >>> approach that I tried before was to create a delayed work item but I
> >>> don't recall why I dropped it.
> >>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
> >>>
> >>>
> >>>> -Dai
> >>>>
> >>>>> Note that without the patch, the threshold to do inter-server
> >>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
> >>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
> >>>>> threshold to decide to sync/async for intra-copy.
> >>>>>
> >>>>>> While I agree that delaying the unmount on the server is beneficial
> >>>>>> I'm not so sure that dropping the client restriction is wise because
> >>>>>> the small (singular) copy would suffer the setup cost of the initial
> >>>>>> mount.
> >>>>> Right, but only the 1st copy. The export remains to be mounted for
> >>>>> 15 mins so subsequent small copies do not incur the mount and unmount
> >>>>> overhead.
> >>>>>
> >>>>> I think ideally we want the server to do inter-copy only if it's faster
> >>>>> than the generic copy. We can probably come up with a number after some
> >>>>> testing and that number can not be based on the rsize as it is now since
> >>>>> the rsize can be changed by mount option. This can be a fixed number,
> >>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
> >>>>> to any other options.
> >>>>>
> >>>>>>     Just my initial thoughts...
> >>>>> Thanks,
> >>>>> -Dai
> >>>>>
> >>>>>>> --
> >>>>>>> Chuck Lever
> >>>>>>>
> >>>>>>>
> >>>>>>>

Dai Ngo April 7, 2021, 10:50 p.m. UTC | #14

On 4/7/21 2:40 PM, Olga Kornievskaia wrote:
> On Wed, Apr 7, 2021 at 4:16 PM <dai.ngo@oracle.com> wrote:
>>
>> On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
>>> On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
>>>> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
>>>>> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
>>>>>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
>>>>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>>>>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
>>>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
>>>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>>>>>>>> for each copy.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>>>>>>>> export is not being used for the configured time it will be
>>>>>>>>>>>>>> unmounted
>>>>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>>>>>>>> extended for another period.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Dai
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>>>>>>>> Hi Olga-
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
>>>>>>>>>>>>> request.
>>>>>>>>>>>>> Have you had a chance to review Dai's patches?
>>>>>>>>>>>> Hi Chuck,
>>>>>>>>>>>>
>>>>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
>>>>>>>>>>>> have
>>>>>>>>>>>> until tomorrow evening to do so?
>>>>>>>>>>> Next couple of days will be fine. Thanks!
>>>>>>>>>>>
>>>>>>>>>> I also assumed there would be a v2 given that kbot complained about
>>>>>>>>>> the NFSD patch.
>>>>>>>>> This is the v2 (see Subject: )
>>>>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>>>>>>>> the originals. Again I'll test/review the v2 by the end of the day
>>>>>>>> tomorrow!
>>>>>>>>
>>>>>>>> Actually a question for Dai: have you done performance tests with your
>>>>>>>> patches and can show that small copies still perform? Can you please
>>>>>>>> post your numbers with the patch series? When we posted the original
>>>>>>>> patch set we did provide performance numbers to support the choices we
>>>>>>>> made (ie, not hurting performance of small copies).
>>>>>>> Currently the source and destination export was mounted with default
>>>>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
>>>>>>> to decide whether to do inter-server copy or generic copy.
>>>>>>>
>>>>>>> I ran performance tests on my test VMs, with and without the patch,
>>>>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
>>>>>>> each test 5 times and took the average. I include the results of 'cp'
>>>>>>> for reference:
>>>>>>>
>>>>>>> size            cp          with patch                  without patch
>>>>>>> ----------------------------------------------------------------
>>>>>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
>>>>>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
>>>>>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
>>>>>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
>>>>>>> ----------------------------------------------------------------
>>>>>> Sorry, times are in seconds.
>>>>> Thank you for the numbers. #2 case is what I'm worried about.
>>>> Regarding performance numbers, the patch does better than the original
>>>> code in #3 and #4 and worse then original code in #1 and #2. #4 run
>>>> shows the benefit of the patch when doing inter-copy. The #2 case can
>>>> be mitigated by using a configurable threshold. In general, I think it's
>>>> more important to have good performance on large files than small files
>>>> when using inter-server copy.  Note that the original code does not
>>>> do well with small files either as shown above.
>>> I think the current approach tries to be very conservative to achieve
>>> the goal of never being worse than the cp. I'm not sure what you mean
>>> that current code doesn't do well with small files. For small files it
>>> falls back to the generic copy.
>> In this table, the only advantage the current code has over 'cp' is
>> run 1 which I don't know why. The rest is slower than 'cp'. I don't
>> have the size of the copy where the inter-copy in the current code
>> starts showing better performance yet, but even at ~7MB it is still
>> slower than 'cp'. So for any size that is smaller than 7MB+, the
>> inter-server copy will be slower than 'cp'. Compare that with the
>> patch, the benefit of inter-server copy starts at ~2MB.
> I went back to Jorge Mora's perf numbers we posted. You are right, we
> did report perf degradation for any copies smaller than 16MB for when
> we didn't cap the copy size to be at least 14*rsize (I think the
> assumption was that rsize=1M and making it 14M). I'm just uneasy to
> open it to even smaller sizes. I think we should explicitly change it
> to 16MB instead of removing the restriction. Again I think the policy
> we want it to do no worse than cp.

I can make this a module's configurable parameter and default it to 16MB.
However, why 16MB while the measurement I did shows inter-copy starts
performing better than 'cp' at ~2MB? Your previous measurement might no
longer valid for the latest code with the patch. Can you verify the patch
performs than cp even with 2048000 bytes copy?

>
>>>>> I don't believe the code works. In my 1st test doing "nfstest_ssc
>>>>> --runtest inter01" and then doing it again. What I see from inspecting
>>>>> the traces is that indeed unmount doesn't happen but for the 2nd copy
>>>>> the mount happens again.
>>>>>
>>>>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
>>>>> client .68. The first run of "inter01" places a copy in frame 364.
>>>>> frame 367 has the beginning of the "mount" between .114 and .110. then
>>>>> read happens. then a copy offload callback happens. No unmount happens
>>>>> as expected. inter01 continues with its verification and clean up. By
>>>>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
>>>>> between the .114 and .110 in frame 769. Then the next run of the
>>>>> "inter01", COPY is placed in frame 1110. The next thing that happens
>>>>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
>>>>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
>>>>> however it works provides no savings.
>>>> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
>>>> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
>>>> The saving is reflected in the number of #4 test run above.
>>> Can't we do better than that? Since you are keeping a list of umounts,
>>> can't they be searched before doing the vfs_mount() and instead get
>>> the mount structure from the list (and not call the vfs_mount at all)
>>> and remove it from the umount list (wouldn't that save all the calls)?
>> I thought about this. My problem here is that we don't have much to key
>> on to search the list. The only thing in the COPY argument can be used
>> for this purpose is the list of IP addresses of the source server.
>> I think that is not enough, there can be multiple exports from the
>> same server, how do we find the right one? it can get complicated.
>> I'm happy to consider any suggestion you have for this.
> I believe an IP address is exactly what's needed for keying. Each of
> those "mounts" to the same server is shared (that's normal behaviour
> for the client) -- meaning there is just   1 "mount" for a given IP
> (there is a single nfs4_client structure).

ok, I can give this a try. However, I think this only reduces the
number of RPCs which are bunch of GETATTRs so I don't think this will
help the performance number significantly.

>
>> I think the patch is an improvement, in performance for copying large
>> files (if you consider 2MB file is large) and for removing the bug
>> of computing overhead in __nfs4_copy_file_range. Note that we can
>> always improve it and not necessary doing it all at once.
> I don't think saving 3 RPCs out of 14 is a good enough improvement
> when it can be made to save them all (unless you can convince me that
> we can't save all 14).

I don't understand your numbers here. Without the patch, there are
14 RPCs for mount and 2 RPCs for unmount and overhead of creating
and tearing down TCP connection each time. With the patch, there are
8 RPCs for the mount (PUTROOTFH and GETATTRs) and 0 for unmount and
no create and tear down TCP connection. The RPCs that the patch save
are mostly heavy-weight RPC requests as the test showed the patch
outperforms the current code even at 2MB.

-Dai

>
>> -Dai
>>
>>>> Note that the overhead of the copy in the current code includes mount
>>>> *and* unmount. However the threshold computed in __nfs4_copy_file_range
>>>> includes only the guesstimated mount overhead and not the unmount
>>>> overhead so it not correct.
>>>>
>>>> -Dai
>>>>
>>>>
>>>>> Honestly I don't understand the whole need of a semaphore and all.
>>>> The semaphore is to prevent the export to be unmounted while it's
>>>> being used.
>>>>
>>>> -Dai
>>>>
>>>>> My
>>>>> approach that I tried before was to create a delayed work item but I
>>>>> don't recall why I dropped it.
>>>>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
>>>>>
>>>>>
>>>>>> -Dai
>>>>>>
>>>>>>> Note that without the patch, the threshold to do inter-server
>>>>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
>>>>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
>>>>>>> threshold to decide to sync/async for intra-copy.
>>>>>>>
>>>>>>>> While I agree that delaying the unmount on the server is beneficial
>>>>>>>> I'm not so sure that dropping the client restriction is wise because
>>>>>>>> the small (singular) copy would suffer the setup cost of the initial
>>>>>>>> mount.
>>>>>>> Right, but only the 1st copy. The export remains to be mounted for
>>>>>>> 15 mins so subsequent small copies do not incur the mount and unmount
>>>>>>> overhead.
>>>>>>>
>>>>>>> I think ideally we want the server to do inter-copy only if it's faster
>>>>>>> than the generic copy. We can probably come up with a number after some
>>>>>>> testing and that number can not be based on the rsize as it is now since
>>>>>>> the rsize can be changed by mount option. This can be a fixed number,
>>>>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
>>>>>>> to any other options.
>>>>>>>
>>>>>>>>      Just my initial thoughts...
>>>>>>> Thanks,
>>>>>>> -Dai
>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chuck Lever
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>

Olga Kornievskaia April 8, 2021, 12:58 a.m. UTC | #15

On Wed, Apr 7, 2021 at 6:50 PM <dai.ngo@oracle.com> wrote:
>
>
> On 4/7/21 2:40 PM, Olga Kornievskaia wrote:
> > On Wed, Apr 7, 2021 at 4:16 PM <dai.ngo@oracle.com> wrote:
> >>
> >> On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
> >>> On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
> >>>> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
> >>>>> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
> >>>>>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
> >>>>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
> >>>>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
> >>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
> >>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
> >>>>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
> >>>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
> >>>>>>>>>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
> >>>>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
> >>>>>>>>>>>>>> for each copy.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
> >>>>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
> >>>>>>>>>>>>>> export is not being used for the configured time it will be
> >>>>>>>>>>>>>> unmounted
> >>>>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
> >>>>>>>>>>>>>> extended for another period.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
> >>>>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
> >>>>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
> >>>>>>>>>>>>>> to determine sync or async copy is now used for this decision.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -Dai
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> v2: fix compiler warning of missing prototype.
> >>>>>>>>>>>>> Hi Olga-
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
> >>>>>>>>>>>>> request.
> >>>>>>>>>>>>> Have you had a chance to review Dai's patches?
> >>>>>>>>>>>> Hi Chuck,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
> >>>>>>>>>>>> have
> >>>>>>>>>>>> until tomorrow evening to do so?
> >>>>>>>>>>> Next couple of days will be fine. Thanks!
> >>>>>>>>>>>
> >>>>>>>>>> I also assumed there would be a v2 given that kbot complained about
> >>>>>>>>>> the NFSD patch.
> >>>>>>>>> This is the v2 (see Subject: )
> >>>>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
> >>>>>>>> the originals. Again I'll test/review the v2 by the end of the day
> >>>>>>>> tomorrow!
> >>>>>>>>
> >>>>>>>> Actually a question for Dai: have you done performance tests with your
> >>>>>>>> patches and can show that small copies still perform? Can you please
> >>>>>>>> post your numbers with the patch series? When we posted the original
> >>>>>>>> patch set we did provide performance numbers to support the choices we
> >>>>>>>> made (ie, not hurting performance of small copies).
> >>>>>>> Currently the source and destination export was mounted with default
> >>>>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
> >>>>>>> to decide whether to do inter-server copy or generic copy.
> >>>>>>>
> >>>>>>> I ran performance tests on my test VMs, with and without the patch,
> >>>>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
> >>>>>>> each test 5 times and took the average. I include the results of 'cp'
> >>>>>>> for reference:
> >>>>>>>
> >>>>>>> size            cp          with patch                  without patch
> >>>>>>> ----------------------------------------------------------------
> >>>>>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
> >>>>>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
> >>>>>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
> >>>>>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
> >>>>>>> ----------------------------------------------------------------
> >>>>>> Sorry, times are in seconds.
> >>>>> Thank you for the numbers. #2 case is what I'm worried about.
> >>>> Regarding performance numbers, the patch does better than the original
> >>>> code in #3 and #4 and worse then original code in #1 and #2. #4 run
> >>>> shows the benefit of the patch when doing inter-copy. The #2 case can
> >>>> be mitigated by using a configurable threshold. In general, I think it's
> >>>> more important to have good performance on large files than small files
> >>>> when using inter-server copy.  Note that the original code does not
> >>>> do well with small files either as shown above.
> >>> I think the current approach tries to be very conservative to achieve
> >>> the goal of never being worse than the cp. I'm not sure what you mean
> >>> that current code doesn't do well with small files. For small files it
> >>> falls back to the generic copy.
> >> In this table, the only advantage the current code has over 'cp' is
> >> run 1 which I don't know why. The rest is slower than 'cp'. I don't
> >> have the size of the copy where the inter-copy in the current code
> >> starts showing better performance yet, but even at ~7MB it is still
> >> slower than 'cp'. So for any size that is smaller than 7MB+, the
> >> inter-server copy will be slower than 'cp'. Compare that with the
> >> patch, the benefit of inter-server copy starts at ~2MB.
> > I went back to Jorge Mora's perf numbers we posted. You are right, we
> > did report perf degradation for any copies smaller than 16MB for when
> > we didn't cap the copy size to be at least 14*rsize (I think the
> > assumption was that rsize=1M and making it 14M). I'm just uneasy to
> > open it to even smaller sizes. I think we should explicitly change it
> > to 16MB instead of removing the restriction. Again I think the policy
> > we want it to do no worse than cp.
>
> I can make this a module's configurable parameter and default it to 16MB.
> However, why 16MB while the measurement I did shows inter-copy starts
> performing better than 'cp' at ~2MB? Your previous measurement might no
> longer valid for the latest code with the patch. Can you verify the patch
> performs than cp even with 2048000 bytes copy?

16MB came from hardware measurements which are a lot more realistic
than VM measurements. Can you please get hardware measurements? I
haven't gotten around to the hardware testing. Because I think the
solution needs to be changed (ie the whole mount needs to be saved and
not just part of the mount). Once we have that solution, we can
measure performance and see what numbers make sense for the client
side. Having it as a module configuration parameter is fine with me.
Updating it to a new set of hardware based numbers is ok with me too.

> >>>>> I don't believe the code works. In my 1st test doing "nfstest_ssc
> >>>>> --runtest inter01" and then doing it again. What I see from inspecting
> >>>>> the traces is that indeed unmount doesn't happen but for the 2nd copy
> >>>>> the mount happens again.
> >>>>>
> >>>>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
> >>>>> client .68. The first run of "inter01" places a copy in frame 364.
> >>>>> frame 367 has the beginning of the "mount" between .114 and .110. then
> >>>>> read happens. then a copy offload callback happens. No unmount happens
> >>>>> as expected. inter01 continues with its verification and clean up. By
> >>>>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
> >>>>> between the .114 and .110 in frame 769. Then the next run of the
> >>>>> "inter01", COPY is placed in frame 1110. The next thing that happens
> >>>>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
> >>>>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
> >>>>> however it works provides no savings.
> >>>> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
> >>>> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
> >>>> The saving is reflected in the number of #4 test run above.
> >>> Can't we do better than that? Since you are keeping a list of umounts,
> >>> can't they be searched before doing the vfs_mount() and instead get
> >>> the mount structure from the list (and not call the vfs_mount at all)
> >>> and remove it from the umount list (wouldn't that save all the calls)?
> >> I thought about this. My problem here is that we don't have much to key
> >> on to search the list. The only thing in the COPY argument can be used
> >> for this purpose is the list of IP addresses of the source server.
> >> I think that is not enough, there can be multiple exports from the
> >> same server, how do we find the right one? it can get complicated.
> >> I'm happy to consider any suggestion you have for this.
> > I believe an IP address is exactly what's needed for keying. Each of
> > those "mounts" to the same server is shared (that's normal behaviour
> > for the client) -- meaning there is just   1 "mount" for a given IP
> > (there is a single nfs4_client structure).
>
> ok, I can give this a try. However, I think this only reduces the
> number of RPCs which are bunch of GETATTRs so I don't think this will
> help the performance number significantly.

They are RPCs and meta data operations that go into the file system.I
would think they affect performance...

> >> I think the patch is an improvement, in performance for copying large
> >> files (if you consider 2MB file is large) and for removing the bug
> >> of computing overhead in __nfs4_copy_file_range. Note that we can
> >> always improve it and not necessary doing it all at once.
> > I don't think saving 3 RPCs out of 14 is a good enough improvement
> > when it can be made to save them all (unless you can convince me that
> > we can't save all 14).
>
> I don't understand your numbers here. Without the patch, there are
> 14 RPCs for mount and 2 RPCs for unmount and overhead of creating
> and tearing down TCP connection each time. With the patch, there are
> 8 RPCs for the mount (PUTROOTFH and GETATTRs) and 0 for unmount and
> no create and tear down TCP connection. The RPCs that the patch save
> are mostly heavy-weight RPC requests as the test showed the patch
> outperforms the current code even at 2MB.

I counted mount operations savings. I agreed with you that we saved
the umount. I'm saying that given that you are keeping track of the
mounts structures, those 8 RPCs shouldn't happen.

I think my approach didn't keep track of the mount and just delayed
the execution of the cleanup function so it was not possible to "save
the mount rpcs". You code takes it further but not far enough.

As for the savings as I said I'd like to see the performance numbers
based on the hardware setup not VMs. And yes the existing hardware
numbers are based on the old code and need to be re-run on the
existing code. I will try to do that tomorrow.

I think at this point you need to have one of the maintainers weigh in
on whether or not this intermediate step of saying a few rpcs is worth
checking in and then get the saving of the whole mount. It's really up
to them. In my opinion, I think if we are trying to save mount
operations, we should do it all the way.

Furthermore, I still really would like Bruce or Chuck to weigh in on
the use of the semaphore. We are holding onto a semaphore while doing
vfs_kern_mount(), I thought it's frowned upon to hold any locks while
doing network operations. What if vfs_kern_mount() is trying to reach
an unresponsive server, then all other unmount are blocked, right? I
still don't understand why we need a semaphore. I think a spin lock
that protects access to the list of umounts is sufficient. I probably
should be commenting on the actual patch here.


> -Dai
>
> >
> >> -Dai
> >>
> >>>> Note that the overhead of the copy in the current code includes mount
> >>>> *and* unmount. However the threshold computed in __nfs4_copy_file_range
> >>>> includes only the guesstimated mount overhead and not the unmount
> >>>> overhead so it not correct.
> >>>>
> >>>> -Dai
> >>>>
> >>>>
> >>>>> Honestly I don't understand the whole need of a semaphore and all.
> >>>> The semaphore is to prevent the export to be unmounted while it's
> >>>> being used.
> >>>>
> >>>> -Dai
> >>>>
> >>>>> My
> >>>>> approach that I tried before was to create a delayed work item but I
> >>>>> don't recall why I dropped it.
> >>>>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
> >>>>>
> >>>>>
> >>>>>> -Dai
> >>>>>>
> >>>>>>> Note that without the patch, the threshold to do inter-server
> >>>>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
> >>>>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
> >>>>>>> threshold to decide to sync/async for intra-copy.
> >>>>>>>
> >>>>>>>> While I agree that delaying the unmount on the server is beneficial
> >>>>>>>> I'm not so sure that dropping the client restriction is wise because
> >>>>>>>> the small (singular) copy would suffer the setup cost of the initial
> >>>>>>>> mount.
> >>>>>>> Right, but only the 1st copy. The export remains to be mounted for
> >>>>>>> 15 mins so subsequent small copies do not incur the mount and unmount
> >>>>>>> overhead.
> >>>>>>>
> >>>>>>> I think ideally we want the server to do inter-copy only if it's faster
> >>>>>>> than the generic copy. We can probably come up with a number after some
> >>>>>>> testing and that number can not be based on the rsize as it is now since
> >>>>>>> the rsize can be changed by mount option. This can be a fixed number,
> >>>>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
> >>>>>>> to any other options.
> >>>>>>>
> >>>>>>>>      Just my initial thoughts...
> >>>>>>> Thanks,
> >>>>>>> -Dai
> >>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Chuck Lever
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>

Dai Ngo April 8, 2021, 7:19 a.m. UTC | #16

On 4/7/21 5:58 PM, Olga Kornievskaia wrote:
> On Wed, Apr 7, 2021 at 6:50 PM <dai.ngo@oracle.com> wrote:
>>
>> On 4/7/21 2:40 PM, Olga Kornievskaia wrote:
>>> On Wed, Apr 7, 2021 at 4:16 PM <dai.ngo@oracle.com> wrote:
>>>> On 4/7/21 12:01 PM, Olga Kornievskaia wrote:
>>>>> On Wed, Apr 7, 2021 at 1:13 PM <dai.ngo@oracle.com> wrote:
>>>>>> On 4/7/21 9:30 AM, Olga Kornievskaia wrote:
>>>>>>> On Tue, Apr 6, 2021 at 9:23 PM <dai.ngo@oracle.com> wrote:
>>>>>>>> On 4/6/21 6:12 PM, dai.ngo@oracle.com wrote:
>>>>>>>>> On 4/6/21 1:43 PM, Olga Kornievskaia wrote:
>>>>>>>>>> On Tue, Apr 6, 2021 at 3:58 PM Chuck Lever III
>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>> On Apr 6, 2021, at 3:57 PM, Olga Kornievskaia
>>>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 6, 2021 at 3:43 PM Chuck Lever III
>>>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>>>> On Apr 6, 2021, at 3:41 PM, Olga Kornievskaia
>>>>>>>>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Apr 6, 2021 at 12:33 PM Chuck Lever III
>>>>>>>>>>>>>> <chuck.lever@oracle.com> wrote:
>>>>>>>>>>>>>>>> On Apr 2, 2021, at 7:30 PM, Dai Ngo <dai.ngo@oracle.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Currently the source's export is mounted and unmounted on every
>>>>>>>>>>>>>>>> inter-server copy operation. This causes unnecessary overhead
>>>>>>>>>>>>>>>> for each copy.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This patch series is an enhancement to allow the export to remain
>>>>>>>>>>>>>>>> mounted for a configurable period (default to 15 minutes). If the
>>>>>>>>>>>>>>>> export is not being used for the configured time it will be
>>>>>>>>>>>>>>>> unmounted
>>>>>>>>>>>>>>>> by a delayed task. If it's used again then its expiration time is
>>>>>>>>>>>>>>>> extended for another period.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Since mount and unmount are no longer done on each copy request,
>>>>>>>>>>>>>>>> this overhead is no longer used to decide whether the copy should
>>>>>>>>>>>>>>>> be done with inter-server copy or generic copy. The threshold used
>>>>>>>>>>>>>>>> to determine sync or async copy is now used for this decision.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Dai
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> v2: fix compiler warning of missing prototype.
>>>>>>>>>>>>>>> Hi Olga-
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'm getting ready to shrink-wrap the initial NFSD v5.13 pull
>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>> Have you had a chance to review Dai's patches?
>>>>>>>>>>>>>> Hi Chuck,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I apologize I haven't had the chance to review/test it yet. Can I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>> until tomorrow evening to do so?
>>>>>>>>>>>>> Next couple of days will be fine. Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>> I also assumed there would be a v2 given that kbot complained about
>>>>>>>>>>>> the NFSD patch.
>>>>>>>>>>> This is the v2 (see Subject: )
>>>>>>>>>> Sigh. Thank you. I somehow missed v2 patches themselves and only saw
>>>>>>>>>> the originals. Again I'll test/review the v2 by the end of the day
>>>>>>>>>> tomorrow!
>>>>>>>>>>
>>>>>>>>>> Actually a question for Dai: have you done performance tests with your
>>>>>>>>>> patches and can show that small copies still perform? Can you please
>>>>>>>>>> post your numbers with the patch series? When we posted the original
>>>>>>>>>> patch set we did provide performance numbers to support the choices we
>>>>>>>>>> made (ie, not hurting performance of small copies).
>>>>>>>>> Currently the source and destination export was mounted with default
>>>>>>>>> rsize of 524288 and the patch uses threshold of (rsize * 2 = 1048576)
>>>>>>>>> to decide whether to do inter-server copy or generic copy.
>>>>>>>>>
>>>>>>>>> I ran performance tests on my test VMs, with and without the patch,
>>>>>>>>> using 4 file sizes 1048576, 1049490, 2048000 and 7341056 bytes. I ran
>>>>>>>>> each test 5 times and took the average. I include the results of 'cp'
>>>>>>>>> for reference:
>>>>>>>>>
>>>>>>>>> size            cp          with patch                  without patch
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>>> 1048576  0.031    0.032 (generic)             0.029 (generic)
>>>>>>>>> 1049490  0.032    0.042 (inter-server)      0.037 (generic)
>>>>>>>>> 2048000  0.051    0.047 (inter-server)      0.053 (generic)
>>>>>>>>> 7341056  0.157    0.074 (inter-server)      0.185 (inter-server)
>>>>>>>>> ----------------------------------------------------------------
>>>>>>>> Sorry, times are in seconds.
>>>>>>> Thank you for the numbers. #2 case is what I'm worried about.
>>>>>> Regarding performance numbers, the patch does better than the original
>>>>>> code in #3 and #4 and worse then original code in #1 and #2. #4 run
>>>>>> shows the benefit of the patch when doing inter-copy. The #2 case can
>>>>>> be mitigated by using a configurable threshold. In general, I think it's
>>>>>> more important to have good performance on large files than small files
>>>>>> when using inter-server copy.  Note that the original code does not
>>>>>> do well with small files either as shown above.
>>>>> I think the current approach tries to be very conservative to achieve
>>>>> the goal of never being worse than the cp. I'm not sure what you mean
>>>>> that current code doesn't do well with small files. For small files it
>>>>> falls back to the generic copy.
>>>> In this table, the only advantage the current code has over 'cp' is
>>>> run 1 which I don't know why. The rest is slower than 'cp'. I don't
>>>> have the size of the copy where the inter-copy in the current code
>>>> starts showing better performance yet, but even at ~7MB it is still
>>>> slower than 'cp'. So for any size that is smaller than 7MB+, the
>>>> inter-server copy will be slower than 'cp'. Compare that with the
>>>> patch, the benefit of inter-server copy starts at ~2MB.
>>> I went back to Jorge Mora's perf numbers we posted. You are right, we
>>> did report perf degradation for any copies smaller than 16MB for when
>>> we didn't cap the copy size to be at least 14*rsize (I think the
>>> assumption was that rsize=1M and making it 14M). I'm just uneasy to
>>> open it to even smaller sizes. I think we should explicitly change it
>>> to 16MB instead of removing the restriction. Again I think the policy
>>> we want it to do no worse than cp.
>> I can make this a module's configurable parameter and default it to 16MB.
>> However, why 16MB while the measurement I did shows inter-copy starts
>> performing better than 'cp' at ~2MB? Your previous measurement might no
>> longer valid for the latest code with the patch. Can you verify the patch
>> performs than cp even with 2048000 bytes copy?
> 16MB came from hardware measurements which are a lot more realistic
> than VM measurements. Can you please get hardware measurements? I
> haven't gotten around to the hardware testing. Because I think the
> solution needs to be changed (ie the whole mount needs to be saved and
> not just part of the mount). Once we have that solution, we can
> measure performance and see what numbers make sense for the client
> side. Having it as a module configuration parameter is fine with me.
> Updating it to a new set of hardware based numbers is ok with me too.

I'm also ok with a module configuration parameter. I don't have the
hardware for testing, I will use the default value of 16MB for now.
Since it's configurable anyone can adjust to the desire value. I will
send a new patch for this.

>
>>>>>>> I don't believe the code works. In my 1st test doing "nfstest_ssc
>>>>>>> --runtest inter01" and then doing it again. What I see from inspecting
>>>>>>> the traces is that indeed unmount doesn't happen but for the 2nd copy
>>>>>>> the mount happens again.
>>>>>>>
>>>>>>> I'm attaching the trace. my servers are .114 (dest), .110 (src). my
>>>>>>> client .68. The first run of "inter01" places a copy in frame 364.
>>>>>>> frame 367 has the beginning of the "mount" between .114 and .110. then
>>>>>>> read happens. then a copy offload callback happens. No unmount happens
>>>>>>> as expected. inter01 continues with its verification and clean up. By
>>>>>>> frame 768 the test is done. I'm waiting a bit. So there is a heatbeat
>>>>>>> between the .114 and .110 in frame 769. Then the next run of the
>>>>>>> "inter01", COPY is placed in frame 1110. The next thing that happens
>>>>>>> are PUTROOTFH+bunch of GETATTRs that are part of the mount. So what is
>>>>>>> the saving here? a single EXCHANGE_ID? Either the code doesn't work or
>>>>>>> however it works provides no savings.
>>>>>> The saving are EXCHANGE_ID, CREATE_SESSION, RECLAIM COMPLETE,
>>>>>> DESTROY_SESSION and DESTROY_CLIENTID for *every* inter-copy request.
>>>>>> The saving is reflected in the number of #4 test run above.
>>>>> Can't we do better than that? Since you are keeping a list of umounts,
>>>>> can't they be searched before doing the vfs_mount() and instead get
>>>>> the mount structure from the list (and not call the vfs_mount at all)
>>>>> and remove it from the umount list (wouldn't that save all the calls)?
>>>> I thought about this. My problem here is that we don't have much to key
>>>> on to search the list. The only thing in the COPY argument can be used
>>>> for this purpose is the list of IP addresses of the source server.
>>>> I think that is not enough, there can be multiple exports from the
>>>> same server, how do we find the right one? it can get complicated.
>>>> I'm happy to consider any suggestion you have for this.
>>> I believe an IP address is exactly what's needed for keying. Each of
>>> those "mounts" to the same server is shared (that's normal behaviour
>>> for the client) -- meaning there is just   1 "mount" for a given IP
>>> (there is a single nfs4_client structure).
>> ok, I can give this a try. However, I think this only reduces the
>> number of RPCs which are bunch of GETATTRs so I don't think this will
>> help the performance number significantly.
> They are RPCs and meta data operations that go into the file system.I
> would think they affect performance...
>
>>>> I think the patch is an improvement, in performance for copying large
>>>> files (if you consider 2MB file is large) and for removing the bug
>>>> of computing overhead in __nfs4_copy_file_range. Note that we can
>>>> always improve it and not necessary doing it all at once.
>>> I don't think saving 3 RPCs out of 14 is a good enough improvement
>>> when it can be made to save them all (unless you can convince me that
>>> we can't save all 14).
>> I don't understand your numbers here. Without the patch, there are
>> 14 RPCs for mount and 2 RPCs for unmount and overhead of creating
>> and tearing down TCP connection each time. With the patch, there are
>> 8 RPCs for the mount (PUTROOTFH and GETATTRs) and 0 for unmount and
>> no create and tear down TCP connection. The RPCs that the patch save
>> are mostly heavy-weight RPC requests as the test showed the patch
>> outperforms the current code even at 2MB.
> I counted mount operations savings. I agreed with you that we saved
> the umount. I'm saying that given that you are keeping track of the
> mounts structures, those 8 RPCs shouldn't happen.
>
> I think my approach didn't keep track of the mount and just delayed
> the execution of the cleanup function so it was not possible to "save
> the mount rpcs". You code takes it further but not far enough.

ok, let hold off on this patch and I will look into the IP address
approach and also see if we can completely remove all RPCs related to
mount.

>
> As for the savings as I said I'd like to see the performance numbers
> based on the hardware setup not VMs. And yes the existing hardware
> numbers are based on the old code and need to be re-run on the
> existing code. I will try to do that tomorrow.

yes, it's useful data point.

>
> I think at this point you need to have one of the maintainers weigh in
> on whether or not this intermediate step of saying a few rpcs is worth
> checking in and then get the saving of the whole mount. It's really up
> to them. In my opinion, I think if we are trying to save mount
> operations, we should do it all the way.
>
> Furthermore, I still really would like Bruce or Chuck to weigh in on
> the use of the semaphore. We are holding onto a semaphore while doing
> vfs_kern_mount(), I thought it's frowned upon to hold any locks while
> doing network operations. What if vfs_kern_mount() is trying to reach
> an unresponsive server, then all other unmount are blocked, right? I
> still don't understand why we need a semaphore. I think a spin lock
> that protects access to the list of umounts is sufficient. I probably
> should be commenting on the actual patch here.

Let hold off on this until I submit a new patch.

Thanks,
-Dai

>
>
>> -Dai
>>
>>>> -Dai
>>>>
>>>>>> Note that the overhead of the copy in the current code includes mount
>>>>>> *and* unmount. However the threshold computed in __nfs4_copy_file_range
>>>>>> includes only the guesstimated mount overhead and not the unmount
>>>>>> overhead so it not correct.
>>>>>>
>>>>>> -Dai
>>>>>>
>>>>>>
>>>>>>> Honestly I don't understand the whole need of a semaphore and all.
>>>>>> The semaphore is to prevent the export to be unmounted while it's
>>>>>> being used.
>>>>>>
>>>>>> -Dai
>>>>>>
>>>>>>> My
>>>>>>> approach that I tried before was to create a delayed work item but I
>>>>>>> don't recall why I dropped it.
>>>>>>> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-nfs/patch/20170711164416.1982-43-kolga@netapp.com/__;!!GqivPVa7Brio!Jl5Wq7nrFUsaUQjgLJoSuV-cDlvbPaav3x8nXQcRhAdxjVEoWvK24sNgoE82Zg$
>>>>>>>
>>>>>>>
>>>>>>>> -Dai
>>>>>>>>
>>>>>>>>> Note that without the patch, the threshold to do inter-server
>>>>>>>>> copy is (524288 * 14 = 7340032) bytes. With the patch, the threshold
>>>>>>>>> to do inter-server is (524288 * 2 = 1048576) bytes, same as
>>>>>>>>> threshold to decide to sync/async for intra-copy.
>>>>>>>>>
>>>>>>>>>> While I agree that delaying the unmount on the server is beneficial
>>>>>>>>>> I'm not so sure that dropping the client restriction is wise because
>>>>>>>>>> the small (singular) copy would suffer the setup cost of the initial
>>>>>>>>>> mount.
>>>>>>>>> Right, but only the 1st copy. The export remains to be mounted for
>>>>>>>>> 15 mins so subsequent small copies do not incur the mount and unmount
>>>>>>>>> overhead.
>>>>>>>>>
>>>>>>>>> I think ideally we want the server to do inter-copy only if it's faster
>>>>>>>>> than the generic copy. We can probably come up with a number after some
>>>>>>>>> testing and that number can not be based on the rsize as it is now since
>>>>>>>>> the rsize can be changed by mount option. This can be a fixed number,
>>>>>>>>> 1M/2M/etc, and it should be configurable. What do you think? I'm open
>>>>>>>>> to any other options.
>>>>>>>>>
>>>>>>>>>>       Just my initial thoughts...
>>>>>>>>> Thanks,
>>>>>>>>> -Dai
>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>

Chuck Lever April 8, 2021, 3:25 p.m. UTC | #17

> On Apr 7, 2021, at 8:58 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> Furthermore, I still really would like Bruce or Chuck to weigh in on
> the use of the semaphore.

The semaphore caught my eye too.

The usual trick to prevent a mount from disappearing is to
bump a reference count until it is safe for the mount to
possibly go away. Dai, if you can make this work with just
an extra reference count, I would prefer that.


> We are holding onto a semaphore while doing
> vfs_kern_mount(), I thought it's frowned upon to hold any locks while
> doing network operations. What if vfs_kern_mount() is trying to reach
> an unresponsive server, then all other unmount are blocked, right? I
> still don't understand why we need a semaphore. I think a spin lock
> that protects access to the list of umounts is sufficient. I probably
> should be commenting on the actual patch here.




--
Chuck Lever

[v2,0/2] enhance NFSv4.2 SSC to delay unmount source's export.

Message

Comments