devcoredump: increase the device delete timeout to 10 mins

Message ID	1644349472-31077-1-git-send-email-quic_abhinavk@quicinc.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-arm-msm-owner@kernel.org> From: Abhinav Kumar <quic_abhinavk@quicinc.com> To: <johannes@sipsolutions.net>, <linux-kernel@vger.kernel.org> CC: Abhinav Kumar <quic_abhinavk@quicinc.com>, <gregkh@linuxfoundation.org>, <rafael@kernel.org>, <robdclark@gmail.com>, <dri-devel@lists.freedesktop.org>, <linux-arm-msm@vger.kernel.org>, <freedreno@lists.freedesktop.org>, <seanpaul@chromium.org>, <swboyd@chromium.org>, <nganji@codeaurora.org>, <aravindh@codeaurora.org>, <khsieh@codeaurora.org>, <daniel@ffwll.ch>, <dmitry.baryshkov@linaro.org> Subject: [PATCH] devcoredump: increase the device delete timeout to 10 mins Date: Tue, 8 Feb 2022 11:44:32 -0800 Message-ID: <1644349472-31077-1-git-send-email-quic_abhinavk@quicinc.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	devcoredump: increase the device delete timeout to 10 mins \| expand devcoredump: increase the device delete timeout to 10 mins

Abhinav Kumar Feb. 8, 2022, 7:44 p.m. UTC

There are cases where depending on the size of the devcoredump and the speed
at which the usermode reads the dump, it can take longer than the current 5 mins
timeout.

This can lead to incomplete dumps as the device is deleted once the timeout expires.

One example is below where it took 6 mins for the devcoredump to be completely read.

04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node

Increase the timeout to 10 mins to accommodate system delays and large coredump
sizes.

Signed-off-by: Abhinav Kumar <quic_abhinavk@quicinc.com>
---
 drivers/base/devcoredump.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Johannes Berg Feb. 8, 2022, 8:35 p.m. UTC | #1

On Tue, 2022-02-08 at 11:44 -0800, Abhinav Kumar wrote:
> There are cases where depending on the size of the devcoredump and the speed
> at which the usermode reads the dump, it can take longer than the current 5 mins
> timeout.
> 
> This can lead to incomplete dumps as the device is deleted once the timeout expires.
> 
> One example is below where it took 6 mins for the devcoredump to be completely read.
> 
> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> 
> Increase the timeout to 10 mins to accommodate system delays and large coredump
> sizes.
> 

No real objection, I guess, but can the data actually disappear *while*
the sysfs file is open?!

Or did it take 5 minutes to open the file?

If the former, maybe we should fix that too (or instead)?

johannes

Abhinav Kumar Feb. 8, 2022, 9:04 p.m. UTC | #2

Hi Johannes

Thanks for the response.

On 2/8/2022 12:35 PM, Johannes Berg wrote:
> On Tue, 2022-02-08 at 11:44 -0800, Abhinav Kumar wrote:
>> There are cases where depending on the size of the devcoredump and the speed
>> at which the usermode reads the dump, it can take longer than the current 5 mins
>> timeout.
>>
>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>
>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>
>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>
>> Increase the timeout to 10 mins to accommodate system delays and large coredump
>> sizes.
>>
> 
> No real objection, I guess, but can the data actually disappear *while*
> the sysfs file is open?!
> 
> Or did it take 5 minutes to open the file?
> 
> If the former, maybe we should fix that too (or instead)?
> 
> johannes

It opened the file rightaway but could not finish reading.

The device gets deleted so the corresponding /data will disappear too ( 
as the data node is under devcd*/data)

60 static void devcd_del(struct work_struct *wk)
61 {
62 	struct devcd_entry *devcd;
63
64 	devcd = container_of(wk, struct devcd_entry, del_wk.work);
65
66 	device_del(&devcd->devcd_dev);
67 	put_device(&devcd->devcd_dev);
68 }

Are you suggesting we implement a logic like :

a) if the usermode has started reading the data but has not finished yet 
( we can detect the former with something like devcd->data_read_ongoing 
= 1 and we know it has finished when it acks and we can clear this flag 
then), in the timeout del_wk then we can delay the the delete timer by 
another TIMEOUT amount of time to give usermode time to finish the data?

b) If usermode acks, we will clear both the flag and delete the device 
as usual

But there is a corner case here:

c) If usermode starts the read, but then for some reason crashes, the 
timer will timeout and try to delete the device but will detect that 
usermode is still reading and will keep the device. How do we detect 
this case?

Thats why i thought maybe the easier way right now is to try increasing 
the timeout.

Johannes Berg Feb. 8, 2022, 9:54 p.m. UTC | #3

On Tue, 2022-02-08 at 13:40 -0800, Abhinav Kumar wrote:
> > 
> I am checking what usermode sees and will get back ( I didnt see an 
> error do most likely it was EOF ). I didnt follow the second part.

I think probably it got -ENODEV, looking at kernfs_file_read_iter().

> If the file descriptor read returns EOF, even if we consider them 
> separate how will it resolve this issue?
> 
> My earlier questions were related to fixing it in devcoredump to detect
> and fix it there. Are you suggesting to fix in usermode instead? How?
> 

Yeah, no, you cannot fix it in userspace.

But I just followed the rabbit hole down kernfs and all, and it looks
like indeed the read would be cut short with -ENODEV, sorry.

It doesn't look like there's good API for this, but it seems at least
from the underlying kernfs POV it should be possible to get_device() in
open and put_device() in release, so that the device sticks around while
somebody has the file open? It's entirely virtual, so this should be OK?

johannes

Abhinav Kumar Feb. 9, 2022, 1:55 a.m. UTC | #4

Hi Johannes

On 2/8/2022 1:54 PM, Johannes Berg wrote:
> On Tue, 2022-02-08 at 13:40 -0800, Abhinav Kumar wrote:
>>>
>> I am checking what usermode sees and will get back ( I didnt see an
>> error do most likely it was EOF ). I didnt follow the second part.
> 
> I think probably it got -ENODEV, looking at kernfs_file_read_iter().
> 
>> If the file descriptor read returns EOF, even if we consider them
>> separate how will it resolve this issue?
>>
>> My earlier questions were related to fixing it in devcoredump to detect
>> and fix it there. Are you suggesting to fix in usermode instead? How?
>>
> 
> Yeah, no, you cannot fix it in userspace.
> 
> But I just followed the rabbit hole down kernfs and all, and it looks
> like indeed the read would be cut short with -ENODEV, sorry.
> 
> It doesn't look like there's good API for this, but it seems at least
> from the underlying kernfs POV it should be possible to get_device() in
> open and put_device() in release, so that the device sticks around while
> somebody has the file open? It's entirely virtual, so this should be OK?
> 
> johannes

Are you suggesting something like below?

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 42dcf96..14203d0 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -32,6 +32,22 @@ static const struct sysfs_ops *sysfs_file_ops(struct 
kernfs_node *kn)
         return kobj->ktype ? kobj->ktype->sysfs_ops : NULL;
  }

+static int sysfs_kf_open(struct kernfs_open_file *of)
+{
+       struct kobject *kobj = of->kn->parent->priv;
+       struct device *dev = kobj_to_dev(kobj);
+
+       get_device(dev);
+}
+
+static void sysfs_kf_release(struct kernfs_open_file *of)
+{
+       struct kobject *kobj = of->kn->parent->priv;
+       struct device *dev = kobj_to_dev(kobj);
+
+       put_device(dev);
+}
+
  /*
   * Reads on sysfs are handled through seq_file, which takes care of hairy
   * details like buffering and seeking.  The following function pipes
@@ -211,6 +227,8 @@ static const struct kernfs_ops sysfs_file_kfops_wo = {
  };

  static const struct kernfs_ops sysfs_file_kfops_rw = {
+       .open       = sysfs_kf_open;
+       .release    = sysfs_kf_release;
         .seq_show       = sysfs_kf_seq_show,
         .write          = sysfs_kf_write,
  };

If so, dont you think this will be a more intrusive change just for the 
sake of devcoredump? Any other way to keep the changes limited to 
devcoredump?

Thanks

Abhinav

Johannes Berg Feb. 9, 2022, 7:50 a.m. UTC | #5

On Tue, 2022-02-08 at 17:55 -0800, Abhinav Kumar wrote:
> 
> Are you suggesting something like below?
> 
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 42dcf96..14203d0 100644
> --- a/fs/sysfs/file.c
> 

No, for sure not, but I guess from the looks of this patch there's no
way to do something like that for just an individual attribute...

Oh well.

johannes

Abhinav Kumar Feb. 9, 2022, 4:29 p.m. UTC | #6

Hi Johannes

On 2/8/2022 11:50 PM, Johannes Berg wrote:
> On Tue, 2022-02-08 at 17:55 -0800, Abhinav Kumar wrote:
>>
>> Are you suggesting something like below?
>>
>> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
>> index 42dcf96..14203d0 100644
>> --- a/fs/sysfs/file.c
>>
> 
> No, for sure not, but I guess from the looks of this patch there's no
> way to do something like that for just an individual attribute...
> 
> Oh well.
> 
> johannes

In that case, I was not clear on the previous solution you suggested.
Are you suggesting then we can go ahead with the timeout increase?
If so, can I please get your ack?

Thanks

Abhinav

Greg KH Feb. 11, 2022, 11:09 a.m. UTC | #7

On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> There are cases where depending on the size of the devcoredump and the speed
> at which the usermode reads the dump, it can take longer than the current 5 mins
> timeout.
> 
> This can lead to incomplete dumps as the device is deleted once the timeout expires.
> 
> One example is below where it took 6 mins for the devcoredump to be completely read.
> 
> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node

What makes this so slow?  Reading from the kernel shouldn't be the
limit, is it where the data is being sent to?

> Increase the timeout to 10 mins to accommodate system delays and large coredump
> sizes.

Nit, please wrap your changelog texts at 72 columns.

And what is "large"?

thanks,

greg k-h

Greg KH Feb. 11, 2022, 11:09 a.m. UTC | #8

On Tue, Feb 08, 2022 at 05:55:18PM -0800, Abhinav Kumar wrote:
> Hi Johannes
> 
> On 2/8/2022 1:54 PM, Johannes Berg wrote:
> > On Tue, 2022-02-08 at 13:40 -0800, Abhinav Kumar wrote:
> > > > 
> > > I am checking what usermode sees and will get back ( I didnt see an
> > > error do most likely it was EOF ). I didnt follow the second part.
> > 
> > I think probably it got -ENODEV, looking at kernfs_file_read_iter().
> > 
> > > If the file descriptor read returns EOF, even if we consider them
> > > separate how will it resolve this issue?
> > > 
> > > My earlier questions were related to fixing it in devcoredump to detect
> > > and fix it there. Are you suggesting to fix in usermode instead? How?
> > > 
> > 
> > Yeah, no, you cannot fix it in userspace.
> > 
> > But I just followed the rabbit hole down kernfs and all, and it looks
> > like indeed the read would be cut short with -ENODEV, sorry.
> > 
> > It doesn't look like there's good API for this, but it seems at least
> > from the underlying kernfs POV it should be possible to get_device() in
> > open and put_device() in release, so that the device sticks around while
> > somebody has the file open? It's entirely virtual, so this should be OK?
> > 
> > johannes
> 
> Are you suggesting something like below?
> 
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 42dcf96..14203d0 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -32,6 +32,22 @@ static const struct sysfs_ops *sysfs_file_ops(struct
> kernfs_node *kn)
>         return kobj->ktype ? kobj->ktype->sysfs_ops : NULL;
>  }
> 
> +static int sysfs_kf_open(struct kernfs_open_file *of)
> +{
> +       struct kobject *kobj = of->kn->parent->priv;
> +       struct device *dev = kobj_to_dev(kobj);
> +
> +       get_device(dev);
> +}
> +
> +static void sysfs_kf_release(struct kernfs_open_file *of)
> +{
> +       struct kobject *kobj = of->kn->parent->priv;
> +       struct device *dev = kobj_to_dev(kobj);
> +
> +       put_device(dev);
> +}


That obviously does not work as not everything in sysfs is a struct
device :(

Abhinav Kumar Feb. 11, 2022, 6:59 p.m. UTC | #9

Hi Greg

Thanks for the response.

On 2/11/2022 3:09 AM, Greg KH wrote:
> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>> There are cases where depending on the size of the devcoredump and the speed
>> at which the usermode reads the dump, it can take longer than the current 5 mins
>> timeout.
>>
>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>
>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>
>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> 
> What makes this so slow?  Reading from the kernel shouldn't be the
> limit, is it where the data is being sent to?

We are still checking this. We are seeing better read times when we bump 
up the thread priority of the thread which was reading this.
We are also trying to check if bumping up CPU speed is helping.
But, results have not been consistently good enough. So we thought we 
should also increase the timeout to be safe.

> 
>> Increase the timeout to 10 mins to accommodate system delays and large coredump
>> sizes.
> 
> Nit, please wrap your changelog texts at 72 columns.
> 
Yes, i will fix this when I re-post.

> And what is "large"?

We are seeing devcoredumps in the range of 2.5MB-3MB. I can also mention 
this in the commit text in the next post.

Thanks

Abhinav

> 
> thanks,
> 
> greg k-h

Greg KH Feb. 12, 2022, 7:04 a.m. UTC | #10

On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
> Hi Greg
> 
> Thanks for the response.
> 
> On 2/11/2022 3:09 AM, Greg KH wrote:
> > On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> > > There are cases where depending on the size of the devcoredump and the speed
> > > at which the usermode reads the dump, it can take longer than the current 5 mins
> > > timeout.
> > > 
> > > This can lead to incomplete dumps as the device is deleted once the timeout expires.
> > > 
> > > One example is below where it took 6 mins for the devcoredump to be completely read.
> > > 
> > > 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> > > 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> > 
> > What makes this so slow?  Reading from the kernel shouldn't be the
> > limit, is it where the data is being sent to?
> 
> We are still checking this. We are seeing better read times when we bump up
> the thread priority of the thread which was reading this.

Where is the thread sending the data to?

> We are also trying to check if bumping up CPU speed is helping.
> But, results have not been consistently good enough. So we thought we should
> also increase the timeout to be safe.

Why would 10 minutes be better than 30?  What should the limit be?  :)

thanks,

greg k-h

Abhinav Kumar Feb. 12, 2022, 7:52 a.m. UTC | #11

Hi Greg

On 2/11/2022 11:04 PM, Greg KH wrote:
> On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
>> Hi Greg
>>
>> Thanks for the response.
>>
>> On 2/11/2022 3:09 AM, Greg KH wrote:
>>> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>>>> There are cases where depending on the size of the devcoredump and the speed
>>>> at which the usermode reads the dump, it can take longer than the current 5 mins
>>>> timeout.
>>>>
>>>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>>>
>>>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>>>
>>>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>>>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>>
>>> What makes this so slow?  Reading from the kernel shouldn't be the
>>> limit, is it where the data is being sent to?
>>
>> We are still checking this. We are seeing better read times when we bump up
>> the thread priority of the thread which was reading this.
> 
> Where is the thread sending the data to?

The thread is writing the data to a file in local storage. From our 
profiling, the read is the one taking the time not the write.

> 
>> We are also trying to check if bumping up CPU speed is helping.
>> But, results have not been consistently good enough. So we thought we should
>> also increase the timeout to be safe.
> 
> Why would 10 minutes be better than 30?  What should the limit be?  :)

Again, this is from our profiling. We are seeing a worst case time of 7 
mins to finish the read for our data. Thats where the 10mins came from. 
Just doubling what we have currently. I am not sure how the current 5 
mins timeout came from.

> 
> thanks,
> 
> greg k-h

Johannes Berg Feb. 12, 2022, 8:24 a.m. UTC | #12

On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
> 
> The thread is writing the data to a file in local storage. From our 
> profiling, the read is the one taking the time not the write.
> 

That seems kind of hard to believe, let's say it's a 4/3 split (4
minutes reading, 3 minutes writing, to make read > write as you say),
and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
low, unless you're reading with really tiny buffers?

Can you strace this somehow? (with timestamp info)

> Just doubling what we have currently. I am not sure how the current 5 
> mins timeout came from.
> 

To be honest it came out of thin air, and wasn't really meant as a limit
on how fast you can read (feels like even if it's tens of MiB you should
read it in milliseconds into userspace), but more of a maximum time that
we're willing to waste kernel memory if nobody is around to read the
data.

I thought it'd be better if we could somehow pin it while the userspace
is reading it, but OTOH maybe that's actually bad, since that means
userspace (though suitably privileged) could pin this kernel memory
indefinitely.

johannes

Greg KH Feb. 12, 2022, 8:29 a.m. UTC | #13

On Fri, Feb 11, 2022 at 11:52:41PM -0800, Abhinav Kumar wrote:
> Hi Greg
> 
> On 2/11/2022 11:04 PM, Greg KH wrote:
> > On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
> > > Hi Greg
> > > 
> > > Thanks for the response.
> > > 
> > > On 2/11/2022 3:09 AM, Greg KH wrote:
> > > > On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
> > > > > There are cases where depending on the size of the devcoredump and the speed
> > > > > at which the usermode reads the dump, it can take longer than the current 5 mins
> > > > > timeout.
> > > > > 
> > > > > This can lead to incomplete dumps as the device is deleted once the timeout expires.
> > > > > 
> > > > > One example is below where it took 6 mins for the devcoredump to be completely read.
> > > > > 
> > > > > 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
> > > > > 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
> > > > 
> > > > What makes this so slow?  Reading from the kernel shouldn't be the
> > > > limit, is it where the data is being sent to?
> > > 
> > > We are still checking this. We are seeing better read times when we bump up
> > > the thread priority of the thread which was reading this.
> > 
> > Where is the thread sending the data to?
> 
> The thread is writing the data to a file in local storage. From our
> profiling, the read is the one taking the time not the write.

The read is coming directly from memory, there should not be any
slowdown at all here.  How can that be the delay?  Have a trace
somewhere?

thanks,

greg k-h

Abhinav Kumar Feb. 12, 2022, 8:33 a.m. UTC | #14

Hi Greg

On 2/12/2022 12:29 AM, Greg KH wrote:
> On Fri, Feb 11, 2022 at 11:52:41PM -0800, Abhinav Kumar wrote:
>> Hi Greg
>>
>> On 2/11/2022 11:04 PM, Greg KH wrote:
>>> On Fri, Feb 11, 2022 at 10:59:39AM -0800, Abhinav Kumar wrote:
>>>> Hi Greg
>>>>
>>>> Thanks for the response.
>>>>
>>>> On 2/11/2022 3:09 AM, Greg KH wrote:
>>>>> On Tue, Feb 08, 2022 at 11:44:32AM -0800, Abhinav Kumar wrote:
>>>>>> There are cases where depending on the size of the devcoredump and the speed
>>>>>> at which the usermode reads the dump, it can take longer than the current 5 mins
>>>>>> timeout.
>>>>>>
>>>>>> This can lead to incomplete dumps as the device is deleted once the timeout expires.
>>>>>>
>>>>>> One example is below where it took 6 mins for the devcoredump to be completely read.
>>>>>>
>>>>>> 04:22:24.668 23916 23994 I HWDeviceDRM::DumpDebugData: Opening /sys/class/devcoredump/devcd6/data
>>>>>> 04:28:35.377 23916 23994 W HWDeviceDRM::DumpDebugData: Freeing devcoredump node
>>>>>
>>>>> What makes this so slow?  Reading from the kernel shouldn't be the
>>>>> limit, is it where the data is being sent to?
>>>>
>>>> We are still checking this. We are seeing better read times when we bump up
>>>> the thread priority of the thread which was reading this.
>>>
>>> Where is the thread sending the data to?
>>
>> The thread is writing the data to a file in local storage. From our
>> profiling, the read is the one taking the time not the write.
> 
> The read is coming directly from memory, there should not be any
> slowdown at all here.  How can that be the delay?  Have a trace
> somewhere?
> 
> thanks,
> 
> greg k-h

Yes, like I mentioned in my previous comment we are still checking why 
its taking so long. We will update with our findings if we have any.
Alright, we will try to capture trace to share and will update this 
thread if we find something as well.

Abhinav Kumar Feb. 12, 2022, 8:35 a.m. UTC | #15

Hi Johannes

On 2/12/2022 12:24 AM, Johannes Berg wrote:
> On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
>>
>> The thread is writing the data to a file in local storage. From our
>> profiling, the read is the one taking the time not the write.
>>
> 
> That seems kind of hard to believe, let's say it's a 4/3 split (4
> minutes reading, 3 minutes writing, to make read > write as you say),
> and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
> low, unless you're reading with really tiny buffers?
> 
> Can you strace this somehow? (with timestamp info)
> 

Yes, like I have already mentioned in earlier comments, we continue to 
check whats taking that long.

Once we find something from our analysis and also have the trace, will 
update the thread.

>> Just doubling what we have currently. I am not sure how the current 5
>> mins timeout came from.
>>
> 
> To be honest it came out of thin air, and wasn't really meant as a limit
> on how fast you can read (feels like even if it's tens of MiB you should
> read it in milliseconds into userspace), but more of a maximum time that
> we're willing to waste kernel memory if nobody is around to read the
> data.
> 
> I thought it'd be better if we could somehow pin it while the userspace
> is reading it, but OTOH maybe that's actually bad, since that means
> userspace (though suitably privileged) could pin this kernel memory
> indefinitely.
> 
> johannes

Abhinav Kumar Feb. 28, 2022, 9:38 p.m. UTC | #16

Hi Johannes and Greg

On 2/12/2022 12:35 AM, Abhinav Kumar wrote:
> Hi Johannes
> 
> On 2/12/2022 12:24 AM, Johannes Berg wrote:
>> On Fri, 2022-02-11 at 23:52 -0800, Abhinav Kumar wrote:
>>>
>>> The thread is writing the data to a file in local storage. From our
>>> profiling, the read is the one taking the time not the write.
>>>
>>
>> That seems kind of hard to believe, let's say it's a 4/3 split (4
>> minutes reading, 3 minutes writing, to make read > write as you say),
>> and 3MiB size, that'd mean you get 12.8KiB/sec? That seems implausibly
>> low, unless you're reading with really tiny buffers?
>>
>> Can you strace this somehow? (with timestamp info)
>>
> 
> Yes, like I have already mentioned in earlier comments, we continue to 
> check whats taking that long.
> 
> Once we find something from our analysis and also have the trace, will 
> update the thread.
> 
>>> Just doubling what we have currently. I am not sure how the current 5
>>> mins timeout came from.
>>>
>>
>> To be honest it came out of thin air, and wasn't really meant as a limit
>> on how fast you can read (feels like even if it's tens of MiB you should
>> read it in milliseconds into userspace), but more of a maximum time that
>> we're willing to waste kernel memory if nobody is around to read the
>> data.
>>
>> I thought it'd be better if we could somehow pin it while the userspace
>> is reading it, but OTOH maybe that's actually bad, since that means
>> userspace (though suitably privileged) could pin this kernel memory
>> indefinitely.
>>
>> johannes

So, we were able to narrow down the bottle-neck further. The tiny 
buffers which Johannes was referring to is coming from the sysfs method 
below.

It defaults to a PAGE_SIZE worth of data which results in taking a lot 
of time due to many number of reads.

If we increase the length to match the size of our data like below we 
are able to finish the read in almost no-time.

--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -184,10 +184,11 @@ static const struct seq_operations kernfs_seq_ops = {
  static ssize_t kernfs_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
  {
         struct kernfs_open_file *of = kernfs_of(iocb->ki_filp);
-       ssize_t len = min_t(size_t, iov_iter_count(iter), PAGE_SIZE);
+       ssize_t len = min_t(size_t, iov_iter_count(iter), (PAGE_SIZE * 
768));
         const struct kernfs_ops *ops;
         char *buf;

+       pr_err("[hbc debug] %s, len:%d\n", __func__, len);
         buf = of->prealloc_buf;
         if (buf)
                 mutex_lock(&of->prealloc_mutex);

( 768 because the size of our dump was ~3MB so that would be ~ 768 * 4kB 
block sizes )

We also did some profiling around how much increasing the block size 
helps and here is the data:

Block size	cost

4KB	        229s
8KB	         86s
3MB	          2s

So looks like 2 * block size OR 4 * block size can help quite a bit.

Hence, while we are exploring some options like reducing the size of the 
dump etc, I wanted to also check if increasing the block size to like 4 
* 4Kb could be a solution because it will bring down the read times 
drastically based on our profiling.

Thanks

Abhinav

David Laight March 1, 2022, 6:48 a.m. UTC | #17

From: Abhinav Kumar
> Sent: 28 February 2022 21:38
...
> We also did some profiling around how much increasing the block size
> helps and here is the data:
> 
> Block size	cost
> 
> 4KB	        229s
> 8KB	         86s

You must have an O(n^2) operation in there - find it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Rob Clark March 1, 2022, 5:45 p.m. UTC | #18

On Mon, Feb 28, 2022 at 10:49 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Abhinav Kumar
> > Sent: 28 February 2022 21:38
> ...
> > We also did some profiling around how much increasing the block size
> > helps and here is the data:
> >
> > Block size    cost
> >
> > 4KB           229s
> > 8KB            86s
>
> You must have an O(n^2) operation in there - find it.

The problem is how the devcoredump/sysfs interface works, which
results in "re-rendering" the output for each block.. it's fine for
moderate size sysfs files, but scales quite badly once you get into
couple MB size sysfs files.

It could be fixed by having some way to keep state across successive
read callbacks.

BR,
-R

>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

Johannes Berg March 11, 2022, 11:53 a.m. UTC | #19

On Tue, 2022-03-01 at 09:45 -0800, Rob Clark wrote:
> On Mon, Feb 28, 2022 at 10:49 PM David Laight <David.Laight@aculab.com> wrote:
> > 
> > From: Abhinav Kumar
> > > Sent: 28 February 2022 21:38
> > ...
> > > We also did some profiling around how much increasing the block size
> > > helps and here is the data:
> > > 
> > > Block size    cost
> > > 
> > > 4KB           229s
> > > 8KB            86s
> > 
> > You must have an O(n^2) operation in there - find it.
> 
> The problem is how the devcoredump/sysfs interface works, which
> results in "re-rendering" the output for each block.. it's fine for
> moderate size sysfs files, but scales quite badly once you get into
> couple MB size sysfs files.
> 
> It could be fixed by having some way to keep state across successive
> read callbacks.

I'm not sure that's true? Perhaps for dev_coredumpm(), but only if you
implemented read() badly.

If you have e.g. dev_coredumpsg() or dev_coredumpv() that's just a
simple read from the existing buffer.

johannes

devcoredump: increase the device delete timeout to 10 mins

Commit Message

Comments

Patch