xfs_repair: fix progress reporting

Message ID	c4df68a7-706b-0216-b2a0-a177789f380f@redhat.com (mailing list archive)
State	Superseded, archived
Headers	show Return-Path: <SRS0=xgbg=7A=vger.kernel.org=linux-xfs-owner@kernel.org> To: linux-xfs <linux-xfs@vger.kernel.org> From: Eric Sandeen <sandeen@redhat.com> Subject: [PATCH] xfs_repair: fix progress reporting Cc: Leonardo Vaz <lvaz@redhat.com> Message-ID: <c4df68a7-706b-0216-b2a0-a177789f380f@redhat.com> Date: Mon, 18 May 2020 17:35:33 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs_repair: fix progress reporting \| expand xfs_repair: fix progress reporting

Message ID

c4df68a7-706b-0216-b2a0-a177789f380f@redhat.com (mailing list archive)

State

Superseded, archived

Headers

To: linux-xfs <linux-xfs@vger.kernel.org>
From: Eric Sandeen <sandeen@redhat.com>
Subject: [PATCH] xfs_repair: fix progress reporting
Cc: Leonardo Vaz <lvaz@redhat.com>
Message-ID: <c4df68a7-706b-0216-b2a0-a177789f380f@redhat.com>
Date: Mon, 18 May 2020 17:35:33 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0)
 Gecko/20100101 Thunderbird/68.8.0
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-xfs-owner@vger.kernel.org
Precedence: bulk

Series

xfs_repair: fix progress reporting | expand

Commit Message

Eric Sandeen May 18, 2020, 10:35 p.m. UTC

Long ago, a young developer tried to fix a segfault in xfs_repair where
a short progress reporting interval could cause a timer to go off and try
to print a progress mesage before any had been properly set up because
we were still busy zeroing the log, and a NULL pointer dereference
ensued.

That young developer got it wrong, and completely broke progress
reporting, because the change caused us to exit early from the pthread
start routine, and not initialize the progress timer at all.

That developer is now slightly older and wiser, and finally realizes that
the simple and correct solution here is to initialize the message format
to the first one in the list, so that we will be ready to go with a
progress message no matter when the first timer goes off.

Reported-by: Leonardo Vaz <lvaz@redhat.com>
Fixes: 7f2d6b811755 ("xfs_repair: avoid segfault if reporting progre...")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

It might be nice to add progress reporting for the log zeroing, but that
requires renumbering all these macros, and we don't/can't actually get
any fine-grained progress at all, so probably not worth it.

Comments

Darrick J. Wong May 19, 2020, 12:58 a.m. UTC | #1

On Mon, May 18, 2020 at 05:35:33PM -0500, Eric Sandeen wrote:
> Long ago, a young developer tried to fix a segfault in xfs_repair where
> a short progress reporting interval could cause a timer to go off and try
> to print a progress mesage before any had been properly set up because
> we were still busy zeroing the log, and a NULL pointer dereference
> ensued.
> 
> That young developer got it wrong, and completely broke progress
> reporting, because the change caused us to exit early from the pthread
> start routine, and not initialize the progress timer at all.
> 
> That developer is now slightly older and wiser, and finally realizes that
> the simple and correct solution here is to initialize the message format
> to the first one in the list, so that we will be ready to go with a
> progress message no matter when the first timer goes off.
> 
> Reported-by: Leonardo Vaz <lvaz@redhat.com>
> Fixes: 7f2d6b811755 ("xfs_repair: avoid segfault if reporting progre...")
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> ---
> 
> It might be nice to add progress reporting for the log zeroing, but that
> requires renumbering all these macros, and we don't/can't actually get
> any fine-grained progress at all, so probably not worth it.
> 
> diff --git a/repair/progress.c b/repair/progress.c
> index 5ee08229..d7baa606 100644
> --- a/repair/progress.c
> +++ b/repair/progress.c
> @@ -125,7 +125,11 @@ init_progress_rpt (void)
>  	 */
>  
>  	pthread_mutex_init(&global_msgs.mutex, NULL);
> -	global_msgs.format = NULL;
> +	/*
> +	 * Ensure the format string is not NULL in case the first timer
> +	 * goes off before any stage calls set_progress_msg() to set it.
> +	 */
> +	global_msgs.format = &progress_rpt_reports[0];

Hmm so does that mean the first progress report could be for "scanning
freespace"?

Or could you append a new entry to progress_rpt_reports for "getting my
shit together and moving out of my parents basement" and initialize it
to that?

--D

>  	global_msgs.count = glob_agcount;
>  	global_msgs.interval = report_interval;
>  	global_msgs.done   = prog_rpt_done;
> @@ -171,10 +175,6 @@ progress_rpt_thread (void *p)
>  	msg_block_t *msgp = (msg_block_t *)p;
>  	uint64_t percent;
>  
> -	/* It's possible to get here very early w/ no progress msg set */
> -	if (!msgp->format)
> -		return NULL;
> -
>  	if ((msgbuf = (char *)malloc(DURATION_BUF_SIZE)) == NULL)
>  		do_error (_("progress_rpt: cannot malloc progress msg buffer\n"));
>  
>

Eric Sandeen May 19, 2020, 1:03 a.m. UTC | #2

On 5/18/20 7:58 PM, Darrick J. Wong wrote:
> On Mon, May 18, 2020 at 05:35:33PM -0500, Eric Sandeen wrote:
>> Long ago, a young developer tried to fix a segfault in xfs_repair where
>> a short progress reporting interval could cause a timer to go off and try
>> to print a progress mesage before any had been properly set up because
>> we were still busy zeroing the log, and a NULL pointer dereference
>> ensued.
>>
>> That young developer got it wrong, and completely broke progress
>> reporting, because the change caused us to exit early from the pthread
>> start routine, and not initialize the progress timer at all.
>>
>> That developer is now slightly older and wiser, and finally realizes that
>> the simple and correct solution here is to initialize the message format
>> to the first one in the list, so that we will be ready to go with a
>> progress message no matter when the first timer goes off.
>>
>> Reported-by: Leonardo Vaz <lvaz@redhat.com>
>> Fixes: 7f2d6b811755 ("xfs_repair: avoid segfault if reporting progre...")
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> ---
>>
>> It might be nice to add progress reporting for the log zeroing, but that
>> requires renumbering all these macros, and we don't/can't actually get
>> any fine-grained progress at all, so probably not worth it.
>>
>> diff --git a/repair/progress.c b/repair/progress.c
>> index 5ee08229..d7baa606 100644
>> --- a/repair/progress.c
>> +++ b/repair/progress.c
>> @@ -125,7 +125,11 @@ init_progress_rpt (void)
>>  	 */
>>  
>>  	pthread_mutex_init(&global_msgs.mutex, NULL);
>> -	global_msgs.format = NULL;
>> +	/*
>> +	 * Ensure the format string is not NULL in case the first timer
>> +	 * goes off before any stage calls set_progress_msg() to set it.
>> +	 */
>> +	global_msgs.format = &progress_rpt_reports[0];
> 
> Hmm so does that mean the first progress report could be for "scanning
> freespace"?

Yes.  But unless "zeroing the log" takes more than the report interval,
which by default is 15 minutes, it won't be wrong.

> Or could you append a new entry to progress_rpt_reports for "getting my
> shit together and moving out of my parents basement" and initialize it
> to that?

Oh I guess it could be appended and wouldn't have to be out of order but
honestly I don't think it's worth it, even a big slow log shouldn't take
long enough to need a progress report.

-Eric

> --D
> 
>>  	global_msgs.count = glob_agcount;
>>  	global_msgs.interval = report_interval;
>>  	global_msgs.done   = prog_rpt_done;
>> @@ -171,10 +175,6 @@ progress_rpt_thread (void *p)
>>  	msg_block_t *msgp = (msg_block_t *)p;
>>  	uint64_t percent;
>>  
>> -	/* It's possible to get here very early w/ no progress msg set */
>> -	if (!msgp->format)
>> -		return NULL;
>> -
>>  	if ((msgbuf = (char *)malloc(DURATION_BUF_SIZE)) == NULL)
>>  		do_error (_("progress_rpt: cannot malloc progress msg buffer\n"));
>>  
>>
>

Darrick J. Wong May 19, 2020, 1:13 a.m. UTC | #3

On Mon, May 18, 2020 at 08:03:09PM -0500, Eric Sandeen wrote:
> On 5/18/20 7:58 PM, Darrick J. Wong wrote:
> > On Mon, May 18, 2020 at 05:35:33PM -0500, Eric Sandeen wrote:
> >> Long ago, a young developer tried to fix a segfault in xfs_repair where
> >> a short progress reporting interval could cause a timer to go off and try
> >> to print a progress mesage before any had been properly set up because
> >> we were still busy zeroing the log, and a NULL pointer dereference
> >> ensued.
> >>
> >> That young developer got it wrong, and completely broke progress
> >> reporting, because the change caused us to exit early from the pthread
> >> start routine, and not initialize the progress timer at all.
> >>
> >> That developer is now slightly older and wiser, and finally realizes that
> >> the simple and correct solution here is to initialize the message format
> >> to the first one in the list, so that we will be ready to go with a
> >> progress message no matter when the first timer goes off.
> >>
> >> Reported-by: Leonardo Vaz <lvaz@redhat.com>
> >> Fixes: 7f2d6b811755 ("xfs_repair: avoid segfault if reporting progre...")
> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> >> ---
> >>
> >> It might be nice to add progress reporting for the log zeroing, but that
> >> requires renumbering all these macros, and we don't/can't actually get
> >> any fine-grained progress at all, so probably not worth it.
> >>
> >> diff --git a/repair/progress.c b/repair/progress.c
> >> index 5ee08229..d7baa606 100644
> >> --- a/repair/progress.c
> >> +++ b/repair/progress.c
> >> @@ -125,7 +125,11 @@ init_progress_rpt (void)
> >>  	 */
> >>  
> >>  	pthread_mutex_init(&global_msgs.mutex, NULL);
> >> -	global_msgs.format = NULL;
> >> +	/*
> >> +	 * Ensure the format string is not NULL in case the first timer
> >> +	 * goes off before any stage calls set_progress_msg() to set it.
> >> +	 */
> >> +	global_msgs.format = &progress_rpt_reports[0];
> > 
> > Hmm so does that mean the first progress report could be for "scanning
> > freespace"?
> 
> Yes.  But unless "zeroing the log" takes more than the report interval,
> which by default is 15 minutes, it won't be wrong.
> 
> > Or could you append a new entry to progress_rpt_reports for "getting my
> > shit together and moving out of my parents basement" and initialize it
> > to that?
> 
> Oh I guess it could be appended and wouldn't have to be out of order but
> honestly I don't think it's worth it, even a big slow log shouldn't take
> long enough to need a progress report.

Zeroing a 2048MB log in 900s = 2.28MB/s

So I guess that's unlikely... but it still feels like leaving some kind
of weird logic bomb lurking where if we decrease the interval or someone
throws us a slow cloudy block store, we'll start issuing weird progress
messsages about some other part of xfs_repair which hasn't even started
yet.  <shrug>

--D

> -Eric
> 
> > --D
> > 
> >>  	global_msgs.count = glob_agcount;
> >>  	global_msgs.interval = report_interval;
> >>  	global_msgs.done   = prog_rpt_done;
> >> @@ -171,10 +175,6 @@ progress_rpt_thread (void *p)
> >>  	msg_block_t *msgp = (msg_block_t *)p;
> >>  	uint64_t percent;
> >>  
> >> -	/* It's possible to get here very early w/ no progress msg set */
> >> -	if (!msgp->format)
> >> -		return NULL;
> >> -
> >>  	if ((msgbuf = (char *)malloc(DURATION_BUF_SIZE)) == NULL)
> >>  		do_error (_("progress_rpt: cannot malloc progress msg buffer\n"));
> >>  
> >>
> >

Eric Sandeen May 19, 2020, 1:15 a.m. UTC | #4

On 5/18/20 8:13 PM, Darrick J. Wong wrote:
> On Mon, May 18, 2020 at 08:03:09PM -0500, Eric Sandeen wrote:
>> On 5/18/20 7:58 PM, Darrick J. Wong wrote:
>>> On Mon, May 18, 2020 at 05:35:33PM -0500, Eric Sandeen wrote:
>>>> Long ago, a young developer tried to fix a segfault in xfs_repair where
>>>> a short progress reporting interval could cause a timer to go off and try
>>>> to print a progress mesage before any had been properly set up because
>>>> we were still busy zeroing the log, and a NULL pointer dereference
>>>> ensued.
>>>>
>>>> That young developer got it wrong, and completely broke progress
>>>> reporting, because the change caused us to exit early from the pthread
>>>> start routine, and not initialize the progress timer at all.
>>>>
>>>> That developer is now slightly older and wiser, and finally realizes that
>>>> the simple and correct solution here is to initialize the message format
>>>> to the first one in the list, so that we will be ready to go with a
>>>> progress message no matter when the first timer goes off.
>>>>
>>>> Reported-by: Leonardo Vaz <lvaz@redhat.com>
>>>> Fixes: 7f2d6b811755 ("xfs_repair: avoid segfault if reporting progre...")
>>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>>>> ---
>>>>
>>>> It might be nice to add progress reporting for the log zeroing, but that
>>>> requires renumbering all these macros, and we don't/can't actually get
>>>> any fine-grained progress at all, so probably not worth it.
>>>>
>>>> diff --git a/repair/progress.c b/repair/progress.c
>>>> index 5ee08229..d7baa606 100644
>>>> --- a/repair/progress.c
>>>> +++ b/repair/progress.c
>>>> @@ -125,7 +125,11 @@ init_progress_rpt (void)
>>>>  	 */
>>>>  
>>>>  	pthread_mutex_init(&global_msgs.mutex, NULL);
>>>> -	global_msgs.format = NULL;
>>>> +	/*
>>>> +	 * Ensure the format string is not NULL in case the first timer
>>>> +	 * goes off before any stage calls set_progress_msg() to set it.
>>>> +	 */
>>>> +	global_msgs.format = &progress_rpt_reports[0];
>>>
>>> Hmm so does that mean the first progress report could be for "scanning
>>> freespace"?
>>
>> Yes.  But unless "zeroing the log" takes more than the report interval,
>> which by default is 15 minutes, it won't be wrong.
>>
>>> Or could you append a new entry to progress_rpt_reports for "getting my
>>> shit together and moving out of my parents basement" and initialize it
>>> to that?
>>
>> Oh I guess it could be appended and wouldn't have to be out of order but
>> honestly I don't think it's worth it, even a big slow log shouldn't take
>> long enough to need a progress report.
> 
> Zeroing a 2048MB log in 900s = 2.28MB/s
> 
> So I guess that's unlikely... but it still feels like leaving some kind
> of weird logic bomb lurking where if we decrease the interval or someone
> throws us a slow cloudy block store, we'll start issuing weird progress
> messsages about some other part of xfs_repair which hasn't even started
> yet.  <shrug>

welp, alright, I'll add a log status message.

diff --git a/repair/progress.c b/repair/progress.c
index 5ee08229..d7baa606 100644
--- a/repair/progress.c
+++ b/repair/progress.c
@@ -125,7 +125,11 @@  init_progress_rpt (void)
 	 */
 
 	pthread_mutex_init(&global_msgs.mutex, NULL);
-	global_msgs.format = NULL;
+	/*
+	 * Ensure the format string is not NULL in case the first timer
+	 * goes off before any stage calls set_progress_msg() to set it.
+	 */
+	global_msgs.format = &progress_rpt_reports[0];
 	global_msgs.count = glob_agcount;
 	global_msgs.interval = report_interval;
 	global_msgs.done   = prog_rpt_done;
@@ -171,10 +175,6 @@  progress_rpt_thread (void *p)
 	msg_block_t *msgp = (msg_block_t *)p;
 	uint64_t percent;
 
-	/* It's possible to get here very early w/ no progress msg set */
-	if (!msgp->format)
-		return NULL;
-
 	if ((msgbuf = (char *)malloc(DURATION_BUF_SIZE)) == NULL)
 		do_error (_("progress_rpt: cannot malloc progress msg buffer\n"));

xfs_repair: fix progress reporting

Commit Message

Comments

Patch