Message ID | 20190426110922.21888-1-robbieko@synology.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Btrfs: avoid allocating too many data chunks on massive concurrent write | expand |
On Fri, Apr 26, 2019 at 12:10 PM robbieko <robbieko@synology.com> wrote: > > From: Robbie Ko <robbieko@synology.com> > > I found a issue when btrfs allocates much more space than it actual needed > on massive concurrent writes. That may consume all free space and when it > need to allocate more space for metadata that result in ENOSPC. > > I did a test that issue by 5000 dd to do write stress concurrently. > > The space info after ENOSPC: > Overall: > Device size: 926.91GiB > Device allocated: 926.91GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Used: 211.59GiB > Free (estimated): 714.18GiB (min: 714.18GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:923.77GiB, Used:209.59GiB > /dev/devname 923.77GiB > > Metadata,DUP: Size:1.50GiB, Used:1022.50MiB > /dev/devname 3.00GiB > > System,DUP: Size:72.00MiB, Used:128.00KiB > /dev/devname 144.00MiB So can you provide more details? What parameters you gave to the dd processes? I tried to reproduce that like this: for ((i = 0; i < 5000; i++)); do dd if=/dev/zero of=/mnt/sdi/dd$i bs=2M oflag=dsync & dd_pids[$i]=$! done wait ${dd_pids[@]} But after it finished, the used data space was about the same as the allocated data space (it was on a 200G fs). So I didn't observe the problem you mention, even though at first glance the patch looks ok. Thanks. > > We can see that the Metadata space (1022.50MiB + 512.00MiB) is almost full. > But Data allocated much more space (923.77GiB) than it actually needed > (209.59GiB). > > When the data space is not enough, this 5000 dd process will call > do_chunk_alloc() to allocate more space. > > In the while loop of do_chunk_alloc, the variable force will be changed > to CHUNK_ALLOC_FORCE in second run and should_alloc_chunk() will always > return true when force is CHUNK_ALLOC_FORCE. That means every concurrent > dd process will allocate a chunk in do_chunk_alloc(). > > Fix this by keeping original value of force and re-assign it to force in > the end of the loop. > > Signed-off-by: Robbie Ko <robbieko@synology.com> > --- > fs/btrfs/extent-tree.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index c5880329ae37..73856f70db31 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4511,6 +4511,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, > bool wait_for_alloc = false; > bool should_alloc = false; > int ret = 0; > + int orig_force = force; > > /* Don't re-enter if we're already allocating a chunk */ > if (trans->allocating_chunk) > @@ -4544,6 +4545,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, > */ > wait_for_alloc = true; > spin_unlock(&space_info->lock); > + force = orig_force; > mutex_lock(&fs_info->chunk_mutex); > mutex_unlock(&fs_info->chunk_mutex); > } else { > -- > 2.17.1 >
Filipe Manana 於 2019-05-03 18:53 寫到: > On Fri, Apr 26, 2019 at 12:10 PM robbieko <robbieko@synology.com> > wrote: >> >> From: Robbie Ko <robbieko@synology.com> >> >> I found a issue when btrfs allocates much more space than it actual >> needed >> on massive concurrent writes. That may consume all free space and when >> it >> need to allocate more space for metadata that result in ENOSPC. >> >> I did a test that issue by 5000 dd to do write stress concurrently. >> >> The space info after ENOSPC: >> Overall: >> Device size: 926.91GiB >> Device allocated: 926.91GiB >> Device unallocated: 1.00MiB >> Device missing: 0.00B >> Used: 211.59GiB >> Free (estimated): 714.18GiB (min: 714.18GiB) >> Data ratio: 1.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> Data,single: Size:923.77GiB, Used:209.59GiB >> /dev/devname 923.77GiB >> >> Metadata,DUP: Size:1.50GiB, Used:1022.50MiB >> /dev/devname 3.00GiB >> >> System,DUP: Size:72.00MiB, Used:128.00KiB >> /dev/devname 144.00MiB > > So can you provide more details? What parameters you gave to the dd > processes? > > I tried to reproduce that like this: > > for ((i = 0; i < 5000; i++)); do > dd if=/dev/zero of=/mnt/sdi/dd$i bs=2M oflag=dsync & > dd_pids[$i]=$! > done > > wait ${dd_pids[@]} > > But after it finished, the used data space was about the same as the > allocated data space (it was on a 200G fs). > So I didn't observe the problem you mention, even though at first > glance the patch looks ok. > > Thanks. > I use the following script : #!/bin/sh for (( i=1; i<=1000000; i++ )) do dd if=/dev/urandom of="/mnt/dir/sa${i}" bs=1k count=1024 status=none & remain=$(( $i % 5000)) if [ $remain -eq 0 ]; then echo "i = ${i}" wait fi done The problem occurred after the script ran for one days. Thanks. >> >> We can see that the Metadata space (1022.50MiB + 512.00MiB) is almost >> full. >> But Data allocated much more space (923.77GiB) than it actually needed >> (209.59GiB). >> >> When the data space is not enough, this 5000 dd process will call >> do_chunk_alloc() to allocate more space. >> >> In the while loop of do_chunk_alloc, the variable force will be >> changed >> to CHUNK_ALLOC_FORCE in second run and should_alloc_chunk() will >> always >> return true when force is CHUNK_ALLOC_FORCE. That means every >> concurrent >> dd process will allocate a chunk in do_chunk_alloc(). >> >> Fix this by keeping original value of force and re-assign it to force >> in >> the end of the loop. >> >> Signed-off-by: Robbie Ko <robbieko@synology.com> >> --- >> fs/btrfs/extent-tree.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index c5880329ae37..73856f70db31 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4511,6 +4511,7 @@ static int do_chunk_alloc(struct >> btrfs_trans_handle *trans, u64 flags, >> bool wait_for_alloc = false; >> bool should_alloc = false; >> int ret = 0; >> + int orig_force = force; >> >> /* Don't re-enter if we're already allocating a chunk */ >> if (trans->allocating_chunk) >> @@ -4544,6 +4545,7 @@ static int do_chunk_alloc(struct >> btrfs_trans_handle *trans, u64 flags, >> */ >> wait_for_alloc = true; >> spin_unlock(&space_info->lock); >> + force = orig_force; >> mutex_lock(&fs_info->chunk_mutex); >> mutex_unlock(&fs_info->chunk_mutex); >> } else { >> -- >> 2.17.1 >>
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c5880329ae37..73856f70db31 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4511,6 +4511,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, bool wait_for_alloc = false; bool should_alloc = false; int ret = 0; + int orig_force = force; /* Don't re-enter if we're already allocating a chunk */ if (trans->allocating_chunk) @@ -4544,6 +4545,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, */ wait_for_alloc = true; spin_unlock(&space_info->lock); + force = orig_force; mutex_lock(&fs_info->chunk_mutex); mutex_unlock(&fs_info->chunk_mutex); } else {