Message ID | 20240711050750.17792-1-kundan.kumar@samsung.com (mailing list archive) |
---|---|
Headers | show |
Series | block: add larger order folio instead of pages | expand |
On Thu, Jul 11, 2024 at 10:37:45AM +0530, Kundan Kumar wrote: > User space memory is mapped in kernel in form of pages array. These pages > are iterated and added to BIO. In process, pages are also checked for > contiguity and merged. > > When mTHP is enabled the pages generally belong to larger order folio. This > patch series enables adding large folio to bio. It fetches folio for > page in the page array. The page might start from an offset in the folio > which could be multiples of PAGE_SIZE. Subsequent pages in page array > might belong to same folio. Using the length of folio, folio_offset and > remaining size, determine length in folio which can be added to the bio. > Check if pages are contiguous and belong to same folio. If yes then skip > further processing for the contiguous pages. > > This complete scheme reduces the overhead of iterating through pages. > > perf diff before and after this change(with mTHP enabled): > > Perf diff for write I/O with 128K block size: > 1.24% -0.20% [kernel.kallsyms] [k] bio_iov_iter_get_pages > 1.71% [kernel.kallsyms] [k] bvec_try_merge_page > Perf diff for read I/O with 128K block size: > 4.03% -1.59% [kernel.kallsyms] [k] bio_iov_iter_get_pages > 5.14% [kernel.kallsyms] [k] bvec_try_merge_page This is not just about mTHP uses though, this can also affect buffered IO and direct IO patterns as well and this needs to be considered and tested as well. I've given this a spin on top of of the LBS patches [0] and used the LBS patches as a baseline. The good news is I see a considerable amount of larger IOs for buffered IO and direct IO, however for buffered IO there is an increase on unalignenment to the target filesystem block size and that can affect performance. You can test this with Daniel Gomez's blkalgn tool for IO introspection: wget https://raw.githubusercontent.com/dkruces/bcc/lbs/tools/blkalgn.py mv blkalgn.py /usr/local/bin/ apt-get install python3-bpfcc And so let's try to make things "bad" by forcing a million of small 4k files on a 64k block size fileystem, we see an increase in alignment by a factor of about 2133: fio -name=1k-files-per-thread --nrfiles=1000 -direct=0 -bs=512 \ -ioengine=io_uring --group_reporting=1 \ --alloc-size=2097152 --filesize=4KiB --readwrite=randwrite \ --fallocate=none --numjobs=1000 --create_on_open=1 --directory=$DIR # Force any pending IO from the page cache umount /xfs-64k/ You can use blkalgn with something like this: The left hand side are order, so for example we see only six 4k IOs aligned to 4k with the baseline of just LBS on top of next-20240723. However with these patches that increases to 11 4k IOs, but 23,468 IOs are aligned to 4k. mkfs.xfs -f -b size=64k /dev/nvme0n1 blkalgn -d nvme0n1 --ops Write --json-output 64k-next-20240723.json # Hit CTRL-C after you umount above. cat 64k-next-20240723.json { "Block size": { "13": 1, "12": 6, "18": 244899, "16": 5236751, "17": 13088 }, "Algn size": { "18": 244899, "12": 6, "17": 9793, "13": 1, "16": 5240047 } } And with this series say 64k-next-20240723-block-folios.json { "Block size": { "16": 1018244, "9": 7, "17": 507163, "13": 16, "10": 4, "15": 51671, "12": 11, "14": 43, "11": 5 }, "Algn size": { "15": 6651, "16": 1018244, "13": 17620, "12": 23468, "17": 507163, "14": 4018 } } When using direct IO, since applications typically do the right thing, I see only improvements. And so this needs a bit more testing and evaluation for impact on alignment for buffered IO. [0] https://github.com/linux-kdevops/linux/tree/large-block-folio-for-next Luis
On Thu, Aug 08, 2024 at 04:04:03PM -0700, Luis Chamberlain wrote: > This is not just about mTHP uses though, this can also affect buffered IO and > direct IO patterns as well and this needs to be considered and tested as well. Not sure what the above is supposed to mean. Besides small tweaks to very low-level helpers the changes are entirely in the direct I/O path, and they optimize that path for folios larger than PAGE_SIZE. > I've given this a spin on top of of the LBS patches [0] and used the LBS > patches as a baseline. The good news is I see a considerable amount of > larger IOs for buffered IO and direct IO, however for buffered IO there > is an increase on unalignenment to the target filesystem block size and > that can affect performance. Compared to what? There is nothing in the series here changing buffered I/O patterns. What do you compare? If this series changes buffered I/O patterns that is very well hidden and accidental, so we need to bisect which patch does it and figure out why, but it would surprise me a lot.
On Mon, Aug 12, 2024 at 03:38:43PM +0200, Christoph Hellwig wrote: > On Thu, Aug 08, 2024 at 04:04:03PM -0700, Luis Chamberlain wrote: > > This is not just about mTHP uses though, this can also affect buffered IO and > > direct IO patterns as well and this needs to be considered and tested as well. > > Not sure what the above is supposed to mean. Besides small tweaks > to very low-level helpers the changes are entirely in the direct I/O > path, and they optimize that path for folios larger than PAGE_SIZE. Which was my expectation as well. > > I've given this a spin on top of of the LBS patches [0] and used the LBS > > patches as a baseline. The good news is I see a considerable amount of > > larger IOs for buffered IO and direct IO, however for buffered IO there > > is an increase on unalignenment to the target filesystem block size and > > that can affect performance. > > Compared to what? There is nothing in the series here changing buffered > I/O patterns. What do you compare? If this series changes buffered > I/O patterns that is very well hidden and accidental, so we need to > bisect which patch does it and figure out why, but it would surprise me > a lot. The comparison was the without the patches Vs with the patches on the same fio run with buffered IO. I'll re-test more times and bisect. Thanks, Luis
On 12/08/24 09:35AM, Luis Chamberlain wrote: >On Mon, Aug 12, 2024 at 03:38:43PM +0200, Christoph Hellwig wrote: >> On Thu, Aug 08, 2024 at 04:04:03PM -0700, Luis Chamberlain wrote: >> > This is not just about mTHP uses though, this can also affect buffered IO and >> > direct IO patterns as well and this needs to be considered and tested as well. >> >> Not sure what the above is supposed to mean. Besides small tweaks >> to very low-level helpers the changes are entirely in the direct I/O >> path, and they optimize that path for folios larger than PAGE_SIZE. > >Which was my expectation as well. > >> > I've given this a spin on top of of the LBS patches [0] and used the LBS >> > patches as a baseline. The good news is I see a considerable amount of >> > larger IOs for buffered IO and direct IO, however for buffered IO there >> > is an increase on unalignenment to the target filesystem block size and >> > that can affect performance. >> >> Compared to what? There is nothing in the series here changing buffered >> I/O patterns. What do you compare? If this series changes buffered >> I/O patterns that is very well hidden and accidental, so we need to >> bisect which patch does it and figure out why, but it would surprise me >> a lot. > >The comparison was the without the patches Vs with the patches on the >same fio run with buffered IO. I'll re-test more times and bisect. > Did tests with LBS + block folio patches and couldn't observe alignment issue. Also, the changes in this series are not executed when we issue buffered I/O.
On Fri, Aug 16, 2024 at 02:15:41PM +0530, Kundan Kumar wrote: > On 12/08/24 09:35AM, Luis Chamberlain wrote: > > On Mon, Aug 12, 2024 at 03:38:43PM +0200, Christoph Hellwig wrote: > > > On Thu, Aug 08, 2024 at 04:04:03PM -0700, Luis Chamberlain wrote: > > > > This is not just about mTHP uses though, this can also affect buffered IO and > > > > direct IO patterns as well and this needs to be considered and tested as well. > > > > > > Not sure what the above is supposed to mean. Besides small tweaks > > > to very low-level helpers the changes are entirely in the direct I/O > > > path, and they optimize that path for folios larger than PAGE_SIZE. > > > > Which was my expectation as well. > > > > > > I've given this a spin on top of of the LBS patches [0] and used the LBS > > > > patches as a baseline. The good news is I see a considerable amount of > > > > larger IOs for buffered IO and direct IO, however for buffered IO there > > > > is an increase on unalignenment to the target filesystem block size and > > > > that can affect performance. > > > > > > Compared to what? There is nothing in the series here changing buffered > > > I/O patterns. What do you compare? If this series changes buffered > > > I/O patterns that is very well hidden and accidental, so we need to > > > bisect which patch does it and figure out why, but it would surprise me > > > a lot. > > > > The comparison was the without the patches Vs with the patches on the > > same fio run with buffered IO. I'll re-test more times and bisect. > > > > Did tests with LBS + block folio patches and couldn't observe alignment > issue. Also, the changes in this series are not executed when we issue > buffered I/O. I can't quite understand yet either why buffered IO is implicated yet data does not lie. The good news is I re-tested twice and I get similar results **however** what I failed to notice is that we also get a lot more IOs and this ends up even helping in other ways. So this is not bad, in the end only good. It is hard to visualize this, but an image says more than 1000 words so here you go: https://lh3.googleusercontent.com/pw/AP1GczML6LevSkZ8yHTF9zu0xtXkzy332kd98XBp7biDrxyGWG2IXfgyKNpy6YItUYaWnVeLQSABGJgpiJOANppix7lIYb82_pjl_ZtCCjXenvkDgHGV3KlvXlayG4mAFR762jLugrI4osH0uoKRA1WGZk50xA=w1389-h690-s-no-gm So the volume in the end is what counts too, so let's say we use a tool like the collowing which takes the blkalgn json file as input and outputs worst case workload WAF computation: #!/usr/bin/python3 import json import argparse import math def order_to_kb(order): return (2 ** order) / 1024 def calculate_waf(file_path, iu): with open(file_path, 'r') as file: data = json.load(file) block_size_orders = data["Block size"] algn_size_orders = data["Algn size"] # Calculate total host writes total_host_writes_kb = sum(order_to_kb(int(order)) * count for order, count in block_size_orders.items()) # Calculate total internal writes based on the provided logic total_internal_writes_kb = 0 for order, count in block_size_orders.items(): size_kb = order_to_kb(int(order)) if size_kb >= iu: total_internal_writes_kb += size_kb * count else: total_internal_writes_kb += math.ceil(size_kb / iu) * iu * count # Calculate WAF waf = total_internal_writes_kb / total_host_writes_kb return waf def main(): parser = argparse.ArgumentParser(description="Calculate the Worst-case Write Amplification Factor (WAF) from JSON data.") parser.add_argument('file', type=str, help='Path to the JSON file containing the IO data.') parser.add_argument('--iu', type=int, default=16, help='Indirection Unit (IU) size in KB (default: 16)') args = parser.parse_args() file_path = args.file iu = args.iu waf = calculate_waf(file_path, iu) print(f"Worst-case WAF: {waf:.10f}") if __name__ == "__main__": main() compute-waf.py lbs.json --iu 64 Worst-case WAF: 1.0000116423 compute-waf.py lbs+block-folios.json --iu 64 Worst-case WAF: 1.0000095356 On my second run I have: cat 01-next-20240723-LBS.json { "Block size": { "18": 6960, "14": 302, "16": 2339746, "13": 165, "12": 88, "19": 117886, "10": 49, "9": 33, "11": 31, "17": 42707, "15": 89393 }, "Algn size": { "16": 2351238, "19": 117886, "18": 3353, "17": 34823, "15": 13067, "12": 40060, "14": 13583, "13": 23351 } } cat 02-next-20240723-LBS+block-folios.json { "Block size": { "11": 38, "10": 49, "12": 88, "15": 91949, "18": 33858, "17": 104329, "19": 129301, "9": 34, "13": 199, "16": 4912264, "14": 344 }, "Algn size": { "16": 4954494, "14": 10166, "13": 20527, "17": 82125, "19": 129301, "15": 13111, "12": 48897, "18": 13820 } } compute-waf.py 01-next-20240723-LBS.json --iu 64 Worst-case WAF: 1.0131538374 compute-waf.py 02-next-20240723-LBS+block-folios.json --iu 64 Worst-case WAF: 1.0073550532 Things are even better for Direct IO :) and so I encourage you to test as this is all nice. Tested-by: Luis Chamberlain <mcgrof@kernel.org> Luis