Message ID | 20240327160237.2355-1-bharata@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Hot page promotion optimization for large address space | expand |
Bharata B Rao <bharata@amd.com> writes: > In order to check how efficiently the existing NUMA balancing > based hot page promotion mechanism can detect hot regions and > promote pages for workloads with large memory footprints, I > wrote and tested a program that allocates huge amount of > memory but routinely touches only small parts of it. > > This microbenchmark provisions memory both on DRAM node and CXL node. > It then divides the entire allocated memory into chunks of smaller > size and randomly choses a chunk for generating memory accesses. > Each chunk is then accessed for a fixed number of iterations to > create the notion of hotness. Within each chunk, the individual > pages at 4K granularity are again accessed in random fashion. > > When a chunk is taken up for access in this manner, its pages > can either be residing on DRAM or CXL. In the latter case, the NUMA > balancing driven hot page promotion logic is expected to detect and > promote the hot pages that reside on CXL. > > The experiment was conducted on a 2P AMD Bergamo system that has > CXL as the 3rd node. > > $ numactl -H > available: 3 nodes (0-2) > node 0 cpus: 0-127,256-383 > node 0 size: 128054 MB > node 1 cpus: 128-255,384-511 > node 1 size: 128880 MB > node 2 cpus: > node 2 size: 129024 MB > node distances: > node 0 1 2 > 0: 10 32 60 > 1: 32 10 50 > 2: 255 255 10 > > It is seen that number of pages that get promoted is really low and > the reason for it happens to be that the NUMA hint fault latency turns > out to be much higher than the hot threshold most of the times. Here > are a few latency and threshold sample values captured from > should_numa_migrate_memory() routine when the benchmark was run: > > latency threshold (in ms) > 20620 1125 > 56185 1125 > 98710 1250 > 148871 1375 > 182891 1625 > 369415 1875 > 630745 2000 The access latency of your workload is 20s to 630s, which appears too long. Can you try to increase the range of threshold to deal with that? For example, echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms [snip] -- Best Regards, Huang, Ying
On 28-Mar-24 11:05 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> In order to check how efficiently the existing NUMA balancing >> based hot page promotion mechanism can detect hot regions and >> promote pages for workloads with large memory footprints, I >> wrote and tested a program that allocates huge amount of >> memory but routinely touches only small parts of it. >> >> This microbenchmark provisions memory both on DRAM node and CXL node. >> It then divides the entire allocated memory into chunks of smaller >> size and randomly choses a chunk for generating memory accesses. >> Each chunk is then accessed for a fixed number of iterations to >> create the notion of hotness. Within each chunk, the individual >> pages at 4K granularity are again accessed in random fashion. >> >> When a chunk is taken up for access in this manner, its pages >> can either be residing on DRAM or CXL. In the latter case, the NUMA >> balancing driven hot page promotion logic is expected to detect and >> promote the hot pages that reside on CXL. >> >> The experiment was conducted on a 2P AMD Bergamo system that has >> CXL as the 3rd node. >> >> $ numactl -H >> available: 3 nodes (0-2) >> node 0 cpus: 0-127,256-383 >> node 0 size: 128054 MB >> node 1 cpus: 128-255,384-511 >> node 1 size: 128880 MB >> node 2 cpus: >> node 2 size: 129024 MB >> node distances: >> node 0 1 2 >> 0: 10 32 60 >> 1: 32 10 50 >> 2: 255 255 10 >> >> It is seen that number of pages that get promoted is really low and >> the reason for it happens to be that the NUMA hint fault latency turns >> out to be much higher than the hot threshold most of the times. Here >> are a few latency and threshold sample values captured from >> should_numa_migrate_memory() routine when the benchmark was run: >> >> latency threshold (in ms) >> 20620 1125 >> 56185 1125 >> 98710 1250 >> 148871 1375 >> 182891 1625 >> 369415 1875 >> 630745 2000 > > The access latency of your workload is 20s to 630s, which appears too > long. Can you try to increase the range of threshold to deal with that? > For example, > > echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms That of course should help. But I was exploring alternatives where the notion of hotness can be de-linked from the absolute scanning time to the extent possible. For large memory workloads where only parts of memory get accessed at once, the scanning time can lag from the actual access time significantly as the data above shows. Wondering if such cases can be addressed without having to be workload-specific. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 28-Mar-24 11:05 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> In order to check how efficiently the existing NUMA balancing >>> based hot page promotion mechanism can detect hot regions and >>> promote pages for workloads with large memory footprints, I >>> wrote and tested a program that allocates huge amount of >>> memory but routinely touches only small parts of it. >>> >>> This microbenchmark provisions memory both on DRAM node and CXL node. >>> It then divides the entire allocated memory into chunks of smaller >>> size and randomly choses a chunk for generating memory accesses. >>> Each chunk is then accessed for a fixed number of iterations to >>> create the notion of hotness. Within each chunk, the individual >>> pages at 4K granularity are again accessed in random fashion. >>> >>> When a chunk is taken up for access in this manner, its pages >>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>> balancing driven hot page promotion logic is expected to detect and >>> promote the hot pages that reside on CXL. >>> >>> The experiment was conducted on a 2P AMD Bergamo system that has >>> CXL as the 3rd node. >>> >>> $ numactl -H >>> available: 3 nodes (0-2) >>> node 0 cpus: 0-127,256-383 >>> node 0 size: 128054 MB >>> node 1 cpus: 128-255,384-511 >>> node 1 size: 128880 MB >>> node 2 cpus: >>> node 2 size: 129024 MB >>> node distances: >>> node 0 1 2 >>> 0: 10 32 60 >>> 1: 32 10 50 >>> 2: 255 255 10 >>> >>> It is seen that number of pages that get promoted is really low and >>> the reason for it happens to be that the NUMA hint fault latency turns >>> out to be much higher than the hot threshold most of the times. Here >>> are a few latency and threshold sample values captured from >>> should_numa_migrate_memory() routine when the benchmark was run: >>> >>> latency threshold (in ms) >>> 20620 1125 >>> 56185 1125 >>> 98710 1250 >>> 148871 1375 >>> 182891 1625 >>> 369415 1875 >>> 630745 2000 >> >> The access latency of your workload is 20s to 630s, which appears too >> long. Can you try to increase the range of threshold to deal with that? >> For example, >> >> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms > > That of course should help. But I was exploring alternatives where the > notion of hotness can be de-linked from the absolute scanning time to In fact, only relative time from scan to hint fault is recorded and calculated, we have only limited bits. > the extent possible. For large memory workloads where only parts of memory > get accessed at once, the scanning time can lag from the actual access > time significantly as the data above shows. Wondering if such cases can > be addressed without having to be workload-specific. Does it really matter to promote the quite cold pages (accessed every more than 20s)? And if so, how can we adjust the current algorithm to cover that? I think that may be possible via extending the threshold range. And I think that we can find some way to extending the range by default if necessary. -- Best Regards, Huang, Ying
On 28-Mar-24 11:33 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 28-Mar-24 11:05 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >>> >>>> In order to check how efficiently the existing NUMA balancing >>>> based hot page promotion mechanism can detect hot regions and >>>> promote pages for workloads with large memory footprints, I >>>> wrote and tested a program that allocates huge amount of >>>> memory but routinely touches only small parts of it. >>>> >>>> This microbenchmark provisions memory both on DRAM node and CXL node. >>>> It then divides the entire allocated memory into chunks of smaller >>>> size and randomly choses a chunk for generating memory accesses. >>>> Each chunk is then accessed for a fixed number of iterations to >>>> create the notion of hotness. Within each chunk, the individual >>>> pages at 4K granularity are again accessed in random fashion. >>>> >>>> When a chunk is taken up for access in this manner, its pages >>>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>>> balancing driven hot page promotion logic is expected to detect and >>>> promote the hot pages that reside on CXL. >>>> >>>> The experiment was conducted on a 2P AMD Bergamo system that has >>>> CXL as the 3rd node. >>>> >>>> $ numactl -H >>>> available: 3 nodes (0-2) >>>> node 0 cpus: 0-127,256-383 >>>> node 0 size: 128054 MB >>>> node 1 cpus: 128-255,384-511 >>>> node 1 size: 128880 MB >>>> node 2 cpus: >>>> node 2 size: 129024 MB >>>> node distances: >>>> node 0 1 2 >>>> 0: 10 32 60 >>>> 1: 32 10 50 >>>> 2: 255 255 10 >>>> >>>> It is seen that number of pages that get promoted is really low and >>>> the reason for it happens to be that the NUMA hint fault latency turns >>>> out to be much higher than the hot threshold most of the times. Here >>>> are a few latency and threshold sample values captured from >>>> should_numa_migrate_memory() routine when the benchmark was run: >>>> >>>> latency threshold (in ms) >>>> 20620 1125 >>>> 56185 1125 >>>> 98710 1250 >>>> 148871 1375 >>>> 182891 1625 >>>> 369415 1875 >>>> 630745 2000 >>> >>> The access latency of your workload is 20s to 630s, which appears too >>> long. Can you try to increase the range of threshold to deal with that? >>> For example, >>> >>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms >> >> That of course should help. But I was exploring alternatives where the >> notion of hotness can be de-linked from the absolute scanning time to > > In fact, only relative time from scan to hint fault is recorded and > calculated, we have only limited bits. > >> the extent possible. For large memory workloads where only parts of memory >> get accessed at once, the scanning time can lag from the actual access >> time significantly as the data above shows. Wondering if such cases can >> be addressed without having to be workload-specific. > > Does it really matter to promote the quite cold pages (accessed every > more than 20s)? And if so, how can we adjust the current algorithm to > cover that? I think that may be possible via extending the threshold > range. And I think that we can find some way to extending the range by > default if necessary. I don't think the pages are cold but rather the existing mechanism fails to categorize them as hot. This is because the pages were scanned way before the accesses start happening. When repeated accesses are made to a chunk of memory that has been scanned a while back, none of those accesses get classified as hot because the scan time is way behind the current access time. That's the reason we are seeing the value of latency ranging from 20s to 630s as shown above. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 28-Mar-24 11:33 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 28-Mar-24 11:05 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>>> >>>>> In order to check how efficiently the existing NUMA balancing >>>>> based hot page promotion mechanism can detect hot regions and >>>>> promote pages for workloads with large memory footprints, I >>>>> wrote and tested a program that allocates huge amount of >>>>> memory but routinely touches only small parts of it. >>>>> >>>>> This microbenchmark provisions memory both on DRAM node and CXL node. >>>>> It then divides the entire allocated memory into chunks of smaller >>>>> size and randomly choses a chunk for generating memory accesses. >>>>> Each chunk is then accessed for a fixed number of iterations to >>>>> create the notion of hotness. Within each chunk, the individual >>>>> pages at 4K granularity are again accessed in random fashion. >>>>> >>>>> When a chunk is taken up for access in this manner, its pages >>>>> can either be residing on DRAM or CXL. In the latter case, the NUMA >>>>> balancing driven hot page promotion logic is expected to detect and >>>>> promote the hot pages that reside on CXL. >>>>> >>>>> The experiment was conducted on a 2P AMD Bergamo system that has >>>>> CXL as the 3rd node. >>>>> >>>>> $ numactl -H >>>>> available: 3 nodes (0-2) >>>>> node 0 cpus: 0-127,256-383 >>>>> node 0 size: 128054 MB >>>>> node 1 cpus: 128-255,384-511 >>>>> node 1 size: 128880 MB >>>>> node 2 cpus: >>>>> node 2 size: 129024 MB >>>>> node distances: >>>>> node 0 1 2 >>>>> 0: 10 32 60 >>>>> 1: 32 10 50 >>>>> 2: 255 255 10 >>>>> >>>>> It is seen that number of pages that get promoted is really low and >>>>> the reason for it happens to be that the NUMA hint fault latency turns >>>>> out to be much higher than the hot threshold most of the times. Here >>>>> are a few latency and threshold sample values captured from >>>>> should_numa_migrate_memory() routine when the benchmark was run: >>>>> >>>>> latency threshold (in ms) >>>>> 20620 1125 >>>>> 56185 1125 >>>>> 98710 1250 >>>>> 148871 1375 >>>>> 182891 1625 >>>>> 369415 1875 >>>>> 630745 2000 >>>> >>>> The access latency of your workload is 20s to 630s, which appears too >>>> long. Can you try to increase the range of threshold to deal with that? >>>> For example, >>>> >>>> echo 100000 > /sys/kernel/debug/sched/numa_balancing/hot_threshold_ms >>> >>> That of course should help. But I was exploring alternatives where the >>> notion of hotness can be de-linked from the absolute scanning time to >> >> In fact, only relative time from scan to hint fault is recorded and >> calculated, we have only limited bits. >> >>> the extent possible. For large memory workloads where only parts of memory >>> get accessed at once, the scanning time can lag from the actual access >>> time significantly as the data above shows. Wondering if such cases can >>> be addressed without having to be workload-specific. >> >> Does it really matter to promote the quite cold pages (accessed every >> more than 20s)? And if so, how can we adjust the current algorithm to >> cover that? I think that may be possible via extending the threshold >> range. And I think that we can find some way to extending the range by >> default if necessary. > > I don't think the pages are cold but rather the existing mechanism fails > to categorize them as hot. This is because the pages were scanned way > before the accesses start happening. When repeated accesses are made to > a chunk of memory that has been scanned a while back, none of those > accesses get classified as hot because the scan time is way behind > the current access time. That's the reason we are seeing the value > of latency ranging from 20s to 630s as shown above. If repeated accesses continue, the page will be identified as hot when it is scanned next time even if we don't expand the threshold range. If the repeated accesses only last very short time, it makes little sense to identify the pages as hot. Right? The bits to record scan time or hint page fault is limited, so it's possible for it to overflow anyway. We scan scale time stamp if necessary (for example, from 1ms to 10ms). But it's hard to scale fault counter. And nobody can guarantee the frequency of hint page fault must be less 1/ms, if it's 10/ms, it can record even short interval. -- Best Regards, Huang, Ying
On 29-Mar-24 6:44 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: <snip> >> I don't think the pages are cold but rather the existing mechanism fails >> to categorize them as hot. This is because the pages were scanned way >> before the accesses start happening. When repeated accesses are made to >> a chunk of memory that has been scanned a while back, none of those >> accesses get classified as hot because the scan time is way behind >> the current access time. That's the reason we are seeing the value >> of latency ranging from 20s to 630s as shown above. > > If repeated accesses continue, the page will be identified as hot when > it is scanned next time even if we don't expand the threshold range. If > the repeated accesses only last very short time, it makes little sense > to identify the pages as hot. Right? The total allocated memory here is 192G and the chunk size is 1G. Each time one such 1G chunk is taken up randomly for generating memory accesses. Within that 1G, 262144 random accesses are performed and 262144 such accesses are repeated for 512 times. I thought that should be enough to classify that chunk of memory as hot. But as we see, often times the scan time is lagging the access time by a large value. Let me instrument the code further to learn more insights (if possible) about the scanning/fault time behaviors here. Leaving the fault count based threshold apart, do you think there is value in updating the scan time for skipped pages/PTEs during every scan so that the scan time remains current for all the pages? > > The bits to record scan time or hint page fault is limited, so it's > possible for it to overflow anyway. We scan scale time stamp if > necessary (for example, from 1ms to 10ms). But it's hard to scale fault > counter. And nobody can guarantee the frequency of hint page fault must > be less 1/ms, if it's 10/ms, it can record even short interval. Yes, with the approach I have taken, the time factor is out of the equation and the notion of hotness is purely a factor of the number of faults (or accesses) Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 29-Mar-24 6:44 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: > <snip> >>> I don't think the pages are cold but rather the existing mechanism fails >>> to categorize them as hot. This is because the pages were scanned way >>> before the accesses start happening. When repeated accesses are made to >>> a chunk of memory that has been scanned a while back, none of those >>> accesses get classified as hot because the scan time is way behind >>> the current access time. That's the reason we are seeing the value >>> of latency ranging from 20s to 630s as shown above. >> >> If repeated accesses continue, the page will be identified as hot when >> it is scanned next time even if we don't expand the threshold range. If >> the repeated accesses only last very short time, it makes little sense >> to identify the pages as hot. Right? > > The total allocated memory here is 192G and the chunk size is 1G. Each > time one such 1G chunk is taken up randomly for generating memory accesses. > Within that 1G, 262144 random accesses are performed and 262144 such > accesses are repeated for 512 times. I thought that should be enough > to classify that chunk of memory as hot. IIUC, some pages are accessed in very short time (maybe within 1ms). This isn't repeated access in a long period. I think that pages accessed repeatedly in a long period are good candidates for promoting. But pages accessed frequently in only very short time aren't. > But as we see, often times > the scan time is lagging the access time by a large value. > > Let me instrument the code further to learn more insights (if possible) > about the scanning/fault time behaviors here. > > Leaving the fault count based threshold apart, do you think there is > value in updating the scan time for skipped pages/PTEs during every > scan so that the scan time remains current for all the pages? No, I don't think so. That makes hint page fault latency more inaccurate. >> >> The bits to record scan time or hint page fault is limited, so it's >> possible for it to overflow anyway. We scan scale time stamp if >> necessary (for example, from 1ms to 10ms). But it's hard to scale fault >> counter. And nobody can guarantee the frequency of hint page fault must >> be less 1/ms, if it's 10/ms, it can record even short interval. > > Yes, with the approach I have taken, the time factor is out of the > equation and the notion of hotness is purely a factor of the number of > faults (or accesses) Sorry, I don't get your idea here. I think that the fault count may be worse than time in quite some cases. -- Best Regards, Huang, Ying
On 02-Apr-24 7:33 AM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >> <snip> >>>> I don't think the pages are cold but rather the existing mechanism fails >>>> to categorize them as hot. This is because the pages were scanned way >>>> before the accesses start happening. When repeated accesses are made to >>>> a chunk of memory that has been scanned a while back, none of those >>>> accesses get classified as hot because the scan time is way behind >>>> the current access time. That's the reason we are seeing the value >>>> of latency ranging from 20s to 630s as shown above. >>> >>> If repeated accesses continue, the page will be identified as hot when >>> it is scanned next time even if we don't expand the threshold range. If >>> the repeated accesses only last very short time, it makes little sense >>> to identify the pages as hot. Right? >> >> The total allocated memory here is 192G and the chunk size is 1G. Each >> time one such 1G chunk is taken up randomly for generating memory accesses. >> Within that 1G, 262144 random accesses are performed and 262144 such >> accesses are repeated for 512 times. I thought that should be enough >> to classify that chunk of memory as hot. > > IIUC, some pages are accessed in very short time (maybe within 1ms). > This isn't repeated access in a long period. I think that pages > accessed repeatedly in a long period are good candidates for promoting. > But pages accessed frequently in only very short time aren't. Here are the numbers for the 192nd chunk: Each iteration of 262144 random accesses takes around ~10ms 512 such iterations are taking ~5s numa_scan_seq is 16 when this chunk is accessed. And no page promotions were done from this chunk. All the time should_numa_migrate_memory() found the NUMA hint fault latency to be higher than threshold. Are these time periods considered too short for the pages to be detected as hot and promoted? > >> But as we see, often times >> the scan time is lagging the access time by a large value. >> >> Let me instrument the code further to learn more insights (if possible) >> about the scanning/fault time behaviors here. >> >> Leaving the fault count based threshold apart, do you think there is >> value in updating the scan time for skipped pages/PTEs during every >> scan so that the scan time remains current for all the pages? > > No, I don't think so. That makes hint page fault latency more > inaccurate. For the case that I have shown, depending on a old value of scan time doesn't work well when pages get accessed after a long time since scanning. At least with the scheme I show in patch 2/2, probability of detecting pages as hot increases. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 02-Apr-24 7:33 AM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>> <snip> >>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>> to categorize them as hot. This is because the pages were scanned way >>>>> before the accesses start happening. When repeated accesses are made to >>>>> a chunk of memory that has been scanned a while back, none of those >>>>> accesses get classified as hot because the scan time is way behind >>>>> the current access time. That's the reason we are seeing the value >>>>> of latency ranging from 20s to 630s as shown above. >>>> >>>> If repeated accesses continue, the page will be identified as hot when >>>> it is scanned next time even if we don't expand the threshold range. If >>>> the repeated accesses only last very short time, it makes little sense >>>> to identify the pages as hot. Right? >>> >>> The total allocated memory here is 192G and the chunk size is 1G. Each >>> time one such 1G chunk is taken up randomly for generating memory accesses. >>> Within that 1G, 262144 random accesses are performed and 262144 such >>> accesses are repeated for 512 times. I thought that should be enough >>> to classify that chunk of memory as hot. >> >> IIUC, some pages are accessed in very short time (maybe within 1ms). >> This isn't repeated access in a long period. I think that pages >> accessed repeatedly in a long period are good candidates for promoting. >> But pages accessed frequently in only very short time aren't. > > Here are the numbers for the 192nd chunk: > > Each iteration of 262144 random accesses takes around ~10ms > 512 such iterations are taking ~5s > numa_scan_seq is 16 when this chunk is accessed. > And no page promotions were done from this chunk. All the > time should_numa_migrate_memory() found the NUMA hint fault > latency to be higher than threshold. > > Are these time periods considered too short for the pages > to be detected as hot and promoted? Yes. I think so. This is burst accessing, not repeated accessing. IIUC, NUMA balancing based promotion only works for repeated accessing for long time, for example, >100s. >> >>> But as we see, often times >>> the scan time is lagging the access time by a large value. >>> >>> Let me instrument the code further to learn more insights (if possible) >>> about the scanning/fault time behaviors here. >>> >>> Leaving the fault count based threshold apart, do you think there is >>> value in updating the scan time for skipped pages/PTEs during every >>> scan so that the scan time remains current for all the pages? >> >> No, I don't think so. That makes hint page fault latency more >> inaccurate. > > For the case that I have shown, depending on a old value of scan > time doesn't work well when pages get accessed after a long time > since scanning. At least with the scheme I show in patch 2/2, > probability of detecting pages as hot increases. Yes. This may help your cases, but it will hurt other cases with incorrect hint page fault latency. To resolve your issue, we can increase the max value of the hot threshold automatically. We can work on that if you can find a real workload. -- Best Regards, Huang, Ying
On 03-Apr-24 2:10 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 02-Apr-24 7:33 AM, Huang, Ying wrote: >>> Bharata B Rao <bharata@amd.com> writes: >>> >>>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>>> Bharata B Rao <bharata@amd.com> writes: >>>> <snip> >>>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>>> to categorize them as hot. This is because the pages were scanned way >>>>>> before the accesses start happening. When repeated accesses are made to >>>>>> a chunk of memory that has been scanned a while back, none of those >>>>>> accesses get classified as hot because the scan time is way behind >>>>>> the current access time. That's the reason we are seeing the value >>>>>> of latency ranging from 20s to 630s as shown above. >>>>> >>>>> If repeated accesses continue, the page will be identified as hot when >>>>> it is scanned next time even if we don't expand the threshold range. If >>>>> the repeated accesses only last very short time, it makes little sense >>>>> to identify the pages as hot. Right? >>>> >>>> The total allocated memory here is 192G and the chunk size is 1G. Each >>>> time one such 1G chunk is taken up randomly for generating memory accesses. >>>> Within that 1G, 262144 random accesses are performed and 262144 such >>>> accesses are repeated for 512 times. I thought that should be enough >>>> to classify that chunk of memory as hot. >>> >>> IIUC, some pages are accessed in very short time (maybe within 1ms). >>> This isn't repeated access in a long period. I think that pages >>> accessed repeatedly in a long period are good candidates for promoting. >>> But pages accessed frequently in only very short time aren't. >> >> Here are the numbers for the 192nd chunk: >> >> Each iteration of 262144 random accesses takes around ~10ms >> 512 such iterations are taking ~5s >> numa_scan_seq is 16 when this chunk is accessed. >> And no page promotions were done from this chunk. All the >> time should_numa_migrate_memory() found the NUMA hint fault >> latency to be higher than threshold. >> >> Are these time periods considered too short for the pages >> to be detected as hot and promoted? > > Yes. I think so. This is burst accessing, not repeated accessing. > IIUC, NUMA balancing based promotion only works for repeated accessing > for long time, for example, >100s. Hmm... When a page is accessed 512 times over a period of 5s and it is still not detected as hot. This is understandable if fresh scanning couldn't be done as the accesses were bursty and hence they couldn't be captured via NUMA hint faults. But here the access captured via hint fault is being rejected as not hot because the scanning was done a while back. But I do see the challenge here since we depend on scanning time to obtain the frequency-of-access metric. BTW, for the above same scenario with numa_balancing_mode=1, the remote accesses get detected and migration to source node is tried. It is a different matter that eventually pages can't be migrated in this specific scenario as the src node is already full. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 03-Apr-24 2:10 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 02-Apr-24 7:33 AM, Huang, Ying wrote: >>>> Bharata B Rao <bharata@amd.com> writes: >>>> >>>>> On 29-Mar-24 6:44 AM, Huang, Ying wrote: >>>>>> Bharata B Rao <bharata@amd.com> writes: >>>>> <snip> >>>>>>> I don't think the pages are cold but rather the existing mechanism fails >>>>>>> to categorize them as hot. This is because the pages were scanned way >>>>>>> before the accesses start happening. When repeated accesses are made to >>>>>>> a chunk of memory that has been scanned a while back, none of those >>>>>>> accesses get classified as hot because the scan time is way behind >>>>>>> the current access time. That's the reason we are seeing the value >>>>>>> of latency ranging from 20s to 630s as shown above. >>>>>> >>>>>> If repeated accesses continue, the page will be identified as hot when >>>>>> it is scanned next time even if we don't expand the threshold range. If >>>>>> the repeated accesses only last very short time, it makes little sense >>>>>> to identify the pages as hot. Right? >>>>> >>>>> The total allocated memory here is 192G and the chunk size is 1G. Each >>>>> time one such 1G chunk is taken up randomly for generating memory accesses. >>>>> Within that 1G, 262144 random accesses are performed and 262144 such >>>>> accesses are repeated for 512 times. I thought that should be enough >>>>> to classify that chunk of memory as hot. >>>> >>>> IIUC, some pages are accessed in very short time (maybe within 1ms). >>>> This isn't repeated access in a long period. I think that pages >>>> accessed repeatedly in a long period are good candidates for promoting. >>>> But pages accessed frequently in only very short time aren't. >>> >>> Here are the numbers for the 192nd chunk: >>> >>> Each iteration of 262144 random accesses takes around ~10ms >>> 512 such iterations are taking ~5s >>> numa_scan_seq is 16 when this chunk is accessed. >>> And no page promotions were done from this chunk. All the >>> time should_numa_migrate_memory() found the NUMA hint fault >>> latency to be higher than threshold. >>> >>> Are these time periods considered too short for the pages >>> to be detected as hot and promoted? >> >> Yes. I think so. This is burst accessing, not repeated accessing. >> IIUC, NUMA balancing based promotion only works for repeated accessing >> for long time, for example, >100s. > > Hmm... When a page is accessed 512 times over a period of 5s and it is > still not detected as hot. This is understandable if fresh scanning couldn't > be done as the accesses were bursty and hence they couldn't be captured via > NUMA hint faults. But here the access captured via hint fault is being rejected > as not hot because the scanning was done a while back. But I do see the challenge > here since we depend on scanning time to obtain the frequency-of-access metric. Consider some pages that will be accessed once every 1 hour, should we consider it hot or not? Will your proposed method deal with that correctly? > BTW, for the above same scenario with numa_balancing_mode=1, the remote > accesses get detected and migration to source node is tried. It is a different > matter that eventually pages can't be migrated in this specific scenario as > the src node is already full. -- Best Regards, Huang, Ying
On 12-Apr-24 12:58 PM, Huang, Ying wrote: > Bharata B Rao <bharata@amd.com> writes: > >> On 03-Apr-24 2:10 PM, Huang, Ying wrote: >>>> Here are the numbers for the 192nd chunk: >>>> >>>> Each iteration of 262144 random accesses takes around ~10ms >>>> 512 such iterations are taking ~5s >>>> numa_scan_seq is 16 when this chunk is accessed. >>>> And no page promotions were done from this chunk. All the >>>> time should_numa_migrate_memory() found the NUMA hint fault >>>> latency to be higher than threshold. >>>> >>>> Are these time periods considered too short for the pages >>>> to be detected as hot and promoted? >>> >>> Yes. I think so. This is burst accessing, not repeated accessing. >>> IIUC, NUMA balancing based promotion only works for repeated accessing >>> for long time, for example, >100s. >> >> Hmm... When a page is accessed 512 times over a period of 5s and it is >> still not detected as hot. This is understandable if fresh scanning couldn't >> be done as the accesses were bursty and hence they couldn't be captured via >> NUMA hint faults. But here the access captured via hint fault is being rejected >> as not hot because the scanning was done a while back. But I do see the challenge >> here since we depend on scanning time to obtain the frequency-of-access metric. > > Consider some pages that will be accessed once every 1 hour, should we > consider it hot or not? Will your proposed method deal with that > correctly? The proposed method removes the absolute time as a factor for the decision and instead relies on the number of hint faults that have occurred since that page was scanned last. As long as there are enough hint faults happening in that 1 hour (which means a lot many other accesses have been captured in that 1 hour), that page shouldn't be considered as hot. You did mention earlier about hint fault rate varying a lot and one thing I haven't tried yet is to vary the fault threshold based on current or historical fault rate. Regards, Bharata.
Bharata B Rao <bharata@amd.com> writes: > On 12-Apr-24 12:58 PM, Huang, Ying wrote: >> Bharata B Rao <bharata@amd.com> writes: >> >>> On 03-Apr-24 2:10 PM, Huang, Ying wrote: >>>>> Here are the numbers for the 192nd chunk: >>>>> >>>>> Each iteration of 262144 random accesses takes around ~10ms >>>>> 512 such iterations are taking ~5s >>>>> numa_scan_seq is 16 when this chunk is accessed. >>>>> And no page promotions were done from this chunk. All the >>>>> time should_numa_migrate_memory() found the NUMA hint fault >>>>> latency to be higher than threshold. >>>>> >>>>> Are these time periods considered too short for the pages >>>>> to be detected as hot and promoted? >>>> >>>> Yes. I think so. This is burst accessing, not repeated accessing. >>>> IIUC, NUMA balancing based promotion only works for repeated accessing >>>> for long time, for example, >100s. >>> >>> Hmm... When a page is accessed 512 times over a period of 5s and it is >>> still not detected as hot. This is understandable if fresh scanning couldn't >>> be done as the accesses were bursty and hence they couldn't be captured via >>> NUMA hint faults. But here the access captured via hint fault is being rejected >>> as not hot because the scanning was done a while back. But I do see the challenge >>> here since we depend on scanning time to obtain the frequency-of-access metric. >> >> Consider some pages that will be accessed once every 1 hour, should we >> consider it hot or not? Will your proposed method deal with that >> correctly? > > The proposed method removes the absolute time as a factor for the decision and instead > relies on the number of hint faults that have occurred since that page was scanned last. > As long as there are enough hint faults happening in that 1 hour (which means a lot many > other accesses have been captured in that 1 hour), that page shouldn't be considered as > hot. You did mention earlier about hint fault rate varying a lot and one thing I haven't > tried yet is to vary the fault threshold based on current or historical fault rate. In your original example, if a lot many other accesses between NUMA balancing page table scanning and 512 page accesses, you cannot identify the page as hot too, right? If the NUMA balancing page table scanning period is much longer than 5s, it's high possible that we cannot distinguish between 1 and 512 page accesses within 5s with your method and the original method. Better discuss the behavior with a more detail example, for example, when the page is scanned, how many pages are accessed, how long between accesses, etc. -- Best Regards, Huang, Ying