Message ID | 20200304102840.2801-1-thomas_os@shipmail.org (mailing list archive) |
---|---|
Headers | show |
Series | Huge page-table entries for TTM | expand |
On 3/4/20 11:28 AM, Thomas Hellström (VMware) wrote: > In order to reduce CPU usage [1] and in theory TLB misses this patchset enables > huge- and giant page-table entries for TTM and TTM-enabled graphics drivers. > > Patch 1 and 2 introduce a vma_is_special_huge() function to make the mm code > take the same path as DAX when splitting huge- and giant page table entries, > (which currently means zapping the page-table entry and rely on re-faulting). > > Patch 3 makes the mm code split existing huge page-table entries > on huge_fault fallbacks. Typically on COW or on buffer-objects that want > write-notify. COW and write-notification is always done on the lowest > page-table level. See the patch log message for additional considerations. > > Patch 4 introduces functions to allow the graphics drivers to manipulate > the caching- and encryption flags of huge page-table entries without ugly > hacks. > > Patch 5 implements the huge_fault handler in TTM. > This enables huge page-table entries, provided that the kernel is configured > to support transhuge pages, either by default or using madvise(). > However, they are unlikely to be inserted unless the kernel buffer object > pfns and user-space addresses align perfectly. There are various options > here, but since buffer objects that reside in system pages typically start > at huge page boundaries if they are backed by huge pages, we try to enforce > buffer object starting pfns and user-space addresses to be huge page-size > aligned if their size exceeds a huge page-size. If pud-size transhuge > ("giant") pages are enabled by the arch, the same holds for those. > > Patch 6 implements a specialized huge_fault handler for vmwgfx. > The vmwgfx driver may perform dirty-tracking and needs some special code > to handle that correctly. > > Patch 7 implements a drm helper to align user-space addresses according > to the above scheme, if possible. > > Patch 8 implements a TTM range manager for vmwgfx that does the same for > graphics IO memory. This may later be reused by other graphics drivers > if necessary. > > Patch 9 finally hooks up the helpers of patch 7 and 8 to the vmwgfx driver. > A similar change is needed for graphics drivers that want a reasonable > likelyhood of actually using huge page-table entries. > > If a buffer object size is not huge-page or giant-page aligned, > its size will NOT be inflated by this patchset. This means that the buffer > object tail will use smaller size page-table entries and thus no memory > overhead occurs. Drivers that want to pay the memory overhead price need to > implement their own scheme to inflate buffer-object sizes. > > PMD size huge page-table-entries have been tested with vmwgfx and found to > work well both with system memory backed and IO memory backed buffer objects. > > PUD size giant page-table-entries have seen limited (fault and COW) testing > using a modified kernel (to support 1GB page allocations) and a fake vmwgfx > TTM memory type. The vmwgfx driver does otherwise not support 1GB-size IO > memory resources. > > Comments and suggestions welcome. > Thomas > > Changes since RFC: > * Check for buffer objects present in contigous IO Memory (Christian König) > * Rebased on the vmwgfx emulated coherent memory functionality. That rebase > adds patch 5. > Changes since v1: > * Make the new TTM range manager vmwgfx-specific. (Christian König) > * Minor fixes for configs that don't support or only partially support > transhuge pages. > Changes since v2: > * Minor coding style and doc fixes in patch 5/9 (Christian König) > * Patch 5/9 doesn't touch mm. Remove from the patch title. > Changes since v3: > * Added reviews and acks > * Implemented ugly but generic ttm_pgprot_is_wrprotecting() instead of arch > specific code. > Changes since v4: > * Added timings (Andrew Morton) > * Updated function documentation (Andrew Morton) > Changes since v6: > * Fix drm build error with !CONFIG_MMU > > [1] > The below test program generates the following gnu time output when run on a > vmwgfx-enabled kernel without the patch series: > > 4.78user 6.02system 0:10.91elapsed 99%CPU (0avgtext+0avgdata 1624maxresident)k > 0inputs+0outputs (0major+640077minor)pagefaults 0swaps > > and with the patch series: > > 1.71user 3.60system 0:05.40elapsed 98%CPU (0avgtext+0avgdata 1656maxresident)k > 0inputs+0outputs (0major+20079minor)pagefaults 0swaps > > A consistent number of reduced graphics page-faults can be seen with normal > graphics applications, but due to the aggressive buffer object caching in > vmwgfx user-space drivers the CPU time reduction is within the error marginal. > > #include <unistd.h> > #include <string.h> > #include <sys/mman.h> > #include <xf86drm.h> > > static void checkerr(int ret, const char *name) > { > if (ret < 0) { > perror(name); > exit(-1); > } > } > > int main(int agc, const char *argv[]) > { > struct drm_mode_create_dumb c_arg = {0}; > struct drm_mode_map_dumb m_arg = {0}; > struct drm_mode_destroy_dumb d_arg = {0}; > int ret, i, fd; > void *map; > > fd = open("/dev/dri/card0", O_RDWR); > checkerr(fd, argv[0]); > > for (i = 0; i < 10000; ++i) { > c_arg.bpp = 32; > c_arg.width = 1024; > c_arg.height = 1024; > ret = drmIoctl(fd, DRM_IOCTL_MODE_CREATE_DUMB, &c_arg); > checkerr(fd, argv[0]); > > m_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &m_arg); > checkerr(fd, argv[0]); > > map = mmap(NULL, c_arg.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, > m_arg.offset); > checkerr(map == MAP_FAILED ? -1 : 0, argv[0]); > > (void) madvise((void *) map, c_arg.size, MADV_HUGEPAGE); > memset(map, 0x67, c_arg.size); > munmap(map, c_arg.size); > > d_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_DESTROY_DUMB, &d_arg); > checkerr(ret, argv[0]); > } > > close(fd); > } > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Ralph Campbell <rcampbell@nvidia.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: Dan Williams <dan.j.williams@intel.com> > > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel Andrew, would it be possible to have an ack for merge using a DRM tree for the -mm patches? Thanks, Thomas
On Mon, 16 Mar 2020 13:32:08 +0100 Thomas Hellström (VMware) <thomas_os@shipmail.org> wrote: > > _______________________________________________ > > dri-devel mailing list > > dri-devel@lists.freedesktop.org > > https://lists.freedesktop.org/mailman/listinfo/dri-devel > > Andrew, would it be possible to have an ack for merge using a DRM tree > for the -mm patches? Yes, please do. It's all pretty straightforward addition of new functionality which won't affect existing code.
On 3/19/20 12:27 AM, Andrew Morton wrote: > On Mon, 16 Mar 2020 13:32:08 +0100 Thomas Hellström (VMware) <thomas_os@shipmail.org> wrote: > >>> _______________________________________________ >>> dri-devel mailing list >>> dri-devel@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel >> Andrew, would it be possible to have an ack for merge using a DRM tree >> for the -mm patches? > Yes, please do. It's all pretty straightforward addition of new > functionality which won't affect existing code. Thanks Andrew. Can I add your Acked-by: To the mm patches for Linus' reference? Thanks, Thomas
On Thu, 19 Mar 2020 11:20:44 +0100 Thomas Hellström (VMware) <thomas_os@shipmail.org> wrote: > On 3/19/20 12:27 AM, Andrew Morton wrote: > > On Mon, 16 Mar 2020 13:32:08 +0100 Thomas Hellström (VMware) <thomas_os@shipmail.org> wrote: > > > >>> _______________________________________________ > >>> dri-devel mailing list > >>> dri-devel@lists.freedesktop.org > >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel > >> Andrew, would it be possible to have an ack for merge using a DRM tree > >> for the -mm patches? > > Yes, please do. It's all pretty straightforward addition of new > > functionality which won't affect existing code. > > Thanks Andrew. Can I add your Acked-by: To the mm patches for Linus' > reference? > Please do.
On 3/4/20 11:28 AM, Thomas Hellström (VMware) wrote: > In order to reduce CPU usage [1] and in theory TLB misses this patchset enables > huge- and giant page-table entries for TTM and TTM-enabled graphics drivers. > > Patch 1 and 2 introduce a vma_is_special_huge() function to make the mm code > take the same path as DAX when splitting huge- and giant page table entries, > (which currently means zapping the page-table entry and rely on re-faulting). > > Patch 3 makes the mm code split existing huge page-table entries > on huge_fault fallbacks. Typically on COW or on buffer-objects that want > write-notify. COW and write-notification is always done on the lowest > page-table level. See the patch log message for additional considerations. > > Patch 4 introduces functions to allow the graphics drivers to manipulate > the caching- and encryption flags of huge page-table entries without ugly > hacks. > > Patch 5 implements the huge_fault handler in TTM. > This enables huge page-table entries, provided that the kernel is configured > to support transhuge pages, either by default or using madvise(). > However, they are unlikely to be inserted unless the kernel buffer object > pfns and user-space addresses align perfectly. There are various options > here, but since buffer objects that reside in system pages typically start > at huge page boundaries if they are backed by huge pages, we try to enforce > buffer object starting pfns and user-space addresses to be huge page-size > aligned if their size exceeds a huge page-size. If pud-size transhuge > ("giant") pages are enabled by the arch, the same holds for those. > > Patch 6 implements a specialized huge_fault handler for vmwgfx. > The vmwgfx driver may perform dirty-tracking and needs some special code > to handle that correctly. > > Patch 7 implements a drm helper to align user-space addresses according > to the above scheme, if possible. > > Patch 8 implements a TTM range manager for vmwgfx that does the same for > graphics IO memory. This may later be reused by other graphics drivers > if necessary. > > Patch 9 finally hooks up the helpers of patch 7 and 8 to the vmwgfx driver. > A similar change is needed for graphics drivers that want a reasonable > likelyhood of actually using huge page-table entries. > > If a buffer object size is not huge-page or giant-page aligned, > its size will NOT be inflated by this patchset. This means that the buffer > object tail will use smaller size page-table entries and thus no memory > overhead occurs. Drivers that want to pay the memory overhead price need to > implement their own scheme to inflate buffer-object sizes. > > PMD size huge page-table-entries have been tested with vmwgfx and found to > work well both with system memory backed and IO memory backed buffer objects. > > PUD size giant page-table-entries have seen limited (fault and COW) testing > using a modified kernel (to support 1GB page allocations) and a fake vmwgfx > TTM memory type. The vmwgfx driver does otherwise not support 1GB-size IO > memory resources. > > Comments and suggestions welcome. > Thomas > > Changes since RFC: > * Check for buffer objects present in contigous IO Memory (Christian König) > * Rebased on the vmwgfx emulated coherent memory functionality. That rebase > adds patch 5. > Changes since v1: > * Make the new TTM range manager vmwgfx-specific. (Christian König) > * Minor fixes for configs that don't support or only partially support > transhuge pages. > Changes since v2: > * Minor coding style and doc fixes in patch 5/9 (Christian König) > * Patch 5/9 doesn't touch mm. Remove from the patch title. > Changes since v3: > * Added reviews and acks > * Implemented ugly but generic ttm_pgprot_is_wrprotecting() instead of arch > specific code. > Changes since v4: > * Added timings (Andrew Morton) > * Updated function documentation (Andrew Morton) > Changes since v6: > * Fix drm build error with !CONFIG_MMU > > [1] > The below test program generates the following gnu time output when run on a > vmwgfx-enabled kernel without the patch series: > > 4.78user 6.02system 0:10.91elapsed 99%CPU (0avgtext+0avgdata 1624maxresident)k > 0inputs+0outputs (0major+640077minor)pagefaults 0swaps > > and with the patch series: > > 1.71user 3.60system 0:05.40elapsed 98%CPU (0avgtext+0avgdata 1656maxresident)k > 0inputs+0outputs (0major+20079minor)pagefaults 0swaps > > A consistent number of reduced graphics page-faults can be seen with normal > graphics applications, but due to the aggressive buffer object caching in > vmwgfx user-space drivers the CPU time reduction is within the error marginal. > > #include <unistd.h> > #include <string.h> > #include <sys/mman.h> > #include <xf86drm.h> > > static void checkerr(int ret, const char *name) > { > if (ret < 0) { > perror(name); > exit(-1); > } > } > > int main(int agc, const char *argv[]) > { > struct drm_mode_create_dumb c_arg = {0}; > struct drm_mode_map_dumb m_arg = {0}; > struct drm_mode_destroy_dumb d_arg = {0}; > int ret, i, fd; > void *map; > > fd = open("/dev/dri/card0", O_RDWR); > checkerr(fd, argv[0]); > > for (i = 0; i < 10000; ++i) { > c_arg.bpp = 32; > c_arg.width = 1024; > c_arg.height = 1024; > ret = drmIoctl(fd, DRM_IOCTL_MODE_CREATE_DUMB, &c_arg); > checkerr(fd, argv[0]); > > m_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &m_arg); > checkerr(fd, argv[0]); > > map = mmap(NULL, c_arg.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, > m_arg.offset); > checkerr(map == MAP_FAILED ? -1 : 0, argv[0]); > > (void) madvise((void *) map, c_arg.size, MADV_HUGEPAGE); > memset(map, 0x67, c_arg.size); > munmap(map, c_arg.size); > > d_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_DESTROY_DUMB, &d_arg); > checkerr(ret, argv[0]); > } > > close(fd); > } > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Ralph Campbell <rcampbell@nvidia.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: Dan Williams <dan.j.williams@intel.com> > > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel Hi, Christian, I think this should be OK to merge now. Is it OK if I ask Dave to pull this separately? Thanks, Thomas
Yeah, sure go ahead. It's just that I am out of office because of COVID-19 and won't be able to help if it goes up in flames :) Cheers, Christian Am 24.03.2020 11:03 schrieb "Thomas Hellström (VMware)" <thomas_os@shipmail.org>: On 3/4/20 11:28 AM, Thomas Hellström (VMware) wrote: > In order to reduce CPU usage [1] and in theory TLB misses this patchset enables > huge- and giant page-table entries for TTM and TTM-enabled graphics drivers. > > Patch 1 and 2 introduce a vma_is_special_huge() function to make the mm code > take the same path as DAX when splitting huge- and giant page table entries, > (which currently means zapping the page-table entry and rely on re-faulting). > > Patch 3 makes the mm code split existing huge page-table entries > on huge_fault fallbacks. Typically on COW or on buffer-objects that want > write-notify. COW and write-notification is always done on the lowest > page-table level. See the patch log message for additional considerations. > > Patch 4 introduces functions to allow the graphics drivers to manipulate > the caching- and encryption flags of huge page-table entries without ugly > hacks. > > Patch 5 implements the huge_fault handler in TTM. > This enables huge page-table entries, provided that the kernel is configured > to support transhuge pages, either by default or using madvise(). > However, they are unlikely to be inserted unless the kernel buffer object > pfns and user-space addresses align perfectly. There are various options > here, but since buffer objects that reside in system pages typically start > at huge page boundaries if they are backed by huge pages, we try to enforce > buffer object starting pfns and user-space addresses to be huge page-size > aligned if their size exceeds a huge page-size. If pud-size transhuge > ("giant") pages are enabled by the arch, the same holds for those. > > Patch 6 implements a specialized huge_fault handler for vmwgfx. > The vmwgfx driver may perform dirty-tracking and needs some special code > to handle that correctly. > > Patch 7 implements a drm helper to align user-space addresses according > to the above scheme, if possible. > > Patch 8 implements a TTM range manager for vmwgfx that does the same for > graphics IO memory. This may later be reused by other graphics drivers > if necessary. > > Patch 9 finally hooks up the helpers of patch 7 and 8 to the vmwgfx driver. > A similar change is needed for graphics drivers that want a reasonable > likelyhood of actually using huge page-table entries. > > If a buffer object size is not huge-page or giant-page aligned, > its size will NOT be inflated by this patchset. This means that the buffer > object tail will use smaller size page-table entries and thus no memory > overhead occurs. Drivers that want to pay the memory overhead price need to > implement their own scheme to inflate buffer-object sizes. > > PMD size huge page-table-entries have been tested with vmwgfx and found to > work well both with system memory backed and IO memory backed buffer objects. > > PUD size giant page-table-entries have seen limited (fault and COW) testing > using a modified kernel (to support 1GB page allocations) and a fake vmwgfx > TTM memory type. The vmwgfx driver does otherwise not support 1GB-size IO > memory resources. > > Comments and suggestions welcome. > Thomas > > Changes since RFC: > * Check for buffer objects present in contigous IO Memory (Christian König) > * Rebased on the vmwgfx emulated coherent memory functionality. That rebase > adds patch 5. > Changes since v1: > * Make the new TTM range manager vmwgfx-specific. (Christian König) > * Minor fixes for configs that don't support or only partially support > transhuge pages. > Changes since v2: > * Minor coding style and doc fixes in patch 5/9 (Christian König) > * Patch 5/9 doesn't touch mm. Remove from the patch title. > Changes since v3: > * Added reviews and acks > * Implemented ugly but generic ttm_pgprot_is_wrprotecting() instead of arch > specific code. > Changes since v4: > * Added timings (Andrew Morton) > * Updated function documentation (Andrew Morton) > Changes since v6: > * Fix drm build error with !CONFIG_MMU > > [1] > The below test program generates the following gnu time output when run on a > vmwgfx-enabled kernel without the patch series: > > 4.78user 6.02system 0:10.91elapsed 99%CPU (0avgtext+0avgdata 1624maxresident)k > 0inputs+0outputs (0major+640077minor)pagefaults 0swaps > > and with the patch series: > > 1.71user 3.60system 0:05.40elapsed 98%CPU (0avgtext+0avgdata 1656maxresident)k > 0inputs+0outputs (0major+20079minor)pagefaults 0swaps > > A consistent number of reduced graphics page-faults can be seen with normal > graphics applications, but due to the aggressive buffer object caching in > vmwgfx user-space drivers the CPU time reduction is within the error marginal. > > #include <unistd.h> > #include <string.h> > #include <sys/mman.h> > #include <xf86drm.h> > > static void checkerr(int ret, const char *name) > { > if (ret < 0) { > perror(name); > exit(-1); > } > } > > int main(int agc, const char *argv[]) > { > struct drm_mode_create_dumb c_arg = {0}; > struct drm_mode_map_dumb m_arg = {0}; > struct drm_mode_destroy_dumb d_arg = {0}; > int ret, i, fd; > void *map; > > fd = open("/dev/dri/card0", O_RDWR); > checkerr(fd, argv[0]); > > for (i = 0; i < 10000; ++i) { > c_arg.bpp = 32; > c_arg.width = 1024; > c_arg.height = 1024; > ret = drmIoctl(fd, DRM_IOCTL_MODE_CREATE_DUMB, &c_arg); > checkerr(fd, argv[0]); > > m_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_MAP_DUMB, &m_arg); > checkerr(fd, argv[0]); > > map = mmap(NULL, c_arg.size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, > m_arg.offset); > checkerr(map == MAP_FAILED ? -1 : 0, argv[0]); > > (void) madvise((void *) map, c_arg.size, MADV_HUGEPAGE); > memset(map, 0x67, c_arg.size); > munmap(map, c_arg.size); > > d_arg.handle = c_arg.handle; > ret = drmIoctl(fd, DRM_IOCTL_MODE_DESTROY_DUMB, &d_arg); > checkerr(ret, argv[0]); > } > > close(fd); > } > > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Michal Hocko <mhocko@suse.com> > Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> > Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> > Cc: Ralph Campbell <rcampbell@nvidia.com> > Cc: "Jérôme Glisse" <jglisse@redhat.com> > Cc: "Christian König" <christian.koenig@amd.com> > Cc: Dan Williams <dan.j.williams@intel.com> > > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=02%7C01%7Cchristian.koenig%40amd.com%7Cbd52811a941244ba8b9908d7cfda99cb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637206410124474022&sdata=642fi4213qtdqtYCFfPNesOIDGmoYBuR4PeEzWMyfKE%3D&reserved=0 Hi, Christian, I think this should be OK to merge now. Is it OK if I ask Dave to pull this separately? Thanks, Thomas