mbox series

[0/2] Introduce DMA_HEAP_IOCTL_ALLOC_AND_READ

Message ID 20240711074221.459589-1-link@vivo.com (mailing list archive)
Headers show
Series Introduce DMA_HEAP_IOCTL_ALLOC_AND_READ | expand

Message

Huan Yang July 11, 2024, 7:42 a.m. UTC
Backgroud
====
We are currently facing some challenges when loading the model file into DMA-BUF.
  1. Our camera application algorithm model has reached the 1GB level.
  2. Our AI application's 3B model has reached the 1GB level, and the 7B model
     has reached the 3GB level.
The above-mentioned internal applications all require reading the model files
into dma-buf for sharing between the CPU and DMA devices.

Consider the current pathway for loading model files into DMA-BUF:
  1. open dma-heap, get heap fd
  2. open file, get fd
  3. allocate dma-buf, get dma-buf fd
  4. mmap dma-buf fd, get vaddr
  5. read(file_fd, vaddr, file_size) into dma-buf pages
  6. share, attach, whatever you want

IMO, The above process involves two inefficient behaviors:
  1. we need to wait dma-buf allocate success, and then load file into.
  2. dma-buf load file need through page cache
As I mentioned above, we currently have scenarios where we need to load files
of gigabyte size into DMA-BUF.
That's mean:
  1. dma-buf also need to be GB size, so, if avaliable memory is not
     enough, we need enter slowpath and wait. If we use already allocated
     memory to load file, it can save time by using a parallel approach.
  2. GB is too heavy, the page cache is useless for boost file load.(it will
     be recycled quickly.) And we need double copy to load it into dma-buf.
     a) load file into page cache
     b) memcpy from page cache to dma-buf

DMA_HEAP_IOCTL_ALLOC_AND_READ
===
This patchset implements a new ioctl, DMA_HEAP_IOCTL_ALLOC_AND_READ, which can
be used to allocate and read a file into a dma-buf in a single operation.
This ioctl is similar to DMA_HEAP_IOCTL_ALLOC, but it also reads the file into
the dma-buf.

Different from DMA_HEAP_IOCTL_ALLOC, the user does not need to pass the size
of the dma-buf, but rather the file descriptor of the opened file.
User also can offer a `batch`, so if memory allocated reach to it, trigger IO,
default is 128MB.

Both buffered I/O and direct I/O(O_DIRECT) can be accepted, but if file size reach
to GB, I will warn you if you use buffered I/O.

In kernel space, heap_fwork_t kthread used to comsume all produced file read work,
this is single thread for read.(Due to heavy size read, multi-thread may helpless).


Reference
===
Currently, we have many patches that aim to make dma-buf support direct I/O in
userspace.

Recently liu's work:
https://lore.kernel.org/all/20240710140948.25870-1-liulei.rjpt@vivo.com/

However, this patch is not focused on enabling dma-buf to perform direct I/O in
userspace. The main goal is to ensure that dma-buf completes the file memory
loading when the allocation is completed. Buffered I/O and direct I/O are both
methods to end file read.


Performance
===
Here a some self-test result:

dd a 3GB file for test, 12G RAM phone, UFS4.0, no memory pressure.
MemTotal:       11583824 kB
MemFree:         2307972 kB
MemAvailable:    7287640 kB

Notice, mtk_mm-uncached is our phone heap, you can use system_heap or other to
test.(need suit DMA_HEAP_IOCTL_ALLOC_AND_READ)

1. original
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal
> result is total cost 2370513769ns
```

2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached direct_io
> result is total cost 1269239770ns
# use direct_io_check can check the content if is same to file.
```

3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal_io
> result is total cost 2268621769ns
```

------------------
dd a 3GB file for test, 12G RAM phone, UFS4.0, stressapptest 4G memory pressure.

1. original
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal
> result is total cost 13087213847ns
```

2.DMA_HEAP_IOCTL_ALLOC_AND_READ O_DIRECT
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached direct_io
> result is total cost 2902386846ns
# use direct_io_check can check the content if is same to file.
```

3. DMA_HEAP_IOCTL_ALLOC_AND_READ BUFFER I/O
```shel
# create a model file
dd if=/dev/zero of=./model.txt bs=1M count=3072
# drop page cache
echo 3 > /proc/sys/vm/drop_caches
./dmabuf-heap-file-read mtk_mm-uncached normal_io
> result is total cost 5735579385ns
```



Can see, use O_DIRECT can improve 50% performance. Even buffered I/O, also can
improve a little.
If given memory pressure, the performance gap will become more significant.

Here are the test file which you can build by self.
```c
#include <dirent.h>
#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <unistd.h>

#include <linux/dma-buf.h>
#include <linux/dma-heap.h>
#include <dirent.h>

#define HEAP_DEVPATH "/dev/dma_heap"

#define TEST_FILE "./model.txt"


enum {
    NORMAL_DMABUF_TEST,
    NORMAL_IO_DMABUF_TEST,
    DIRECT_IO_DMABUF_TEST,
    DIRECT_IO_DMABUF_CHECK_TEST,
};

#define assert(as)                             \
	if (!(as)) {                           \
		printf("%s is failed\n", #as); \
		exit(-1);                      \
	}

int dmabuf_heap_open(char* name) {
    int ret, fd;
    char buf[256];

    ret = sprintf(buf, "%s/%s", HEAP_DEVPATH, name);
    if (ret < 0) {
        printf("sprintf failed!\n");
        return ret;
    }

    fd = open(buf, O_RDWR);
    if (fd < 0) printf("open %s failed!\n", buf);
    return fd;
}

int dmabuf_heap_alloc_read_file(int heap_fd, int file_fd, unsigned int flags,
int* dmabuf_fd) {
    struct dma_heap_allocation_file_data data = {
            .file_fd = file_fd,
            .fd_flags = O_RDWR | O_CLOEXEC,
            .heap_flags = flags,
    };
    int ret;

    if (dmabuf_fd == NULL) return -EINVAL;

    ret = ioctl(heap_fd, DMA_HEAP_IOCTL_ALLOC_AND_READ, &data);
    if (ret < 0) return ret;
    *dmabuf_fd = (int)data.fd;
    return ret;
}


int dmabuf_heap_alloc(int fd, size_t len, unsigned int flags, int* dmabuf_fd) {
    struct dma_heap_allocation_data data = {
            .len = len,
            .fd_flags = O_RDWR | O_CLOEXEC,
            .heap_flags = flags,
    };
    int ret;

    if (dmabuf_fd == NULL) return -EINVAL;

    ret = ioctl(fd, DMA_HEAP_IOCTL_ALLOC, &data);
    if (ret < 0) return ret;
    *dmabuf_fd = (int)data.fd;
    return ret;
}

void dmabuf_heap_test(int type, char *heap_name) {
    int heapfd = dmabuf_heap_open(heap_name);
    assert(heapfd > 0);

    if (type == NORMAL_DMABUF_TEST) {
        int file_fd = open(TEST_FILE, O_RDONLY);
        unsigned long fsize;
        int dma_buf_fd;
        struct stat ftat;
        fstat(file_fd, &ftat);
        fsize = ftat.st_size;

        dmabuf_heap_alloc(heapfd, fsize, 0, &dma_buf_fd);
        assert(dma_buf_fd > 0);

        void *file_addr = mmap(NULL, fsize, PROT_READ, MAP_SHARED, file_fd, 0);
        assert(file_addr != MAP_FAILED);
        void *dma_buf_addr = mmap(NULL, fsize, PROT_WRITE, MAP_SHARED,
dma_buf_fd, 0);
        assert(dma_buf_addr != MAP_FAILED);

        memcpy(dma_buf_addr, file_addr, fsize);

        munmap(file_addr, fsize);
        munmap(dma_buf_addr, fsize);
        close(file_fd);
        close(dma_buf_fd);
    } else {
        int file_fd;
        if (type == NORMAL_IO_DMABUF_TEST)
            file_fd = open(TEST_FILE, O_RDONLY);
        else
        	file_fd = open(TEST_FILE, O_RDONLY | O_DIRECT);
        int dma_buf_fd;

        dmabuf_heap_alloc_read_file(heapfd, file_fd, 0, &dma_buf_fd);
        assert(dma_buf_fd > 0);

        if (type == DIRECT_IO_DMABUF_CHECK_TEST) {
            struct stat ftat;
            fstat(file_fd, &ftat);
            unsigned long size = ftat.st_size;

            char *dmabuf_addr = (char *)mmap(NULL, size, PROT_READ,
                    MAP_SHARED, dma_buf_fd, 0);
            assert(dmabuf_addr != NULL);
            char *file_addr = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED,
file_fd, 0);
            assert(file_addr != NULL);

            unsigned long i = 0;
            for (; i < size; i += 4096) {
                if (memcmp(&dmabuf_addr[i], &file_addr[i], 4096) != 0)
                    printf("cur %lu comp size %d\n", i, size);
                assert (memcmp(&dmabuf_addr[i], &file_addr[i], 4096) == 0);
            }
            munmap(dmabuf_addr, size);
            munmap(file_addr, size);
        }
        close(file_fd);
        close(dma_buf_fd);
    }
    close(heapfd);
}

int main(int argc, char* argv[]) {
    char* dmabuf_heap_name;
    char* type_name;
    int type;
    struct timespec ts_start, ts_end;
	long long start, end;
    if (argc < 3) {
        printf("input heap name, copy or trans or normal\n");
    }

    dmabuf_heap_name = argv[1];
    type_name = argv[2];

    if (strcmp(type_name, "normal") == 0)
        type = NORMAL_DMABUF_TEST;
    else if (strcmp(type_name, "direct_io") == 0)
        type = DIRECT_IO_DMABUF_TEST;
    else if (strcmp(type_name, "direct_io_check") == 0)
        type = DIRECT_IO_DMABUF_CHECK_TEST;
    else if (strcmp(type_name, "normal_io") == 0)
        type = NORMAL_IO_DMABUF_TEST;
    else
        exit(-1);

    printf("Testing dmabuf %s", dmabuf_heap_name);

    printf("\n---------------------------------------------\n");
    clock_gettime(CLOCK_MONOTONIC, &ts_start);
    dmabuf_heap_test(type, dmabuf_heap_name);
    clock_gettime(CLOCK_MONOTONIC, &ts_end);
	start = ts_start.tv_sec * 1000000000 + ts_start.tv_nsec;
	end = ts_end.tv_sec * 1000000000 + ts_end.tv_nsec;

    printf("total cost %lldns\n", end - start);

    return 0;
}
```

Huan Yang (2):
  dma-buf: heaps: DMA_HEAP_IOCTL_ALLOC_READ_FILE framework
  dma-buf: heaps: system_heap support DMA_HEAP_IOCTL_ALLOC_AND_READ

 drivers/dma-buf/dma-heap.c          | 525 +++++++++++++++++++++++++++-
 drivers/dma-buf/heaps/system_heap.c |  53 ++-
 include/linux/dma-heap.h            |  57 ++-
 include/uapi/linux/dma-heap.h       |  32 ++
 4 files changed, 660 insertions(+), 7 deletions(-)


base-commit: 523b23f0bee3014a7a752c9bb9f5c54f0eddae88
--
2.45.2