[0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

Message ID	20240429104330.3636113-1-dmitrii.kuvaiskii@intel.com (mailing list archive)
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 94F9B3987C; Mon, 29 Apr 2024 10:51:52 +0000 (UTC) From: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com> To: dave.hansen@linux.intel.com, jarkko@kernel.org, kai.huang@intel.com, haitao.huang@linux.intel.com, reinette.chatre@intel.com, linux-sgx@vger.kernel.org, linux-kernel@vger.kernel.org Cc: mona.vij@intel.com, kailun.qin@intel.com Subject: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows Date: Mon, 29 Apr 2024 03:43:28 -0700 Message-Id: <20240429104330.3636113-1-dmitrii.kuvaiskii@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Organization: Intel Deutschland GmbH - Registered Address: Am Campeon 10, 85579 Neubiberg, Germany Content-Transfer-Encoding: 8bit
Series	x86/sgx: Fix two data races in EAUG/EREMOVE flows \| expand [0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows [1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS [2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

Message ID

20240429104330.3636113-1-dmitrii.kuvaiskii@intel.com (mailing list archive)

Headers

From: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
To: dave.hansen@linux.intel.com,
	jarkko@kernel.org,
	kai.huang@intel.com,
	haitao.huang@linux.intel.com,
	reinette.chatre@intel.com,
	linux-sgx@vger.kernel.org,
	linux-kernel@vger.kernel.org
Cc: mona.vij@intel.com,
	kailun.qin@intel.com
Subject: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows
Date: Mon, 29 Apr 2024 03:43:28 -0700
Message-Id: <20240429104330.3636113-1-dmitrii.kuvaiskii@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Organization: Intel Deutschland GmbH - Registered Address: Am Campeon 10,
 85579 Neubiberg, Germany
Content-Transfer-Encoding: 8bit

Series

x86/sgx: Fix two data races in EAUG/EREMOVE flows | expand

Message

Dmitrii Kuvaiskii April 29, 2024, 10:43 a.m. UTC

SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

A trivial program like below (run under Gramine and with EDMM enabled)
stresses these two flows in the SGX driver and hangs:

/* repeatedly touch different enclave pages at random and mix with
 * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
    size_t num_pages = 0xA000 / page_size;
    for (int i = 0; i < 5000; i++) {
        size_t page = get_random_ulong() % num_pages;
        char data = READ_ONCE(((char*)arg)[page * page_size]);

        page = get_random_ulong() % num_pages;
        madvise(arg + page * page_size, page_size, MADV_DONTNEED);
    }
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
    pthread_create(&threads[i], NULL, thread_func, addr);

This program uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
  stresses the first data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
  stresses the second data race. A Gramine SGX test suite running in the
  background for additional stressing. Result: 1,000 runs without hangs
  (result with the first bug fix but without the second bug fix: hangs
  approx. once in 50 runs).
- On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
  additionally stresses the enclave page swapping flows. Two Gramine SGX
  test suites running in the background for additional stressing of
  swapping (I observe 100% CPU utilization from ksgxd which confirms that
  swapping happens). Result: 1,000 runs without hangs.

(Sorry for the previous copy of this email, accidentally sent to
stable@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)

Dmitrii Kuvaiskii (2):
  x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
  x86/sgx: Resolve EREMOVE page vs EAUG page data race

 arch/x86/kernel/cpu/sgx/encl.c  | 10 +++++++---
 arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
 arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

Comments

Jarkko Sakkinen April 29, 2024, 1:06 p.m. UTC | #1

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> enclave pages and may support MADV_DONTNEED semantics [1]. The former
> implies #PF-based page allocation, and the latter implies the usage of
> SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
>
> A trivial program like below (run under Gramine and with EDMM enabled)
> stresses these two flows in the SGX driver and hangs:
>
> /* repeatedly touch different enclave pages at random and mix with
>  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> static void* thread_func(void* arg) {
>     size_t num_pages = 0xA000 / page_size;
>     for (int i = 0; i < 5000; i++) {
>         size_t page = get_random_ulong() % num_pages;
>         char data = READ_ONCE(((char*)arg)[page * page_size]);
>
>         page = get_random_ulong() % num_pages;
>         madvise(arg + page * page_size, page_size, MADV_DONTNEED);
>     }
> }
>
> addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> pthread_t threads[16];
> for (int i = 0; i < 16; i++)
>     pthread_create(&threads[i], NULL, thread_func, addr);

I'm not convinced that kernel is the problem here but it could be also
how Gramine is implemented.

So maybe you could make a better case of that. The example looks a bit
artificial to me.

>
> This program uncovers two data races in the SGX driver. The remaining
> patches describe and fix these races.
>
> I performed several stress tests to verify that there are no other data
> races (at least with the test program above):
>
> - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
>   stresses the first data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result without the first bug fix: hangs every time).
> - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
>   stresses the second data race. A Gramine SGX test suite running in the
>   background for additional stressing. Result: 1,000 runs without hangs
>   (result with the first bug fix but without the second bug fix: hangs
>   approx. once in 50 runs).
> - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
>   additionally stresses the enclave page swapping flows. Two Gramine SGX
>   test suites running in the background for additional stressing of
>   swapping (I observe 100% CPU utilization from ksgxd which confirms that
>   swapping happens). Result: 1,000 runs without hangs.
>
> (Sorry for the previous copy of this email, accidentally sent to
> stable@vger.kernel.org. Failed to use `--suppress-cc` during a test send.)
>
> Dmitrii Kuvaiskii (2):
>   x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
>   x86/sgx: Resolve EREMOVE page vs EAUG page data race
>
>  arch/x86/kernel/cpu/sgx/encl.c  | 10 +++++++---
>  arch/x86/kernel/cpu/sgx/encl.h  |  3 +++
>  arch/x86/kernel/cpu/sgx/ioctl.c |  1 +
>  3 files changed, 11 insertions(+), 3 deletions(-)

BR, Jarkko

Dmitrii Kuvaiskii April 30, 2024, 2:35 p.m. UTC | #2

On Mon, Apr 29, 2024 at 04:06:39PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> > enclave pages and may support MADV_DONTNEED semantics [1]. The former
> > implies #PF-based page allocation, and the latter implies the usage of
> > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
> >
> > A trivial program like below (run under Gramine and with EDMM enabled)
> > stresses these two flows in the SGX driver and hangs:
> >
> > /* repeatedly touch different enclave pages at random and mix with
> >  * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> > static void* thread_func(void* arg) {
> >     size_t num_pages = 0xA000 / page_size;
> >     for (int i = 0; i < 5000; i++) {
> >         size_t page = get_random_ulong() % num_pages;
> >         char data = READ_ONCE(((char*)arg)[page * page_size]);
> >
> >         page = get_random_ulong() % num_pages;
> >         madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> >     }
> > }
> >
> > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> > pthread_t threads[16];
> > for (int i = 0; i < 16; i++)
> >     pthread_create(&threads[i], NULL, thread_func, addr);
> 
> I'm not convinced that kernel is the problem here but it could be also
> how Gramine is implemented.
> 
> So maybe you could make a better case of that. The example looks a bit
> artificial to me.

I believe that these are the bugs in the kernel (in the SGX driver). I
provided more detailed descriptions of the races and ensuing bugs in the
other two replies, please check them.

The example is a stress test written to debug very infrequent hangs of
real-world applications that are run with Gramine, EDMM, and two
optimizations (lazy allocation and MADV_DONTNEED semantics). We observed
hangs of Node.js, PyTorch, R, iperf, Blender, Nginx. To root cause these
hangs, we wrote this artificial stress test. This test succeeds on vanilla
Linux, so ideally it should also pass on Gramine.

Please also note that the optimizations of lazy allocation and
MADV_DONTNEED provide significant performance improvement for some
workloads that run on Gramine. For example, a Java workload with a 16GB
enclave size has approx. 57x improvement in total runtime. Thus, we
consider it important to permit these optimizations in Gramine, which IIUC
requires bug fixes in the SGX driver.

You can find more info at
https://github.com/gramineproject/gramine/pull/1513.

Which parts do you consider artificial, and how could I modify the stress
test?

--
Dmitrii Kuvaiskii