[RFC,2/2] OOM, PM: make OOM detection in the freezer path raceless

5695be142e20 (OOM, PM: OOM killed task shouldn't escape PM suspend)
has left a race window when OOM killer manages to note_oom_kill after
freeze_processes checks the counter. The race window is quite small and
really unlikely and partial solution deemed sufficient at the time of
submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled is now checked at out_of_memory level which takes
the lock for reading. This also means that the page fault path is
covered now as well although it was assumed to be safe before. As per
Tejun, "We used to have freezing points deep in file system code which
may be reacheable from page fault." so it would be better and more
robust to not rely on freezing points here. Same applies to the memcg
OOM killer.

out_of_memory tells the caller whether the OOM was allowed to
trigger and the callers are supposed to handle the situation. The page
allocation path simply fails the allocation same as before. The page
fault path will be retrying the fault until the freezer fails and Sysrq
OOM trigger will simply complain to the log.

oom_killer_disable takes oom_sem for writing and after it disables
further OOM killer invocations it checks for any OOM victims which
are still alive (because they haven't woken up to handle the pending
signal). Victims are counted via {un}mark_tsk_oom_victim. The
last victim signals the completion via oom_victims_wait on which
oom_killer_disable() waits if it sees non zero oom_victims.
This is safe against both mark_tsk_oom_victim which cannot be called
after oom_killer_disabled is set and unmark_tsk_oom_victim signals the
completion only for the last oom_victim when oom is disabled and
oom_killer_disable waits for completion only of there was at least one
victim at the time it disabled the oom.

As oom_killer_disable is a full OOM barrier now we can postpone it to
later after all freezable tasks are frozen during PM freezer. This
reduces the time when OOM is put out order and so reduces chances of
misbehavior due to unexpected allocation failures.

TODO:
Android lowmemory killer abuses mark_tsk_oom_victim in lowmem_scan
and it has to learn about oom_disable logic as well.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 drivers/tty/sysrq.c    |  6 ++--
 include/linux/oom.h    | 26 ++++++++------
 kernel/power/process.c | 60 +++++++++-----------------------
 mm/memcontrol.c        |  4 ++-
 mm/oom_kill.c          | 94 +++++++++++++++++++++++++++++++++++++++++---------
 mm/page_alloc.c        | 32 ++++++++---------
 6 files changed, 132 insertions(+), 90 deletions(-)

[RFC,2/2] OOM, PM: make OOM detection in the freezer path raceless

Commit Message

Comments

Patch