All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Andy Lutomirski <luto@kernel.org>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
	"Frédéric Weisbecker" <fweisbec@gmail.com>,
	"Rik van Riel" <riel@redhat.com>,
	"Oleg Nesterov" <oleg@redhat.com>,
	"Denys Vlasenko" <vda.linux@googlemail.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Kees Cook" <keescook@chromium.org>,
	"Brian Gerst" <brgerst@gmail.com>
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
Date: Wed, 17 Jun 2015 12:32:26 +0200	[thread overview]
Message-ID: <20150617103226.GA30325@gmail.com> (raw)
In-Reply-To: <cover.1434485184.git.luto@kernel.org>


* Andy Lutomirski <luto@kernel.org> wrote:

> The main things that are missing are that I haven't done the 32-bit parts 
> (anyone want to help?) and therefore I haven't deleted the old C code.  I also 
> think this may break UML for trivial reasons.

So I'd suggest moving most of the SYSRET fast path to C too.

This is how it looks like now after your patches:

	testl	$_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	tracesys
entry_SYSCALL_64_fastpath:
#if __SYSCALL_MASK == ~0
	cmpq	$__NR_syscall_max, %rax
#else
	andl	$__SYSCALL_MASK, %eax
	cmpl	$__NR_syscall_max, %eax
#endif
	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
	movq	%r10, %rcx
	call	*sys_call_table(, %rax, 8)
	movq	%rax, RAX(%rsp)
1:
/*
 * Syscall return path ending with SYSRET (fast path).
 * Has incompletely filled pt_regs.
 */
	LOCKDEP_SYS_EXIT
	/*
	 * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
	 * it is too small to ever cause noticeable irq latency.
	 */
	DISABLE_INTERRUPTS(CLBR_NONE)

	/*
	 * We must check ti flags with interrupts (or at least preemption)
	 * off because we must *never* return to userspace without
	 * processing exit work that is enqueued if we're preempted here.
	 * In particular, returning to userspace with any of the one-shot
	 * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
	 * very bad.
	 */
	testl	$_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
	jnz	int_ret_from_sys_call_irqs_off	/* Go to the slow path */

Most of that can be done in C.

And I think we could also convert the IRET syscall return slow path to C too:

GLOBAL(int_ret_from_sys_call)
	SAVE_EXTRA_REGS
	movq	%rsp, %rdi
	call	syscall_return_slowpath	/* returns with IRQs disabled */
	RESTORE_EXTRA_REGS

	/*
	 * Try to use SYSRET instead of IRET if we're returning to
	 * a completely clean 64-bit userspace context.
	 */
	movq	RCX(%rsp), %rcx
	movq	RIP(%rsp), %r11
	cmpq	%rcx, %r11			/* RCX == RIP */
	jne	opportunistic_sysret_failed

	/*
	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
	 * in kernel space.  This essentially lets the user take over
	 * the kernel, since userspace controls RSP.
	 *
	 * If width of "canonical tail" ever becomes variable, this will need
	 * to be updated to remain correct on both old and new CPUs.
	 */
	.ifne __VIRTUAL_MASK_SHIFT - 47
	.error "virtual address width changed -- SYSRET checks need update"
	.endif

	/* Change top 16 bits to be the sign-extension of 47th bit */
	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

	/* If this changed %rcx, it was not canonical */
	cmpq	%rcx, %r11
	jne	opportunistic_sysret_failed

	cmpq	$__USER_CS, CS(%rsp)		/* CS must match SYSRET */
	jne	opportunistic_sysret_failed

	movq	R11(%rsp), %r11
	cmpq	%r11, EFLAGS(%rsp)		/* R11 == RFLAGS */
	jne	opportunistic_sysret_failed

	/*
	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
	 * restoring TF results in a trap from userspace immediately after
	 * SYSRET.  This would cause an infinite loop whenever #DB happens
	 * with register state that satisfies the opportunistic SYSRET
	 * conditions.  For example, single-stepping this user code:
	 *
	 *           movq	$stuck_here, %rcx
	 *           pushfq
	 *           popq %r11
	 *   stuck_here:
	 *
	 * would never get past 'stuck_here'.
	 */
	testq	$(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
	jnz	opportunistic_sysret_failed

	/* nothing to check for RSP */

	cmpq	$__USER_DS, SS(%rsp)		/* SS must match SYSRET */
	jne	opportunistic_sysret_failed

	/*
	 * We win! This label is here just for ease of understanding
	 * perf profiles. Nothing jumps here.
	 */
syscall_return_via_sysret:
	/* rcx and r11 are already restored (see code above) */
	RESTORE_C_REGS_EXCEPT_RCX_R11
	movq	RSP(%rsp), %rsp
	USERGS_SYSRET64

opportunistic_sysret_failed:
	SWAPGS
	jmp	restore_c_regs_and_iret
END(entry_SYSCALL_64)


Basically there would be a single C function we'd call, which returns a condition 
(or fixes up its return address on the stack directly) to determine between the 
SYSRET and IRET return paths.

Moving this to C too has immediate benefits: that way we could easily add 
instrumentation to see how efficient these various return methods are, etc.

I.e. I don't think there's two ways about this: once the entry code moves to the 
domain of C code, we get the best benefits by moving as much of it as possible. 

The only low level bits remaining in assembly will be low level hardware ABI 
details: saving registers and restoring registers to the expected format - no 
'active' code whatsoever.

Thanks,

	Ingo

  parent reply	other threads:[~2015-06-17 10:32 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-16 20:16 [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 01/13] context_tracking: Add context_tracking_assert_state Andy Lutomirski
2015-06-17  9:41   ` Ingo Molnar
2015-06-17 14:15     ` Andy Lutomirski
2015-06-18  9:57       ` Ingo Molnar
2015-06-18 11:07         ` Andy Lutomirski
2015-06-18 15:52           ` Andy Lutomirski
2015-06-18 16:17             ` Ingo Molnar
2015-06-18 16:26               ` Frederic Weisbecker
2015-06-18 19:26                 ` Andy Lutomirski
2015-06-17 15:27     ` Paul E. McKenney
2015-06-18  9:59       ` Ingo Molnar
2015-06-18 22:54         ` Paul E. McKenney
2015-06-19  2:19           ` Paul E. McKenney
2015-06-30 11:04           ` Ingo Molnar
2015-06-30 16:16             ` Paul E. McKenney
2015-06-16 20:16 ` [RFC/INCOMPLETE 02/13] notifiers: Assert that RCU is watching in notify_die Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 03/13] x86: Move C entry and exit code to arch/x86/entry/common.c Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 04/13] x86/traps: Assert that we're in CONTEXT_KERNEL in exception entries Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 05/13] x86/entry: Add enter_from_user_mode and use it in syscalls Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 06/13] x86/entry: Add new, comprehensible entry and exit hooks Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 07/13] x86/entry/64: Really create an error-entry-from-usermode code path Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 08/13] x86/entry/64: Migrate 64-bit syscalls to new exit hooks Andy Lutomirski
2015-06-17 10:00   ` Ingo Molnar
2015-06-17 10:02     ` Ingo Molnar
2015-06-17 14:12       ` Andy Lutomirski
2015-06-18 10:17         ` Ingo Molnar
2015-06-18 10:19           ` Ingo Molnar
2015-06-16 20:16 ` [RFC/INCOMPLETE 09/13] x86/entry/compat: Migrate compat " Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 10/13] x86/asm/entry/64: Save all regs on interrupt entry Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 11/13] x86/asm/entry/64: Simplify irq stack pt_regs handling Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 12/13] x86/asm/entry/64: Migrate error and interrupt exit work to C Andy Lutomirski
2015-06-16 20:16 ` [RFC/INCOMPLETE 13/13] x86/entry: Remove SCHEDULE_USER and asm/context-tracking.h Andy Lutomirski
2015-06-17  9:48 ` [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Ingo Molnar
2015-06-17 10:13   ` Richard Weinberger
2015-06-17 11:04     ` Ingo Molnar
2015-06-17 14:19     ` Andy Lutomirski
2015-06-17 15:16   ` Andy Lutomirski
2015-06-18 10:14     ` Ingo Molnar
2015-06-17 10:32 ` Ingo Molnar [this message]
2015-06-17 11:14   ` Ingo Molnar
2015-06-17 14:23   ` Andy Lutomirski
2015-06-18 10:11     ` Ingo Molnar
2015-06-18 11:06       ` Andy Lutomirski
2015-06-18 16:24         ` Ingo Molnar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150617103226.GA30325@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@alien8.de \
    --cc=brgerst@gmail.com \
    --cc=fweisbec@gmail.com \
    --cc=keescook@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=oleg@redhat.com \
    --cc=riel@redhat.com \
    --cc=vda.linux@googlemail.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.