From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752912AbbFQKck (ORCPT ); Wed, 17 Jun 2015 06:32:40 -0400 Received: from mail-wg0-f53.google.com ([74.125.82.53]:34729 "EHLO mail-wg0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750967AbbFQKcb (ORCPT ); Wed, 17 Jun 2015 06:32:31 -0400 Date: Wed, 17 Jun 2015 12:32:26 +0200 From: Ingo Molnar To: Andy Lutomirski Cc: x86@kernel.org, linux-kernel@vger.kernel.org, =?iso-8859-1?Q?Fr=E9d=E9ric?= Weisbecker , Rik van Riel , Oleg Nesterov , Denys Vlasenko , Borislav Petkov , Kees Cook , Brian Gerst Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code Message-ID: <20150617103226.GA30325@gmail.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Andy Lutomirski wrote: > The main things that are missing are that I haven't done the 32-bit parts > (anyone want to help?) and therefore I haven't deleted the old C code. I also > think this may break UML for trivial reasons. So I'd suggest moving most of the SYSRET fast path to C too. This is how it looks like now after your patches: testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz tracesys entry_SYSCALL_64_fastpath: #if __SYSCALL_MASK == ~0 cmpq $__NR_syscall_max, %rax #else andl $__SYSCALL_MASK, %eax cmpl $__NR_syscall_max, %eax #endif ja 1f /* return -ENOSYS (already in pt_regs->ax) */ movq %r10, %rcx call *sys_call_table(, %rax, 8) movq %rax, RAX(%rsp) 1: /* * Syscall return path ending with SYSRET (fast path). * Has incompletely filled pt_regs. */ LOCKDEP_SYS_EXIT /* * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON, * it is too small to ever cause noticeable irq latency. */ DISABLE_INTERRUPTS(CLBR_NONE) /* * We must check ti flags with interrupts (or at least preemption) * off because we must *never* return to userspace without * processing exit work that is enqueued if we're preempted here. * In particular, returning to userspace with any of the one-shot * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is * very bad. */ testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */ Most of that can be done in C. And I think we could also convert the IRET syscall return slow path to C too: GLOBAL(int_ret_from_sys_call) SAVE_EXTRA_REGS movq %rsp, %rdi call syscall_return_slowpath /* returns with IRQs disabled */ RESTORE_EXTRA_REGS /* * Try to use SYSRET instead of IRET if we're returning to * a completely clean 64-bit userspace context. */ movq RCX(%rsp), %rcx movq RIP(%rsp), %r11 cmpq %rcx, %r11 /* RCX == RIP */ jne opportunistic_sysret_failed /* * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP * in kernel space. This essentially lets the user take over * the kernel, since userspace controls RSP. * * If width of "canonical tail" ever becomes variable, this will need * to be updated to remain correct on both old and new CPUs. */ .ifne __VIRTUAL_MASK_SHIFT - 47 .error "virtual address width changed -- SYSRET checks need update" .endif /* Change top 16 bits to be the sign-extension of 47th bit */ shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx /* If this changed %rcx, it was not canonical */ cmpq %rcx, %r11 jne opportunistic_sysret_failed cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */ jne opportunistic_sysret_failed movq R11(%rsp), %r11 cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */ jne opportunistic_sysret_failed /* * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, * restoring TF results in a trap from userspace immediately after * SYSRET. This would cause an infinite loop whenever #DB happens * with register state that satisfies the opportunistic SYSRET * conditions. For example, single-stepping this user code: * * movq $stuck_here, %rcx * pushfq * popq %r11 * stuck_here: * * would never get past 'stuck_here'. */ testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11 jnz opportunistic_sysret_failed /* nothing to check for RSP */ cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */ jne opportunistic_sysret_failed /* * We win! This label is here just for ease of understanding * perf profiles. Nothing jumps here. */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ RESTORE_C_REGS_EXCEPT_RCX_R11 movq RSP(%rsp), %rsp USERGS_SYSRET64 opportunistic_sysret_failed: SWAPGS jmp restore_c_regs_and_iret END(entry_SYSCALL_64) Basically there would be a single C function we'd call, which returns a condition (or fixes up its return address on the stack directly) to determine between the SYSRET and IRET return paths. Moving this to C too has immediate benefits: that way we could easily add instrumentation to see how efficient these various return methods are, etc. I.e. I don't think there's two ways about this: once the entry code moves to the domain of C code, we get the best benefits by moving as much of it as possible. The only low level bits remaining in assembly will be low level hardware ABI details: saving registers and restoring registers to the expected format - no 'active' code whatsoever. Thanks, Ingo