From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755416AbbFQOYR (ORCPT ); Wed, 17 Jun 2015 10:24:17 -0400 Received: from mail-la0-f42.google.com ([209.85.215.42]:36253 "EHLO mail-la0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753509AbbFQOYL (ORCPT ); Wed, 17 Jun 2015 10:24:11 -0400 MIME-Version: 1.0 In-Reply-To: <20150617103226.GA30325@gmail.com> References: <20150617103226.GA30325@gmail.com> From: Andy Lutomirski Date: Wed, 17 Jun 2015 07:23:49 -0700 Message-ID: Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code To: Ingo Molnar Cc: Andy Lutomirski , X86 ML , "linux-kernel@vger.kernel.org" , =?UTF-8?B?RnLDqWTDqXJpYyBXZWlzYmVja2Vy?= , Rik van Riel , Oleg Nesterov , Denys Vlasenko , Borislav Petkov , Kees Cook , Brian Gerst Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> The main things that are missing are that I haven't done the 32-bit parts >> (anyone want to help?) and therefore I haven't deleted the old C code. I also >> think this may break UML for trivial reasons. > > So I'd suggest moving most of the SYSRET fast path to C too. > > This is how it looks like now after your patches: > > testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) > jnz tracesys > entry_SYSCALL_64_fastpath: > #if __SYSCALL_MASK == ~0 > cmpq $__NR_syscall_max, %rax > #else > andl $__SYSCALL_MASK, %eax > cmpl $__NR_syscall_max, %eax > #endif > ja 1f /* return -ENOSYS (already in pt_regs->ax) */ > movq %r10, %rcx > call *sys_call_table(, %rax, 8) > movq %rax, RAX(%rsp) > 1: > /* > * Syscall return path ending with SYSRET (fast path). > * Has incompletely filled pt_regs. > */ > LOCKDEP_SYS_EXIT > /* > * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON, > * it is too small to ever cause noticeable irq latency. > */ > DISABLE_INTERRUPTS(CLBR_NONE) > > /* > * We must check ti flags with interrupts (or at least preemption) > * off because we must *never* return to userspace without > * processing exit work that is enqueued if we're preempted here. > * In particular, returning to userspace with any of the one-shot > * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is > * very bad. > */ > testl $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS) > jnz int_ret_from_sys_call_irqs_off /* Go to the slow path */ > > Most of that can be done in C. > > And I think we could also convert the IRET syscall return slow path to C too: > > GLOBAL(int_ret_from_sys_call) > SAVE_EXTRA_REGS > movq %rsp, %rdi > call syscall_return_slowpath /* returns with IRQs disabled */ > RESTORE_EXTRA_REGS > > /* > * Try to use SYSRET instead of IRET if we're returning to > * a completely clean 64-bit userspace context. > */ > movq RCX(%rsp), %rcx > movq RIP(%rsp), %r11 > cmpq %rcx, %r11 /* RCX == RIP */ > jne opportunistic_sysret_failed > > /* > * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP > * in kernel space. This essentially lets the user take over > * the kernel, since userspace controls RSP. > * > * If width of "canonical tail" ever becomes variable, this will need > * to be updated to remain correct on both old and new CPUs. > */ > .ifne __VIRTUAL_MASK_SHIFT - 47 > .error "virtual address width changed -- SYSRET checks need update" > .endif > > /* Change top 16 bits to be the sign-extension of 47th bit */ > shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx > sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx > > /* If this changed %rcx, it was not canonical */ > cmpq %rcx, %r11 > jne opportunistic_sysret_failed > > cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */ > jne opportunistic_sysret_failed > > movq R11(%rsp), %r11 > cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */ > jne opportunistic_sysret_failed > > /* > * SYSRET can't restore RF. SYSRET can restore TF, but unlike IRET, > * restoring TF results in a trap from userspace immediately after > * SYSRET. This would cause an infinite loop whenever #DB happens > * with register state that satisfies the opportunistic SYSRET > * conditions. For example, single-stepping this user code: > * > * movq $stuck_here, %rcx > * pushfq > * popq %r11 > * stuck_here: > * > * would never get past 'stuck_here'. > */ > testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11 > jnz opportunistic_sysret_failed > > /* nothing to check for RSP */ > > cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */ > jne opportunistic_sysret_failed > > /* > * We win! This label is here just for ease of understanding > * perf profiles. Nothing jumps here. > */ > syscall_return_via_sysret: > /* rcx and r11 are already restored (see code above) */ > RESTORE_C_REGS_EXCEPT_RCX_R11 > movq RSP(%rsp), %rsp > USERGS_SYSRET64 > > opportunistic_sysret_failed: > SWAPGS > jmp restore_c_regs_and_iret > END(entry_SYSCALL_64) > > > Basically there would be a single C function we'd call, which returns a condition > (or fixes up its return address on the stack directly) to determine between the > SYSRET and IRET return paths. > > Moving this to C too has immediate benefits: that way we could easily add > instrumentation to see how efficient these various return methods are, etc. > > I.e. I don't think there's two ways about this: once the entry code moves to the > domain of C code, we get the best benefits by moving as much of it as possible. This is almost certainly true. There are a lot more cleanups possible here. I want to nail down the 32-bit case first so we can delete the old code. > > The only low level bits remaining in assembly will be low level hardware ABI > details: saving registers and restoring registers to the expected format - no > 'active' code whatsoever. I think this is true for syscalls. Getting the weird special cases (IRET and GS fault) for error_entry to work correctly in C could be tricky. --Andy