From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1755416AbbFQOYR (ORCPT <rfc822;w@1wt.eu>);
	Wed, 17 Jun 2015 10:24:17 -0400
Received: from mail-la0-f42.google.com ([209.85.215.42]:36253 "EHLO
	mail-la0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753509AbbFQOYL (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 17 Jun 2015 10:24:11 -0400
MIME-Version: 1.0
In-Reply-To: <20150617103226.GA30325@gmail.com>
References: <cover.1434485184.git.luto@kernel.org> <20150617103226.GA30325@gmail.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 17 Jun 2015 07:23:49 -0700
Message-ID: <CALCETrU2oEiHiqb9gu+ZnDU+zOMk+JqDG2dYFVHsAh5xm2tGtw@mail.gmail.com>
Subject: Re: [RFC/INCOMPLETE 00/13] x86: Rewrite exit-to-userspace code
To: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>, X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        =?UTF-8?B?RnLDqWTDqXJpYyBXZWlzYmVja2Vy?= <fweisbec@gmail.com>,
        Rik van Riel <riel@redhat.com>, Oleg Nesterov <oleg@redhat.com>,
        Denys Vlasenko <vda.linux@googlemail.com>,
        Borislav Petkov <bp@alien8.de>, Kees Cook <keescook@chromium.org>,
        Brian Gerst <brgerst@gmail.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Jun 17, 2015 at 3:32 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Andy Lutomirski <luto@kernel.org> wrote:
>
>> The main things that are missing are that I haven't done the 32-bit parts
>> (anyone want to help?) and therefore I haven't deleted the old C code.  I also
>> think this may break UML for trivial reasons.
>
> So I'd suggest moving most of the SYSRET fast path to C too.
>
> This is how it looks like now after your patches:
>
>         testl   $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     tracesys
> entry_SYSCALL_64_fastpath:
> #if __SYSCALL_MASK == ~0
>         cmpq    $__NR_syscall_max, %rax
> #else
>         andl    $__SYSCALL_MASK, %eax
>         cmpl    $__NR_syscall_max, %eax
> #endif
>         ja      1f                              /* return -ENOSYS (already in pt_regs->ax) */
>         movq    %r10, %rcx
>         call    *sys_call_table(, %rax, 8)
>         movq    %rax, RAX(%rsp)
> 1:
> /*
>  * Syscall return path ending with SYSRET (fast path).
>  * Has incompletely filled pt_regs.
>  */
>         LOCKDEP_SYS_EXIT
>         /*
>          * We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
>          * it is too small to ever cause noticeable irq latency.
>          */
>         DISABLE_INTERRUPTS(CLBR_NONE)
>
>         /*
>          * We must check ti flags with interrupts (or at least preemption)
>          * off because we must *never* return to userspace without
>          * processing exit work that is enqueued if we're preempted here.
>          * In particular, returning to userspace with any of the one-shot
>          * flags (TIF_NOTIFY_RESUME, TIF_USER_RETURN_NOTIFY, etc) set is
>          * very bad.
>          */
>         testl   $_TIF_ALLWORK_MASK, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
>         jnz     int_ret_from_sys_call_irqs_off  /* Go to the slow path */
>
> Most of that can be done in C.
>
> And I think we could also convert the IRET syscall return slow path to C too:
>
> GLOBAL(int_ret_from_sys_call)
>         SAVE_EXTRA_REGS
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         RESTORE_EXTRA_REGS
>
>         /*
>          * Try to use SYSRET instead of IRET if we're returning to
>          * a completely clean 64-bit userspace context.
>          */
>         movq    RCX(%rsp), %rcx
>         movq    RIP(%rsp), %r11
>         cmpq    %rcx, %r11                      /* RCX == RIP */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
>          * the kernel, since userspace controls RSP.
>          *
>          * If width of "canonical tail" ever becomes variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- SYSRET checks need update"
>         .endif
>
>         /* Change top 16 bits to be the sign-extension of 47th bit */
>         shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>         sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>
>         /* If this changed %rcx, it was not canonical */
>         cmpq    %rcx, %r11
>         jne     opportunistic_sysret_failed
>
>         cmpq    $__USER_CS, CS(%rsp)            /* CS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         movq    R11(%rsp), %r11
>         cmpq    %r11, EFLAGS(%rsp)              /* R11 == RFLAGS */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>          * restoring TF results in a trap from userspace immediately after
>          * SYSRET.  This would cause an infinite loop whenever #DB happens
>          * with register state that satisfies the opportunistic SYSRET
>          * conditions.  For example, single-stepping this user code:
>          *
>          *           movq       $stuck_here, %rcx
>          *           pushfq
>          *           popq %r11
>          *   stuck_here:
>          *
>          * would never get past 'stuck_here'.
>          */
>         testq   $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
>         jnz     opportunistic_sysret_failed
>
>         /* nothing to check for RSP */
>
>         cmpq    $__USER_DS, SS(%rsp)            /* SS must match SYSRET */
>         jne     opportunistic_sysret_failed
>
>         /*
>          * We win! This label is here just for ease of understanding
>          * perf profiles. Nothing jumps here.
>          */
> syscall_return_via_sysret:
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         USERGS_SYSRET64
>
> opportunistic_sysret_failed:
>         SWAPGS
>         jmp     restore_c_regs_and_iret
> END(entry_SYSCALL_64)
>
>
> Basically there would be a single C function we'd call, which returns a condition
> (or fixes up its return address on the stack directly) to determine between the
> SYSRET and IRET return paths.
>
> Moving this to C too has immediate benefits: that way we could easily add
> instrumentation to see how efficient these various return methods are, etc.
>
> I.e. I don't think there's two ways about this: once the entry code moves to the
> domain of C code, we get the best benefits by moving as much of it as possible.

This is almost certainly true.  There are a lot more cleanups possible here.

I want to nail down the 32-bit case first so we can delete the old code.

>
> The only low level bits remaining in assembly will be low level hardware ABI
> details: saving registers and restoring registers to the expected format - no
> 'active' code whatsoever.

I think this is true for syscalls.  Getting the weird special cases
(IRET and GS fault) for error_entry to work correctly in C could be
tricky.

--Andy